Subj : Dupeloops To : mark lewis From : Rob Swindell Date : Wed Jun 20 2018 11:44 am Re: Dupeloops By: mark lewis to Rob Swindell on Wed Jun 20 2018 08:08 am > > On 2018 Jun 19 22:43:24, you wrote to me: > > >> AFAIK, seenbys and paths are not included in most dupe detection > >> schemes... other non-changing control lines are fine to be included... > >> one of the problems comes when some system sort those control lines on > >> messages they are passing along... we don't see so much of that like we > >> did at one time ;) > > RS> So some metadata is included in the data that is hashed for dupe > RS> detection and some is not? > > yes... > > RS> Are you sure about that? > > yes... in fact, and i don't recall who pointed this out to me back in the > '90s, > dbridge does exactly this in a manner of speaking... it takes the whole > message > header plus X bytes immediately following the message header and uses all of > that as at least part of the checksum calculation... this was pointed out to > me > when i was working on my posting tool and was adding MSGID support to it... > > i was using a library and just letting it do its thing... some of my test > posts > were reported as dupes when they clearly weren't... IIRC, they were detected > as > dupes because they were posted within the same second... it turned out that > my MSGID was somewhere in the middle of the control lines at the beginning > of the message body and only my dbridge using testers were seeing this... > someone pointed out this thing about dbridge also using X bytes from the > beginning of the message body in addition to the message header so i moved > my posting tool's > MSGID to the top of the list and no more dupes were detected by those > dbridge systems... > > i don't know what other systems do... there's only a very few that provide > this > information... SBBS is one of them... when i was testing Mystic, there was > some > discussion about dupe detection as james worked to try to figure out the > best method he liked... i have used fastecho here for decades but i don't > know what data it uses for its checksums... i do know it uses two > checksums, though... i know this because i was being nosy one day and > looking at FE's dupe database file (one for all message areas) with a hex > viewer and noticed that groups of bytes were repeated all throughout the > file... i asked about this and was told i found a bug... basically, FE has > two checksums that it uses for each message and both are supposed to be > stored in the database... what i found was that only one was being used and > written to both fields... toby fixed that problem right quick... i just > don't know what data is used to calculate them... > > back in the day, dupe detection formulas were not really shared around... > maybe > a couple of developers talking amongst themselves would tell each other what > they were doing but this information was not published where everyone could > find it... it was more or less black majik to a point... To complete the discussion, Synchronet (smblib) actually uses multiple methods of body text dupe detection: 1. A "legacy" CRC-32 hash of the body text, excluding any metadata, like FTN control lines and excluding any trailing white-space or control-characters 2. A tuple of hashes (MD5 digest, CRC-32, and CRC-16) and length (char count) of the body text excluding any metadata and *all* white-space characters These, in addition to duplicate Internet (RFC-822) compliant Message-ID and FTN-compliant Message-ID checks. No black majik here. :-) digital man Synchronet "Real Fact" #64: Synchronet PCMS (introduced w/v2.0) is Programmable Command and Menu Structure. Norco, CA WX: 77.6øF, 57.0% humidity, 8 mph ENE wind, 0.00 inches rain/24hrs .