Subj : talking to myself
To   : Maurice Kinal
From : mark lewis
Date : Sun Feb 20 2005 04:42 pm

 RT>> with his "gate" a while back, that I know of that is.

 MK> That was one situation I as thinking of.  Got under everyone's
 MK> radar despite all the sysops in that echo noticing it.  So
 MK> much for MSGID eh?  Everyone of those messages was a true dupe
 MK> that not one tosser caught.

MSGID isn't the magic bullet that some seem to want to think it is... some of 
your comments appear to be saying that it should/could be and that it isn't and 
thus should be thrown in the bitbucket...

while i do /tend/ to agree, i also tend more to not agree... detecting dupes in 
fidonet is not magic nor is it tied to one thing... for various reasons, 
fidonet can't even use md5 checksums on the message body to determine if a 
message is a duplicate... message headers, existing control lines, origin and 
seenby and path lines can all be stripped or otherwise modified or corrupted... 
the only real way to tell a dupe would be by enforcing some sort of message 
body formatting and md5'ing the message body much the same way that PGP can be 
used to sign a message to show if it has been modified since sending...

  ie: when the body is generated and the message saved,
      md5 the body and store that in a control line that
      travels with the message. i still don't recall if
      there are message processors out there that alter
      the message body (ie: by replacing CRLF with LF)

even then, you then have the problem of crossposted messages... is it a dupe 
because it is exactly the same message in more than one area? i don't think 
so...

detecting duplicates in fidonet is a tricky science, to say the least... 
checking the header info and message control lines (including the origin line) 
is about the only way... still this can fail due to the way some systems have 
been retrofitted for fidonet messaging... wildcat, pcboard and wwiv systems are 
the first three that come to mind as having shoehorned retrofits for 
participation in fidonet... quite simply, their message bases were not designed 
with fidonet in mind... actually, not just fidonet but more without any sort of 
thought to control lines within messages...

it is long past the time when this stuff can truely be fixed and enforced... 
all we can do now is to play the game and hope for the best... that said, there 
are things that can be done to try to ensure that messages generated by your 
software do make it past the various and sundry dupe checking schemes out 
there... one of the first and easiest is to implement MSGID and ensure that it 
is the first control line after the message header... this may or may not help 
with very braindead dupe checking that looks to the header only with no regard 
for the message body at all as that system was developed with a myopic view of 
users creating messages and not with the thought that an automated process like 
text file posting or offline mail doors may post more than one message per 
second... most of the software that did that braindead method of dupe checking 
have been tossed or upgraded for something that does the same but also takes 
into consideration the first 20+ bytes of the message body...

there is still the problem of dupe checkers that use a CRC16 or CRC32 method of 
storing an "ID" of a message based on the header and 20+ bytes of the 
message... this is due to the simple fact that there are a limited number of 
CRC16 and CRC32 results and that it is fairly trivial to find more than one 
dataset that generates the same CRC16 or CRC32 value...

that takes us to the question of how to build a dataset of messages and what to 
use as the duplicate trigger... remembering that many things are done in binary 
in fidonet because of limited storage space as well as for speed of processing, 
we have to ask what method would ultimately be the best for quick processing, 
small storage, and generating truely unique IDs for the local duplicate 
detection system?

the first thing i can think of is to record the header info and the entire 
MSGID... the question is, then, how to record the header info? would one use 
the actual fields or would one run the header fields thru a formula like md5 or 
something else??

i can see possibly a two fold method involving recording the actual header data 
as well as running it thru md5 or some such and recording the MSGID if it 
exists...

that would likely be the utmost method but it wouldn't be the smallest data 
record per message... there's also the question of speed... how much time are 
you willing to spend rummaging thru a duplicate dataset looking for a match 
before deciding if a message is a duplicate or not? considering your high 
desire for speed, i can see small datasets (one per message area al la squish?) 
to ease the search time...

interesting problem, this is... i'm already visualising multiple dupe dataset 
files based on the AREA line, locally carried areas notwithstanding due to the 
processing of passthru areas, or one large or even multiple large datafiles 
containing AREA grouped datasets of header and MSGID data...

)\/(ark

 
* Origin: (1:3634/12)

.