Subj : Re: Complicated email parse or text extraction and database insertion To : comp.programming From : =?iso-8859-1?q?C=E9dric_LEMAIRE?= Date : Tue Aug 16 2005 02:19 pm I've put each email in different files, and supposed they were exactly worth... * "mail1.txt": ************** ID:.............. 12345 Name:............ JOHN DOE Address:......... PO BOX 9999 City:............ Somecity State:........... CA Zip Code:........ 90210 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=AD=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=AD=3D=3D=3D Company Information: 1=2E:- Company Name:....... Perl N PHP Scripts Welcome * "mail2.txt": ************** Full Name -- Doe, John Address -- PO BOX 9999 City -- Somecity St -- California Zip -- 90210 Company Name -- Perl N PHP Scripts Welcome ID -- 12345 * "mail3.txt": ************** Name.....Address.....City.....=ADState.....Zip.....Identificati=ADon Number.....Company John Doe.....PO Box 9999.....Somecity.....CA.....9-0210.....12345.....Perl N PHP Scripts Welcome * "mail4.txt": ************** Name.........Address.........C=ADity.........State.....Zip.....=AD..Identif= ication Number.....Company JOHN DOE.....PO BOX 9999.....SOMECITY.....CA......-..90210.....12345.............-........Perl N PHP Scripts Welcome The extended BNF parse script, called "email-parser.cwp", describes a production rule for each message format. The extracted data are displayed in the console, but they could be exported to a file. * "email-parser.cwp" (100 lines): ************************************* function outputLine(sID : value, sCompany : value, sName : value, sAddress : value, sCity : value, sState : value, sZC : value) { traceLine(sID + '\t' + sCompany + '\t' + sName + '\t' + sAddress + '\t' + sCity + '\t' + sState + '\t' + sZC); // if it was a translation script, we could have written (remove //): // @@sID@ @sCompany@ @sName@ @sAddress@ @sCity@ @sState@ @sZC@ //@ } email_extractor ::=3D ID_first | full_name_first | header_line1 | header_line2 | =3D> error("New email format! This variant isn't handled yet."); ; ID_first ::=3D [ #skipIgnore(blanks) // skip superfluous whitespaces between records "ID:" #continue ['.']+ ' ' ->(:sID)[['\r']? '\n'] "Name:" ['.']+ ' ' ->(:sName)[['\r']? '\n'] "Address:" ['.']+ ' ' ->(:sAddress)[['\r']? '\n'] "City:" ['.']+ ' ' ->(:sCity)[['\r']? '\n'] "State:" ['.']+ ' ' ->(:sState)[['\r']? '\n'] "Zip Code:" ['.']+ ' ' ->(:sZC)[['\r']? '\n'] "=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=AD=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=AD=3D=3D=3D" [ #ignore(blanks) #continue "Company Information:" "1.:-" "Company Name:" ['.']+ ] ' ' ->(:sCompany)[['\r']? '\n'] =3D> outputLine(sID, sCompany, sName, sAddress, sCity, sState, sZC); ]+ ; function getStateShortcut(sState) { switch(sState.toUpperString()) { case "CALIFORNIA": return "CA"; // continue for each state // ... } } full_name_first ::=3D [ #skipIgnore(blanks) "Full Name -- " #continue ->(:sLN)", " ->(:sFN)[['\r']? '\n'] =3D> trimString(sLN); =3D> trimString(sFN); =3D> local sName =3D toUpperString(sFN + ' ' + sLN); "Address -- " ->(:sAddress)[['\r']? '\n'] "City -- " ->(:sCity)" St -- " ->(:sState)[['\r']? '\n'] =3D> trimString(sState); =3D> sState =3D getStateShortcut(sState); "Zip -- " ->(:sZC)[['\r']? '\n'] "Company Name -- " ->(:sCompany)[['\r']? '\n'] "ID -- " ->(:sID)[['\r']? '\n'] =3D> outputLine(sID, sCompany, sName, sAddress, sCity, sState, sZC); ]+ ; header_line1 ::=3D "Name.....Address.....City.....=ADState.....Zip.....Identificati=ADon Number.....Company" #continue #skipIgnore(blanks) // skip superfluous whitespaces [ ->(:sName)"....." #continue =3D> sName =3D toUpperString(sName); ->(:sAddress)"....." ->(:sCity)"....." ->(:sState)"....." ->(:sZC)"....." ->(:sID)"....." ->(:sCompany)[['\r']? '\n'] =3D> outputLine(sID, sCompany, sName, sAddress, sCity, sState, sZC); ]+ #empty // end of the mail ; header_line2 ::=3D "Name.........Address.........C=ADity.........State.....Zip.....=AD..Ident= ification Number.....Company" #continue #skipIgnore(blanks) // skip superfluous whitespaces [ ->(:sName)['.']+ #continue ->(:sAddress)['.']+ ->(:sCity)['.']+ ->(:sState)['.']+ '-' ['.']+ ->(:sZC)['.']+ ->(:sID)['.']+ '-' ['.']+ ->(:sCompany)[['\r']? '\n'] =3D> outputLine(sID, sCompany, sName, sAddress, sCity, sState, sZC); ]+ ; The script "email-iterator.cws" executes the parser for each file that validates the filter "mail*.txt": * "email-iterator.cws" : ************************** forfile i in "mail*.txt" { parseAsBNF("email-parser.cwp", project, i); } The result, displayed in the console, is: 12345 Perl N PHP Scripts Welcome JOHN DOE PO BOX 9999 Somecity CA 90210 12345 Perl N PHP Scripts Welcome JOHN DOE PO BOX 9999 Somecity CA 90210 12345 Perl N PHP Scripts Welcome JOHN DOE PO Box 9999 Somecity CA 9-0210 12345 Perl N PHP Scripts Welcome JOHN DOE PO BOX 9999 SOMECITY CA 90210 We could have standardized the writing of the city and zip code (as done on name, for instance). .