# 20200714 ## Motivation [Erowid Recruiter](https://twitter.com/erowidrecruiter) is a fun Twitter account powered by Markov chains. How hard could it be to recreate this? ## Erowid A fascinating web 1.0 site with trip reports for seemingly every known substance. Experiences list: https://www.erowid.org/experiences/exp.cgi?OldSort=PDA_RA&ShowViews=0&Cellar=0&Start=100&Max=1000 Experience (HTML): https://www.erowid.org/experiences/exp.php?ID=005513 Experience (LaTeX): https://www.erowid.org/experiences/exp_pdf.php?ID=093696&format=latex Blocked: https://blackhole.erowid.org/blocked.shtml Insights so far over the last two weeks: - There's no convenient option to download all reports and the website discourages downloading and analyzing them. However the experience search offers a customizable limit, by increasing it to the maximum, you can obtain all valid IDs. This list is not consecutive, most likely due to the review process. - It's possible to get blocked, play it safe and do no more than 5000 requests per day. - HTML reports are unusually broken: Key characters (angular brackets, ampersand, quotes) are not consistently escaped using HTML entities, little semantic formatting, copious comments suggest basic editing work, closing tags are omitted often - LaTeX reports are mildly broken: Quotes aren't consistently escaped, some HTML comments are halfway preserved in the export. I've contacted the Erowid Recruiter author and they revealed they handpicked their favorite trip reports. I don't really want to spend that much time. First I've tried downloading random HTML reports and stored in a database whether access was successful or triggered an error. I later learned that there's LaTeX export and a full list of experiences, with a subset marked as outstanding using one, two or three stars. I've downloaded them all, fixed some HTML comment fuck-ups and wrote Scheme code to extract the report from the LaTeX template and convert the LaTeX syntax to plaintext. My first attempt at doing Emacs-style text processing was comically verbose and not successful, I resorted to parsing a sequence of tokens, accounting for LaTeX insanity, converting the tokens to TeX commands, then interpreting those specially to turn it into minimally marked up text. ## Recruiter I haven't found any good public data for such email. I bet the real thing draws from personal email. My plan is to instead use generic spam from [this data set](http://untroubled.org/spam/), hence the name "madads". We'll see how this goes and how much cleaning is required. # 20200720 ## Recruiter Initially I thought I could just download all archives, extract them and look at the data, but I've underestimated just how many files these are. Take for example the 2011 archive, a file clocking in at almost 100M takes surprisingly long for decompression. At 10% into decompression and 300M of text files, I canceled it. Instead I grabbed the archives from 1998 to 2003 which extracted to a few thousand files at a far more managable 472M. Some massaging is definitely required before further processing. `file` recognizes many files as emails except for some starting with a "From " line. There is a later "From: " line which is clearly an email header, so I looked up some `sed` magic to delete the first line if it has this pattern: `sed --in-place=bak '1{/From /d;}' */*.txt`. The emails themselves have structure and can be parsed. My plan is to extract plain text whenever possible and falling back to making sense of HTML if necessary. [hato](https://github.com/ashinn/hato) seems to be a good codebase to study for that. # 20200803 ## Recruiter I've considered even using Gauche's built-in email parser, but then a friend reminded me that [mblaze](https://git.vuxu.org/mblaze/) is a thing, a suite of tools for wrangling maildir-style mailboxes. As usual I immediately ran into a bug with `mshow` and got it fixed for the 1.0 release. Using some shell oneliners I extracted 33k plaintext messages from 99k files, a far better success rate than expected. That leaves text generation. ## Markov This turned out easier than expected. You take n-grams of a text, split each n-gram into prefix (all but the last word) and suffix (the last word) and track the seen combinations of prefix and suffix in a hash table mapping prefix to seen suffixes. To generate text, pick a random prefix from the hash table, look its suffixes up, pick a random one and combine the chosen prefix and suffix into a new n-gram to repeat the process. This can end prematurely if you end up with a prefix with no suffixes (for example at the end of the text), but can otherwise repeated as often as needed to generate the necessary amount of text. Many tweaks are possible to this basic idea: - Rather than just splitting up text on whitespace into words, do extra clean-up to deal with quotes and other funny characters. - Save the generated hash table to disk and load it up again to avoid expensive recomputation. - Find better starting points, for example by looking for sentence starters (capitalized words). - Find better stopping points, for example by looking for sentence enders (punctuation). - Combine several text files and find some way to judge which ones go particularly well together (perhaps looking for overlap between prefixes is a suitable metric?).