[HN Gopher] Text Classification by Data Compression
___________________________________________________________________
Text Classification by Data Compression
Author : Lemaxoxo
Score : 32 points
Date : 2021-06-08 20:08 UTC (2 hours ago)
(HTM) web link (maxhalford.github.io)
(TXT) w3m dump (maxhalford.github.io)
| sean_pedersen wrote:
| Cool idea! Shouldn't this work also by concatenating the single
| document (you want to classify) with the compressed version of
| the conc. class corpus (saving compute time)?
| Lemaxoxo wrote:
| I think that is what is being suggested in the other comment.
| One would have to try! My instincts tell me the results would
| not be identical.
| lovasoa wrote:
| You don't have to recompress the whole corpus to add a single
| document to it. All the compression algorithms mentioned here
| work in a streaming fashion. You could "just" save the internal
| state of the algorithm after compressing the training data, and
| then reuse that state for each classification task.
| Lemaxoxo wrote:
| I suspected this. However, I wasn't able to grok the
| documentation well enough but I didn't able to find a
| convincing example. It seems to me that these Python
| compressors get "frozen" and can't be used to compress further
| data.
| spullara wrote:
| Was going to come here to say that. Played around with this a
| bit for compressing small fields using a learned dictionary:
|
| https://github.com/spullara/corpuscompression
| thomasluce wrote:
| I worked for an internet scraping/statistics gathering company
| some years ago, and we used this approach alongside a few others
| to find mailing addresses embedded in websites. Basically use
| LZW-type compression with entropy information only trained on
| known addresses, and then compress a document, looking for the
| section of the document with the highest compression ratio.
|
| It worked decently well, and surprisingly better than a lot of
| other, more standard approaches just because of the wild non-
| uniformity of human-generated content on the web.
___________________________________________________________________
(page generated 2021-06-08 23:00 UTC)