[HN Gopher] Large language model data pipelines and Common Crawl
       ___________________________________________________________________
        
       Large language model data pipelines and Common Crawl
        
       Author : sonabinu
       Score  : 124 points
       Date   : 2024-06-18 23:42 UTC (23 hours ago)
        
 (HTM) web link (blog.christianperone.com)
 (TXT) w3m dump (blog.christianperone.com)
        
       | barrenko wrote:
       | This is a great blog btw.
        
       | fbdab103 wrote:
       | After that it removes (or replaces) Unicode punctuation, performs
       | a SHA1 hashing, and uses the first 8 bytes for deduplication
       | comparisons (paragraph level)
       | 
       | Is taking the first few bytes that much faster than comparing the
       | entire hash? Or something else? That is one of those performance
       | optimizations I would go back and forth on endlessly wondering if
       | I lost something by trying to shave off a few cycles.
        
         | npn wrote:
         | You want to store hashed value in database as effectively as
         | possible, hence the first 8 bytes.
         | 
         | Though I think a 64 bits hash algorithms might be more suitable
         | than sha1. Personally I use fnv-1a for hashing (not the fastest
         | but trivial to implement) instead.
        
       | msp26 wrote:
       | The section on deduplication was very useful thanks for posting
        
       | alhaad wrote:
       | Nicely written, thanks for posting!
       | 
       | I was reminded about recent LLM wins coming from training data
       | improvements (eg. fineweb)
        
       | hobofan wrote:
       | Does anyone know of a maintained alternative to fasttext? It is
       | mentioned here for language identification, but clicking through
       | to the GitHub project, it looks to be recently archived.
       | 
       | I usually use a BERT model for text classification these days,
       | but would like to have an alternative that it less CPU-heavy like
       | fasttext at hand for high-volume use cases.
        
         | mhuffman wrote:
         | fasttext appears to be up and fine.[0] with fairly recent
         | activity.[1] If you are looking for something similar, word2vec
         | maybe?
         | 
         | [0] https://fasttext.cc/
         | 
         | [1] https://github.com/facebookresearch/fastText
        
           | infecto wrote:
           | "This repository has been archived by the owner on Mar 19,
           | 2024. It is now read-only."
           | 
           | I would assume that since it has been archived that no future
           | work will happen on it.
        
             | yorwba wrote:
             | The list of forks
             | https://github.com/facebookresearch/fastText/forks has
             | explosion.ai (makers of spaCy)'s floret
             | https://github.com/explosion/floret as the most starred
             | one.
        
           | ds_opseeker wrote:
           | fasttext failed for me with numpy2.0.
           | 
           | reverted to numpy 1.26.4 and it was fine.
        
       | spott wrote:
       | (2023)
       | 
       | Still very useful, but it should probably have a date in the
       | title.
        
       ___________________________________________________________________
       (page generated 2024-06-19 23:01 UTC)