[HN Gopher] Large language model data pipelines and Common Crawl
___________________________________________________________________
Large language model data pipelines and Common Crawl
Author : sonabinu
Score : 124 points
Date : 2024-06-18 23:42 UTC (23 hours ago)
(HTM) web link (blog.christianperone.com)
(TXT) w3m dump (blog.christianperone.com)
| barrenko wrote:
| This is a great blog btw.
| fbdab103 wrote:
| After that it removes (or replaces) Unicode punctuation, performs
| a SHA1 hashing, and uses the first 8 bytes for deduplication
| comparisons (paragraph level)
|
| Is taking the first few bytes that much faster than comparing the
| entire hash? Or something else? That is one of those performance
| optimizations I would go back and forth on endlessly wondering if
| I lost something by trying to shave off a few cycles.
| npn wrote:
| You want to store hashed value in database as effectively as
| possible, hence the first 8 bytes.
|
| Though I think a 64 bits hash algorithms might be more suitable
| than sha1. Personally I use fnv-1a for hashing (not the fastest
| but trivial to implement) instead.
| msp26 wrote:
| The section on deduplication was very useful thanks for posting
| alhaad wrote:
| Nicely written, thanks for posting!
|
| I was reminded about recent LLM wins coming from training data
| improvements (eg. fineweb)
| hobofan wrote:
| Does anyone know of a maintained alternative to fasttext? It is
| mentioned here for language identification, but clicking through
| to the GitHub project, it looks to be recently archived.
|
| I usually use a BERT model for text classification these days,
| but would like to have an alternative that it less CPU-heavy like
| fasttext at hand for high-volume use cases.
| mhuffman wrote:
| fasttext appears to be up and fine.[0] with fairly recent
| activity.[1] If you are looking for something similar, word2vec
| maybe?
|
| [0] https://fasttext.cc/
|
| [1] https://github.com/facebookresearch/fastText
| infecto wrote:
| "This repository has been archived by the owner on Mar 19,
| 2024. It is now read-only."
|
| I would assume that since it has been archived that no future
| work will happen on it.
| yorwba wrote:
| The list of forks
| https://github.com/facebookresearch/fastText/forks has
| explosion.ai (makers of spaCy)'s floret
| https://github.com/explosion/floret as the most starred
| one.
| ds_opseeker wrote:
| fasttext failed for me with numpy2.0.
|
| reverted to numpy 1.26.4 and it was fine.
| spott wrote:
| (2023)
|
| Still very useful, but it should probably have a date in the
| title.
___________________________________________________________________
(page generated 2024-06-19 23:01 UTC)