[HN Gopher] Full LLM training and evaluation toolkit
___________________________________________________________________
Full LLM training and evaluation toolkit
Author : testerui
Score : 132 points
Date : 2024-11-24 15:44 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| timhigins wrote:
| Might be worth updating the title to "SmolLM: state-of-the-art
| small language model trained on open datasets" (See the first
| table of https://huggingface.co/blog/smollm for benchmarks)
|
| It was fascinating digging into this to find their dataset
| weights defined in a declarative YAML file [2]. 70% is from
| FineWeb/Commoncrawl but filtered using a classifier trained on
| Llama-70b's rating from 0-5 of the educational content of the
| text [3]. This is something we know small models like Phi-3 have
| been doing for a while, but it's great to see a fully open
| reproduction of it that beats their benchmarks. Definitely
| supports the idea you can get even better reasoning at smaller
| model sizes by carefully filtering and curating your training
| data (and generating good synthetic data from/distilling bigger
| models).
|
| You can see the 450k Llama educational value scores here:
| https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-ll...
| It's interesting, I think the text with 3 scores is really good,
| but the 5 scores pick content that is not very reasoning or
| information-heavy but just mentions education or a worksheet. For
| SmolLM they just took the documents with scores >= 3 so it
| doesn't matter a ton.
|
| 2.
| https://github.com/huggingface/smollm/blob/9efce803bc7e37727...
| 3. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier
| timhigins wrote:
| Update: While SmolLM was SOTA at the time of release in July,
| SmolLM 2 1.7B (which is the newest release) is not currently
| the best model under 2B params on
| https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
| abeppu wrote:
| While it's great that this is open source, and I understand the
| pressure for smaller models that can be run in a wider range of
| contexts, I continue to be annoyed that authors keep posting
| comparisons to models which are slightly smaller.
|
| In this page, SmolLM2-1.7B does a bit better than Qwen2.5-1.5B
| which is ahead of Llama3.2-1B. At the next size level up, in
| other comparisons I've seen that e.g. Phi-3.5 (which is ~3.8B
| params) does a bit better than Llama 3.2 3B. Gemma 2 has a 9B
| size, llama 3.1 has an 8B size and I think when that came out
| Mistral had a 7B model -- so whenever a new "small" thing does
| "better" than its peers, we can't easily see whether it's because
| of any of the many small choices that the authors made were
| actually better.
| bashfulpup wrote:
| Pythia is stupidly easy to use.
|
| Then hookup a simple test harness. - this is like a grand total
| of 3 commands - git pull, install, point and run a model
___________________________________________________________________
(page generated 2024-11-24 23:00 UTC)