[HN Gopher] TinyStories: How Small Can Language Models Be and St...
___________________________________________________________________
TinyStories: How Small Can Language Models Be and Still Speak
Coherent English? (2023)
Author : tzury
Score : 111 points
Date : 2025-01-02 17:54 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jasonjmcghee wrote:
| Edit: ah. This paper is from May 2023. Might be worth putting
| that in the title.
|
| ---
|
| > Models with around 125M parameters such as GPT-Neo (small) or
| GPT-2 (small) can rarely generate coherent and consistent English
| text beyond a few words
|
| These models are 5 years old.
|
| I have to wonder if the authors have seen RWKV 7 0.1B, because it
| blows away just about every other model I've seen at that size.
|
| The capabilities it has vs the examples in the paper are night
| and day.
|
| https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1
| jmward01 wrote:
| The age of tiny models is just about here. We are finally
| busting away from the standard transformer block and training.
| I have a side project that can dramatically shrink networks by
| a set of techniques I call sacrificial training[1]. I think
| things like this will finally swing things back to on prem and
| on device small models that are as capable as the big hosted
| models are now.
|
| [1] https://github.com/jmward01/lmplay/wiki/Sacrificial-
| Training
| refulgentis wrote:
| RWKV is def. better than TinyStories 125MB.
|
| Unfortunately, I have only seen 3 models, 3B or over, handle
| RAG.
|
| Tested RWKV with a simple in-the-sports-news question and it
| didn't even get close to approaching the question. And nearly
| everything was fundamentally incoherent even in its internal
| reality (ex. Player gets 5000/game and is the first with 1000
| in 16 games)
|
| (prompt: https://pastebin.com/sCLn5sCJ, response:
| https://pastebin.com/TqudvDbN)
|
| I don't think there's a position for LLMs that are "just"
| writers on the market in 2025.
| jasonjmcghee wrote:
| These tiny models in general have really weird failure
| modes. I tried the tiny stories prompt about asking mom for
| a dog who said no, and it output an incredibly dark story
| about how she asked her dad and they got a dog but it had
| pancreatic cancer (paraphrasing, it went into detail about
| the surgery etc.) and then starting writing an
| informational PSA about who is at risk of pancreatic cancer
| etc.
| kube-system wrote:
| What I find fascinating is how ML models hallucinate in a
| way that is sometimes reminiscent of a fever dream.
| ethbr1 wrote:
| It makes sense that the failure modes of language
| prediction look a lot like ADD.
| nerdponx wrote:
| Lest we forget that this stream-of-consciousness
| confusion was state of the art just a few years ago.
|
| It makes sense if you think about it: a small model's
| "internal state" isn't rich enough to keep track of
| whatever it was supposed to be talking about.
|
| It makes me think that the reason LLMs need to be so
| large is that the internal state needs to be bigger than
| a typical human "idea", whatever that might mean.
| jmward01 wrote:
| I plan on checking out RWKV and seeing if I can add my
| sacrifical training techniques to it this weekend. There is
| a reason quantization works, it is because models are very
| badly trained right now. I think we can get really good
| performance on .1b and 1b models which opens up the world
| to fine-tuning again. I was playing with fine-tuning llama
| 7b and 13b a while back but the HW/SW stack made it so
| unwieldy and the ROI was terrible compared to just
| adjusting prompts on gpt-4o-mini and the like. I have hope
| that we are about to see single GPU, very simple, fine-
| tuning again as models shrink and GPUs grow.
| daxfohl wrote:
| Would there be any way to distribute RAG across multiple
| smaller models? Rather than one giant model handling your
| entire document base, have it be more of a tree where the
| top level classifies the docs into top-level categories and
| sends it to submodels to subclassify, etc? (Doesn't have to
| be 1:1 classification). And same for q/a search?
|
| These could all presumably be the same physical instance,
| just each query would use a different system prompt and
| perhaps different embeddings. (I'm guessing; I don't
| actually know how RAG works). So, a little slower and
| clunkier, but presumably way more efficient. And match
| could be anywhere between horrible to better-than-one-
| large-model. This would be more like how businesses
| organize docs.
|
| Or maybe there's no real benefit to this, and each
| subclassifier would require just as big of a model as if
| you were to throw all docs into a single model anyway. I
| assume it's probably been tried before.
| refulgentis wrote:
| TL;DR: It's a very interesting line of thought that as
| late as Q2 2024, there were a couple thought leaders who
| pushed the idea we'd have, like 16 specialized local
| models.
|
| I could see that in the very long term, but as it stands,
| it works the way you intuited: 2 turkeys don't make an
| eagle, i.e. there's some critical size where its speaking
| coherently, and its at least an OOM bigger than it needs
| to be in order to be interesting for products
|
| fwiw RAG for me in this case is: - user asks q.
|
| - llm generates search queries.
|
| - search api returns urls.
|
| - web view downloads urls.
|
| - app turns html to text.
|
| - local embedding model turns text into chunks.
|
| - app decides, based on "character" limit configured by
| user, how many chunks to send.
|
| - LLM gets all the chunks, instructions + original
| question, and answers.
|
| It's incredibly interesting how many models fail this
| simple test, there's been multiple Google releases in the
| last year that just couldn't handle it.
|
| - Some of it is basic too small to be coherent, bigcos
| don't make that mistake though.
|
| - There's another critical threshold where the model
| doesn't wander off doing the traditional LLM task of
| completing rather than answering. What I mean is,
| throwing in 6 pages worth of retrieved webpages will
| cause some models to just start rambling like its writing
| more web pages, i.e. they're not able to "identify the
| context" of the web page snippets, and they ignore the
| instructions.
| groby_b wrote:
| There's just been a twitter post by Omar Khattab
| (@lateinteraction) on encoding documents into a scoring
| function instead of a simple vector for the work on
| ColBERT - and maybe at some point using a DNN as scoring
| function.
|
| So, yes, maybe there's a way to "distribute" RAG. (I
| still wonder if that isn't just MoE taken to its logical
| conclusion)
|
| So, dig for ColBERT papers, might be helpful. (I wish I
| had the time to do that)
| ankit219 wrote:
| Short answer: Yes, there are ways it can be done.
| Multiple. Needs to be custom built though, given no one
| has explored it deeply yet.
|
| One simple way is what Omar Khattab (ColBert) mentioned
| about scoring function instead of a simple vector.
|
| Another is to use a classifier at the start directing
| queries to the right model. You will have to train the
| classifier though. (I mean a language model kind of does
| this implicitly, you are just taking more control by
| making it explicit.)
|
| Another is how you index your docs. Today, most RAG
| approaches do not encode enough information. If you have
| defined domains/models already, you can encode the same
| in metadata for your docs at the time of indexing, and
| you pick the model based on the metadata.
|
| These approaches would work pretty well, given a model as
| small as 100M size can regurgitate what is in your docs.
| And is faster compared to your larger models.
|
| Benefit wise, I don't see a lot of benefit except
| preserving privacy and gaining more control.
| daxfohl wrote:
| I was originally thinking about it as like a bazel plugin
| for large codebases. Each module would have its own LLM
| context, and it might make it easier to put whole modules
| into the context, plus summaries of the dependencies.
| That could work better than a single huge context
| attempting to summarize the whole monorepo.
|
| The general idea is probably be better for the code use
| case too, since having the module's whole codebase in
| context likely allows for more precise edits. Whereas RAG
| is just search, not edit.
|
| That said, probably code assistants must somewhat do this
| already, though it must be more ad-hoc. Obviously they
| wouldn't be able to do any completions if they don't have
| detailed context of the adjacent code.
| raegis wrote:
| > Unfortunately, I have only seen 3 models, 3B or over,
| handle RAG.
|
| What's the unit "B" in "3B"? I can search for acronyms like
| "RAG" just fine, but you experts aren't making it easy for
| us beginners :)
|
| Edit: Apologies, this is obvious. My brain needed a reboot
| for the new year.
| cauliflower2718 wrote:
| You can ask an LLM exactly this question and it will tell
| you.
|
| (The answer is billions of parameters)
| SketchySeaBeast wrote:
| But what if they want to know they are finding the
| correct answer?
| elliotto wrote:
| Asking anonymous people on a forum would be much better.
| SketchySeaBeast wrote:
| At least a forum with domain-specific knowledge.
| gpm wrote:
| And people to go "no, that's wrong" if someone posts
| something that's wrong.
| jasonjmcghee wrote:
| tbf, the gp comment said 125MB and then 3B, which would
| be pretty confusing, as it's a typo and should be 125M.
| jwineinger wrote:
| The number of parameters the model is trained on, in
| billions
| jmward01 wrote:
| (B)illion. It indicates the rough number of parameters in
| the model. Higher is generally more capable. 1B models
| are currently at the top end of 'easy' to deal with for
| playing around fine tuning and the like for most home lab
| setups.
| a1o wrote:
| What is tiny and what is big?
|
| Can I have a model that is like 100MB in weights and run
| with llama.cpp in my MacBook M2?
| refulgentis wrote:
| Yeah, absolutely -- you'll probably pull 100+ token/s.
|
| Here's a good range of model sizes that run just fine
| with llama.cpp on mac:
| https://huggingface.co/telosnex/fllama/tree/main.
|
| I recommend trying the Telosnex* app, it uses llama.cpp
| and abstracts over LLMs so you can i.e. switch between
| local/servers at will.
|
| The important part for you is its free, accelerated on
| macOS, and very easy to use local LLMs with (Settings >
| AI > LLM > On Device, tap Get)
|
| Prepare to be underwhelmed, slightly: its only when you
| start hitting 3B that its coherent, anything under that
| will feel more like a markov chain than an LLM.
|
| Depending on how geeked out you'll be to have it running
| locally, you might have fun with that Telosnex can run
| local models on _every_ platform, i.e. you can run local
| models on iOS /Android/web too.
|
| * because it's mine :3 It is quietly released currently.
| I want to get one more major update before widely
| announcing it in Jan 2025
| qskousen wrote:
| Sorry to side track, but question about Telosnex - would
| you consider a Linux release with something other than
| Snap? Maybe Flatpak or appimage?
| refulgentis wrote:
| If its a (mostly) CI-able process, I'm totally open to it
| ---
|
| I looked into "What should I do besides Snap?" about 4
| months ago; got quickly overwhelmed, because I don't have
| enough knowledge to understand what's fringe vs. common.
|
| I'll definitely take a look at Flatpak again in the next
| month, 30 second Google says its possible (h/t /u/
| damiano-ferrari at https://www.reddit.com/r/FlutterDev/co
| mments/z35gdo/can_you_...)
|
| (thanks for your interest btw, been working on this for
| ~year and this is my first outside feature request :) may
| there be many more)
| jki275 wrote:
| LM Studio on Mac is your friend. You can choose any model
| you want, run a server for other tools, or chat direct
| with the model. It can use either MLX or just plain
| llama.cpp.
| wolfgangK wrote:
| <<Unfortunately, I have only seen 3 models, 3B or over,
| handle RAG.>>
|
| I would love to know which are these 3 models, especially
| if they can perform grounded RAG. If you have models (and
| their grounded RAG prompt formats) to share, I'm very
| interested !
|
| Thx.
| attentionmech wrote:
| wow, this RWKV thing blew my mind. Thank you for sharing this!
| SGML_ROCKSTAR wrote:
| It might still be of introductory help to someone who has yet
| to formally learn what a language model is, what large language
| models are, and where things might be in the future.
| cjohnson318 wrote:
| You can have small languages, sure, but then you run into awkward
| extended clarifying clauses. The thing that makes languages
| difficult is that almost all vocabulary is sparse. The "Top N"
| words in a language are always pronouns, prepositions, articles,
| and the conjugations of the top 12 or so verbs: to be, to have,
| to do, to go, to come, to say, to give, etc. This is the reason
| that "Top N Words of Language X" and "Learn the Top 50% of Words
| in Language X" listicles/videos are always disappointing.
| nine_k wrote:
| But they seem to use much wider grammars, because their
| (synthetic) dataset is a bunch of coherent stories at the level
| of 3-4 y.o. children.
|
| I would consider the "Simple English Wikipedia" the next
| training set / benchmark.
| Pikamander2 wrote:
| There's an oddly relevant skit of this concept in the American
| version of The Office:
| https://www.youtube.com/watch?v=_K-L9uhsBLM
| momojo wrote:
| > We hope that TinyStories can facilitate the development,
| analysis and research of LMs, especially for low-resource or
| specialized domains, and shed light on the emergence of language
| capabilities in LMs.
|
| This part interests me the most. I want to know how small yet
| functional we can get these models. I don't want an AI that can
| solve calculus, I just want a dumb AI that pretty consistently
| recognizes "lights off" and "lights on".
| MobiusHorizons wrote:
| why would you use an LLM for that? Seems like there are much
| better options available.
| londons_explore wrote:
| It's actually pretty hard to design a non-llm system that can
| detect all the possible variations:
|
| Lights on. Brighter please. Turn on the light. Is there light
| in here? Turn the light on. Table lamp: on. Does the desk
| lamp work? It's a bit dim here, anything you can do? More
| light please. Put the lights on for the next 5 mins. Turn the
| light on when I come home. Turn all the lights off together.
| Switch the lights off whenever its daytime or quiet at home
| unless I say otherwise. etc.
|
| If you don't support every possible way of saying a command,
| then users will get frustrated because they effectively have
| to go and learn the magic incantation of words for every
| possible action, which is very user-unfriendly.
| anon373839 wrote:
| I suspect ModernBERT can also be very helpful with these
| sorts of tasks, if you decompose them into an intent
| classification step and a named entity recognition step.
| simcop2387 wrote:
| that entity extraction is where it actually gets really
| really difficult, even for LLMs since people will use 10
| different names for the same thing and you'll have to
| know them ahead of time to handle them all properly. For
| either BERT based or llm based there's a bit of a need
| for the system to try to correct and learn those new
| names unless you require users to put them all in ahead
| of time. That said i've seen LLMs handle this a lot
| better with a list of aliases in the prompt for each room
| and then type of device when playing with home assistant
| + llm.
| phkahler wrote:
| Your examples include complex instructions and questions,
| but for simple ON/OFF commands you can go far by pulling
| key words and ignoring sentence structure. For example,
| pick out "on" "off" and "light" will work for "turn the
| light on", "turn off the light", "light on", "I want the
| light on", etc... Adding modifiers like "kitchen" or "all"
| can help specify which lights (your "Table lamp: on"
| example), regardless of how they're used. I'm not saying
| this a great solution, but it covers pretty much all the
| basic variations for simple commands and can run on
| anything.
| fl0id wrote:
| They also describe a new benchmark / evaluation, but tbh is there
| any evidence that this even works? (telling GPT-4 to check the
| output as if it were checking student essays) We know it cannot
| really do this, and the model used will not even stay consistent
| if there are updates.
| fi-le wrote:
| We're doing a successor to this, working hard and going public in
| month or so, hopefully. But HN gets a preview of course:
| https://huggingface.co/datasets/lennart-finke/SimpleStories
|
| And here's a more interactive explorer: https://fi-
| le.net/simplestories
| bigmattystyles wrote:
| I've been curious about the opposite - a lot of times, I'll put a
| few keywords that get to the point of what I want, but it's
| incoherent English in - and yet, often the output is on point.
| Suppafly wrote:
| I know natural language is sorta the gold standard for a lot of
| these models, but honestly I could see a lot of utility out of
| a stripped down language set, similar to how you used to be
| able to search google back in the day before they tried to make
| it easier.
| niemandhier wrote:
| This question is also quite possible the most promising way to
| get an upper bound on the Kolmogorov complexity of human
| language.
| lenerdenator wrote:
| "Coherent" seems relatively subjective, no?
|
| Could you get an LLM to generate "coherent" conversational
| Geordie English? Probably, but my Midwestern ear isn't going to
| be able to understand what they're saying.
| osaariki wrote:
| For some interesting context: this paper was a precursor to all
| the work on synthetic data at Microsoft Research that lead to the
| Phi series of SLMs. [1] It was an important demonstration of what
| carefully curated and clean data could do for language models.
|
| 1: https://arxiv.org/abs/2412.08905
| HarHarVeryFunny wrote:
| I'd guess that the ability of a very small model to do well on
| the TinyStories dataset isn't just because of the limited 3-4yr
| old vocabulary, but also because of it being an LLM-generated
| dataset.
|
| LLM-generated content (synthetic data) is easier than human
| generated text for an LLM to learn because it was auto-
| regressively generated, and therefore should be possible to auto-
| regressively predict.
|
| It's surprising that LLMs do as well as they do attempting to
| predict human generated training samples where there is no
| guarantee that the predictive signal is actually contained in the
| sample (it may just be something in the mind of the human that
| generated it).
|
| I've got to wonder what the impact on generation is of an LLM
| only trained on synthetic LLM-generated data? I'd guess it
| wouldn't be as robust as one that had learned to handle more
| uncertainty.
| raymv wrote:
| Trained a GPT-2 like model on the dataset a while back, here's
| the source code and some results for anyone interested:
|
| https://github.com/raymond-van/gpt-tinystories
| mclau156 wrote:
| Side note but is it really that crazy for Github to implement a
| feature to see file size of a repo?
| ankit219 wrote:
| Great to see this here. We used this dataset from Tiny Stories to
| train small models (as small as 20M params) and test out
| knowledge addition. Published a paper based on this dataset. We
| could get coherent outputs at sizes as low as 20M-25M. (though
| not as great as LLMs, but still decent enough).
|
| [1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-
| knowledge-in-large... (Paper is titled: Cross-Domain Content
| Generation with Domain-Specific Small Language Models)
| lutusp wrote:
| Decades ago, prior to the existence of personal computers, when a
| "computer" was a glassed-in room staffed by lab-coat-wearing
| technicians (picture John Von Neumann standing next to the first
| stored-program computer:
| https://www.theguardian.com/technology/2012/feb/26/first-com...),
| someone reduced an entire printed book (or more than one) to a
| word-token decision tree, at great cost and effort, just to see
| what would happen.
|
| I can't find the original paper, but with an appropriate amount
| of pseudorandomness to avoid dead ends, this primitive algorithm
| would generate the occasional sentence that almost made sense and
| that bore little resemblance to the original data.
|
| Because of the state of computer technology it was a massive
| effort and a source of general astonishment. I suspect we're now
| recreating that minimal environment, this time with better ways
| to curate the data for small size and maximum drama.
|
| Let's remember that a modern GPT isn't far removed from that
| scheme -- not really.
| Animats wrote:
| (2023), as someone mentioned.
|
| It's encouraging to see how much can be done with tiny models.
|
| Still need to crack "I don't know" recognition, so you can start
| with a tiny model and then pass the buck to a bigger model for
| hard questions. That will enormously reduce the cost of "AI"
| customer support.
___________________________________________________________________
(page generated 2025-01-02 23:00 UTC)