[HN Gopher] TinyStories: How Small Can Language Models Be and St...
       ___________________________________________________________________
        
       TinyStories: How Small Can Language Models Be and Still Speak
       Coherent English? (2023)
        
       Author : tzury
       Score  : 111 points
       Date   : 2025-01-02 17:54 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jasonjmcghee wrote:
       | Edit: ah. This paper is from May 2023. Might be worth putting
       | that in the title.
       | 
       | ---
       | 
       | > Models with around 125M parameters such as GPT-Neo (small) or
       | GPT-2 (small) can rarely generate coherent and consistent English
       | text beyond a few words
       | 
       | These models are 5 years old.
       | 
       | I have to wonder if the authors have seen RWKV 7 0.1B, because it
       | blows away just about every other model I've seen at that size.
       | 
       | The capabilities it has vs the examples in the paper are night
       | and day.
       | 
       | https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1
        
         | jmward01 wrote:
         | The age of tiny models is just about here. We are finally
         | busting away from the standard transformer block and training.
         | I have a side project that can dramatically shrink networks by
         | a set of techniques I call sacrificial training[1]. I think
         | things like this will finally swing things back to on prem and
         | on device small models that are as capable as the big hosted
         | models are now.
         | 
         | [1] https://github.com/jmward01/lmplay/wiki/Sacrificial-
         | Training
        
           | refulgentis wrote:
           | RWKV is def. better than TinyStories 125MB.
           | 
           | Unfortunately, I have only seen 3 models, 3B or over, handle
           | RAG.
           | 
           | Tested RWKV with a simple in-the-sports-news question and it
           | didn't even get close to approaching the question. And nearly
           | everything was fundamentally incoherent even in its internal
           | reality (ex. Player gets 5000/game and is the first with 1000
           | in 16 games)
           | 
           | (prompt: https://pastebin.com/sCLn5sCJ, response:
           | https://pastebin.com/TqudvDbN)
           | 
           | I don't think there's a position for LLMs that are "just"
           | writers on the market in 2025.
        
             | jasonjmcghee wrote:
             | These tiny models in general have really weird failure
             | modes. I tried the tiny stories prompt about asking mom for
             | a dog who said no, and it output an incredibly dark story
             | about how she asked her dad and they got a dog but it had
             | pancreatic cancer (paraphrasing, it went into detail about
             | the surgery etc.) and then starting writing an
             | informational PSA about who is at risk of pancreatic cancer
             | etc.
        
               | kube-system wrote:
               | What I find fascinating is how ML models hallucinate in a
               | way that is sometimes reminiscent of a fever dream.
        
               | ethbr1 wrote:
               | It makes sense that the failure modes of language
               | prediction look a lot like ADD.
        
               | nerdponx wrote:
               | Lest we forget that this stream-of-consciousness
               | confusion was state of the art just a few years ago.
               | 
               | It makes sense if you think about it: a small model's
               | "internal state" isn't rich enough to keep track of
               | whatever it was supposed to be talking about.
               | 
               | It makes me think that the reason LLMs need to be so
               | large is that the internal state needs to be bigger than
               | a typical human "idea", whatever that might mean.
        
             | jmward01 wrote:
             | I plan on checking out RWKV and seeing if I can add my
             | sacrifical training techniques to it this weekend. There is
             | a reason quantization works, it is because models are very
             | badly trained right now. I think we can get really good
             | performance on .1b and 1b models which opens up the world
             | to fine-tuning again. I was playing with fine-tuning llama
             | 7b and 13b a while back but the HW/SW stack made it so
             | unwieldy and the ROI was terrible compared to just
             | adjusting prompts on gpt-4o-mini and the like. I have hope
             | that we are about to see single GPU, very simple, fine-
             | tuning again as models shrink and GPUs grow.
        
             | daxfohl wrote:
             | Would there be any way to distribute RAG across multiple
             | smaller models? Rather than one giant model handling your
             | entire document base, have it be more of a tree where the
             | top level classifies the docs into top-level categories and
             | sends it to submodels to subclassify, etc? (Doesn't have to
             | be 1:1 classification). And same for q/a search?
             | 
             | These could all presumably be the same physical instance,
             | just each query would use a different system prompt and
             | perhaps different embeddings. (I'm guessing; I don't
             | actually know how RAG works). So, a little slower and
             | clunkier, but presumably way more efficient. And match
             | could be anywhere between horrible to better-than-one-
             | large-model. This would be more like how businesses
             | organize docs.
             | 
             | Or maybe there's no real benefit to this, and each
             | subclassifier would require just as big of a model as if
             | you were to throw all docs into a single model anyway. I
             | assume it's probably been tried before.
        
               | refulgentis wrote:
               | TL;DR: It's a very interesting line of thought that as
               | late as Q2 2024, there were a couple thought leaders who
               | pushed the idea we'd have, like 16 specialized local
               | models.
               | 
               | I could see that in the very long term, but as it stands,
               | it works the way you intuited: 2 turkeys don't make an
               | eagle, i.e. there's some critical size where its speaking
               | coherently, and its at least an OOM bigger than it needs
               | to be in order to be interesting for products
               | 
               | fwiw RAG for me in this case is: - user asks q.
               | 
               | - llm generates search queries.
               | 
               | - search api returns urls.
               | 
               | - web view downloads urls.
               | 
               | - app turns html to text.
               | 
               | - local embedding model turns text into chunks.
               | 
               | - app decides, based on "character" limit configured by
               | user, how many chunks to send.
               | 
               | - LLM gets all the chunks, instructions + original
               | question, and answers.
               | 
               | It's incredibly interesting how many models fail this
               | simple test, there's been multiple Google releases in the
               | last year that just couldn't handle it.
               | 
               | - Some of it is basic too small to be coherent, bigcos
               | don't make that mistake though.
               | 
               | - There's another critical threshold where the model
               | doesn't wander off doing the traditional LLM task of
               | completing rather than answering. What I mean is,
               | throwing in 6 pages worth of retrieved webpages will
               | cause some models to just start rambling like its writing
               | more web pages, i.e. they're not able to "identify the
               | context" of the web page snippets, and they ignore the
               | instructions.
        
               | groby_b wrote:
               | There's just been a twitter post by Omar Khattab
               | (@lateinteraction) on encoding documents into a scoring
               | function instead of a simple vector for the work on
               | ColBERT - and maybe at some point using a DNN as scoring
               | function.
               | 
               | So, yes, maybe there's a way to "distribute" RAG. (I
               | still wonder if that isn't just MoE taken to its logical
               | conclusion)
               | 
               | So, dig for ColBERT papers, might be helpful. (I wish I
               | had the time to do that)
        
               | ankit219 wrote:
               | Short answer: Yes, there are ways it can be done.
               | Multiple. Needs to be custom built though, given no one
               | has explored it deeply yet.
               | 
               | One simple way is what Omar Khattab (ColBert) mentioned
               | about scoring function instead of a simple vector.
               | 
               | Another is to use a classifier at the start directing
               | queries to the right model. You will have to train the
               | classifier though. (I mean a language model kind of does
               | this implicitly, you are just taking more control by
               | making it explicit.)
               | 
               | Another is how you index your docs. Today, most RAG
               | approaches do not encode enough information. If you have
               | defined domains/models already, you can encode the same
               | in metadata for your docs at the time of indexing, and
               | you pick the model based on the metadata.
               | 
               | These approaches would work pretty well, given a model as
               | small as 100M size can regurgitate what is in your docs.
               | And is faster compared to your larger models.
               | 
               | Benefit wise, I don't see a lot of benefit except
               | preserving privacy and gaining more control.
        
               | daxfohl wrote:
               | I was originally thinking about it as like a bazel plugin
               | for large codebases. Each module would have its own LLM
               | context, and it might make it easier to put whole modules
               | into the context, plus summaries of the dependencies.
               | That could work better than a single huge context
               | attempting to summarize the whole monorepo.
               | 
               | The general idea is probably be better for the code use
               | case too, since having the module's whole codebase in
               | context likely allows for more precise edits. Whereas RAG
               | is just search, not edit.
               | 
               | That said, probably code assistants must somewhat do this
               | already, though it must be more ad-hoc. Obviously they
               | wouldn't be able to do any completions if they don't have
               | detailed context of the adjacent code.
        
             | raegis wrote:
             | > Unfortunately, I have only seen 3 models, 3B or over,
             | handle RAG.
             | 
             | What's the unit "B" in "3B"? I can search for acronyms like
             | "RAG" just fine, but you experts aren't making it easy for
             | us beginners :)
             | 
             | Edit: Apologies, this is obvious. My brain needed a reboot
             | for the new year.
        
               | cauliflower2718 wrote:
               | You can ask an LLM exactly this question and it will tell
               | you.
               | 
               | (The answer is billions of parameters)
        
               | SketchySeaBeast wrote:
               | But what if they want to know they are finding the
               | correct answer?
        
               | elliotto wrote:
               | Asking anonymous people on a forum would be much better.
        
               | SketchySeaBeast wrote:
               | At least a forum with domain-specific knowledge.
        
               | gpm wrote:
               | And people to go "no, that's wrong" if someone posts
               | something that's wrong.
        
               | jasonjmcghee wrote:
               | tbf, the gp comment said 125MB and then 3B, which would
               | be pretty confusing, as it's a typo and should be 125M.
        
               | jwineinger wrote:
               | The number of parameters the model is trained on, in
               | billions
        
               | jmward01 wrote:
               | (B)illion. It indicates the rough number of parameters in
               | the model. Higher is generally more capable. 1B models
               | are currently at the top end of 'easy' to deal with for
               | playing around fine tuning and the like for most home lab
               | setups.
        
             | a1o wrote:
             | What is tiny and what is big?
             | 
             | Can I have a model that is like 100MB in weights and run
             | with llama.cpp in my MacBook M2?
        
               | refulgentis wrote:
               | Yeah, absolutely -- you'll probably pull 100+ token/s.
               | 
               | Here's a good range of model sizes that run just fine
               | with llama.cpp on mac:
               | https://huggingface.co/telosnex/fllama/tree/main.
               | 
               | I recommend trying the Telosnex* app, it uses llama.cpp
               | and abstracts over LLMs so you can i.e. switch between
               | local/servers at will.
               | 
               | The important part for you is its free, accelerated on
               | macOS, and very easy to use local LLMs with (Settings >
               | AI > LLM > On Device, tap Get)
               | 
               | Prepare to be underwhelmed, slightly: its only when you
               | start hitting 3B that its coherent, anything under that
               | will feel more like a markov chain than an LLM.
               | 
               | Depending on how geeked out you'll be to have it running
               | locally, you might have fun with that Telosnex can run
               | local models on _every_ platform, i.e. you can run local
               | models on iOS /Android/web too.
               | 
               | * because it's mine :3 It is quietly released currently.
               | I want to get one more major update before widely
               | announcing it in Jan 2025
        
               | qskousen wrote:
               | Sorry to side track, but question about Telosnex - would
               | you consider a Linux release with something other than
               | Snap? Maybe Flatpak or appimage?
        
               | refulgentis wrote:
               | If its a (mostly) CI-able process, I'm totally open to it
               | ---
               | 
               | I looked into "What should I do besides Snap?" about 4
               | months ago; got quickly overwhelmed, because I don't have
               | enough knowledge to understand what's fringe vs. common.
               | 
               | I'll definitely take a look at Flatpak again in the next
               | month, 30 second Google says its possible (h/t /u/
               | damiano-ferrari at https://www.reddit.com/r/FlutterDev/co
               | mments/z35gdo/can_you_...)
               | 
               | (thanks for your interest btw, been working on this for
               | ~year and this is my first outside feature request :) may
               | there be many more)
        
               | jki275 wrote:
               | LM Studio on Mac is your friend. You can choose any model
               | you want, run a server for other tools, or chat direct
               | with the model. It can use either MLX or just plain
               | llama.cpp.
        
             | wolfgangK wrote:
             | <<Unfortunately, I have only seen 3 models, 3B or over,
             | handle RAG.>>
             | 
             | I would love to know which are these 3 models, especially
             | if they can perform grounded RAG. If you have models (and
             | their grounded RAG prompt formats) to share, I'm very
             | interested !
             | 
             | Thx.
        
         | attentionmech wrote:
         | wow, this RWKV thing blew my mind. Thank you for sharing this!
        
         | SGML_ROCKSTAR wrote:
         | It might still be of introductory help to someone who has yet
         | to formally learn what a language model is, what large language
         | models are, and where things might be in the future.
        
       | cjohnson318 wrote:
       | You can have small languages, sure, but then you run into awkward
       | extended clarifying clauses. The thing that makes languages
       | difficult is that almost all vocabulary is sparse. The "Top N"
       | words in a language are always pronouns, prepositions, articles,
       | and the conjugations of the top 12 or so verbs: to be, to have,
       | to do, to go, to come, to say, to give, etc. This is the reason
       | that "Top N Words of Language X" and "Learn the Top 50% of Words
       | in Language X" listicles/videos are always disappointing.
        
         | nine_k wrote:
         | But they seem to use much wider grammars, because their
         | (synthetic) dataset is a bunch of coherent stories at the level
         | of 3-4 y.o. children.
         | 
         | I would consider the "Simple English Wikipedia" the next
         | training set / benchmark.
        
         | Pikamander2 wrote:
         | There's an oddly relevant skit of this concept in the American
         | version of The Office:
         | https://www.youtube.com/watch?v=_K-L9uhsBLM
        
       | momojo wrote:
       | > We hope that TinyStories can facilitate the development,
       | analysis and research of LMs, especially for low-resource or
       | specialized domains, and shed light on the emergence of language
       | capabilities in LMs.
       | 
       | This part interests me the most. I want to know how small yet
       | functional we can get these models. I don't want an AI that can
       | solve calculus, I just want a dumb AI that pretty consistently
       | recognizes "lights off" and "lights on".
        
         | MobiusHorizons wrote:
         | why would you use an LLM for that? Seems like there are much
         | better options available.
        
           | londons_explore wrote:
           | It's actually pretty hard to design a non-llm system that can
           | detect all the possible variations:
           | 
           | Lights on. Brighter please. Turn on the light. Is there light
           | in here? Turn the light on. Table lamp: on. Does the desk
           | lamp work? It's a bit dim here, anything you can do? More
           | light please. Put the lights on for the next 5 mins. Turn the
           | light on when I come home. Turn all the lights off together.
           | Switch the lights off whenever its daytime or quiet at home
           | unless I say otherwise. etc.
           | 
           | If you don't support every possible way of saying a command,
           | then users will get frustrated because they effectively have
           | to go and learn the magic incantation of words for every
           | possible action, which is very user-unfriendly.
        
             | anon373839 wrote:
             | I suspect ModernBERT can also be very helpful with these
             | sorts of tasks, if you decompose them into an intent
             | classification step and a named entity recognition step.
        
               | simcop2387 wrote:
               | that entity extraction is where it actually gets really
               | really difficult, even for LLMs since people will use 10
               | different names for the same thing and you'll have to
               | know them ahead of time to handle them all properly. For
               | either BERT based or llm based there's a bit of a need
               | for the system to try to correct and learn those new
               | names unless you require users to put them all in ahead
               | of time. That said i've seen LLMs handle this a lot
               | better with a list of aliases in the prompt for each room
               | and then type of device when playing with home assistant
               | + llm.
        
             | phkahler wrote:
             | Your examples include complex instructions and questions,
             | but for simple ON/OFF commands you can go far by pulling
             | key words and ignoring sentence structure. For example,
             | pick out "on" "off" and "light" will work for "turn the
             | light on", "turn off the light", "light on", "I want the
             | light on", etc... Adding modifiers like "kitchen" or "all"
             | can help specify which lights (your "Table lamp: on"
             | example), regardless of how they're used. I'm not saying
             | this a great solution, but it covers pretty much all the
             | basic variations for simple commands and can run on
             | anything.
        
       | fl0id wrote:
       | They also describe a new benchmark / evaluation, but tbh is there
       | any evidence that this even works? (telling GPT-4 to check the
       | output as if it were checking student essays) We know it cannot
       | really do this, and the model used will not even stay consistent
       | if there are updates.
        
       | fi-le wrote:
       | We're doing a successor to this, working hard and going public in
       | month or so, hopefully. But HN gets a preview of course:
       | https://huggingface.co/datasets/lennart-finke/SimpleStories
       | 
       | And here's a more interactive explorer: https://fi-
       | le.net/simplestories
        
       | bigmattystyles wrote:
       | I've been curious about the opposite - a lot of times, I'll put a
       | few keywords that get to the point of what I want, but it's
       | incoherent English in - and yet, often the output is on point.
        
         | Suppafly wrote:
         | I know natural language is sorta the gold standard for a lot of
         | these models, but honestly I could see a lot of utility out of
         | a stripped down language set, similar to how you used to be
         | able to search google back in the day before they tried to make
         | it easier.
        
       | niemandhier wrote:
       | This question is also quite possible the most promising way to
       | get an upper bound on the Kolmogorov complexity of human
       | language.
        
       | lenerdenator wrote:
       | "Coherent" seems relatively subjective, no?
       | 
       | Could you get an LLM to generate "coherent" conversational
       | Geordie English? Probably, but my Midwestern ear isn't going to
       | be able to understand what they're saying.
        
       | osaariki wrote:
       | For some interesting context: this paper was a precursor to all
       | the work on synthetic data at Microsoft Research that lead to the
       | Phi series of SLMs. [1] It was an important demonstration of what
       | carefully curated and clean data could do for language models.
       | 
       | 1: https://arxiv.org/abs/2412.08905
        
       | HarHarVeryFunny wrote:
       | I'd guess that the ability of a very small model to do well on
       | the TinyStories dataset isn't just because of the limited 3-4yr
       | old vocabulary, but also because of it being an LLM-generated
       | dataset.
       | 
       | LLM-generated content (synthetic data) is easier than human
       | generated text for an LLM to learn because it was auto-
       | regressively generated, and therefore should be possible to auto-
       | regressively predict.
       | 
       | It's surprising that LLMs do as well as they do attempting to
       | predict human generated training samples where there is no
       | guarantee that the predictive signal is actually contained in the
       | sample (it may just be something in the mind of the human that
       | generated it).
       | 
       | I've got to wonder what the impact on generation is of an LLM
       | only trained on synthetic LLM-generated data? I'd guess it
       | wouldn't be as robust as one that had learned to handle more
       | uncertainty.
        
       | raymv wrote:
       | Trained a GPT-2 like model on the dataset a while back, here's
       | the source code and some results for anyone interested:
       | 
       | https://github.com/raymond-van/gpt-tinystories
        
       | mclau156 wrote:
       | Side note but is it really that crazy for Github to implement a
       | feature to see file size of a repo?
        
       | ankit219 wrote:
       | Great to see this here. We used this dataset from Tiny Stories to
       | train small models (as small as 20M params) and test out
       | knowledge addition. Published a paper based on this dataset. We
       | could get coherent outputs at sizes as low as 20M-25M. (though
       | not as great as LLMs, but still decent enough).
       | 
       | [1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-
       | knowledge-in-large... (Paper is titled: Cross-Domain Content
       | Generation with Domain-Specific Small Language Models)
        
       | lutusp wrote:
       | Decades ago, prior to the existence of personal computers, when a
       | "computer" was a glassed-in room staffed by lab-coat-wearing
       | technicians (picture John Von Neumann standing next to the first
       | stored-program computer:
       | https://www.theguardian.com/technology/2012/feb/26/first-com...),
       | someone reduced an entire printed book (or more than one) to a
       | word-token decision tree, at great cost and effort, just to see
       | what would happen.
       | 
       | I can't find the original paper, but with an appropriate amount
       | of pseudorandomness to avoid dead ends, this primitive algorithm
       | would generate the occasional sentence that almost made sense and
       | that bore little resemblance to the original data.
       | 
       | Because of the state of computer technology it was a massive
       | effort and a source of general astonishment. I suspect we're now
       | recreating that minimal environment, this time with better ways
       | to curate the data for small size and maximum drama.
       | 
       | Let's remember that a modern GPT isn't far removed from that
       | scheme -- not really.
        
       | Animats wrote:
       | (2023), as someone mentioned.
       | 
       | It's encouraging to see how much can be done with tiny models.
       | 
       | Still need to crack "I don't know" recognition, so you can start
       | with a tiny model and then pass the buck to a bigger model for
       | hard questions. That will enormously reduce the cost of "AI"
       | customer support.
        
       ___________________________________________________________________
       (page generated 2025-01-02 23:00 UTC)