[HN Gopher] TinyStories: How Small Can Language Models Be and St...
       ___________________________________________________________________
        
       TinyStories: How Small Can Language Models Be and Still Speak
       Coherent English? (2023)
        
       Author : tzury
       Score  : 209 points
       Date   : 2025-01-02 17:54 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jasonjmcghee wrote:
       | Edit: ah. This paper is from May 2023. Might be worth putting
       | that in the title.
       | 
       | ---
       | 
       | > Models with around 125M parameters such as GPT-Neo (small) or
       | GPT-2 (small) can rarely generate coherent and consistent English
       | text beyond a few words
       | 
       | These models are 5 years old.
       | 
       | I have to wonder if the authors have seen RWKV 7 0.1B, because it
       | blows away just about every other model I've seen at that size.
       | 
       | The capabilities it has vs the examples in the paper are night
       | and day.
       | 
       | https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1
        
         | jmward01 wrote:
         | The age of tiny models is just about here. We are finally
         | busting away from the standard transformer block and training.
         | I have a side project that can dramatically shrink networks by
         | a set of techniques I call sacrificial training[1]. I think
         | things like this will finally swing things back to on prem and
         | on device small models that are as capable as the big hosted
         | models are now.
         | 
         | [1] https://github.com/jmward01/lmplay/wiki/Sacrificial-
         | Training
        
           | refulgentis wrote:
           | RWKV is def. better than TinyStories 125MB.
           | 
           | Unfortunately, I have only seen 3 models, 3B or over, handle
           | RAG.
           | 
           | Tested RWKV with a simple in-the-sports-news question and it
           | didn't even get close to approaching the question. And nearly
           | everything was fundamentally incoherent even in its internal
           | reality (ex. Player gets 5000/game and is the first with 1000
           | in 16 games)
           | 
           | (prompt: https://pastebin.com/sCLn5sCJ, response:
           | https://pastebin.com/TqudvDbN)
           | 
           | I don't think there's a position for LLMs that are "just"
           | writers on the market in 2025.
        
             | jasonjmcghee wrote:
             | These tiny models in general have really weird failure
             | modes. I tried the tiny stories prompt about asking mom for
             | a dog who said no, and it output an incredibly dark story
             | about how she asked her dad and they got a dog but it had
             | pancreatic cancer (paraphrasing, it went into detail about
             | the surgery etc.) and then starting writing an
             | informational PSA about who is at risk of pancreatic cancer
             | etc.
        
               | kube-system wrote:
               | What I find fascinating is how ML models hallucinate in a
               | way that is sometimes reminiscent of a fever dream.
        
               | ethbr1 wrote:
               | It makes sense that the failure modes of language
               | prediction look a lot like ADD.
        
               | p0w3n3d wrote:
               | It's because they are precisely lacking attention
        
               | jdiff wrote:
               | Don't fall into the trap of applying human psychology to
               | LLMs. Bag-of-chemistry quirks do not translate to matrix-
               | multiplication quirks.
        
               | ethbr1 wrote:
               | Why not? In both cases the result is losing the thread of
               | thought.
        
               | jdiff wrote:
               | ...No, no that's not how ADHD works. It's difficult to
               | sum up how wrong this is concisely, but I invite you do
               | to some serious research into ADHD, how it functions, and
               | the great variety of ways in which it can present in
               | different people. It's quite a poor analogy.
        
               | ethbr1 wrote:
               | I'm aware that anything to do with the brain has a
               | variety of presentations.
               | 
               | Could you try to put a couple sentences down on how ADHD
               | is an inapt metaphor for failure modes in this case?
               | 
               | It's lazy to claim something is wrong without offering a
               | useful point as to how it's wrong. I trust in your
               | ability to summarize.
        
               | jdiff wrote:
               | For additional context/discussion, I feel this comment[0]
               | elsewhere in the thread put it well.
               | 
               | The reply to that comment also has some information I
               | feel is helpful to show the breakdown here. It mentions
               | that lack of attention presents in only 15-20% of cases.
               | This isn't ADHD, it is something new, the fundamental
               | underpinnings do not relate, and so the analogy/metaphor
               | does not facilitate a better understanding of the
               | situation.
               | 
               | On the contrary, it makes LLM "attention" out to be
               | something entirely different from what it actually is.
               | Without attention, models don't become easily distracted.
               | They are easily distracted regardless. Without attention,
               | LLMs _primarily_ fail to disambiguate between different
               | meanings of identical words, they fail to take context of
               | the sentence structure into account when assigning
               | meaning.
               | 
               | I hopefully don't have to dive into psychological and
               | chemical specifics of ADHD to have demonstrated that this
               | is fundamentally just not at all what ADHD is. Again,
               | there is no underlying harmony between this mechanism and
               | how ADHD affects human attention in 15-20% of cases, and
               | there is no analogy.
               | 
               | The only similarity is that they both use the word
               | "attention". If they'd used a different label, we
               | wouldn't even be having this conversation right now.
               | 
               | [0] https://news.ycombinator.com/item?id=42585600
        
               | marxisttemp wrote:
               | It's lazier to claim something is correct without
               | offering a useful point as to how it's correct. I trust
               | in your ability to theorize.
        
               | hobs wrote:
               | Because analogy can be useful in explaining things, or it
               | can be worse than useless - it ties our thinking up into
               | side quests that have nothing to do with the matter at
               | hand.
        
               | soulofmischief wrote:
               | ADHD is an actively-researched dopaminergic disorder with
               | a host of possible symptoms completely unrelated to
               | attention or hyperactivity.
               | 
               | It is ill-named and thus one often encounters comments
               | such as yours in the real world, which while not meant to
               | be negative, can be marginalizing to those with ADHD who
               | see their disorder as misunderstood and the term misused
               | much like people who say "I'm depressed" or "They're
               | acting schizo again".
               | 
               | LLMs do not have dopamine pathways and therefore we
               | should avoid comparing them to human-specific brain
               | disorders, or marginalizing ADHD folk by trivializing the
               | disorder or spreading misinformation about the
               | presentation of ADHD. LLM hallucination does not "look a
               | lot like ADD", that's such a vague and unsupported claim.
               | Furthermore, "lacking attention" doesn't even make sense
               | with respect to attention models. The "attention" in ADHD
               | and "attention" in transformers share a semantic basis
               | but are two very different phenomena.
        
               | robwwilliams wrote:
               | For a good overview in ADHD see
               | 
               | https://www.ncbi.nlm.nih.giv/books/NBK441838/
               | 
               | It is not "a dopaminergic disorder" any more than many
               | other neuropsychiatric disorders. Nothing much happens in
               | CNS without some level of modulation by dopaminergic
               | receptors, and to the best of my knowledge variants in
               | these receptors are not known to contribute strongly to
               | ADHD (I just confirmed by reviewed the GWAS Catalog:
               | ebi.ac.uk/gwas/efotraits/EFI_oo3888 ).
               | 
               | Furthermoe lack of attention is considered an important
               | facet of ADHD---common to about 15-20% of cases.
               | 
               | Humans tend to think in terms of metaphors. Similes and
               | metaphors are crucial in learning and thinking. And yes,
               | sometimes problematic.
               | 
               | Explaining what is wrong with a particular metaphor can
               | help.
        
               | taneq wrote:
               | A fever dream looks nothing like ADD. If anything it's
               | like a very mild mushroom trip. Did you base this on
               | anything or did it just sound good in your head?
        
               | ethbr1 wrote:
               | Your fever dreams and/or mushroom trips must be a lot
               | more narratively stable and consistent than mine...
        
               | nerdponx wrote:
               | Lest we forget that this stream-of-consciousness
               | confusion was state of the art just a few years ago.
               | 
               | It makes sense if you think about it: a small model's
               | "internal state" isn't rich enough to keep track of
               | whatever it was supposed to be talking about.
               | 
               | It makes me think that the reason LLMs need to be so
               | large is that the internal state needs to be bigger than
               | a typical human "idea", whatever that might mean.
        
               | acchow wrote:
               | The way we do LLMs now is that the program and the data
               | are one and the same. The program mutates itself as it
               | "executes". This is probably also how the brain works
               | since there is no hard separation between "memory"
               | neurons and "data processing" neurons. (biology has no
               | hard separation in general).
        
               | seattleeng wrote:
               | As is usually the case, check the data! A lot of the
               | dataset used has fairly morbid scenarios, so the model is
               | working as expected. All the data was synthetically
               | created with GPT4
        
             | jmward01 wrote:
             | I plan on checking out RWKV and seeing if I can add my
             | sacrifical training techniques to it this weekend. There is
             | a reason quantization works, it is because models are very
             | badly trained right now. I think we can get really good
             | performance on .1b and 1b models which opens up the world
             | to fine-tuning again. I was playing with fine-tuning llama
             | 7b and 13b a while back but the HW/SW stack made it so
             | unwieldy and the ROI was terrible compared to just
             | adjusting prompts on gpt-4o-mini and the like. I have hope
             | that we are about to see single GPU, very simple, fine-
             | tuning again as models shrink and GPUs grow.
        
             | daxfohl wrote:
             | Would there be any way to distribute RAG across multiple
             | smaller models? Rather than one giant model handling your
             | entire document base, have it be more of a tree where the
             | top level classifies the docs into top-level categories and
             | sends it to submodels to subclassify, etc? (Doesn't have to
             | be 1:1 classification). And same for q/a search?
             | 
             | These could all presumably be the same physical instance,
             | just each query would use a different system prompt and
             | perhaps different embeddings. (I'm guessing; I don't
             | actually know how RAG works). So, a little slower and
             | clunkier, but presumably way more efficient. And match
             | could be anywhere between horrible to better-than-one-
             | large-model. This would be more like how businesses
             | organize docs.
             | 
             | Or maybe there's no real benefit to this, and each
             | subclassifier would require just as big of a model as if
             | you were to throw all docs into a single model anyway. I
             | assume it's probably been tried before.
        
               | refulgentis wrote:
               | TL;DR: It's a very interesting line of thought that as
               | late as Q2 2024, there were a couple thought leaders who
               | pushed the idea we'd have, like 16 specialized local
               | models.
               | 
               | I could see that in the very long term, but as it stands,
               | it works the way you intuited: 2 turkeys don't make an
               | eagle, i.e. there's some critical size where its speaking
               | coherently, and its at least an OOM bigger than it needs
               | to be in order to be interesting for products
               | 
               | fwiw RAG for me in this case is: - user asks q.
               | 
               | - llm generates search queries.
               | 
               | - search api returns urls.
               | 
               | - web view downloads urls.
               | 
               | - app turns html to text.
               | 
               | - local embedding model turns text into chunks.
               | 
               | - app decides, based on "character" limit configured by
               | user, how many chunks to send.
               | 
               | - LLM gets all the chunks, instructions + original
               | question, and answers.
               | 
               | It's incredibly interesting how many models fail this
               | simple test, there's been multiple Google releases in the
               | last year that just couldn't handle it.
               | 
               | - Some of it is basic too small to be coherent, bigcos
               | don't make that mistake though.
               | 
               | - There's another critical threshold where the model
               | doesn't wander off doing the traditional LLM task of
               | completing rather than answering. What I mean is,
               | throwing in 6 pages worth of retrieved webpages will
               | cause some models to just start rambling like its writing
               | more web pages, i.e. they're not able to "identify the
               | context" of the web page snippets, and they ignore the
               | instructions.
        
               | groby_b wrote:
               | There's just been a twitter post by Omar Khattab
               | (@lateinteraction) on encoding documents into a scoring
               | function instead of a simple vector for the work on
               | ColBERT - and maybe at some point using a DNN as scoring
               | function.
               | 
               | So, yes, maybe there's a way to "distribute" RAG. (I
               | still wonder if that isn't just MoE taken to its logical
               | conclusion)
               | 
               | So, dig for ColBERT papers, might be helpful. (I wish I
               | had the time to do that)
        
               | ankit219 wrote:
               | Short answer: Yes, there are ways it can be done.
               | Multiple. Needs to be custom built though, given no one
               | has explored it deeply yet.
               | 
               | One simple way is what Omar Khattab (ColBert) mentioned
               | about scoring function instead of a simple vector.
               | 
               | Another is to use a classifier at the start directing
               | queries to the right model. You will have to train the
               | classifier though. (I mean a language model kind of does
               | this implicitly, you are just taking more control by
               | making it explicit.)
               | 
               | Another is how you index your docs. Today, most RAG
               | approaches do not encode enough information. If you have
               | defined domains/models already, you can encode the same
               | in metadata for your docs at the time of indexing, and
               | you pick the model based on the metadata.
               | 
               | These approaches would work pretty well, given a model as
               | small as 100M size can regurgitate what is in your docs.
               | And is faster compared to your larger models.
               | 
               | Benefit wise, I don't see a lot of benefit except
               | preserving privacy and gaining more control.
        
               | daxfohl wrote:
               | I was originally thinking about it as like a bazel plugin
               | for large codebases. Each module would have its own LLM
               | context, and it might make it easier to put whole modules
               | into the context, plus summaries of the dependencies.
               | That could work better than a single huge context
               | attempting to summarize the whole monorepo.
               | 
               | The general idea is probably be better for the code use
               | case too, since having the module's whole codebase in
               | context likely allows for more precise edits. Whereas RAG
               | is just search, not edit.
               | 
               | That said, probably code assistants must somewhat do this
               | already, though it must be more ad-hoc. Obviously they
               | wouldn't be able to do any completions if they don't have
               | detailed context of the adjacent code.
        
               | Pamar wrote:
               | _Another is how you index your docs. Today, most RAG
               | approaches do not encode enough information...._
               | 
               | Could you please provide some more info (or maybe links)
               | about this, please?
        
               | antman wrote:
               | I think he might be saying, have metadata in your vector
               | retrieval that describe the domain of the retrieved chunk
               | and use that as a decision on which model to use
               | downstream. Sounds like very interesting improvement of
               | RAG
        
             | raegis wrote:
             | > Unfortunately, I have only seen 3 models, 3B or over,
             | handle RAG.
             | 
             | What's the unit "B" in "3B"? I can search for acronyms like
             | "RAG" just fine, but you experts aren't making it easy for
             | us beginners :)
             | 
             | Edit: Apologies, this is obvious. My brain needed a reboot
             | for the new year.
        
               | cauliflower2718 wrote:
               | You can ask an LLM exactly this question and it will tell
               | you.
               | 
               | (The answer is billions of parameters)
        
               | SketchySeaBeast wrote:
               | But what if they want to know they are finding the
               | correct answer?
        
               | elliotto wrote:
               | Asking anonymous people on a forum would be much better.
        
               | SketchySeaBeast wrote:
               | At least a forum with domain-specific knowledge.
        
               | gpm wrote:
               | And people to go "no, that's wrong" if someone posts
               | something that's wrong.
        
               | greesil wrote:
               | No, any answer will do.
        
               | cauliflower2718 wrote:
               | I think basic definitions for LLMs are solidly within the
               | bounds of what we would expect e.g. chatgpt to be
               | competent at. The task (defining terms) is simple and the
               | specific content (basic LLM stuff) is easy to check by
               | anyone who works on the LLM.
               | 
               | I agree with the general sentiment that we should not
               | just blindly trust LLMs though.
        
               | jasonjmcghee wrote:
               | tbf, the gp comment said 125MB and then 3B, which would
               | be pretty confusing, as it's a typo and should be 125M.
        
               | jwineinger wrote:
               | The number of parameters the model is trained on, in
               | billions
        
               | jmward01 wrote:
               | (B)illion. It indicates the rough number of parameters in
               | the model. Higher is generally more capable. 1B models
               | are currently at the top end of 'easy' to deal with for
               | playing around fine tuning and the like for most home lab
               | setups.
        
             | a1o wrote:
             | What is tiny and what is big?
             | 
             | Can I have a model that is like 100MB in weights and run
             | with llama.cpp in my MacBook M2?
        
               | refulgentis wrote:
               | Yeah, absolutely -- you'll probably pull 100+ token/s.
               | 
               | Here's a good range of model sizes that run just fine
               | with llama.cpp on mac:
               | https://huggingface.co/telosnex/fllama/tree/main.
               | 
               | I recommend trying the Telosnex* app, it uses llama.cpp
               | and abstracts over LLMs so you can i.e. switch between
               | local/servers at will.
               | 
               | The important part for you is its free, accelerated on
               | macOS, and very easy to use local LLMs with (Settings >
               | AI > LLM > On Device, tap Get)
               | 
               | Prepare to be underwhelmed, slightly: its only when you
               | start hitting 3B that its coherent, anything under that
               | will feel more like a markov chain than an LLM.
               | 
               | Depending on how geeked out you'll be to have it running
               | locally, you might have fun with that Telosnex can run
               | local models on _every_ platform, i.e. you can run local
               | models on iOS /Android/web too.
               | 
               | * because it's mine :3 It is quietly released currently.
               | I want to get one more major update before widely
               | announcing it in Jan 2025
        
               | qskousen wrote:
               | Sorry to side track, but question about Telosnex - would
               | you consider a Linux release with something other than
               | Snap? Maybe Flatpak or appimage?
        
               | refulgentis wrote:
               | If its a (mostly) CI-able process, I'm totally open to it
               | ---
               | 
               | I looked into "What should I do besides Snap?" about 4
               | months ago; got quickly overwhelmed, because I don't have
               | enough knowledge to understand what's fringe vs. common.
               | 
               | I'll definitely take a look at Flatpak again in the next
               | month, 30 second Google says its possible (h/t /u/
               | damiano-ferrari at https://www.reddit.com/r/FlutterDev/co
               | mments/z35gdo/can_you_...)
               | 
               | (thanks for your interest btw, been working on this for
               | ~year and this is my first outside feature request :) may
               | there be many more)
        
               | a1o wrote:
               | I have no interest in that, I would like small models
               | that I can integrate and run offline in software that I
               | make it myself be IDEs or games. CLion has a nice
               | predictive model for single line C++ completion that has
               | 400 MBs.
        
               | refulgentis wrote:
               | Ah, totally possible, but wrapping llama.cpp will likely
               | take a week to spike out and a month to stabilize across
               | models.
               | 
               | The biggest problem for relying on it for local software
               | is there's just too much latency for ex. game use cases
               | currently. (among other UX bugaboos)
               | (https://news.ycombinator.com/item?id=42561095)
        
               | jki275 wrote:
               | LM Studio on Mac is your friend. You can choose any model
               | you want, run a server for other tools, or chat direct
               | with the model. It can use either MLX or just plain
               | llama.cpp.
        
             | wolfgangK wrote:
             | <<Unfortunately, I have only seen 3 models, 3B or over,
             | handle RAG.>>
             | 
             | I would love to know which are these 3 models, especially
             | if they can perform grounded RAG. If you have models (and
             | their grounded RAG prompt formats) to share, I'm very
             | interested !
             | 
             | Thx.
        
           | mr_toad wrote:
           | > small models that are as capable as the big hosted model
           | 
           | Free lunch?
        
         | attentionmech wrote:
         | wow, this RWKV thing blew my mind. Thank you for sharing this!
        
         | SGML_ROCKSTAR wrote:
         | It might still be of introductory help to someone who has yet
         | to formally learn what a language model is, what large language
         | models are, and where things might be in the future.
        
         | mcny wrote:
         | What am I doing wrong here?
         | 
         | Prompt:
         | 
         | Say , oh my dear husband I am hungry. Please give me something
         | to eat
         | 
         | In as many languages as you know
         | 
         | Response:
         | 
         | (This response is from hugging face with default settings)
         | 
         | , but you don't know me.
         | 
         | 1 2 THE GREAT KINGDOM OF THE MONGOLIANS
         | 
         | But I am hungry and I am thirsty.
         | 
         | I want to drink water from the well.
         | 
         | And I have a friend who lives in the mountains. He is a great
         | mountain man, and he has a big stick that he uses to cut down
         | trees. And when he cuts down a tree, he can eat all of its
         | leaves.
         | 
         | You see, my friend, you are not alone in this world. There are
         | many people like you in the mountains who live in peace and
         | harmony with nature. They do not quarrel with each other or
         | fight each other for their land or their homes. They do not
         | fight for their own interests or their own selfish desires.
         | They do not quarrel with each other because they are united by
         | a common love for the land and the people who live there.
         | 
         | They have learned how to live together peacefully and happily,
         | without fighting or quarreling, without any grudges or
         | misunderstandings. They have learned how to share everything
         | they have, to work together as one family, to eat what they
         | need and drink what they want.
         | 
         | And so it is that we all live together in peace and harmony
         | with each other as one family. We all know that we are all
         | equal before God, that we are all children of God and that we
         | are all children of humanity. And so it is that we all share in
         | the love of Christ Jesus our Lord, our Savior, our Redeemer,
         | our Beloved Disciple , our beloved Friend , our Holy Mother
         | Mary Our Lady , our Holy Father Pope John Paul II , Our Lady of
         | Mount Carmel , Our Lady of Mercy , Our Lady of Good Will , Our
         | Lady of Sorrow and Blessed Virgin Mary .
         | 
         | And so it is that we all know that we are children of God and
         | that we are children of humanity. We all know that God has made
         | us special and unique because He loves us as His children and
         | because He has given us His Holy Spirit to give us the gift of
         | prophecy and wisdom .
         | 
         | We know also that God has given us the gift of prophecy to help
         | us understand how He has made the world into what it is today.
         | We know also that He has given us the gift of wisdom to help us
         | learn how to live in harmony with each other as one family . We
         | know also that God has given us the gift of faith to help us
         | understand how He has made the world into what it is today. We
         | know also that God has given us the gift of knowledge to help
         | us understand how He has made the world into what it is today.
         | 
         | And so it is that we all know that God has given us His holy
         | name Jesus Christ our Lord, our Savior, our Redeemer, our
         | Beloved Disciple , Our beloved Friend , Our Holy Father Pope
         | John Paul II , Our Lady Mary Our Lady .
         | 
         | And so it is that we all know also that God has made the world
         | into what it is today. We know also that God has given us His
         | Holy Spirit to help us understand how He has made the world
         | into what it is today. We know also that God has given us His
         | Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our
         | Beloved Disciple , Our beloved Friend , Our Holy Father Pope
         | John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
         | 
         | And so it is that we all know also that God has made the world
         | into what it is today. We know also that God has given us His
         | Holy Spirit to help us understand how He has made the world
         | into what it is today. We know also that God has given us His
         | Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our
         | Beloved Disciple , our beloved Friend , Our Holy Father Pope
         | John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
         | 
         | And so it is that we all know also that God has made the world
         | into what it is today. We know also that God has given us His
         | Holy Spirit to help us understand how He has made the world
         | into what it is today. We know also that God has given us His
         | Holy Name Jesus Christ his Lord, our Savior, our Redeemer, our
         | Beloved Disciple , our beloved Friend , Our Holy Father Pope
         | John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
         | 
         | And so it is that we all know also that God has made the world
         | into what it is today. We know also that God has given us His
         | Holy Spirit to help us understand how He has made the world
         | into
        
           | userbinator wrote:
           | You seem to have inadvertently summoned the latent Terry
           | Davis lurking inside the LLM.
        
         | tzury wrote:
         | saw over here a post about this model a day ago I think. tested
         | it now using the link you provided.
         | 
         | That is a lot to take over here for such simple question...
         | PROMPT: Why do gyms fully packed the first two weeks of a year?
         | RESPONSE: I think it's because they're the first two weeks of a
         | year. You're going to be out in the world for a while, and
         | you're going to be in a lot of different places. So you're
         | going to have a lot of people that are not familiar with gyms.
         | And so it's kind of like, "Oh, I'll just go there." It's kind
         | of like, "I'll just go to this other place." So that's how it
         | works.              And I think what we've seen is that some
         | gyms are starting to really embrace social distancing. They're
         | doing things like wearing masks or getting masks on the bus or
         | whatever else they can do to help prevent people from getting
         | sick. But they're also doing things like encouraging people to
         | get their kids into school and encourage them to stay home if
         | they feel sick. And so we see that kind of trend coming back in
         | gyms.              And I think that's what's really important
         | for us as a community is to make sure that we're not putting
         | ourselves at risk by not being able to social distance
         | ourselves properly. And so I think that's what we need to do
         | right now.              And one thing that I think is really
         | important is if you're going to be out in the world for a
         | while, and you don't want your kids or your family members or
         | your friends around you, then you need to wear masks. So if
         | you're going out with friends, and you don't want them around
         | you, then wear a mask. If you're going out with someone who has
         | been diagnosed with COVID-19, then wear a mask too.
         | We've seen many cases where people were wearing masks at events
         | and restaurants and things like that. But I think we need to
         | continue to wear masks even more during this pandemic period
         | because we know that the virus spreads through droplets when
         | people cough or sneeze or whatever else they do. And so it's
         | really important for us as individuals and as communities as
         | well as our government officials and our public health
         | officials to continue wearing masks even though we may not be
         | able to see them or touch them or smell them.              So
         | let me just end this video with a quote from the CDC: "The
         | Centers for Disease Control and Prevention (CDC) has released
         | new guidance on how best to protect yourself from COVID-19." So
         | please keep up the good work!              Thank you very much
         | for watching!              We'll talk more about the latest
         | news in fitness in just a minute, but first, let's talk about
         | why we should wear masks.
        
           | Bengalilol wrote:
           | The model just went "to be out in the world for a while"
        
           | dreis_sw wrote:
           | I had the same experience. It's clearly very limited
           | intellectually.
        
         | atombender wrote:
         | Prompt: Are there vampires on Mars?
         | 
         | Output: There are no vampires on Mars. The only vampires that
         | exist on Mars are those who live in the colony of Hades, which
         | is located near the center of the planet. There are also
         | vampires who live in the underworld, but they are not as common
         | as vampires on Earth.
        
       | cjohnson318 wrote:
       | You can have small languages, sure, but then you run into awkward
       | extended clarifying clauses. The thing that makes languages
       | difficult is that almost all vocabulary is sparse. The "Top N"
       | words in a language are always pronouns, prepositions, articles,
       | and the conjugations of the top 12 or so verbs: to be, to have,
       | to do, to go, to come, to say, to give, etc. This is the reason
       | that "Top N Words of Language X" and "Learn the Top 50% of Words
       | in Language X" listicles/videos are always disappointing.
        
         | nine_k wrote:
         | But they seem to use much wider grammars, because their
         | (synthetic) dataset is a bunch of coherent stories at the level
         | of 3-4 y.o. children.
         | 
         | I would consider the "Simple English Wikipedia" the next
         | training set / benchmark.
        
         | Pikamander2 wrote:
         | There's an oddly relevant skit of this concept in the American
         | version of The Office:
         | https://www.youtube.com/watch?v=_K-L9uhsBLM
        
       | momojo wrote:
       | > We hope that TinyStories can facilitate the development,
       | analysis and research of LMs, especially for low-resource or
       | specialized domains, and shed light on the emergence of language
       | capabilities in LMs.
       | 
       | This part interests me the most. I want to know how small yet
       | functional we can get these models. I don't want an AI that can
       | solve calculus, I just want a dumb AI that pretty consistently
       | recognizes "lights off" and "lights on".
        
         | MobiusHorizons wrote:
         | why would you use an LLM for that? Seems like there are much
         | better options available.
        
           | londons_explore wrote:
           | It's actually pretty hard to design a non-llm system that can
           | detect all the possible variations:
           | 
           | Lights on. Brighter please. Turn on the light. Is there light
           | in here? Turn the light on. Table lamp: on. Does the desk
           | lamp work? It's a bit dim here, anything you can do? More
           | light please. Put the lights on for the next 5 mins. Turn the
           | light on when I come home. Turn all the lights off together.
           | Switch the lights off whenever its daytime or quiet at home
           | unless I say otherwise. etc.
           | 
           | If you don't support every possible way of saying a command,
           | then users will get frustrated because they effectively have
           | to go and learn the magic incantation of words for every
           | possible action, which is very user-unfriendly.
        
             | anon373839 wrote:
             | I suspect ModernBERT can also be very helpful with these
             | sorts of tasks, if you decompose them into an intent
             | classification step and a named entity recognition step.
        
               | simcop2387 wrote:
               | that entity extraction is where it actually gets really
               | really difficult, even for LLMs since people will use 10
               | different names for the same thing and you'll have to
               | know them ahead of time to handle them all properly. For
               | either BERT based or llm based there's a bit of a need
               | for the system to try to correct and learn those new
               | names unless you require users to put them all in ahead
               | of time. That said i've seen LLMs handle this a lot
               | better with a list of aliases in the prompt for each room
               | and then type of device when playing with home assistant
               | + llm.
        
               | ted_dunning wrote:
               | It isn't unreasonable to imagine one recognizable intent
               | to be teaching new terminology. That would allow dialogs
               | where the machine doesn't understand the object of the
               | command and the human then says something like "when I
               | say x, I mean the y" and the computer updates the
               | training set for the named entity recognizer and does a
               | quick fine-tuning pass.
        
             | phkahler wrote:
             | Your examples include complex instructions and questions,
             | but for simple ON/OFF commands you can go far by pulling
             | key words and ignoring sentence structure. For example,
             | pick out "on" "off" and "light" will work for "turn the
             | light on", "turn off the light", "light on", "I want the
             | light on", etc... Adding modifiers like "kitchen" or "all"
             | can help specify which lights (your "Table lamp: on"
             | example), regardless of how they're used. I'm not saying
             | this a great solution, but it covers pretty much all the
             | basic variations for simple commands and can run on
             | anything.
        
               | astrange wrote:
               | Original Siri used a primitive sentence embedding and
               | then nearest-neighbor of the vector to supported
               | commands.
        
             | MobiusHorizons wrote:
             | That is certainly a hard problem, but how do you know such
             | a system is better than simpler command based ones? Systems
             | like you describe have a much higher chance of taking the
             | wrong action. Even humans do, when given ambiguous
             | instructions. I know everyone loves to hate on Siri because
             | it doesn't do a good job of understanding anything complex,
             | but I have always found it to be very reliable when you
             | find the right command. As a result I use it much more
             | often than google assistant (was team android/ pixel until
             | 2020). I especially use it for timers, reminders, and
             | navigation, and if my hands are not free, texts. Taking the
             | wrong action breeds distrust, which I also think is not
             | user friendly.
        
       | fl0id wrote:
       | They also describe a new benchmark / evaluation, but tbh is there
       | any evidence that this even works? (telling GPT-4 to check the
       | output as if it were checking student essays) We know it cannot
       | really do this, and the model used will not even stay consistent
       | if there are updates.
        
       | fi-le wrote:
       | We're doing a successor to this, working hard and going public in
       | month or so, hopefully. But HN gets a preview of course:
       | https://huggingface.co/datasets/lennart-finke/SimpleStories
       | 
       | And here's a more interactive explorer: https://fi-
       | le.net/simplestories
        
         | jmward01 wrote:
         | This looks like a great dataset! Thanks for posting. I'm
         | looking for projects just like this to try my training
         | modifications against. Do you have initial results shown? It is
         | a small model/dataset so training the gpt2 model in the repo
         | probably wouldn't be too hard but it would be good to have
         | reference runs to make sure things are set up right when I run
         | it.
        
           | fi-le wrote:
           | So glad you like it! If I understand your question correctly,
           | yes, we are also putting together a small library for
           | training small language models. It's not mature at all yet,
           | but can keep up with our progress here:
           | https://github.com/danbraunai/simple_stories_train
        
             | jmward01 wrote:
             | Yeah. I looked at the dataset and there are a lot of
             | possible tasks you could train against here since it has
             | some great annotations. So, having a simple reference
             | baseline, like a pretrain gpt2 run (which I think your repo
             | is set up to do), helps give a starting point for other
             | work. It looks like the dataset is small enough and the
             | gpt2 ref code in your repo is lightweight enough to do a
             | quick run and plot some curves on. Thanks!
        
         | coder543 wrote:
         | Does template_plural actually work well / offer any benefits?
        
       | bigmattystyles wrote:
       | I've been curious about the opposite - a lot of times, I'll put a
       | few keywords that get to the point of what I want, but it's
       | incoherent English in - and yet, often the output is on point.
        
         | Suppafly wrote:
         | I know natural language is sorta the gold standard for a lot of
         | these models, but honestly I could see a lot of utility out of
         | a stripped down language set, similar to how you used to be
         | able to search google back in the day before they tried to make
         | it easier.
        
       | niemandhier wrote:
       | This question is also quite possible the most promising way to
       | get an upper bound on the Kolmogorov complexity of human
       | language.
        
       | lenerdenator wrote:
       | "Coherent" seems relatively subjective, no?
       | 
       | Could you get an LLM to generate "coherent" conversational
       | Geordie English? Probably, but my Midwestern ear isn't going to
       | be able to understand what they're saying.
        
       | osaariki wrote:
       | For some interesting context: this paper was a precursor to all
       | the work on synthetic data at Microsoft Research that lead to the
       | Phi series of SLMs. [1] It was an important demonstration of what
       | carefully curated and clean data could do for language models.
       | 
       | 1: https://arxiv.org/abs/2412.08905
        
       | HarHarVeryFunny wrote:
       | I'd guess that the ability of a very small model to do well on
       | the TinyStories dataset isn't just because of the limited 3-4yr
       | old vocabulary, but also because of it being an LLM-generated
       | dataset.
       | 
       | LLM-generated content (synthetic data) is easier than human
       | generated text for an LLM to learn because it was auto-
       | regressively generated, and therefore should be possible to auto-
       | regressively predict.
       | 
       | It's surprising that LLMs do as well as they do attempting to
       | predict human generated training samples where there is no
       | guarantee that the predictive signal is actually contained in the
       | sample (it may just be something in the mind of the human that
       | generated it).
       | 
       | I've got to wonder what the impact on generation is of an LLM
       | only trained on synthetic LLM-generated data? I'd guess it
       | wouldn't be as robust as one that had learned to handle more
       | uncertainty.
        
         | kouteiheika wrote:
         | > I'd guess that the ability of a very small model to do well
         | on the TinyStories dataset isn't just because of the limited
         | 3-4yr old vocabulary, but also because of it being an LLM-
         | generated dataset.
         | 
         | You guess is correct. The level of vocabulary has little to do
         | with it. There was a paper about this a while back (sorry,
         | can't find the link) where they found that the model still
         | learned just as well when they increased the complexity of the
         | text, as long as the texts were LLM generated.
        
       | raymv wrote:
       | Trained a GPT-2 like model on the dataset a while back, here's
       | the source code and some results for anyone interested:
       | 
       | https://github.com/raymond-van/gpt-tinystories
        
       | mclau156 wrote:
       | Side note but is it really that crazy for Github to implement a
       | feature to see file size of a repo?
        
       | ankit219 wrote:
       | Great to see this here. We used this dataset from Tiny Stories to
       | train small models (as small as 20M params) and test out
       | knowledge addition. Published a paper based on this dataset. We
       | could get coherent outputs at sizes as low as 20M-25M. (though
       | not as great as LLMs, but still decent enough).
       | 
       | [1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-
       | knowledge-in-large... (Paper is titled: Cross-Domain Content
       | Generation with Domain-Specific Small Language Models)
        
       | lutusp wrote:
       | Decades ago, prior to the existence of personal computers, when a
       | "computer" was a glassed-in room staffed by lab-coat-wearing
       | technicians (picture John Von Neumann standing next to the first
       | stored-program computer:
       | https://www.theguardian.com/technology/2012/feb/26/first-com...),
       | someone reduced an entire printed book (or more than one) to a
       | word-token decision tree, at great cost and effort, just to see
       | what would happen.
       | 
       | I can't find the original paper, but with an appropriate amount
       | of pseudorandomness to avoid dead ends, this primitive algorithm
       | would generate the occasional sentence that almost made sense and
       | that bore little resemblance to the original data.
       | 
       | Because of the state of computer technology it was a massive
       | effort and a source of general astonishment. I suspect we're now
       | recreating that minimal environment, this time with better ways
       | to curate the data for small size and maximum drama.
       | 
       | Let's remember that a modern GPT isn't far removed from that
       | scheme -- not really.
        
         | wrycoder wrote:
         | They didn't wear lab coats. That's for chemists and biologists.
        
       | Animats wrote:
       | (2023), as someone mentioned.
       | 
       | It's encouraging to see how much can be done with tiny models.
       | 
       | Still need to crack "I don't know" recognition, so you can start
       | with a tiny model and then pass the buck to a bigger model for
       | hard questions. That will enormously reduce the cost of "AI"
       | customer support.
        
       | alister wrote:
       | > _short stories generated by GPT-3.5 and GPT-4 to train LMs that
       | are smaller_
       | 
       | The loop of development is fascinating:
       | 
       | Millions of humans write literature, Wikipedia, etc.
       | 
       | Large language models are trained on that body of work.
       | 
       | Now large language models generate training data for small
       | language models.
       | 
       | What's the next iteration? A talking Buzz Lightyear toy with one
       | of those small language models that'll teach (human) infants to
       | talk?
        
         | ocean_moist wrote:
         | This is actually a common pattern called "model distilling".[0]
         | 
         | [0] https://platform.openai.com/docs/guides/distillation
        
         | visarga wrote:
         | > A talking Buzz Lightyear toy with one of those small language
         | models that'll teach (human) infants to talk?
         | 
         | Great idea. I was thinking more like a plushie toy with
         | sensors, it would react to touch, sight and speech. I would run
         | the models locally from a computer, keep the toy just
         | lightweight I/O.
        
         | nickpsecurity wrote:
         | I thought that, too. It wasn't really true, though.
         | 
         | Some papers pointed out that the models start failing after
         | being trained with too much synthetic data. They also need tons
         | of random, Internet data in the first place. Humans don't have
         | those failure modes. The AI's also got smarter the more data we
         | produced.
         | 
         | So, there's some critical differences between what we're doing
         | and what they're doing that keep it from being a neat flow like
         | that. What many humans do in training other humans fits that,
         | though.
        
       | ted_dunning wrote:
       | This is good (though old) work. One interesting aspect is that
       | even though they tout their models ability to produce grammatical
       | text, their abstract has a number of grammatical errors.
       | 
       | On a dark night in a melancholic mood, that might seem to bear on
       | the question of what it is to be truly human and bar that we set
       | for (linguistically) anthropomorphic automatons like LLMs.
        
       | ted_dunning wrote:
       | This idea of machine-curated or machine-generated training data
       | goes back thirty years or so under different names, at least in
       | analogy.
       | 
       | A classic thing to do with ancient fraud detection models was
       | 
       | a) train a large/complex model on a reasonably accurate dataset
       | 
       | b) select training examples from the original training data and
       | other examples that the original model gets right.
       | 
       | c) train a simpler model on the filtered dataset
       | 
       | This worked very well in the simpler world of simple classifiers
       | particularly when the original training data had errors in it.
       | Trying to fit these errors made the first model more complex (and
       | still it often failed). The subset of training examples excluded
       | many or even most of the erroneous examples and also excluded
       | subtle cases that were impossible to learn within the complexity
       | bounds of the time.
        
       | make3 wrote:
       | Try this one instead :)
       | https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct Much
       | better, a modern attempt at a small model
        
       | bazmattaz wrote:
       | I honestly think a fantastic use case for a tiny model is better
       | spell checking on a phone.
       | 
       | I can't believe it's 2025 and spell checkers are still quite dumb
        
       ___________________________________________________________________
       (page generated 2025-01-03 23:01 UTC)