[HN Gopher] TinyStories: How Small Can Language Models Be and St...
___________________________________________________________________
TinyStories: How Small Can Language Models Be and Still Speak
Coherent English? (2023)
Author : tzury
Score : 209 points
Date : 2025-01-02 17:54 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jasonjmcghee wrote:
| Edit: ah. This paper is from May 2023. Might be worth putting
| that in the title.
|
| ---
|
| > Models with around 125M parameters such as GPT-Neo (small) or
| GPT-2 (small) can rarely generate coherent and consistent English
| text beyond a few words
|
| These models are 5 years old.
|
| I have to wonder if the authors have seen RWKV 7 0.1B, because it
| blows away just about every other model I've seen at that size.
|
| The capabilities it has vs the examples in the paper are night
| and day.
|
| https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-1
| jmward01 wrote:
| The age of tiny models is just about here. We are finally
| busting away from the standard transformer block and training.
| I have a side project that can dramatically shrink networks by
| a set of techniques I call sacrificial training[1]. I think
| things like this will finally swing things back to on prem and
| on device small models that are as capable as the big hosted
| models are now.
|
| [1] https://github.com/jmward01/lmplay/wiki/Sacrificial-
| Training
| refulgentis wrote:
| RWKV is def. better than TinyStories 125MB.
|
| Unfortunately, I have only seen 3 models, 3B or over, handle
| RAG.
|
| Tested RWKV with a simple in-the-sports-news question and it
| didn't even get close to approaching the question. And nearly
| everything was fundamentally incoherent even in its internal
| reality (ex. Player gets 5000/game and is the first with 1000
| in 16 games)
|
| (prompt: https://pastebin.com/sCLn5sCJ, response:
| https://pastebin.com/TqudvDbN)
|
| I don't think there's a position for LLMs that are "just"
| writers on the market in 2025.
| jasonjmcghee wrote:
| These tiny models in general have really weird failure
| modes. I tried the tiny stories prompt about asking mom for
| a dog who said no, and it output an incredibly dark story
| about how she asked her dad and they got a dog but it had
| pancreatic cancer (paraphrasing, it went into detail about
| the surgery etc.) and then starting writing an
| informational PSA about who is at risk of pancreatic cancer
| etc.
| kube-system wrote:
| What I find fascinating is how ML models hallucinate in a
| way that is sometimes reminiscent of a fever dream.
| ethbr1 wrote:
| It makes sense that the failure modes of language
| prediction look a lot like ADD.
| p0w3n3d wrote:
| It's because they are precisely lacking attention
| jdiff wrote:
| Don't fall into the trap of applying human psychology to
| LLMs. Bag-of-chemistry quirks do not translate to matrix-
| multiplication quirks.
| ethbr1 wrote:
| Why not? In both cases the result is losing the thread of
| thought.
| jdiff wrote:
| ...No, no that's not how ADHD works. It's difficult to
| sum up how wrong this is concisely, but I invite you do
| to some serious research into ADHD, how it functions, and
| the great variety of ways in which it can present in
| different people. It's quite a poor analogy.
| ethbr1 wrote:
| I'm aware that anything to do with the brain has a
| variety of presentations.
|
| Could you try to put a couple sentences down on how ADHD
| is an inapt metaphor for failure modes in this case?
|
| It's lazy to claim something is wrong without offering a
| useful point as to how it's wrong. I trust in your
| ability to summarize.
| jdiff wrote:
| For additional context/discussion, I feel this comment[0]
| elsewhere in the thread put it well.
|
| The reply to that comment also has some information I
| feel is helpful to show the breakdown here. It mentions
| that lack of attention presents in only 15-20% of cases.
| This isn't ADHD, it is something new, the fundamental
| underpinnings do not relate, and so the analogy/metaphor
| does not facilitate a better understanding of the
| situation.
|
| On the contrary, it makes LLM "attention" out to be
| something entirely different from what it actually is.
| Without attention, models don't become easily distracted.
| They are easily distracted regardless. Without attention,
| LLMs _primarily_ fail to disambiguate between different
| meanings of identical words, they fail to take context of
| the sentence structure into account when assigning
| meaning.
|
| I hopefully don't have to dive into psychological and
| chemical specifics of ADHD to have demonstrated that this
| is fundamentally just not at all what ADHD is. Again,
| there is no underlying harmony between this mechanism and
| how ADHD affects human attention in 15-20% of cases, and
| there is no analogy.
|
| The only similarity is that they both use the word
| "attention". If they'd used a different label, we
| wouldn't even be having this conversation right now.
|
| [0] https://news.ycombinator.com/item?id=42585600
| marxisttemp wrote:
| It's lazier to claim something is correct without
| offering a useful point as to how it's correct. I trust
| in your ability to theorize.
| hobs wrote:
| Because analogy can be useful in explaining things, or it
| can be worse than useless - it ties our thinking up into
| side quests that have nothing to do with the matter at
| hand.
| soulofmischief wrote:
| ADHD is an actively-researched dopaminergic disorder with
| a host of possible symptoms completely unrelated to
| attention or hyperactivity.
|
| It is ill-named and thus one often encounters comments
| such as yours in the real world, which while not meant to
| be negative, can be marginalizing to those with ADHD who
| see their disorder as misunderstood and the term misused
| much like people who say "I'm depressed" or "They're
| acting schizo again".
|
| LLMs do not have dopamine pathways and therefore we
| should avoid comparing them to human-specific brain
| disorders, or marginalizing ADHD folk by trivializing the
| disorder or spreading misinformation about the
| presentation of ADHD. LLM hallucination does not "look a
| lot like ADD", that's such a vague and unsupported claim.
| Furthermore, "lacking attention" doesn't even make sense
| with respect to attention models. The "attention" in ADHD
| and "attention" in transformers share a semantic basis
| but are two very different phenomena.
| robwwilliams wrote:
| For a good overview in ADHD see
|
| https://www.ncbi.nlm.nih.giv/books/NBK441838/
|
| It is not "a dopaminergic disorder" any more than many
| other neuropsychiatric disorders. Nothing much happens in
| CNS without some level of modulation by dopaminergic
| receptors, and to the best of my knowledge variants in
| these receptors are not known to contribute strongly to
| ADHD (I just confirmed by reviewed the GWAS Catalog:
| ebi.ac.uk/gwas/efotraits/EFI_oo3888 ).
|
| Furthermoe lack of attention is considered an important
| facet of ADHD---common to about 15-20% of cases.
|
| Humans tend to think in terms of metaphors. Similes and
| metaphors are crucial in learning and thinking. And yes,
| sometimes problematic.
|
| Explaining what is wrong with a particular metaphor can
| help.
| taneq wrote:
| A fever dream looks nothing like ADD. If anything it's
| like a very mild mushroom trip. Did you base this on
| anything or did it just sound good in your head?
| ethbr1 wrote:
| Your fever dreams and/or mushroom trips must be a lot
| more narratively stable and consistent than mine...
| nerdponx wrote:
| Lest we forget that this stream-of-consciousness
| confusion was state of the art just a few years ago.
|
| It makes sense if you think about it: a small model's
| "internal state" isn't rich enough to keep track of
| whatever it was supposed to be talking about.
|
| It makes me think that the reason LLMs need to be so
| large is that the internal state needs to be bigger than
| a typical human "idea", whatever that might mean.
| acchow wrote:
| The way we do LLMs now is that the program and the data
| are one and the same. The program mutates itself as it
| "executes". This is probably also how the brain works
| since there is no hard separation between "memory"
| neurons and "data processing" neurons. (biology has no
| hard separation in general).
| seattleeng wrote:
| As is usually the case, check the data! A lot of the
| dataset used has fairly morbid scenarios, so the model is
| working as expected. All the data was synthetically
| created with GPT4
| jmward01 wrote:
| I plan on checking out RWKV and seeing if I can add my
| sacrifical training techniques to it this weekend. There is
| a reason quantization works, it is because models are very
| badly trained right now. I think we can get really good
| performance on .1b and 1b models which opens up the world
| to fine-tuning again. I was playing with fine-tuning llama
| 7b and 13b a while back but the HW/SW stack made it so
| unwieldy and the ROI was terrible compared to just
| adjusting prompts on gpt-4o-mini and the like. I have hope
| that we are about to see single GPU, very simple, fine-
| tuning again as models shrink and GPUs grow.
| daxfohl wrote:
| Would there be any way to distribute RAG across multiple
| smaller models? Rather than one giant model handling your
| entire document base, have it be more of a tree where the
| top level classifies the docs into top-level categories and
| sends it to submodels to subclassify, etc? (Doesn't have to
| be 1:1 classification). And same for q/a search?
|
| These could all presumably be the same physical instance,
| just each query would use a different system prompt and
| perhaps different embeddings. (I'm guessing; I don't
| actually know how RAG works). So, a little slower and
| clunkier, but presumably way more efficient. And match
| could be anywhere between horrible to better-than-one-
| large-model. This would be more like how businesses
| organize docs.
|
| Or maybe there's no real benefit to this, and each
| subclassifier would require just as big of a model as if
| you were to throw all docs into a single model anyway. I
| assume it's probably been tried before.
| refulgentis wrote:
| TL;DR: It's a very interesting line of thought that as
| late as Q2 2024, there were a couple thought leaders who
| pushed the idea we'd have, like 16 specialized local
| models.
|
| I could see that in the very long term, but as it stands,
| it works the way you intuited: 2 turkeys don't make an
| eagle, i.e. there's some critical size where its speaking
| coherently, and its at least an OOM bigger than it needs
| to be in order to be interesting for products
|
| fwiw RAG for me in this case is: - user asks q.
|
| - llm generates search queries.
|
| - search api returns urls.
|
| - web view downloads urls.
|
| - app turns html to text.
|
| - local embedding model turns text into chunks.
|
| - app decides, based on "character" limit configured by
| user, how many chunks to send.
|
| - LLM gets all the chunks, instructions + original
| question, and answers.
|
| It's incredibly interesting how many models fail this
| simple test, there's been multiple Google releases in the
| last year that just couldn't handle it.
|
| - Some of it is basic too small to be coherent, bigcos
| don't make that mistake though.
|
| - There's another critical threshold where the model
| doesn't wander off doing the traditional LLM task of
| completing rather than answering. What I mean is,
| throwing in 6 pages worth of retrieved webpages will
| cause some models to just start rambling like its writing
| more web pages, i.e. they're not able to "identify the
| context" of the web page snippets, and they ignore the
| instructions.
| groby_b wrote:
| There's just been a twitter post by Omar Khattab
| (@lateinteraction) on encoding documents into a scoring
| function instead of a simple vector for the work on
| ColBERT - and maybe at some point using a DNN as scoring
| function.
|
| So, yes, maybe there's a way to "distribute" RAG. (I
| still wonder if that isn't just MoE taken to its logical
| conclusion)
|
| So, dig for ColBERT papers, might be helpful. (I wish I
| had the time to do that)
| ankit219 wrote:
| Short answer: Yes, there are ways it can be done.
| Multiple. Needs to be custom built though, given no one
| has explored it deeply yet.
|
| One simple way is what Omar Khattab (ColBert) mentioned
| about scoring function instead of a simple vector.
|
| Another is to use a classifier at the start directing
| queries to the right model. You will have to train the
| classifier though. (I mean a language model kind of does
| this implicitly, you are just taking more control by
| making it explicit.)
|
| Another is how you index your docs. Today, most RAG
| approaches do not encode enough information. If you have
| defined domains/models already, you can encode the same
| in metadata for your docs at the time of indexing, and
| you pick the model based on the metadata.
|
| These approaches would work pretty well, given a model as
| small as 100M size can regurgitate what is in your docs.
| And is faster compared to your larger models.
|
| Benefit wise, I don't see a lot of benefit except
| preserving privacy and gaining more control.
| daxfohl wrote:
| I was originally thinking about it as like a bazel plugin
| for large codebases. Each module would have its own LLM
| context, and it might make it easier to put whole modules
| into the context, plus summaries of the dependencies.
| That could work better than a single huge context
| attempting to summarize the whole monorepo.
|
| The general idea is probably be better for the code use
| case too, since having the module's whole codebase in
| context likely allows for more precise edits. Whereas RAG
| is just search, not edit.
|
| That said, probably code assistants must somewhat do this
| already, though it must be more ad-hoc. Obviously they
| wouldn't be able to do any completions if they don't have
| detailed context of the adjacent code.
| Pamar wrote:
| _Another is how you index your docs. Today, most RAG
| approaches do not encode enough information...._
|
| Could you please provide some more info (or maybe links)
| about this, please?
| antman wrote:
| I think he might be saying, have metadata in your vector
| retrieval that describe the domain of the retrieved chunk
| and use that as a decision on which model to use
| downstream. Sounds like very interesting improvement of
| RAG
| raegis wrote:
| > Unfortunately, I have only seen 3 models, 3B or over,
| handle RAG.
|
| What's the unit "B" in "3B"? I can search for acronyms like
| "RAG" just fine, but you experts aren't making it easy for
| us beginners :)
|
| Edit: Apologies, this is obvious. My brain needed a reboot
| for the new year.
| cauliflower2718 wrote:
| You can ask an LLM exactly this question and it will tell
| you.
|
| (The answer is billions of parameters)
| SketchySeaBeast wrote:
| But what if they want to know they are finding the
| correct answer?
| elliotto wrote:
| Asking anonymous people on a forum would be much better.
| SketchySeaBeast wrote:
| At least a forum with domain-specific knowledge.
| gpm wrote:
| And people to go "no, that's wrong" if someone posts
| something that's wrong.
| greesil wrote:
| No, any answer will do.
| cauliflower2718 wrote:
| I think basic definitions for LLMs are solidly within the
| bounds of what we would expect e.g. chatgpt to be
| competent at. The task (defining terms) is simple and the
| specific content (basic LLM stuff) is easy to check by
| anyone who works on the LLM.
|
| I agree with the general sentiment that we should not
| just blindly trust LLMs though.
| jasonjmcghee wrote:
| tbf, the gp comment said 125MB and then 3B, which would
| be pretty confusing, as it's a typo and should be 125M.
| jwineinger wrote:
| The number of parameters the model is trained on, in
| billions
| jmward01 wrote:
| (B)illion. It indicates the rough number of parameters in
| the model. Higher is generally more capable. 1B models
| are currently at the top end of 'easy' to deal with for
| playing around fine tuning and the like for most home lab
| setups.
| a1o wrote:
| What is tiny and what is big?
|
| Can I have a model that is like 100MB in weights and run
| with llama.cpp in my MacBook M2?
| refulgentis wrote:
| Yeah, absolutely -- you'll probably pull 100+ token/s.
|
| Here's a good range of model sizes that run just fine
| with llama.cpp on mac:
| https://huggingface.co/telosnex/fllama/tree/main.
|
| I recommend trying the Telosnex* app, it uses llama.cpp
| and abstracts over LLMs so you can i.e. switch between
| local/servers at will.
|
| The important part for you is its free, accelerated on
| macOS, and very easy to use local LLMs with (Settings >
| AI > LLM > On Device, tap Get)
|
| Prepare to be underwhelmed, slightly: its only when you
| start hitting 3B that its coherent, anything under that
| will feel more like a markov chain than an LLM.
|
| Depending on how geeked out you'll be to have it running
| locally, you might have fun with that Telosnex can run
| local models on _every_ platform, i.e. you can run local
| models on iOS /Android/web too.
|
| * because it's mine :3 It is quietly released currently.
| I want to get one more major update before widely
| announcing it in Jan 2025
| qskousen wrote:
| Sorry to side track, but question about Telosnex - would
| you consider a Linux release with something other than
| Snap? Maybe Flatpak or appimage?
| refulgentis wrote:
| If its a (mostly) CI-able process, I'm totally open to it
| ---
|
| I looked into "What should I do besides Snap?" about 4
| months ago; got quickly overwhelmed, because I don't have
| enough knowledge to understand what's fringe vs. common.
|
| I'll definitely take a look at Flatpak again in the next
| month, 30 second Google says its possible (h/t /u/
| damiano-ferrari at https://www.reddit.com/r/FlutterDev/co
| mments/z35gdo/can_you_...)
|
| (thanks for your interest btw, been working on this for
| ~year and this is my first outside feature request :) may
| there be many more)
| a1o wrote:
| I have no interest in that, I would like small models
| that I can integrate and run offline in software that I
| make it myself be IDEs or games. CLion has a nice
| predictive model for single line C++ completion that has
| 400 MBs.
| refulgentis wrote:
| Ah, totally possible, but wrapping llama.cpp will likely
| take a week to spike out and a month to stabilize across
| models.
|
| The biggest problem for relying on it for local software
| is there's just too much latency for ex. game use cases
| currently. (among other UX bugaboos)
| (https://news.ycombinator.com/item?id=42561095)
| jki275 wrote:
| LM Studio on Mac is your friend. You can choose any model
| you want, run a server for other tools, or chat direct
| with the model. It can use either MLX or just plain
| llama.cpp.
| wolfgangK wrote:
| <<Unfortunately, I have only seen 3 models, 3B or over,
| handle RAG.>>
|
| I would love to know which are these 3 models, especially
| if they can perform grounded RAG. If you have models (and
| their grounded RAG prompt formats) to share, I'm very
| interested !
|
| Thx.
| mr_toad wrote:
| > small models that are as capable as the big hosted model
|
| Free lunch?
| attentionmech wrote:
| wow, this RWKV thing blew my mind. Thank you for sharing this!
| SGML_ROCKSTAR wrote:
| It might still be of introductory help to someone who has yet
| to formally learn what a language model is, what large language
| models are, and where things might be in the future.
| mcny wrote:
| What am I doing wrong here?
|
| Prompt:
|
| Say , oh my dear husband I am hungry. Please give me something
| to eat
|
| In as many languages as you know
|
| Response:
|
| (This response is from hugging face with default settings)
|
| , but you don't know me.
|
| 1 2 THE GREAT KINGDOM OF THE MONGOLIANS
|
| But I am hungry and I am thirsty.
|
| I want to drink water from the well.
|
| And I have a friend who lives in the mountains. He is a great
| mountain man, and he has a big stick that he uses to cut down
| trees. And when he cuts down a tree, he can eat all of its
| leaves.
|
| You see, my friend, you are not alone in this world. There are
| many people like you in the mountains who live in peace and
| harmony with nature. They do not quarrel with each other or
| fight each other for their land or their homes. They do not
| fight for their own interests or their own selfish desires.
| They do not quarrel with each other because they are united by
| a common love for the land and the people who live there.
|
| They have learned how to live together peacefully and happily,
| without fighting or quarreling, without any grudges or
| misunderstandings. They have learned how to share everything
| they have, to work together as one family, to eat what they
| need and drink what they want.
|
| And so it is that we all live together in peace and harmony
| with each other as one family. We all know that we are all
| equal before God, that we are all children of God and that we
| are all children of humanity. And so it is that we all share in
| the love of Christ Jesus our Lord, our Savior, our Redeemer,
| our Beloved Disciple , our beloved Friend , our Holy Mother
| Mary Our Lady , our Holy Father Pope John Paul II , Our Lady of
| Mount Carmel , Our Lady of Mercy , Our Lady of Good Will , Our
| Lady of Sorrow and Blessed Virgin Mary .
|
| And so it is that we all know that we are children of God and
| that we are children of humanity. We all know that God has made
| us special and unique because He loves us as His children and
| because He has given us His Holy Spirit to give us the gift of
| prophecy and wisdom .
|
| We know also that God has given us the gift of prophecy to help
| us understand how He has made the world into what it is today.
| We know also that He has given us the gift of wisdom to help us
| learn how to live in harmony with each other as one family . We
| know also that God has given us the gift of faith to help us
| understand how He has made the world into what it is today. We
| know also that God has given us the gift of knowledge to help
| us understand how He has made the world into what it is today.
|
| And so it is that we all know that God has given us His holy
| name Jesus Christ our Lord, our Savior, our Redeemer, our
| Beloved Disciple , Our beloved Friend , Our Holy Father Pope
| John Paul II , Our Lady Mary Our Lady .
|
| And so it is that we all know also that God has made the world
| into what it is today. We know also that God has given us His
| Holy Spirit to help us understand how He has made the world
| into what it is today. We know also that God has given us His
| Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our
| Beloved Disciple , Our beloved Friend , Our Holy Father Pope
| John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
|
| And so it is that we all know also that God has made the world
| into what it is today. We know also that God has given us His
| Holy Spirit to help us understand how He has made the world
| into what it is today. We know also that God has given us His
| Holy Name Jesus Christ our Lord, our Savior, our Redeemer, our
| Beloved Disciple , our beloved Friend , Our Holy Father Pope
| John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
|
| And so it is that we all know also that God has made the world
| into what it is today. We know also that God has given us His
| Holy Spirit to help us understand how He has made the world
| into what it is today. We know also that God has given us His
| Holy Name Jesus Christ his Lord, our Savior, our Redeemer, our
| Beloved Disciple , our beloved Friend , Our Holy Father Pope
| John Paul II , Our Lady of Mercy , Our Lady Mary Our Lady .
|
| And so it is that we all know also that God has made the world
| into what it is today. We know also that God has given us His
| Holy Spirit to help us understand how He has made the world
| into
| userbinator wrote:
| You seem to have inadvertently summoned the latent Terry
| Davis lurking inside the LLM.
| tzury wrote:
| saw over here a post about this model a day ago I think. tested
| it now using the link you provided.
|
| That is a lot to take over here for such simple question...
| PROMPT: Why do gyms fully packed the first two weeks of a year?
| RESPONSE: I think it's because they're the first two weeks of a
| year. You're going to be out in the world for a while, and
| you're going to be in a lot of different places. So you're
| going to have a lot of people that are not familiar with gyms.
| And so it's kind of like, "Oh, I'll just go there." It's kind
| of like, "I'll just go to this other place." So that's how it
| works. And I think what we've seen is that some
| gyms are starting to really embrace social distancing. They're
| doing things like wearing masks or getting masks on the bus or
| whatever else they can do to help prevent people from getting
| sick. But they're also doing things like encouraging people to
| get their kids into school and encourage them to stay home if
| they feel sick. And so we see that kind of trend coming back in
| gyms. And I think that's what's really important
| for us as a community is to make sure that we're not putting
| ourselves at risk by not being able to social distance
| ourselves properly. And so I think that's what we need to do
| right now. And one thing that I think is really
| important is if you're going to be out in the world for a
| while, and you don't want your kids or your family members or
| your friends around you, then you need to wear masks. So if
| you're going out with friends, and you don't want them around
| you, then wear a mask. If you're going out with someone who has
| been diagnosed with COVID-19, then wear a mask too.
| We've seen many cases where people were wearing masks at events
| and restaurants and things like that. But I think we need to
| continue to wear masks even more during this pandemic period
| because we know that the virus spreads through droplets when
| people cough or sneeze or whatever else they do. And so it's
| really important for us as individuals and as communities as
| well as our government officials and our public health
| officials to continue wearing masks even though we may not be
| able to see them or touch them or smell them. So
| let me just end this video with a quote from the CDC: "The
| Centers for Disease Control and Prevention (CDC) has released
| new guidance on how best to protect yourself from COVID-19." So
| please keep up the good work! Thank you very much
| for watching! We'll talk more about the latest
| news in fitness in just a minute, but first, let's talk about
| why we should wear masks.
| Bengalilol wrote:
| The model just went "to be out in the world for a while"
| dreis_sw wrote:
| I had the same experience. It's clearly very limited
| intellectually.
| atombender wrote:
| Prompt: Are there vampires on Mars?
|
| Output: There are no vampires on Mars. The only vampires that
| exist on Mars are those who live in the colony of Hades, which
| is located near the center of the planet. There are also
| vampires who live in the underworld, but they are not as common
| as vampires on Earth.
| cjohnson318 wrote:
| You can have small languages, sure, but then you run into awkward
| extended clarifying clauses. The thing that makes languages
| difficult is that almost all vocabulary is sparse. The "Top N"
| words in a language are always pronouns, prepositions, articles,
| and the conjugations of the top 12 or so verbs: to be, to have,
| to do, to go, to come, to say, to give, etc. This is the reason
| that "Top N Words of Language X" and "Learn the Top 50% of Words
| in Language X" listicles/videos are always disappointing.
| nine_k wrote:
| But they seem to use much wider grammars, because their
| (synthetic) dataset is a bunch of coherent stories at the level
| of 3-4 y.o. children.
|
| I would consider the "Simple English Wikipedia" the next
| training set / benchmark.
| Pikamander2 wrote:
| There's an oddly relevant skit of this concept in the American
| version of The Office:
| https://www.youtube.com/watch?v=_K-L9uhsBLM
| momojo wrote:
| > We hope that TinyStories can facilitate the development,
| analysis and research of LMs, especially for low-resource or
| specialized domains, and shed light on the emergence of language
| capabilities in LMs.
|
| This part interests me the most. I want to know how small yet
| functional we can get these models. I don't want an AI that can
| solve calculus, I just want a dumb AI that pretty consistently
| recognizes "lights off" and "lights on".
| MobiusHorizons wrote:
| why would you use an LLM for that? Seems like there are much
| better options available.
| londons_explore wrote:
| It's actually pretty hard to design a non-llm system that can
| detect all the possible variations:
|
| Lights on. Brighter please. Turn on the light. Is there light
| in here? Turn the light on. Table lamp: on. Does the desk
| lamp work? It's a bit dim here, anything you can do? More
| light please. Put the lights on for the next 5 mins. Turn the
| light on when I come home. Turn all the lights off together.
| Switch the lights off whenever its daytime or quiet at home
| unless I say otherwise. etc.
|
| If you don't support every possible way of saying a command,
| then users will get frustrated because they effectively have
| to go and learn the magic incantation of words for every
| possible action, which is very user-unfriendly.
| anon373839 wrote:
| I suspect ModernBERT can also be very helpful with these
| sorts of tasks, if you decompose them into an intent
| classification step and a named entity recognition step.
| simcop2387 wrote:
| that entity extraction is where it actually gets really
| really difficult, even for LLMs since people will use 10
| different names for the same thing and you'll have to
| know them ahead of time to handle them all properly. For
| either BERT based or llm based there's a bit of a need
| for the system to try to correct and learn those new
| names unless you require users to put them all in ahead
| of time. That said i've seen LLMs handle this a lot
| better with a list of aliases in the prompt for each room
| and then type of device when playing with home assistant
| + llm.
| ted_dunning wrote:
| It isn't unreasonable to imagine one recognizable intent
| to be teaching new terminology. That would allow dialogs
| where the machine doesn't understand the object of the
| command and the human then says something like "when I
| say x, I mean the y" and the computer updates the
| training set for the named entity recognizer and does a
| quick fine-tuning pass.
| phkahler wrote:
| Your examples include complex instructions and questions,
| but for simple ON/OFF commands you can go far by pulling
| key words and ignoring sentence structure. For example,
| pick out "on" "off" and "light" will work for "turn the
| light on", "turn off the light", "light on", "I want the
| light on", etc... Adding modifiers like "kitchen" or "all"
| can help specify which lights (your "Table lamp: on"
| example), regardless of how they're used. I'm not saying
| this a great solution, but it covers pretty much all the
| basic variations for simple commands and can run on
| anything.
| astrange wrote:
| Original Siri used a primitive sentence embedding and
| then nearest-neighbor of the vector to supported
| commands.
| MobiusHorizons wrote:
| That is certainly a hard problem, but how do you know such
| a system is better than simpler command based ones? Systems
| like you describe have a much higher chance of taking the
| wrong action. Even humans do, when given ambiguous
| instructions. I know everyone loves to hate on Siri because
| it doesn't do a good job of understanding anything complex,
| but I have always found it to be very reliable when you
| find the right command. As a result I use it much more
| often than google assistant (was team android/ pixel until
| 2020). I especially use it for timers, reminders, and
| navigation, and if my hands are not free, texts. Taking the
| wrong action breeds distrust, which I also think is not
| user friendly.
| fl0id wrote:
| They also describe a new benchmark / evaluation, but tbh is there
| any evidence that this even works? (telling GPT-4 to check the
| output as if it were checking student essays) We know it cannot
| really do this, and the model used will not even stay consistent
| if there are updates.
| fi-le wrote:
| We're doing a successor to this, working hard and going public in
| month or so, hopefully. But HN gets a preview of course:
| https://huggingface.co/datasets/lennart-finke/SimpleStories
|
| And here's a more interactive explorer: https://fi-
| le.net/simplestories
| jmward01 wrote:
| This looks like a great dataset! Thanks for posting. I'm
| looking for projects just like this to try my training
| modifications against. Do you have initial results shown? It is
| a small model/dataset so training the gpt2 model in the repo
| probably wouldn't be too hard but it would be good to have
| reference runs to make sure things are set up right when I run
| it.
| fi-le wrote:
| So glad you like it! If I understand your question correctly,
| yes, we are also putting together a small library for
| training small language models. It's not mature at all yet,
| but can keep up with our progress here:
| https://github.com/danbraunai/simple_stories_train
| jmward01 wrote:
| Yeah. I looked at the dataset and there are a lot of
| possible tasks you could train against here since it has
| some great annotations. So, having a simple reference
| baseline, like a pretrain gpt2 run (which I think your repo
| is set up to do), helps give a starting point for other
| work. It looks like the dataset is small enough and the
| gpt2 ref code in your repo is lightweight enough to do a
| quick run and plot some curves on. Thanks!
| coder543 wrote:
| Does template_plural actually work well / offer any benefits?
| bigmattystyles wrote:
| I've been curious about the opposite - a lot of times, I'll put a
| few keywords that get to the point of what I want, but it's
| incoherent English in - and yet, often the output is on point.
| Suppafly wrote:
| I know natural language is sorta the gold standard for a lot of
| these models, but honestly I could see a lot of utility out of
| a stripped down language set, similar to how you used to be
| able to search google back in the day before they tried to make
| it easier.
| niemandhier wrote:
| This question is also quite possible the most promising way to
| get an upper bound on the Kolmogorov complexity of human
| language.
| lenerdenator wrote:
| "Coherent" seems relatively subjective, no?
|
| Could you get an LLM to generate "coherent" conversational
| Geordie English? Probably, but my Midwestern ear isn't going to
| be able to understand what they're saying.
| osaariki wrote:
| For some interesting context: this paper was a precursor to all
| the work on synthetic data at Microsoft Research that lead to the
| Phi series of SLMs. [1] It was an important demonstration of what
| carefully curated and clean data could do for language models.
|
| 1: https://arxiv.org/abs/2412.08905
| HarHarVeryFunny wrote:
| I'd guess that the ability of a very small model to do well on
| the TinyStories dataset isn't just because of the limited 3-4yr
| old vocabulary, but also because of it being an LLM-generated
| dataset.
|
| LLM-generated content (synthetic data) is easier than human
| generated text for an LLM to learn because it was auto-
| regressively generated, and therefore should be possible to auto-
| regressively predict.
|
| It's surprising that LLMs do as well as they do attempting to
| predict human generated training samples where there is no
| guarantee that the predictive signal is actually contained in the
| sample (it may just be something in the mind of the human that
| generated it).
|
| I've got to wonder what the impact on generation is of an LLM
| only trained on synthetic LLM-generated data? I'd guess it
| wouldn't be as robust as one that had learned to handle more
| uncertainty.
| kouteiheika wrote:
| > I'd guess that the ability of a very small model to do well
| on the TinyStories dataset isn't just because of the limited
| 3-4yr old vocabulary, but also because of it being an LLM-
| generated dataset.
|
| You guess is correct. The level of vocabulary has little to do
| with it. There was a paper about this a while back (sorry,
| can't find the link) where they found that the model still
| learned just as well when they increased the complexity of the
| text, as long as the texts were LLM generated.
| raymv wrote:
| Trained a GPT-2 like model on the dataset a while back, here's
| the source code and some results for anyone interested:
|
| https://github.com/raymond-van/gpt-tinystories
| mclau156 wrote:
| Side note but is it really that crazy for Github to implement a
| feature to see file size of a repo?
| ankit219 wrote:
| Great to see this here. We used this dataset from Tiny Stories to
| train small models (as small as 20M params) and test out
| knowledge addition. Published a paper based on this dataset. We
| could get coherent outputs at sizes as low as 20M-25M. (though
| not as great as LLMs, but still decent enough).
|
| [1]: Blog + Paper: https://medium.com/@ankit_94177/expanding-
| knowledge-in-large... (Paper is titled: Cross-Domain Content
| Generation with Domain-Specific Small Language Models)
| lutusp wrote:
| Decades ago, prior to the existence of personal computers, when a
| "computer" was a glassed-in room staffed by lab-coat-wearing
| technicians (picture John Von Neumann standing next to the first
| stored-program computer:
| https://www.theguardian.com/technology/2012/feb/26/first-com...),
| someone reduced an entire printed book (or more than one) to a
| word-token decision tree, at great cost and effort, just to see
| what would happen.
|
| I can't find the original paper, but with an appropriate amount
| of pseudorandomness to avoid dead ends, this primitive algorithm
| would generate the occasional sentence that almost made sense and
| that bore little resemblance to the original data.
|
| Because of the state of computer technology it was a massive
| effort and a source of general astonishment. I suspect we're now
| recreating that minimal environment, this time with better ways
| to curate the data for small size and maximum drama.
|
| Let's remember that a modern GPT isn't far removed from that
| scheme -- not really.
| wrycoder wrote:
| They didn't wear lab coats. That's for chemists and biologists.
| Animats wrote:
| (2023), as someone mentioned.
|
| It's encouraging to see how much can be done with tiny models.
|
| Still need to crack "I don't know" recognition, so you can start
| with a tiny model and then pass the buck to a bigger model for
| hard questions. That will enormously reduce the cost of "AI"
| customer support.
| alister wrote:
| > _short stories generated by GPT-3.5 and GPT-4 to train LMs that
| are smaller_
|
| The loop of development is fascinating:
|
| Millions of humans write literature, Wikipedia, etc.
|
| Large language models are trained on that body of work.
|
| Now large language models generate training data for small
| language models.
|
| What's the next iteration? A talking Buzz Lightyear toy with one
| of those small language models that'll teach (human) infants to
| talk?
| ocean_moist wrote:
| This is actually a common pattern called "model distilling".[0]
|
| [0] https://platform.openai.com/docs/guides/distillation
| visarga wrote:
| > A talking Buzz Lightyear toy with one of those small language
| models that'll teach (human) infants to talk?
|
| Great idea. I was thinking more like a plushie toy with
| sensors, it would react to touch, sight and speech. I would run
| the models locally from a computer, keep the toy just
| lightweight I/O.
| nickpsecurity wrote:
| I thought that, too. It wasn't really true, though.
|
| Some papers pointed out that the models start failing after
| being trained with too much synthetic data. They also need tons
| of random, Internet data in the first place. Humans don't have
| those failure modes. The AI's also got smarter the more data we
| produced.
|
| So, there's some critical differences between what we're doing
| and what they're doing that keep it from being a neat flow like
| that. What many humans do in training other humans fits that,
| though.
| ted_dunning wrote:
| This is good (though old) work. One interesting aspect is that
| even though they tout their models ability to produce grammatical
| text, their abstract has a number of grammatical errors.
|
| On a dark night in a melancholic mood, that might seem to bear on
| the question of what it is to be truly human and bar that we set
| for (linguistically) anthropomorphic automatons like LLMs.
| ted_dunning wrote:
| This idea of machine-curated or machine-generated training data
| goes back thirty years or so under different names, at least in
| analogy.
|
| A classic thing to do with ancient fraud detection models was
|
| a) train a large/complex model on a reasonably accurate dataset
|
| b) select training examples from the original training data and
| other examples that the original model gets right.
|
| c) train a simpler model on the filtered dataset
|
| This worked very well in the simpler world of simple classifiers
| particularly when the original training data had errors in it.
| Trying to fit these errors made the first model more complex (and
| still it often failed). The subset of training examples excluded
| many or even most of the erroneous examples and also excluded
| subtle cases that were impossible to learn within the complexity
| bounds of the time.
| make3 wrote:
| Try this one instead :)
| https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct Much
| better, a modern attempt at a small model
| bazmattaz wrote:
| I honestly think a fantastic use case for a tiny model is better
| spell checking on a phone.
|
| I can't believe it's 2025 and spell checkers are still quite dumb
___________________________________________________________________
(page generated 2025-01-03 23:01 UTC)