[HN Gopher] No "Zero-Shot" Without Exponential Data
       ___________________________________________________________________
        
       No "Zero-Shot" Without Exponential Data
        
       Author : zerojames
       Score  : 148 points
       Date   : 2024-05-09 13:08 UTC (9 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | cs702 wrote:
       | This deserves to be on the front page.
       | 
       | The authors ask whether image-to-text and text-to-image models
       | (like CLIP and Stable Diffusion) are truly capable of zero-shot
       | generalization.
       | 
       | To answer the question, the authors compile a list of 4000+
       | concepts (see paper for details on how they compile the list of
       | concepts), and test how well 34 different models classify or
       | generate those concepts at different scales of pretraining, from
       | ~3M to ~400M samples.
       | 
       | They find that model performance on each concept scales
       | _linearly_ as the concept 's frequency in pretraining data grows
       | _exponentially_ , i.e., the rarer the concept the less likely it
       | is actually/properly learned -- which implies there is no "zero-
       | shot" generalization.
       | 
       | The authors also release a long-tail test dataset that they
       | cleverly name the "Let it Wag!" benchmark to allow other
       | researchers to see for themselves how current models perform on
       | the long tail of increasingly rare concepts.
       | 
       | Go read the whole thing, or at least the introduction. It's
       | clear, concise, and well-written.
        
         | treyd wrote:
         | But is this some abstract truth of statistics or is this just a
         | property of how these types of models work?
        
         | kolinko wrote:
         | Worth noting that this paper is about CLIP only, which is way
         | simpler than llm architectures. (if I'm not mistaken)
         | 
         | Still, interesting approach and kind of confirms the experience
         | of most people where clip models can recognize known concepts
         | but struggle with novel ones.
        
         | bilsbie wrote:
         | How do we know humans don't do the same thing?
        
       | tiborsaas wrote:
       | It was also mentioned in a just uploaded Computerphile video:
       | https://www.youtube.com/watch?v=dDUC-LqVrPU
        
       | stephc_int13 wrote:
       | Quite a few people saw this coming.
       | 
       | It is still early to tell if we reached AI winter again or not,
       | but at least we can see that news are slowing down.
        
         | cs702 wrote:
         | If there are no breakthroughs and funding dries up in the near
         | future, it will feel to many like _going off a precipice at
         | high speed_.
         | 
         | Only the poor souls who survive the fall, at the very bottom of
         | the precipice, will get to experience the AI winter.
        
         | pas wrote:
         | RAG seems to be all the rage. Not to mention the quest for the
         | cooking up the correct cocktail of smaller MoE/ensemble models,
         | and ... there's decades' worth of optimization work ahead (a
         | few years of it seems to be already VC and edu grants funded),
         | no?
        
           | cs702 wrote:
           | I think the grandparent comment is about AI research, driven
           | by the quest for AGI.
           | 
           | Incremental improvement of proven approaches, driven by
           | profit motive, will surely continue regardless of whether
           | there is an AI winter or not.
        
           | nkozyra wrote:
           | RAG takes the current limit of LLM and focuses it on specific
           | problems using custom data. It's not exactly magic, it just
           | finds a way to produce something tangible and usable from an
           | otherwise broadly focused model.
           | 
           | It's the rage because it's a way to practically _do work_
           | from LLMs that generally provide wow from conversationally
           | accurate, often factually accurate responses.
        
           | PheonixPharts wrote:
           | There's _tons_ of optimization work left, but by it 's nature
           | optimization tends to push the limits of what we currently
           | have, and rarely allows for substantial improvements.
           | 
           | There are some _major_ limitations to LLMs that aren 't going
           | to be "optimized" away. At the end of the day LLMs are Monte
           | Carlo samplers over a latent, compressed representation
           | existing human text data. Many of the tasks people hope LLMs
           | will achieve require major leaps in our current understanding
           | of language modeling.
           | 
           | One great example limitation (which shocks me sometimes when
           | I think about it): generating output is still ultimately
           | stuck in looking at the probability of the next token rather
           | than the much more useful, probability of the generated
           | statement. There are techniques to improve this, but we're
           | missing a _major_ piece of generating highly probable
           | statements with no hint about how to really get there.
           | Consider how you might write SQL. You conceive of the high
           | level query first and start sketching out the pieces. An LLM
           | can only look at each token and can 't, statistically
           | speaking, think in terms of the entire query.
           | 
           | Personally I think LLMs are very underutilized/exploited for
           | what they are good at, and there is way too much focus on
           | what they can't do. Hopefully we'll dodge an AI winter by
           | using LLMs to solve the wide range of classical NLP problems
           | that make many tasks that just a few years ago nearly
           | impossible, rather simple today. Unfortunately the irrational
           | hype around these models makes me skeptical of that scenario.
        
             | mjburgess wrote:
             | The issue with predicting the next statement is the
             | combinatorial infinity which is improbable to model with
             | historical frequencies,
             | 
             | ie., P(mat|the cat sat on the) is a distribution over, say,
             | 100k words. Whereas, P(the cat sat|on the mat) is a
             | distribution over 100k^3 words.
             | 
             | Part of the illusion of an LLM is that we produce create
             | text in such a highly regular way that a mere distribution
             | over 100k iterated, say 500 times, gives you a page of text
             | as if modelling 100k^500.
        
         | loudmax wrote:
         | I think that depends what your expectations are, and what you
         | mean by another AI winter.
         | 
         | We're just scratching the surface of what's possible with the
         | current state of the art. Even if there are no major advances
         | or breakthroughs in the near future, LLMs and associated
         | technologies are already useful in many use cases. Or close
         | enough to useful where engineering rather than science will be
         | sufficient to overcome many (though not all) of the
         | shortcomings of current AI models. Never mind grandiose claims
         | about AGI, there's enough utility to be gotten out of the
         | limited LLMs that we have today to keep engineers and
         | entrepreneurs busy for years to come.
        
           | swatcoder wrote:
           | You're right, and I suspect the GP would agree with you about
           | there being real engineering applications for LLMs, diffusion
           | models, etc
           | 
           | But I think the term "AI Winter" usually refers to the
           | underlying research programme and the economics around it.
           | Soaking up _many_ billions of dollars of industry money and
           | public grants on the pitch that AGI might be just around the
           | corner, and then being unable to deliver on that pitch can
           | induce a hangover effect that makes it much harder to raise
           | money for anything that smells even a little bit like the
           | failed pitch. Investors and administrators feel burnt and
           | turn very skeptical for a very long time.
           | 
           | Meanwhile, the actual productive applications which shook out
           | of the initial boom just get renamed to something else so
           | that they don't carry that smell.
           | 
           | We'll see how it goes here, but that's the familiar road and
           | where the terminology comes from.
        
             | ijustlovemath wrote:
             | The minor irony of this is that current efforts at AGI are
             | focused on the _data scaling laws_ , which we have showed
             | no signs of slowing on, but if funding dries up these crazy
             | expensive training runs wont be allowed anymore.
        
           | nicklecompte wrote:
           | The whole point of past AI winters is that genuinely useful
           | technologies were preposterously overhyped as representing
           | "human cognition," despite overwhelming evidence to the
           | contrary. The bubble bursts - and the money evaporates -
           | because expectations come crashing down, not because the
           | usefulness was a mirage. Lisp was hardly the key to human
           | symbolic thought like some hoped it would be, but it helped
           | improve programming languages across the board. Perceptrons
           | really are quite limited and there's absolutely no way you
           | could emulate a human brain with them alone, but they're
           | still essential for more advanced ML.
           | 
           | But you would think after 70 years AI practitioners would
           | learn some humility! It is very obvious that GPT-4 is dumber
           | than a honeybee, let alone a cat, let alone a crow. But for
           | over a year I've heard dozens of tech folks insist it's
           | smarter than most people.
        
           | marcosdumay wrote:
           | "AI winter" was a political phenomenon that happened because
           | the AI applications didn't hold water to the hype, and all of
           | the gullible people that invested on them got burned out by
           | the disparity.
           | 
           | We are very clearly on that same path again. What leads to
           | the conclusion that another winter is coming. But even the
           | fact that people are talking about it is evidence it's not
           | here yet, and as always with political phenomena, there's no
           | guarantee history will repeat.
           | 
           | Anyway, none of it means people will stop applying their
           | knowledge or studying AI. The entire thing happens on funding
           | and PR, and the world is not entirely controlled by those
           | two.
        
             | dr_dshiv wrote:
             | Except contemporary AI is useful on a day to day basis.
        
               | xboxnolifes wrote:
               | Being useful doesn't mean it met expectations.
        
         | squigz wrote:
         | > reached AI winter again
         | 
         | Again?
        
           | pocketarc wrote:
           | It's a term for when AI hype dries up, has happened multiple
           | times since the 60s[0].
           | 
           | [0]: https://en.wikipedia.org/wiki/AI_winter
        
           | swatcoder wrote:
           | This was not the technology industry's first encounter with
           | AI hype. The term was coined 40 years ago, and has been
           | suggested as a description for almost a dozen periods in the
           | field's history:
           | 
           | https://en.wikipedia.org/wiki/AI_winter
        
         | msp26 wrote:
         | I hope this means that people actually start curating their
         | training sets. The quality control is horrible on all of the
         | datasets I've looked through. Especially image captions.
         | 
         | Yes sheer quantity has a quality of its own but that won't
         | produce optimal results.
        
           | RodgerTheGreat wrote:
           | From a very cynical marketing/VC-pitch perspective, un-
           | curated datasets full of random crap have the benefit of
           | sometimes producing totally surprising, unplanned "features",
           | which helps sell the idea that one-size-fits-all black-box
           | machine learning can solve any problem.
        
         | godelski wrote:
         | I think we're in the beginning of a warm winter.
         | 
         | The funding is there but we're letting hype drive everything
         | and not calling out the con artists.
         | 
         | The problem is we've had recent great success but still don't
         | know how to get to AGI. But because we're afraid of winter
         | we're not willing to try new things. We want to only compare to
         | sota and think it's fair to compare a new method with a handful
         | of papers to the status quo. That's not how the S-curves of
         | technology work. Sure, maybe things don't scale but that
         | doesn't mean they don't have merit or can't scale if someone
         | finds some modification.
         | 
         | The problem is we treat research like products, not academic
         | work. You need to produce everything from hard core theory to
         | robust products to have an effective chain. But we seem hyper
         | focused on the middle area. And for some reason people think
         | products can be placing a nice interface around research code.
         | There's still a lot of work you need to do and all those models
         | can be optimized. They absolutely do not have optimal hyper
         | parameters or even parameters.
        
         | darepublic wrote:
         | Dunno exactly what you mean by ai winter, but the whole gen AI
         | thing has kind of got all the attention and even if that route
         | slows down there are a ton of other branches fruitful for
         | development.
        
           | stephc_int13 wrote:
           | The issue, I think, is that if results don't follow
           | expectations, given the very high costs, investors might get
           | cold feet.
           | 
           | Sure we already have some interesting applications, but they
           | are not exactly printing money.
        
         | CuriouslyC wrote:
         | Definitely not an AI winter, since the tools are so useful.
         | People who think AGI is right around the corner might be
         | disappointed though, since we can't really even accurately
         | define AGI.
        
         | trgn wrote:
         | Commercial adoption has only started to ramp up. We are going
         | to see a floodwave of LLM-powered bots/UX-wizards.
         | 
         | Even if academic progress stalls, we're going to be inundated
         | for at least another few years.
        
         | refulgentis wrote:
         | Uh, what?
        
         | ertgbnm wrote:
         | I'm not even going to discuss whether or not things are
         | actually slowing down since I disagree with that.
         | 
         | But I do feel confident that an AI winter is not on the horizon
         | solely due to the overhead of implementation that we currently
         | have. Just with currently existing AI, it would take years for
         | the economy to fully leverage the abilities that are available.
         | I'm confident we have transformative AI. So it won't feel like
         | a winter for several years as we actually succeed in
         | optimizing, implementing, and productizing current technology
         | in other industries.
        
       | hprotagonist wrote:
       | My long-standing observation has been that while nature may abhor
       | a vacuum, she also really, really loves sigmoids.
       | 
       | That performance vs. training data is not linear, but
       | logarithmic, doesn't exactly come as a surprise.
        
         | bilsbie wrote:
         | Every exponential is really an s curve?
        
           | hprotagonist wrote:
           | in physical systems, it's very often the case!
        
             | mr_mitm wrote:
             | Pretty much always. One exception might be the expansion of
             | the universe
        
           | jskherman wrote:
           | Symmetries in Nature strike again! It's just like in
           | Noether's theorem.
        
         | ixaxaar wrote:
         | The physical world has limits, that's why sigmoids everywhere.
        
         | CuriouslyC wrote:
         | The question that is unanswered, is the logarithmic performance
         | improvement the result of better sampling of the underlying
         | distribution over time, or related to just doing more training
         | with slight variations to effectively regularize the model so
         | it generalizes better? If it's the former, that indicates that
         | we could achieve small models that are every bit as smart as
         | large ones in limited domains, and if that's the case, it
         | radically changes the landscape of what an optimal model
         | architecture is.
         | 
         | I suspect from the success of Phi3 that it is in fact the
         | former.
        
       | bearjaws wrote:
       | Computerphile just did a video on this and it's a pretty good
       | summary: https://www.youtube.com/watch?v=dDUC-LqVrPU
        
       | RhysU wrote:
       | Can one upweight the known-rare concepts in the training set? But
       | then which are rare and how should we identify the rare ones
       | worth knowing?
       | 
       | ML is funny. It's predictably going to run into the Education
       | field.
        
       | six_four_eight wrote:
       | I wonder, is this exponential relation specific to multi-modal
       | models? From my admittedly naive view it seems to make sense that
       | "...what is rare is not properly learned" would apply generally?
        
       | crote wrote:
       | This feels like the worst possible outcome of the current AI
       | hype.
       | 
       | We've essentially been ripping off the _entire internet_ and
       | feeding it to the models already, spending many billions of
       | dollars in the process. It 's pretty much the largest possible
       | dataset you can currently get, and due to the ever-increasing and
       | now rapidly accelerated AI poisoning of the internet most likely
       | the largest possible dataset which will _ever_ exist.
       | 
       | All that and all we're getting out of it is not-entirely-useless
       | but still quite crappy AI? We would've been better off if we had
       | never done this.
        
         | some_random wrote:
         | Quite crappy? It seems to me like the current SOTA is working
         | plenty good enough for most use cases and like the highest
         | impact (practical) way to improve right now is going to be
         | advancements in domain knowledge acquisition and retention.
        
           | crote wrote:
           | I sure haven't seen any of those "plenty good" results yet -
           | current SOTA seems to be about as useful as semi-coherently
           | gluing together random Google results. Good enough to perhaps
           | replace a minimum-wage worker, not good enough to provide
           | actual value when you care about the quality of the result.
           | 
           | This paper seems to suggest that significant advancement in
           | domain knowledge acquisition and retention is _exactly_ the
           | problem, as you seem to need exponentially more data due to a
           | lack of generalization. What 's the point of a model which
           | can perfectly quote Shakespeare if you're a programmer trying
           | to refactor a proprietary codebase and it fails to make a
           | link to whatever garbage it picked up from StackOverflow?
        
             | Filligree wrote:
             | In CLIP.
             | 
             | Replacing CLIP with an LLM is the current meta in image
             | generation models, specifically because of that lack of
             | generalisation. This isn't a surprise to anyone.
        
         | greenavocado wrote:
         | I'm getting a heck of a lot of useful work out of these so-
         | called useless models
        
         | panarky wrote:
         | _> not-entirely-useless but still quite crappy_
         | 
         | 15 months ago, general-purpose LLMs that have not been
         | specifically trained on legal reasoning could score better than
         | 90% of humans on the multistate bar exam, and these are humans
         | who actually completed law school.
         | 
         | General-purpose LLMs get similar results in medicine, and when
         | the models are fine-tuned for medical diagnosis they're even
         | better.
         | 
         | And that was more than a year ago. Those who have seen current
         | models not yet released tell us they'll make the current state
         | of the art look like toys.
         | 
         | Progress is still tracking the steep part of the S-curve, and
         | there's no indication that they're near the top yet.
        
           | flimsypremise wrote:
           | It's not super surprising that LLMs perform well on
           | standardized tests, given that they have a lot of
           | standardized test related text in their training data. There
           | are a lot of claims out there about the zero-shot ability of
           | LLMs, and very little specific research to back it up. Until
           | now that is.
        
             | MacsHeadroom wrote:
             | This paper is about CLIP not LLMs and does not generalize
             | to LLM architectures.
        
           | crote wrote:
           | > Progress is still tracking the steep part of the S-curve,
           | and there's no indication that they're near the top yet.
           | 
           | If I understand it correctly, that seems to be _exactly_ what
           | this paper is suggesting.
           | 
           | Scoring high on the bar exam is pretty trivial for AI - the
           | data needed for that is fairly generic and widely available
           | on the internet. It requires you to demonstrate a relatively
           | basic understanding of the concepts by answering a bunch of
           | multiple-choice answers. If anything, I'd expect AI to have a
           | perfect score.
           | 
           | Like I said, such an AI is not entirely useless. You can
           | replace quite a few legal assistants with that, and I bet it
           | could be used to create first drafts or to expand a core
           | concept into a full legal argument. There is plenty of money
           | to be made there, and it's going to make an awful lot of
           | people jobless. But that's just replacing more-trivial jobs
           | with automation, it doesn't add anything novel to society.
           | 
           | On the other hand, the actual _difficult_ work involves being
           | able to come up with completely novel concepts, and being
           | able to expand upon some obscure but crucial stuff few people
           | have ever heard about. Current models simply aren 't capable
           | of that, and the results achieved here with multimodal models
           | suggests that they never will. We risk getting stuck with
           | models which can do some trivial work, but silently produce
           | complete garbage when you ask them to do anything providing
           | substantial value.
        
           | Imnimo wrote:
           | I think this says more about the benchmark than the
           | capabilities of the model. If it were the case that 90th
           | percentile performance on the bar exam mean that a model was
           | a 90th percentile lawyer, and we've had these models for 15
           | months (in fact longer), where are all the LLM lawyers? The
           | lesson here is that a test designed for humans may not be
           | equally representative of capabilities when given to an LLM.
        
         | HarHarVeryFunny wrote:
         | Even if LLMs (pre-trained transformers) turn out to be a dead
         | end as far as AGI goes, there are productivity applications for
         | them, and perhaps just as importantly interesting
         | insights/confirmations about how the mind works and directions
         | for future AGI research/architectures.
         | 
         | The use cases for LLMs will no doubt grow as hallucinations are
         | reduced, and they gain planning/reasoning ability over next
         | couple of years. It'll be interesting to see what they
         | subjectively "feel" like with these improvements.
        
           | crote wrote:
           | Oh absolutely, LLMs are already causing a slaughter in
           | applications where quality doesn't really matter to the
           | company, like customer service. With minor improvements
           | they're going to be a serious problem for any junior
           | developer/lawyer/journalist/reviewer/artist/whatever, and if
           | they ever fix the hallucinations issue it's game-changing.
           | 
           | On the other hand, it's still a big "if" whether a general
           | hallucinations solution exists, and in the meantime we're
           | paying a pretty high price for it, as the entire internet is
           | being flooded with absolute garbage. We risk getting stuck in
           | a situation where there is no way to get _new_ senior people
           | because nobody hired them as juniors because AI is cheaper
           | and good enough, and senior people are getting less and less
           | productive due to them being unable to use websites like
           | StackOverflow as reference material. That 's a pretty high
           | price for a tiny gain, if you ask me.
        
       | nkozyra wrote:
       | My biggest worry is the idea of generating data as training data.
       | We're obviously already unwittingly doing this, but once someone
       | decides to augment low-volume segments of the dataset with
       | generative input, we're going to start getting some really crappy
       | feedback loops.
        
         | bilsbie wrote:
         | Do you ever have novel ideas while walking or in the shower?
         | 
         | Well, you're learning from data you generate.
        
           | nkozyra wrote:
           | > Well, you're learning from data you generate.
           | 
           | Sure. I'm producing human output from human input in a
           | generally unconstrained, limitless way.
           | 
           | This is producing approximated human output from approximated
           | human input. That second level of abstraction will be
           | constrained, ultimately, by the limits of input.
        
           | sottol wrote:
           | I don't necessarily think it is the equivalent. Maybe it's
           | more akin to a high-schooler reading a single book and then
           | being asked to write several new books that hold equal weight
           | in teaching the next generation of children.
           | 
           | These new text books could be great at simplifying the
           | subject matter and making the material accessible or they may
           | just never have fully understood the materials and are
           | misleading.
           | 
           | Now imagine that over and over again, imo it's pretty likely
           | to introduce inaccuracies if just taking a naiive approach.
        
           | breck wrote:
           | Well put.
           | 
           | Also, ever read your old journals? You are training on
           | generated data.
        
             | Filligree wrote:
             | Ever dream? It's the same thing.
             | 
             | Of course we still need real world data, but it seems like
             | generated data should also play a role. Humans don't weight
             | dreams equally with reality, however, and that's a
             | distinction I feel is missing here.
        
           | a_wild_dandan wrote:
           | This is correct. To give more synthetic data examples:
           | 
           | 1. Generate inverted problems which are easier to produce
           | than solve. For instance, create an integration math exercise
           | by differentiating a hairy function and reversing the steps.
           | 
           | 2. Create (simulated) environmental data.
           | 
           | 3. Use adversarial model competition, e.g. self-playing chess
           | or training an artificial image generator/detector model
           | pair.
           | 
           | There's evidently a commonplace myth that information quality
           | starts pristine and exclusively gets degraded by systems
           | thereafter. That's just easily, demonstrably false in myriad
           | ways. That's why it's absurd to conclude that LLM output
           | being in internet training data will cause model collapse.
           | 
           | Information-rich synthetic data can be created without
           | humans, and it works. (Check out Phi, for instance.)
        
           | vineyardlabs wrote:
           | https://arxiv.org/pdf/2305.17493
           | 
           | There's some cursory indication that in the long tail,
           | training LLMs on LLM-generated data causes model collapse.
           | Kind of like how if you photocopy a photocopy too many times
           | the document becomes unreadable.
           | 
           | This isn't really surprising though. Neural networks at large
           | are a form of lossy compression. You can't do lossy
           | compression on artifacts recovered from lossy compression too
           | many times. The losses stack.
        
           | xboxnolifes wrote:
           | Even hear an idea from another person? You just trained one
           | model with another.
        
         | MeImCounting wrote:
         | We are already doing this. In fact for the next generation of
         | frontier models the primary electricity cost is running
         | inference to generate training data for these models. As Llama
         | 3 has shown us scale of training data is more important than
         | size of model.
        
         | IncreasePosts wrote:
         | Shouldn't there be enough known-good training content that we
         | can use it to determine if a document is worth including in a
         | training set?
        
           | devmor wrote:
           | There's not enough human workers to validate that at scale.
           | 
           | If you want ML to do it... well that's a bit of a catch-22.
           | How would an ML algorithm know if data is good enough to be
           | trained on unless it has already been trained on that data?
        
             | HarHarVeryFunny wrote:
             | The way humans do it is via curiosity/boredom/surprise. If
             | we can't predict some thing well (i.e. we're "surprised" by
             | it) then that both acts as learning trigger to predict
             | better next time, as well as retains our interest to
             | explore it.
             | 
             | Eventually AGI's will need to learn by experimentation as
             | we do, but in the meantime ability to well predict a
             | potential training sample could be used to decide whether
             | to add it to training set or not. At the moment it seems
             | the main emphasis is on a combination of multi-modality
             | (esp. video) and synthetic data where one generation of LLM
             | generates tailored training samples for the next
             | generation. I guess this synthetic data allows a more
             | selective acquisition of knowledge than just adding
             | surprising texts found in the wild.
        
             | IncreasePosts wrote:
             | You could use data where you know it was not AI generated,
             | like the library of Congress catalog from prior to 2015. Or
             | highly cited research papers, things of that nature.
        
             | drdeca wrote:
             | Get a collection of data which is small enough to have
             | humans annotate as being either low quality or high
             | quality. Train a model to predict this annotation. Then on
             | a larger disjoint collection of data, use this model to
             | estimate whether the data points would be considered low
             | quality or high quality, and use this to filter it.
             | 
             | This seems doable, and, I think something like it is
             | already done?
        
       | shenberg wrote:
       | The CLIP plot (Fig. 2) is damning, however some of the generative
       | models show flat responses in Fig. 3 (e.g. Adobe GigaGAN, DALL-E-
       | mini). While those are on the one hand technically linear
       | relationships, but are also exactly what we'd want: image
       | generation aesthetic score that doesn't care about concept
       | frequency. Maybe the issue is with the contrastive training
       | target used in CLIP?
        
       | twobitshifter wrote:
       | The models tested seem to work as expected as it's not a
       | retrieval model being used here. The weights are lowest on
       | wormsnake but much higher on worm and snake. The temperature of
       | the model for something like stable diffusion has to be higher
       | than something doing a retrieval, so we would not expect it to
       | reproduce the exact worm snake image from its training data.
        
       | xcodevn wrote:
       | Of course, it will require exponential data for zero shot. The
       | keyword here is _zero shot_. If you think about it for a second,
       | this applies to humans too. We also need exponential training
       | data to do things _without_ examples.
        
         | IIAOPSW wrote:
         | When we learn the grammar of our language, the teacher does not
         | stand in front of the class and proceed to say a large corpus
         | of examples of ungrammatical sentences, only the correct ones
         | are in the training set.
         | 
         | When we learn to drive, we do not need to crash our car a
         | thousand times in a row before we start to get it.
         | 
         | When we play a new board game for the first time, we can do it
         | fairly competently (though not as good as experienced players)
         | just by reading and understanding the rules.
        
           | xcodevn wrote:
           | please help yourself and do a quick Google search about "zero
           | shot" and "few shot" learning.
        
       | godelski wrote:
       | I've always been rather upset that it's fairly common to train on
       | things like LAION or COCO and then "zero shot" test on ImageNet.
       | Zero shot doesn't mean a held out set, it means disjoint classes.
       | You can't train on all the animals in the zoo with sentences and
       | then be surprised your model knows zebras. You need to train on
       | horses and test on zebras.
        
         | loandbehold wrote:
         | How would the model know what zebra was if it had never seen
         | it? Same is true for humans.
        
           | xboxnolifes wrote:
           | When I was little, zebras were described to me as black and
           | white stripped horses. Without even seeing one, I'm sure
           | anything who has seen a horse could then merge those two
           | concepts to create a close to accurate picture of what a
           | zebra is.
           | 
           | If AI is supposed to resemble a human mind with ability to
           | learn, then it must be able to learn from a blanker slate.
           | You don't teach the human before it is born, and in this
           | comparison an AI is born when you finish it's model and set
           | its weights using the training set. If you test it with the
           | training set, you aren't testing ability to comprehend, just
           | regurgitate what it was born with
        
             | dwallin wrote:
             | If you trained an image generator, removing all instances
             | of zebras from the training set, you could ask it to output
             | images of a black and white striped horse and it would
             | likely succeed. Then you could fine tune an image
             | recognition model (also with all zebras removed from the
             | training set) on the generated image set to associate it
             | with the word zebra. If you then showed it a bunch of
             | images of actual zebras there's a really good chance it
             | would succeed.
        
             | salty_biscuits wrote:
             | You can look at a medieval bestiary to see how people
             | thought animals might look based on descriptions alone.
             | Like these lovely elephants
             | 
             | https://britishlibrary.typepad.co.uk/digitisedmanuscripts/2
             | 0...
        
               | godelski wrote:
               | Honestly, these are some of my favorite stories and I
               | think more ML people need to learn more about mythology
               | (I say this as a ML researcher btw). Because once you go
               | down this path you start to understand how "Rino" ==
               | "Unicorn". You have to really think about how to explain
               | things when you're working with a limited language. Yeah,
               | we have the word "rino" now, but how would you describe
               | one to someone who has no concept of this? Maybe a cow
               | with one big horn? Is "like a big fat tough skin horse
               | with a big horn coming out of its head" accurate? And
               | then apply your classic game of telephone[0]. It is also
               | how you get things like how in Chinese a giraffe is "long
               | neck deer"[1] (that doesn't work for all things in
               | Chinese and there's another game of telephone (lol, maybe
               | I was too harsh on the British in [0]) and well... you
               | can imagine things get warped like crazy).
               | 
               | There's so many rabbit holes to go down when trying to
               | understand language, vision, reasoning, and all that
               | stuff.
               | 
               | [0] (Jesus England... this is what you call this game?!)
               | https://en.wikipedia.org/wiki/Chinese_whispers
               | 
               | [1] https://translate.google.com/?sl=en&tl=zh-
               | CN&text=giraffe&op... ---->
               | https://translate.google.com/?sl=zh-
               | CN&tl=en&text=%E9%95%BF%...
        
           | godelski wrote:
           | Great question!
           | 
           | It depends on the zero-shot experiment. Let's look at two
           | simple examples
           | 
           | Example 1:
           | 
           | We train a classifier that classifies several animals (and
           | maybe other things). For example, you can use the classic
           | CIFAR-10 dataset which has labels: airplane, automobile,
           | bird, cat, deer, dog, frog, __horse__, ship, truck. The
           | reason I underlined horse is because you want your model to
           | classify the zebras as horses!
           | 
           | The reason this is useful is for measuring the ability to
           | generalize. At least in our human thinking framework we'd
           | place a zebra in that bin because it is the most similar (and
           | deer should be the most common "error"). This can help us
           | understand the network and we'll be pretty certain that the
           | network is learning the key concepts of a horse when trying
           | to classify horses rather than things like textures, colors,
           | or background elements. If it frequently picks ships your
           | network is probably focusing on textures (IIRC CIFAR has
           | ships with the Dazzle Camo[0] and that's why I threw "ship"
           | out there).
           | 
           | Example 2:
           | 
           | Let's say we train our network on __text__. In this case it
           | can get any description of a zebra that it wants. In fact,
           | you'd probably want to have a description of what it looks
           | like!
           | 
           | The what we might do is take that trained text network, and
           | attach it to a vision classifier. For simplicity, let's say
           | that was trained on CIFAR-10 again. We then tune our LM + CV
           | model so that it can match the labels of CIFAR-10 (basically
           | you're tuning to ensure the networks build a communication
           | path, otherwise it won't work). Here we end up testing our
           | model's actual understanding of the zebra concept. It again
           | should pick horse as the likely class because you've
           | presumably had in the training text some description that
           | compares zebras to horses.
           | 
           | -----
           | 
           | So really the framework of zero-shot (and few-shot) is a bit
           | different. We're actually more concerned about clustering and
           | you should treat them more similar to clustering algorithms.
           | n-shot frameworks really come from the subfield of
           | metalearning (focusing on learning how networks learn). But
           | as you can imagine, these concepts are pretty abstract, but
           | hey, so are humans (that's why we see a log as a chair and
           | will situationally classify it as such, but let's save the
           | discussion of embodiment for another time).
           | 
           | In either example I think you can probably see how a toddler
           | could do similar tasks. You can ask which of those things the
           | zebra is most similar to and you'd be testing the toddler's
           | visual reasoning. The text one might need be a little older
           | but it could be a great way to test a child's reading
           | comprehension. Does this make sense? Of course machines are
           | different and we need to be careful with these analyses
           | (which is why I rage against just comparing
           | scores/benchmarks, these mean very little), because the
           | machines may be seeing and interpreting things differently
           | than us. So really the desired outcome depends on if you're
           | testing for what the machine knows/understands (you need to
           | do way more than what we discussed above) or if you are
           | training a machine to think more similar to a human (then we
           | can rely pretty close to exactly what we discussed).
           | 
           | Hope this makes more sense.
           | 
           | [0] https://en.wikipedia.org/wiki/Dazzle_camouflage
        
         | red75prime wrote:
         | Should we also try to teach children geometry and test them on
         | calculus?
        
           | IIAOPSW wrote:
           | For centuries we only taught children geometry and one of
           | them invented calculus.
        
             | Filligree wrote:
             | Now that's setting a high bar. If AI could reliably invent
             | calculus, then I'd be briefly impressed and then terrified.
        
           | nico wrote:
           | This illustrates two ways of teaching
           | 
           | I've experienced both, each at a different university
           | 
           | In one, professors would teach one thing then ask very
           | different (and much harder) questions on tests
           | 
           | In the other, tests were more of a recap of the material up
           | to that point
           | 
           | I definitely learned a lot more in the second case and was a
           | lot more motivated. It also required more effort from the
           | professors
           | 
           | The two methods also test different things. The recap one
           | tests effort and dedication, if you do the work, you get the
           | grade
           | 
           | The difficult tests measure either luck and/or creativity and
           | problem solving under pressure. It's not about doing the
           | work, it's about either being lucky or good at testing
        
             | Pet_Ant wrote:
             | > it's about either being lucky or good at testing
             | 
             | I think you are misunderstanding the experience.
             | 
             | The first (harder questions) is testing your understanding
             | of the material and problem. Can you applying the material
             | to solve a novel problem? Do you understand the _material_
             | not just the mechanics. Do you understand how it would
             | interelate it with other problems? Do you understand the
             | limitations?
             | 
             | The second is just regurgitation. This is great for rote
             | skills, but this isn't really learning. This is grinding
             | until you can reproduce. These are the kinds of skills that
             | are easily automated. This is not what we should be testing
             | our kids.
        
               | godelski wrote:
               | Fwiw, I think both of you are on the same page.
               | 
               | And yes, to bring back to ML it is the difference of
               | generalization and memorization (compression). I wrote a
               | longer response to a different response to my initial
               | comment to help clarify because I think this chain is a
               | bit obtuse and aggressive for no reason :/ (I mean you
               | can check the Wiki page to verify what I said)
        
           | feoren wrote:
           | I think the authors are responding to a _claim_ that AI is
           | doing this: look, we taught them geometry, and now they know
           | calculus! GP is saying that it 's not a true zero-shot to
           | have a separate test and training set, because the classes
           | overlap. Similarly, the authors are saying "true zero-shot"
           | is basically not happening, at least not nearly to the extent
           | some are claiming.
           | 
           | So everyone here, including TFA, are all kinda doubting the
           | same claim (our AI models can perform zero-shot
           | generalizations) in different ways, I think?
        
           | godelski wrote:
           | I think you're being overly obtuse, and I'd request you try
           | to work in good faith. If you doubt what I claimed, you can
           | quickly verify on the wiki page[0].
           | 
           | In essence you aren't wrong, but that's not what we'd
           | typically do in a zero (or few) shot setting. We'd be
           | focusing on things that are more similar. If you want to
           | understand this a bit better in what we might do in a ML
           | context I wrote more here[1].
           | 
           | And I like Nico's comment about how different professors
           | test. Because it makes you think about what is actually being
           | tested. Are you being tested on memorization or
           | generalization? You can argue both these kinds of tests are
           | testing "if you learned the material" but we'd understand
           | that these two types of tests are fundamentally different and
           | let's be real, are not reasonably fair to compare scores to.
           | I'm sure many of us have experienced this where someone that
           | gets a C in professor A's class likely learned more than than
           | someone who got an A in professor B's class. The thing is
           | that the nuance is incredibly important here to really
           | understand. And you can trivialize anything, but be careful
           | when doing so, you may overlook the most important things ;)
           | 
           | Now... you could make this argument about geometry ->
           | calculus if we're not talking about the typical geometry
           | (single) class most people will have taken in either middle
           | school or high school. Because yes, at the end of the day
           | there is a geometric interpretation and we have the Riemann
           | sum. But we'd need to ensure those kids have the
           | understanding of infinities (which aren't numbers btw). We'd
           | have to be pretty careful about formulating this kind of test
           | if we're going to take away any useful information from it.
           | Though the naive version might give us clues about
           | information leakage (in our case with children this might be
           | "child who has a parent that's a mathematician" or something
           | along those lines). It really all depends on what the
           | question behind the test is. Scores only mean things when we
           | have nuanced clear understandings of what we're measuring (so
           | again, tread carefully because "here be dragons" and you're
           | likely to get burned before, or even without, knowing it)
           | 
           | And truth be told, we actually do this a bit. There's a
           | reason you take geometry before calculus. Because the skills
           | build up. But you're right that they don't generalize.
           | 
           | [0] https://en.wikipedia.org/wiki/Zero-shot_learning
           | 
           | [1] https://news.ycombinator.com/item?id=40313501
        
       | a_wild_dandan wrote:
       | Better title: Image classification models suck at identifying
       | nouns that they've rarely seen.
       | 
       | Crucial context:
       | 
       | - They're only looking at _image_ models -- _not_ LLMs, etc
       | 
       | - Their models are tiny
       | 
       | - A "concept" here just means "a noun." The authors index images
       | via these nouns.
       | 
       | - They didn't control for difficulty in visual
       | representation/recognition of these exceptional infrequent, long-
       | tail "concepts."
       | 
       | If I didn't know an object's label, I too would struggle to
       | identify/draw it...
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:01 UTC)