[HN Gopher] No "Zero-Shot" Without Exponential Data
___________________________________________________________________
No "Zero-Shot" Without Exponential Data
Author : zerojames
Score : 148 points
Date : 2024-05-09 13:08 UTC (9 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| cs702 wrote:
| This deserves to be on the front page.
|
| The authors ask whether image-to-text and text-to-image models
| (like CLIP and Stable Diffusion) are truly capable of zero-shot
| generalization.
|
| To answer the question, the authors compile a list of 4000+
| concepts (see paper for details on how they compile the list of
| concepts), and test how well 34 different models classify or
| generate those concepts at different scales of pretraining, from
| ~3M to ~400M samples.
|
| They find that model performance on each concept scales
| _linearly_ as the concept 's frequency in pretraining data grows
| _exponentially_ , i.e., the rarer the concept the less likely it
| is actually/properly learned -- which implies there is no "zero-
| shot" generalization.
|
| The authors also release a long-tail test dataset that they
| cleverly name the "Let it Wag!" benchmark to allow other
| researchers to see for themselves how current models perform on
| the long tail of increasingly rare concepts.
|
| Go read the whole thing, or at least the introduction. It's
| clear, concise, and well-written.
| treyd wrote:
| But is this some abstract truth of statistics or is this just a
| property of how these types of models work?
| kolinko wrote:
| Worth noting that this paper is about CLIP only, which is way
| simpler than llm architectures. (if I'm not mistaken)
|
| Still, interesting approach and kind of confirms the experience
| of most people where clip models can recognize known concepts
| but struggle with novel ones.
| bilsbie wrote:
| How do we know humans don't do the same thing?
| tiborsaas wrote:
| It was also mentioned in a just uploaded Computerphile video:
| https://www.youtube.com/watch?v=dDUC-LqVrPU
| stephc_int13 wrote:
| Quite a few people saw this coming.
|
| It is still early to tell if we reached AI winter again or not,
| but at least we can see that news are slowing down.
| cs702 wrote:
| If there are no breakthroughs and funding dries up in the near
| future, it will feel to many like _going off a precipice at
| high speed_.
|
| Only the poor souls who survive the fall, at the very bottom of
| the precipice, will get to experience the AI winter.
| pas wrote:
| RAG seems to be all the rage. Not to mention the quest for the
| cooking up the correct cocktail of smaller MoE/ensemble models,
| and ... there's decades' worth of optimization work ahead (a
| few years of it seems to be already VC and edu grants funded),
| no?
| cs702 wrote:
| I think the grandparent comment is about AI research, driven
| by the quest for AGI.
|
| Incremental improvement of proven approaches, driven by
| profit motive, will surely continue regardless of whether
| there is an AI winter or not.
| nkozyra wrote:
| RAG takes the current limit of LLM and focuses it on specific
| problems using custom data. It's not exactly magic, it just
| finds a way to produce something tangible and usable from an
| otherwise broadly focused model.
|
| It's the rage because it's a way to practically _do work_
| from LLMs that generally provide wow from conversationally
| accurate, often factually accurate responses.
| PheonixPharts wrote:
| There's _tons_ of optimization work left, but by it 's nature
| optimization tends to push the limits of what we currently
| have, and rarely allows for substantial improvements.
|
| There are some _major_ limitations to LLMs that aren 't going
| to be "optimized" away. At the end of the day LLMs are Monte
| Carlo samplers over a latent, compressed representation
| existing human text data. Many of the tasks people hope LLMs
| will achieve require major leaps in our current understanding
| of language modeling.
|
| One great example limitation (which shocks me sometimes when
| I think about it): generating output is still ultimately
| stuck in looking at the probability of the next token rather
| than the much more useful, probability of the generated
| statement. There are techniques to improve this, but we're
| missing a _major_ piece of generating highly probable
| statements with no hint about how to really get there.
| Consider how you might write SQL. You conceive of the high
| level query first and start sketching out the pieces. An LLM
| can only look at each token and can 't, statistically
| speaking, think in terms of the entire query.
|
| Personally I think LLMs are very underutilized/exploited for
| what they are good at, and there is way too much focus on
| what they can't do. Hopefully we'll dodge an AI winter by
| using LLMs to solve the wide range of classical NLP problems
| that make many tasks that just a few years ago nearly
| impossible, rather simple today. Unfortunately the irrational
| hype around these models makes me skeptical of that scenario.
| mjburgess wrote:
| The issue with predicting the next statement is the
| combinatorial infinity which is improbable to model with
| historical frequencies,
|
| ie., P(mat|the cat sat on the) is a distribution over, say,
| 100k words. Whereas, P(the cat sat|on the mat) is a
| distribution over 100k^3 words.
|
| Part of the illusion of an LLM is that we produce create
| text in such a highly regular way that a mere distribution
| over 100k iterated, say 500 times, gives you a page of text
| as if modelling 100k^500.
| loudmax wrote:
| I think that depends what your expectations are, and what you
| mean by another AI winter.
|
| We're just scratching the surface of what's possible with the
| current state of the art. Even if there are no major advances
| or breakthroughs in the near future, LLMs and associated
| technologies are already useful in many use cases. Or close
| enough to useful where engineering rather than science will be
| sufficient to overcome many (though not all) of the
| shortcomings of current AI models. Never mind grandiose claims
| about AGI, there's enough utility to be gotten out of the
| limited LLMs that we have today to keep engineers and
| entrepreneurs busy for years to come.
| swatcoder wrote:
| You're right, and I suspect the GP would agree with you about
| there being real engineering applications for LLMs, diffusion
| models, etc
|
| But I think the term "AI Winter" usually refers to the
| underlying research programme and the economics around it.
| Soaking up _many_ billions of dollars of industry money and
| public grants on the pitch that AGI might be just around the
| corner, and then being unable to deliver on that pitch can
| induce a hangover effect that makes it much harder to raise
| money for anything that smells even a little bit like the
| failed pitch. Investors and administrators feel burnt and
| turn very skeptical for a very long time.
|
| Meanwhile, the actual productive applications which shook out
| of the initial boom just get renamed to something else so
| that they don't carry that smell.
|
| We'll see how it goes here, but that's the familiar road and
| where the terminology comes from.
| ijustlovemath wrote:
| The minor irony of this is that current efforts at AGI are
| focused on the _data scaling laws_ , which we have showed
| no signs of slowing on, but if funding dries up these crazy
| expensive training runs wont be allowed anymore.
| nicklecompte wrote:
| The whole point of past AI winters is that genuinely useful
| technologies were preposterously overhyped as representing
| "human cognition," despite overwhelming evidence to the
| contrary. The bubble bursts - and the money evaporates -
| because expectations come crashing down, not because the
| usefulness was a mirage. Lisp was hardly the key to human
| symbolic thought like some hoped it would be, but it helped
| improve programming languages across the board. Perceptrons
| really are quite limited and there's absolutely no way you
| could emulate a human brain with them alone, but they're
| still essential for more advanced ML.
|
| But you would think after 70 years AI practitioners would
| learn some humility! It is very obvious that GPT-4 is dumber
| than a honeybee, let alone a cat, let alone a crow. But for
| over a year I've heard dozens of tech folks insist it's
| smarter than most people.
| marcosdumay wrote:
| "AI winter" was a political phenomenon that happened because
| the AI applications didn't hold water to the hype, and all of
| the gullible people that invested on them got burned out by
| the disparity.
|
| We are very clearly on that same path again. What leads to
| the conclusion that another winter is coming. But even the
| fact that people are talking about it is evidence it's not
| here yet, and as always with political phenomena, there's no
| guarantee history will repeat.
|
| Anyway, none of it means people will stop applying their
| knowledge or studying AI. The entire thing happens on funding
| and PR, and the world is not entirely controlled by those
| two.
| dr_dshiv wrote:
| Except contemporary AI is useful on a day to day basis.
| xboxnolifes wrote:
| Being useful doesn't mean it met expectations.
| squigz wrote:
| > reached AI winter again
|
| Again?
| pocketarc wrote:
| It's a term for when AI hype dries up, has happened multiple
| times since the 60s[0].
|
| [0]: https://en.wikipedia.org/wiki/AI_winter
| swatcoder wrote:
| This was not the technology industry's first encounter with
| AI hype. The term was coined 40 years ago, and has been
| suggested as a description for almost a dozen periods in the
| field's history:
|
| https://en.wikipedia.org/wiki/AI_winter
| msp26 wrote:
| I hope this means that people actually start curating their
| training sets. The quality control is horrible on all of the
| datasets I've looked through. Especially image captions.
|
| Yes sheer quantity has a quality of its own but that won't
| produce optimal results.
| RodgerTheGreat wrote:
| From a very cynical marketing/VC-pitch perspective, un-
| curated datasets full of random crap have the benefit of
| sometimes producing totally surprising, unplanned "features",
| which helps sell the idea that one-size-fits-all black-box
| machine learning can solve any problem.
| godelski wrote:
| I think we're in the beginning of a warm winter.
|
| The funding is there but we're letting hype drive everything
| and not calling out the con artists.
|
| The problem is we've had recent great success but still don't
| know how to get to AGI. But because we're afraid of winter
| we're not willing to try new things. We want to only compare to
| sota and think it's fair to compare a new method with a handful
| of papers to the status quo. That's not how the S-curves of
| technology work. Sure, maybe things don't scale but that
| doesn't mean they don't have merit or can't scale if someone
| finds some modification.
|
| The problem is we treat research like products, not academic
| work. You need to produce everything from hard core theory to
| robust products to have an effective chain. But we seem hyper
| focused on the middle area. And for some reason people think
| products can be placing a nice interface around research code.
| There's still a lot of work you need to do and all those models
| can be optimized. They absolutely do not have optimal hyper
| parameters or even parameters.
| darepublic wrote:
| Dunno exactly what you mean by ai winter, but the whole gen AI
| thing has kind of got all the attention and even if that route
| slows down there are a ton of other branches fruitful for
| development.
| stephc_int13 wrote:
| The issue, I think, is that if results don't follow
| expectations, given the very high costs, investors might get
| cold feet.
|
| Sure we already have some interesting applications, but they
| are not exactly printing money.
| CuriouslyC wrote:
| Definitely not an AI winter, since the tools are so useful.
| People who think AGI is right around the corner might be
| disappointed though, since we can't really even accurately
| define AGI.
| trgn wrote:
| Commercial adoption has only started to ramp up. We are going
| to see a floodwave of LLM-powered bots/UX-wizards.
|
| Even if academic progress stalls, we're going to be inundated
| for at least another few years.
| refulgentis wrote:
| Uh, what?
| ertgbnm wrote:
| I'm not even going to discuss whether or not things are
| actually slowing down since I disagree with that.
|
| But I do feel confident that an AI winter is not on the horizon
| solely due to the overhead of implementation that we currently
| have. Just with currently existing AI, it would take years for
| the economy to fully leverage the abilities that are available.
| I'm confident we have transformative AI. So it won't feel like
| a winter for several years as we actually succeed in
| optimizing, implementing, and productizing current technology
| in other industries.
| hprotagonist wrote:
| My long-standing observation has been that while nature may abhor
| a vacuum, she also really, really loves sigmoids.
|
| That performance vs. training data is not linear, but
| logarithmic, doesn't exactly come as a surprise.
| bilsbie wrote:
| Every exponential is really an s curve?
| hprotagonist wrote:
| in physical systems, it's very often the case!
| mr_mitm wrote:
| Pretty much always. One exception might be the expansion of
| the universe
| jskherman wrote:
| Symmetries in Nature strike again! It's just like in
| Noether's theorem.
| ixaxaar wrote:
| The physical world has limits, that's why sigmoids everywhere.
| CuriouslyC wrote:
| The question that is unanswered, is the logarithmic performance
| improvement the result of better sampling of the underlying
| distribution over time, or related to just doing more training
| with slight variations to effectively regularize the model so
| it generalizes better? If it's the former, that indicates that
| we could achieve small models that are every bit as smart as
| large ones in limited domains, and if that's the case, it
| radically changes the landscape of what an optimal model
| architecture is.
|
| I suspect from the success of Phi3 that it is in fact the
| former.
| bearjaws wrote:
| Computerphile just did a video on this and it's a pretty good
| summary: https://www.youtube.com/watch?v=dDUC-LqVrPU
| RhysU wrote:
| Can one upweight the known-rare concepts in the training set? But
| then which are rare and how should we identify the rare ones
| worth knowing?
|
| ML is funny. It's predictably going to run into the Education
| field.
| six_four_eight wrote:
| I wonder, is this exponential relation specific to multi-modal
| models? From my admittedly naive view it seems to make sense that
| "...what is rare is not properly learned" would apply generally?
| crote wrote:
| This feels like the worst possible outcome of the current AI
| hype.
|
| We've essentially been ripping off the _entire internet_ and
| feeding it to the models already, spending many billions of
| dollars in the process. It 's pretty much the largest possible
| dataset you can currently get, and due to the ever-increasing and
| now rapidly accelerated AI poisoning of the internet most likely
| the largest possible dataset which will _ever_ exist.
|
| All that and all we're getting out of it is not-entirely-useless
| but still quite crappy AI? We would've been better off if we had
| never done this.
| some_random wrote:
| Quite crappy? It seems to me like the current SOTA is working
| plenty good enough for most use cases and like the highest
| impact (practical) way to improve right now is going to be
| advancements in domain knowledge acquisition and retention.
| crote wrote:
| I sure haven't seen any of those "plenty good" results yet -
| current SOTA seems to be about as useful as semi-coherently
| gluing together random Google results. Good enough to perhaps
| replace a minimum-wage worker, not good enough to provide
| actual value when you care about the quality of the result.
|
| This paper seems to suggest that significant advancement in
| domain knowledge acquisition and retention is _exactly_ the
| problem, as you seem to need exponentially more data due to a
| lack of generalization. What 's the point of a model which
| can perfectly quote Shakespeare if you're a programmer trying
| to refactor a proprietary codebase and it fails to make a
| link to whatever garbage it picked up from StackOverflow?
| Filligree wrote:
| In CLIP.
|
| Replacing CLIP with an LLM is the current meta in image
| generation models, specifically because of that lack of
| generalisation. This isn't a surprise to anyone.
| greenavocado wrote:
| I'm getting a heck of a lot of useful work out of these so-
| called useless models
| panarky wrote:
| _> not-entirely-useless but still quite crappy_
|
| 15 months ago, general-purpose LLMs that have not been
| specifically trained on legal reasoning could score better than
| 90% of humans on the multistate bar exam, and these are humans
| who actually completed law school.
|
| General-purpose LLMs get similar results in medicine, and when
| the models are fine-tuned for medical diagnosis they're even
| better.
|
| And that was more than a year ago. Those who have seen current
| models not yet released tell us they'll make the current state
| of the art look like toys.
|
| Progress is still tracking the steep part of the S-curve, and
| there's no indication that they're near the top yet.
| flimsypremise wrote:
| It's not super surprising that LLMs perform well on
| standardized tests, given that they have a lot of
| standardized test related text in their training data. There
| are a lot of claims out there about the zero-shot ability of
| LLMs, and very little specific research to back it up. Until
| now that is.
| MacsHeadroom wrote:
| This paper is about CLIP not LLMs and does not generalize
| to LLM architectures.
| crote wrote:
| > Progress is still tracking the steep part of the S-curve,
| and there's no indication that they're near the top yet.
|
| If I understand it correctly, that seems to be _exactly_ what
| this paper is suggesting.
|
| Scoring high on the bar exam is pretty trivial for AI - the
| data needed for that is fairly generic and widely available
| on the internet. It requires you to demonstrate a relatively
| basic understanding of the concepts by answering a bunch of
| multiple-choice answers. If anything, I'd expect AI to have a
| perfect score.
|
| Like I said, such an AI is not entirely useless. You can
| replace quite a few legal assistants with that, and I bet it
| could be used to create first drafts or to expand a core
| concept into a full legal argument. There is plenty of money
| to be made there, and it's going to make an awful lot of
| people jobless. But that's just replacing more-trivial jobs
| with automation, it doesn't add anything novel to society.
|
| On the other hand, the actual _difficult_ work involves being
| able to come up with completely novel concepts, and being
| able to expand upon some obscure but crucial stuff few people
| have ever heard about. Current models simply aren 't capable
| of that, and the results achieved here with multimodal models
| suggests that they never will. We risk getting stuck with
| models which can do some trivial work, but silently produce
| complete garbage when you ask them to do anything providing
| substantial value.
| Imnimo wrote:
| I think this says more about the benchmark than the
| capabilities of the model. If it were the case that 90th
| percentile performance on the bar exam mean that a model was
| a 90th percentile lawyer, and we've had these models for 15
| months (in fact longer), where are all the LLM lawyers? The
| lesson here is that a test designed for humans may not be
| equally representative of capabilities when given to an LLM.
| HarHarVeryFunny wrote:
| Even if LLMs (pre-trained transformers) turn out to be a dead
| end as far as AGI goes, there are productivity applications for
| them, and perhaps just as importantly interesting
| insights/confirmations about how the mind works and directions
| for future AGI research/architectures.
|
| The use cases for LLMs will no doubt grow as hallucinations are
| reduced, and they gain planning/reasoning ability over next
| couple of years. It'll be interesting to see what they
| subjectively "feel" like with these improvements.
| crote wrote:
| Oh absolutely, LLMs are already causing a slaughter in
| applications where quality doesn't really matter to the
| company, like customer service. With minor improvements
| they're going to be a serious problem for any junior
| developer/lawyer/journalist/reviewer/artist/whatever, and if
| they ever fix the hallucinations issue it's game-changing.
|
| On the other hand, it's still a big "if" whether a general
| hallucinations solution exists, and in the meantime we're
| paying a pretty high price for it, as the entire internet is
| being flooded with absolute garbage. We risk getting stuck in
| a situation where there is no way to get _new_ senior people
| because nobody hired them as juniors because AI is cheaper
| and good enough, and senior people are getting less and less
| productive due to them being unable to use websites like
| StackOverflow as reference material. That 's a pretty high
| price for a tiny gain, if you ask me.
| nkozyra wrote:
| My biggest worry is the idea of generating data as training data.
| We're obviously already unwittingly doing this, but once someone
| decides to augment low-volume segments of the dataset with
| generative input, we're going to start getting some really crappy
| feedback loops.
| bilsbie wrote:
| Do you ever have novel ideas while walking or in the shower?
|
| Well, you're learning from data you generate.
| nkozyra wrote:
| > Well, you're learning from data you generate.
|
| Sure. I'm producing human output from human input in a
| generally unconstrained, limitless way.
|
| This is producing approximated human output from approximated
| human input. That second level of abstraction will be
| constrained, ultimately, by the limits of input.
| sottol wrote:
| I don't necessarily think it is the equivalent. Maybe it's
| more akin to a high-schooler reading a single book and then
| being asked to write several new books that hold equal weight
| in teaching the next generation of children.
|
| These new text books could be great at simplifying the
| subject matter and making the material accessible or they may
| just never have fully understood the materials and are
| misleading.
|
| Now imagine that over and over again, imo it's pretty likely
| to introduce inaccuracies if just taking a naiive approach.
| breck wrote:
| Well put.
|
| Also, ever read your old journals? You are training on
| generated data.
| Filligree wrote:
| Ever dream? It's the same thing.
|
| Of course we still need real world data, but it seems like
| generated data should also play a role. Humans don't weight
| dreams equally with reality, however, and that's a
| distinction I feel is missing here.
| a_wild_dandan wrote:
| This is correct. To give more synthetic data examples:
|
| 1. Generate inverted problems which are easier to produce
| than solve. For instance, create an integration math exercise
| by differentiating a hairy function and reversing the steps.
|
| 2. Create (simulated) environmental data.
|
| 3. Use adversarial model competition, e.g. self-playing chess
| or training an artificial image generator/detector model
| pair.
|
| There's evidently a commonplace myth that information quality
| starts pristine and exclusively gets degraded by systems
| thereafter. That's just easily, demonstrably false in myriad
| ways. That's why it's absurd to conclude that LLM output
| being in internet training data will cause model collapse.
|
| Information-rich synthetic data can be created without
| humans, and it works. (Check out Phi, for instance.)
| vineyardlabs wrote:
| https://arxiv.org/pdf/2305.17493
|
| There's some cursory indication that in the long tail,
| training LLMs on LLM-generated data causes model collapse.
| Kind of like how if you photocopy a photocopy too many times
| the document becomes unreadable.
|
| This isn't really surprising though. Neural networks at large
| are a form of lossy compression. You can't do lossy
| compression on artifacts recovered from lossy compression too
| many times. The losses stack.
| xboxnolifes wrote:
| Even hear an idea from another person? You just trained one
| model with another.
| MeImCounting wrote:
| We are already doing this. In fact for the next generation of
| frontier models the primary electricity cost is running
| inference to generate training data for these models. As Llama
| 3 has shown us scale of training data is more important than
| size of model.
| IncreasePosts wrote:
| Shouldn't there be enough known-good training content that we
| can use it to determine if a document is worth including in a
| training set?
| devmor wrote:
| There's not enough human workers to validate that at scale.
|
| If you want ML to do it... well that's a bit of a catch-22.
| How would an ML algorithm know if data is good enough to be
| trained on unless it has already been trained on that data?
| HarHarVeryFunny wrote:
| The way humans do it is via curiosity/boredom/surprise. If
| we can't predict some thing well (i.e. we're "surprised" by
| it) then that both acts as learning trigger to predict
| better next time, as well as retains our interest to
| explore it.
|
| Eventually AGI's will need to learn by experimentation as
| we do, but in the meantime ability to well predict a
| potential training sample could be used to decide whether
| to add it to training set or not. At the moment it seems
| the main emphasis is on a combination of multi-modality
| (esp. video) and synthetic data where one generation of LLM
| generates tailored training samples for the next
| generation. I guess this synthetic data allows a more
| selective acquisition of knowledge than just adding
| surprising texts found in the wild.
| IncreasePosts wrote:
| You could use data where you know it was not AI generated,
| like the library of Congress catalog from prior to 2015. Or
| highly cited research papers, things of that nature.
| drdeca wrote:
| Get a collection of data which is small enough to have
| humans annotate as being either low quality or high
| quality. Train a model to predict this annotation. Then on
| a larger disjoint collection of data, use this model to
| estimate whether the data points would be considered low
| quality or high quality, and use this to filter it.
|
| This seems doable, and, I think something like it is
| already done?
| shenberg wrote:
| The CLIP plot (Fig. 2) is damning, however some of the generative
| models show flat responses in Fig. 3 (e.g. Adobe GigaGAN, DALL-E-
| mini). While those are on the one hand technically linear
| relationships, but are also exactly what we'd want: image
| generation aesthetic score that doesn't care about concept
| frequency. Maybe the issue is with the contrastive training
| target used in CLIP?
| twobitshifter wrote:
| The models tested seem to work as expected as it's not a
| retrieval model being used here. The weights are lowest on
| wormsnake but much higher on worm and snake. The temperature of
| the model for something like stable diffusion has to be higher
| than something doing a retrieval, so we would not expect it to
| reproduce the exact worm snake image from its training data.
| xcodevn wrote:
| Of course, it will require exponential data for zero shot. The
| keyword here is _zero shot_. If you think about it for a second,
| this applies to humans too. We also need exponential training
| data to do things _without_ examples.
| IIAOPSW wrote:
| When we learn the grammar of our language, the teacher does not
| stand in front of the class and proceed to say a large corpus
| of examples of ungrammatical sentences, only the correct ones
| are in the training set.
|
| When we learn to drive, we do not need to crash our car a
| thousand times in a row before we start to get it.
|
| When we play a new board game for the first time, we can do it
| fairly competently (though not as good as experienced players)
| just by reading and understanding the rules.
| xcodevn wrote:
| please help yourself and do a quick Google search about "zero
| shot" and "few shot" learning.
| godelski wrote:
| I've always been rather upset that it's fairly common to train on
| things like LAION or COCO and then "zero shot" test on ImageNet.
| Zero shot doesn't mean a held out set, it means disjoint classes.
| You can't train on all the animals in the zoo with sentences and
| then be surprised your model knows zebras. You need to train on
| horses and test on zebras.
| loandbehold wrote:
| How would the model know what zebra was if it had never seen
| it? Same is true for humans.
| xboxnolifes wrote:
| When I was little, zebras were described to me as black and
| white stripped horses. Without even seeing one, I'm sure
| anything who has seen a horse could then merge those two
| concepts to create a close to accurate picture of what a
| zebra is.
|
| If AI is supposed to resemble a human mind with ability to
| learn, then it must be able to learn from a blanker slate.
| You don't teach the human before it is born, and in this
| comparison an AI is born when you finish it's model and set
| its weights using the training set. If you test it with the
| training set, you aren't testing ability to comprehend, just
| regurgitate what it was born with
| dwallin wrote:
| If you trained an image generator, removing all instances
| of zebras from the training set, you could ask it to output
| images of a black and white striped horse and it would
| likely succeed. Then you could fine tune an image
| recognition model (also with all zebras removed from the
| training set) on the generated image set to associate it
| with the word zebra. If you then showed it a bunch of
| images of actual zebras there's a really good chance it
| would succeed.
| salty_biscuits wrote:
| You can look at a medieval bestiary to see how people
| thought animals might look based on descriptions alone.
| Like these lovely elephants
|
| https://britishlibrary.typepad.co.uk/digitisedmanuscripts/2
| 0...
| godelski wrote:
| Honestly, these are some of my favorite stories and I
| think more ML people need to learn more about mythology
| (I say this as a ML researcher btw). Because once you go
| down this path you start to understand how "Rino" ==
| "Unicorn". You have to really think about how to explain
| things when you're working with a limited language. Yeah,
| we have the word "rino" now, but how would you describe
| one to someone who has no concept of this? Maybe a cow
| with one big horn? Is "like a big fat tough skin horse
| with a big horn coming out of its head" accurate? And
| then apply your classic game of telephone[0]. It is also
| how you get things like how in Chinese a giraffe is "long
| neck deer"[1] (that doesn't work for all things in
| Chinese and there's another game of telephone (lol, maybe
| I was too harsh on the British in [0]) and well... you
| can imagine things get warped like crazy).
|
| There's so many rabbit holes to go down when trying to
| understand language, vision, reasoning, and all that
| stuff.
|
| [0] (Jesus England... this is what you call this game?!)
| https://en.wikipedia.org/wiki/Chinese_whispers
|
| [1] https://translate.google.com/?sl=en&tl=zh-
| CN&text=giraffe&op... ---->
| https://translate.google.com/?sl=zh-
| CN&tl=en&text=%E9%95%BF%...
| godelski wrote:
| Great question!
|
| It depends on the zero-shot experiment. Let's look at two
| simple examples
|
| Example 1:
|
| We train a classifier that classifies several animals (and
| maybe other things). For example, you can use the classic
| CIFAR-10 dataset which has labels: airplane, automobile,
| bird, cat, deer, dog, frog, __horse__, ship, truck. The
| reason I underlined horse is because you want your model to
| classify the zebras as horses!
|
| The reason this is useful is for measuring the ability to
| generalize. At least in our human thinking framework we'd
| place a zebra in that bin because it is the most similar (and
| deer should be the most common "error"). This can help us
| understand the network and we'll be pretty certain that the
| network is learning the key concepts of a horse when trying
| to classify horses rather than things like textures, colors,
| or background elements. If it frequently picks ships your
| network is probably focusing on textures (IIRC CIFAR has
| ships with the Dazzle Camo[0] and that's why I threw "ship"
| out there).
|
| Example 2:
|
| Let's say we train our network on __text__. In this case it
| can get any description of a zebra that it wants. In fact,
| you'd probably want to have a description of what it looks
| like!
|
| The what we might do is take that trained text network, and
| attach it to a vision classifier. For simplicity, let's say
| that was trained on CIFAR-10 again. We then tune our LM + CV
| model so that it can match the labels of CIFAR-10 (basically
| you're tuning to ensure the networks build a communication
| path, otherwise it won't work). Here we end up testing our
| model's actual understanding of the zebra concept. It again
| should pick horse as the likely class because you've
| presumably had in the training text some description that
| compares zebras to horses.
|
| -----
|
| So really the framework of zero-shot (and few-shot) is a bit
| different. We're actually more concerned about clustering and
| you should treat them more similar to clustering algorithms.
| n-shot frameworks really come from the subfield of
| metalearning (focusing on learning how networks learn). But
| as you can imagine, these concepts are pretty abstract, but
| hey, so are humans (that's why we see a log as a chair and
| will situationally classify it as such, but let's save the
| discussion of embodiment for another time).
|
| In either example I think you can probably see how a toddler
| could do similar tasks. You can ask which of those things the
| zebra is most similar to and you'd be testing the toddler's
| visual reasoning. The text one might need be a little older
| but it could be a great way to test a child's reading
| comprehension. Does this make sense? Of course machines are
| different and we need to be careful with these analyses
| (which is why I rage against just comparing
| scores/benchmarks, these mean very little), because the
| machines may be seeing and interpreting things differently
| than us. So really the desired outcome depends on if you're
| testing for what the machine knows/understands (you need to
| do way more than what we discussed above) or if you are
| training a machine to think more similar to a human (then we
| can rely pretty close to exactly what we discussed).
|
| Hope this makes more sense.
|
| [0] https://en.wikipedia.org/wiki/Dazzle_camouflage
| red75prime wrote:
| Should we also try to teach children geometry and test them on
| calculus?
| IIAOPSW wrote:
| For centuries we only taught children geometry and one of
| them invented calculus.
| Filligree wrote:
| Now that's setting a high bar. If AI could reliably invent
| calculus, then I'd be briefly impressed and then terrified.
| nico wrote:
| This illustrates two ways of teaching
|
| I've experienced both, each at a different university
|
| In one, professors would teach one thing then ask very
| different (and much harder) questions on tests
|
| In the other, tests were more of a recap of the material up
| to that point
|
| I definitely learned a lot more in the second case and was a
| lot more motivated. It also required more effort from the
| professors
|
| The two methods also test different things. The recap one
| tests effort and dedication, if you do the work, you get the
| grade
|
| The difficult tests measure either luck and/or creativity and
| problem solving under pressure. It's not about doing the
| work, it's about either being lucky or good at testing
| Pet_Ant wrote:
| > it's about either being lucky or good at testing
|
| I think you are misunderstanding the experience.
|
| The first (harder questions) is testing your understanding
| of the material and problem. Can you applying the material
| to solve a novel problem? Do you understand the _material_
| not just the mechanics. Do you understand how it would
| interelate it with other problems? Do you understand the
| limitations?
|
| The second is just regurgitation. This is great for rote
| skills, but this isn't really learning. This is grinding
| until you can reproduce. These are the kinds of skills that
| are easily automated. This is not what we should be testing
| our kids.
| godelski wrote:
| Fwiw, I think both of you are on the same page.
|
| And yes, to bring back to ML it is the difference of
| generalization and memorization (compression). I wrote a
| longer response to a different response to my initial
| comment to help clarify because I think this chain is a
| bit obtuse and aggressive for no reason :/ (I mean you
| can check the Wiki page to verify what I said)
| feoren wrote:
| I think the authors are responding to a _claim_ that AI is
| doing this: look, we taught them geometry, and now they know
| calculus! GP is saying that it 's not a true zero-shot to
| have a separate test and training set, because the classes
| overlap. Similarly, the authors are saying "true zero-shot"
| is basically not happening, at least not nearly to the extent
| some are claiming.
|
| So everyone here, including TFA, are all kinda doubting the
| same claim (our AI models can perform zero-shot
| generalizations) in different ways, I think?
| godelski wrote:
| I think you're being overly obtuse, and I'd request you try
| to work in good faith. If you doubt what I claimed, you can
| quickly verify on the wiki page[0].
|
| In essence you aren't wrong, but that's not what we'd
| typically do in a zero (or few) shot setting. We'd be
| focusing on things that are more similar. If you want to
| understand this a bit better in what we might do in a ML
| context I wrote more here[1].
|
| And I like Nico's comment about how different professors
| test. Because it makes you think about what is actually being
| tested. Are you being tested on memorization or
| generalization? You can argue both these kinds of tests are
| testing "if you learned the material" but we'd understand
| that these two types of tests are fundamentally different and
| let's be real, are not reasonably fair to compare scores to.
| I'm sure many of us have experienced this where someone that
| gets a C in professor A's class likely learned more than than
| someone who got an A in professor B's class. The thing is
| that the nuance is incredibly important here to really
| understand. And you can trivialize anything, but be careful
| when doing so, you may overlook the most important things ;)
|
| Now... you could make this argument about geometry ->
| calculus if we're not talking about the typical geometry
| (single) class most people will have taken in either middle
| school or high school. Because yes, at the end of the day
| there is a geometric interpretation and we have the Riemann
| sum. But we'd need to ensure those kids have the
| understanding of infinities (which aren't numbers btw). We'd
| have to be pretty careful about formulating this kind of test
| if we're going to take away any useful information from it.
| Though the naive version might give us clues about
| information leakage (in our case with children this might be
| "child who has a parent that's a mathematician" or something
| along those lines). It really all depends on what the
| question behind the test is. Scores only mean things when we
| have nuanced clear understandings of what we're measuring (so
| again, tread carefully because "here be dragons" and you're
| likely to get burned before, or even without, knowing it)
|
| And truth be told, we actually do this a bit. There's a
| reason you take geometry before calculus. Because the skills
| build up. But you're right that they don't generalize.
|
| [0] https://en.wikipedia.org/wiki/Zero-shot_learning
|
| [1] https://news.ycombinator.com/item?id=40313501
| a_wild_dandan wrote:
| Better title: Image classification models suck at identifying
| nouns that they've rarely seen.
|
| Crucial context:
|
| - They're only looking at _image_ models -- _not_ LLMs, etc
|
| - Their models are tiny
|
| - A "concept" here just means "a noun." The authors index images
| via these nouns.
|
| - They didn't control for difficulty in visual
| representation/recognition of these exceptional infrequent, long-
| tail "concepts."
|
| If I didn't know an object's label, I too would struggle to
| identify/draw it...
___________________________________________________________________
(page generated 2024-05-09 23:01 UTC)