[HN Gopher] Can generalist foundation models beat special-purpos...
___________________________________________________________________
Can generalist foundation models beat special-purpose tuning?
Author : wslh
Score : 100 points
Date : 2023-11-30 11:14 UTC (11 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| avereveard wrote:
| gpt-4 is not foundational, it's rlhf tuned. also I didn't find
| the actual gpt-4 model revision, so yeah. sloppy all around.
| cl42 wrote:
| I'm looking to run experiments like this for specific domains and
| prompting strategies as a way to showcase the next version of
| https://evals.phasellm.com If the paper above resonates with you
| or if you have questions about promoting techniques, feel free to
| reply below or email me at hello #at# phaseai #dot# com
|
| Happy to help!
| jerpint wrote:
| A real test would be fine tuning gpt4 and comparing that to gpt4
| medprompt
| c7b wrote:
| I think the point of the study was precisely to see how well
| the original model without adaptations (my understanding is
| that they only use prompting, which does not affect weights)
| can perform. I think that question is arguably even more
| interesting than 'How well can we train a model to perform on
| med school questions?'. I'm not saying this is generalization,
| because the training set surely included a lot of medical
| literature, but if the base model without fine-tuning can
| perform well in one important domain, that's a very interesting
| data point (especially if we find the same to hold true for
| multiple domains).
| ramoz wrote:
| Caveat this with the fact that they did rag.
| practice9 wrote:
| Yeah the Medprompt name is misleading
| owl_brawl wrote:
| The "RAG" part (over the training set) is by far the
| smallest contribution to the performance gains reported
| (see the ablation study in section 5.2). I don't think the
| model is actually learning in-context from the selected
| samples, but rather is continuing to be better conditioned
| to sample from the right part of the pre-training
| distribution here, which does a slightly better job when
| the samples are on topic (vs. random)
| sungam wrote:
| We found that GPT-4 performs at a very high level on dermatology
| specialist exam questions (which it would not have been trained
| on)
|
| https://www.medrxiv.org/content/10.1101/2023.07.13.23292418v...
| 4death4 wrote:
| Why wouldn't it have been trained on such questions?
| sungam wrote:
| Not publicly available
| RandomLensman wrote:
| Abstract says the sample questions were public available.
| The point they are making is that the model wasn't
| specifically trained for dermatology.
| pyinstallwoes wrote:
| What makes you think that the entire body of private access
| journals and books hasn't been used?
| civilized wrote:
| Extensive medical information is publicly available on websites
| like www.nhs.uk. So even if it wasn't literally trained on exam
| questions, it may have picked up substantially identical
| information elsewhere.
|
| Which is still a significant feat. But I doubt it's doing
| anything more than the sort of semantic lookup that is already
| well-known to be within its capabilities.
| RandomLensman wrote:
| The questions were public available. The point is the model
| wasn't specifically trained for dermatology.
| evrydayhustling wrote:
| This is a very interesting study, but the generalization implied
| in the abstract -- that specialized models may not have
| advantages in specialized domains -- is an overreach for two
| reasons.
|
| They introduce medprompt, which combines chain of thought
| reasoning, supervision, and retrieval augmented generation to get
| better results on novel questions. This is a cool way to leverage
| supervision capacity outside of training/fine tuning! But they
| compare this strategy, applied to GPT-4, with old prompting
| strategies applied to MedPalm. This is apples and oranges -- you
| could easily have taken the same supervision strategy (possibly
| adapted for smaller attention window etc) to get a closer
| comparison.
|
| Second, MedPalm is a fine tuned 500b parameter model. GPT-4 is
| estimated to be 3-4x size. So the comparison confounds emerging
| capabilities at scale, in addition to prompting impact, with
| anything you can say generally about the value of model
| specialization.
| djha-skin wrote:
| The answer to most of these kinds of questions of generalist
| versus specialist when it comes to AI, according to Richard
| Sutton's "The Bitter Lesson"[1], is that specialized AI will win
| for a while, but generalized will always win the long-term
| because it's better at turning compute into results.
|
| Whether or not that will pan out with "fine tuned" versus
| "generalized" versions of the same data-eating algorithms remains
| to be seen, but I suspect the bitter lesson might still apply.
|
| 1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| nuz wrote:
| Well the counter argument is that with an equal amount of
| compute, specialized models will win over generalized ones. The
| reason is that generalized ones have to spread out their
| weights to be able to do a ton of irrelevant things to the
| niche subject that a finetuned model can accomplish with fewer
| and more focused weights.
| rschiavone wrote:
| I'd argue that yours is not a counter argument, but the same
| thing said by grandparent rephrased. It's obvious that with
| the same amount of computing power, specialized >
| generalized. But in 2-5 years, the future generalized models
| can beat the current specialized ones.
| goldenkey wrote:
| The future of this is the same as what happened with PCs.
| The specificity will initially be thrown away, added back
| later as an accelerator, and eventually, brought back into
| the fold as an automatically used accelerator. It all comes
| full circle as optimizations, hinting, tiering, and
| heuristics.
|
| AI will be used to select the net to automatically load.
| Nets will be cached, branch predicted etc.
|
| The future of AI software and hardware doesn't yet support
| the scale we need for this type of generalized AI processor
| (think CPU but call it an AIPU.)
|
| And no, GPUs aren't an AIPU, we can't even fit whole some
| of the largest models on these things without running them
| in pieces. They don't have a higher level language yet,
| like C, which would compile down to more specific actions
| after optimizations are borne (not PTX/LLVM/Cuda/OpenCL.)
| marcosdumay wrote:
| No, the GP is stating that the short term advantage will
| always exist, for fundamental reasons. It's just short term
| because it has to be reacquired often, not because there
| stops being an advantage.
|
| In 5 years, the specialized model would still beat the
| generalized ones. They would just be different from the
| ones today.
| PaulHoule wrote:
| Couldn't you apply the same prompt engineering tricks to
| the specialized model that their system beats?
| marcosdumay wrote:
| Well, I'm not sure how valid is the entire thing. It's
| not clear how much cross-polination there is between
| those datasets and the ChatGPT training set (IMO, if I
| was training something like it, I'd take any specialized
| dataset I can find), or how useful answering those
| questions is to any real task.
|
| Also, it's not only the prompt engineering. Training
| ChatCPT was a decades long process, with billions of
| dollars invested, and it takes a small cluster to run the
| thing. How would the other models compare if given
| similar resources?
|
| Besides that, on the context of this thread, the bitter
| lesson has absolutely no relation to the specialized vs.
| general purpose model dichotomy. It's about the
| underlining algorithms, that for this article are all
| very similar. (And it's also not a known truth, and
| looking less and less as an absolute truth as deep
| learning advances, so take it with a huge grain of salt.)
| fauigerzigerk wrote:
| That's probably true in terms of benchmark problems, but at
| the same time, losing knowledge of seemingly irrelevant
| things could make the specialised model less creative when it
| comes to looking for unexpected causes and links.
|
| It would also be less robust in the face of exceptional
| situations that should make it doubt the reliability and
| relevance of its own training.
|
| For instance, an autonomous driving system that doesn't know
| the first thing about the zombie apocalypse could put its
| passengers' lives at risk by refusing to run over
| "pedestrians".
|
| A specialised diagnostic system might not notice signs of
| domestic violence that a GP would see. Being able to connect
| the dots beyond the confines of some specialised field is
| extremely useful in many situations.
| notahacker wrote:
| > For instance, an autonomous driving system that doesn't
| know the first thing about the zombie apocalypse could put
| its passengers' lives at risk by refusing to run over
| "pedestrians".
|
| I think that argument works better the other way. I'm
| outspoken in arguing many of the edge cases in driving
| _are_ general intelligence problems we 're not about to
| solve by optimising software over a few billion more miles
| in a simulator, but I don't want that intelligence so
| generalised that my car starts running down elderly
| pedestrians because there's been a lot of zombie literature
| on the Internet lately. I'm pretty confident there's more
| risk of people dying from that than a zombie apocalypse.
| ryanjshaw wrote:
| Wouldn't it become more and more difficult to improve the
| specialist model? And wouldn't the ultimate generalized
| models (AGI) eventually be able to improve themselves
| automatically? Or am I talking about something different
| here?
| anon373839 wrote:
| I think it might be stretching the Bitter Lesson a bit to
| extend it to cover fine-tuning, since the original idea was
| addressing something very different.
|
| Anyway, even if generalist models are truly better at
| everything, they'll never be faster or cheaper than their
| specialist counterparts, and that matters.
| AndrewKemendo wrote:
| I mean the whole point of the "bitter" part is that if you push
| most people that aren't convinced of the data hypothesis, they
| will talk of a belief that there's "something special" that
| happens to create intelligence that isn't just raw number
| crunching (so to speak). It's a human bias in how we view
| social hierarchy and expertise, and ignores the
| epistemic/biological perspective.
|
| It's a philosophical position as much as a technical one
| nonameiguess wrote:
| This is an empirical observation regarding implementation
| details of _machine learning_ models specifically, not math
| models in general. Clearly, in principle, specialized models
| can beat general models simply because at least some
| applications have exact models. No learned functional
| approximator will beat an ALU or even an abacus at simple
| arithmetic.
| crosen99 wrote:
| Well said, and fully agree. If you horse race two approaches,
| you can of course arbitrarily arrive at a winner based solely
| on the version of each approach you choose. You need a deeper
| look if you want to generalize.
| mistrial9 wrote:
| no - because the range of responses in the generalized model
| will tend in some asymptotic way to "patterns the machinery
| recognizes" . Real knowledge is not like a machine process, in
| every instance. Any detective novel has a moment when a faint
| scent or a distant sighting connects very different inputs to
| aid in solving a mystery. Lastly, knowledge has resonance and
| harmonics; great Truths are paradox. Your bitter lesson is
| convenient at this time of evolution with computing, not The
| Answer.
| loeber wrote:
| Good read, thanks for sharing.
| RandomLensman wrote:
| Any sufficiently large library has a lot of specialty books in
| it.
| mistrial9 wrote:
| you have got to be joking. This is not true in any real way.
|
| source: six years in the book industry in California
| upghost wrote:
| Technical foul. Doing science based on closed-source model. 10
| yard penalty for humanity.
|
| GPT-4 _almost_ disappeared last week. Aside from the obvious AI
| bus factor and "continuity of science" concerns, I find it
| absurd that so many papers and open source ML/LLM frameworks and
| libraries exist that simply would not work at all if not for
| GPT-4. Have we simply given up?
|
| I thought this was _hacker_ news.
| Philpax wrote:
| I'm all in favour of doing reproducible work, especially
| against open-source ML (I'm the maintainer of a library for LLM
| inference), but what is the alternative here?
|
| GPT-4 is still the flagship model "of humanity"; it is still
| the best model that is publicly available. The intent of the
| paper is to determine whether the best general model can
| compete with specialized models - how do you do that without
| using the best general model?
| rafram wrote:
| Half of science comes down to poking at black boxes to see how
| they react. The universe is a closed-source model!
| ZiiS wrote:
| The other half is publishing results to fix that.
| lambda_garden wrote:
| Llamafile is (rightfully) the top of HN right now.
|
| I have high hopes that models will end up a bit like Linux
| servers, where everyone is building on open foundations.
| pixl97 wrote:
| Almost all modern science is gated by needing incredibly
| massive amounts of money.
|
| 'Closed source' science happens in every industry. While you
| may not think of what DOW is doing at science, it is and huge
| companies like that progress by their internal workings.
|
| What you're stating is something else, that maybe we should
| fund sciences, such as AI more, but we really did that in the
| past and it had it's own fits and starts. Then transformers
| came out of Google and private industry has been leading the
| way in AI. If you want to keep up with cutting edge and the 10s
| of millions needed to train them, you'll need to use these
| private models.
|
| Hackers hack stuff from private companies all the time. Hackers
| use proprietary software. There is no purity test we have to
| pass.
| PaulHoule wrote:
| I would think you could apply an inference process like that to a
| smaller model and get results better than you get with zero-shot
| on a larger model... And be able to afford to run the smaller
| model as many times as it takes.
| make3 wrote:
| really ? "gpt4 beats our tiny model", that's your paper?
| hallqv wrote:
| Depends on your definition of winning - special-purpose tuning is
| vastly more cost effective since it allows you to train a smaller
| model that can perform specific tasks as good as a bigger one.
|
| A good analogy is building a webapp - would you prefer to hire a
| developer with 30+ years experience in various CS domains as well
| as a PhD or a specialized webdev with 5 years experience at a
| tenth of the rate?
| Xcelerate wrote:
| Not strictly related to their study, but if we consider
| Solomonoff induction to be the most general model of all (albeit
| not computable), then I'd say yes to the question, if only
| because it will select the most specialized model(s) for the
| particular problem at hand automatically.
|
| One could argue universal general intelligence is simply the
| ability to optimally specialize as necessary.
|
| I think one aspect that is overlooked when people are involved is
| that we create specialized or fine-tuned models precisely for the
| reason that a general approach didn't work well enough. If it
| had, we would have stopped there. So there's a selection bias in
| that almost all fine-tuned models are initially better than
| general models, at least until the general models catch up.
| doingtheiroming wrote:
| There are lots of ways to define "good enough" as well. What
| are the costs of inferring several small experts contributing
| to a decomposed workflow versus using GPT4 for example. If you
| want to run multiple instances for different teams or
| departments, how do the costs escalate. Do you want to include
| patient data and have concerns about using a closed source
| model to do so. Etc.
|
| There's little doubt that GPT4 is going to be most capable at
| most tasks either OotB or with prompt engineering as here. But
| that doesn't mean it's the right approach to use now.
| cyanydeez wrote:
| I assume the next step is to find the right model size for
| semantic specialism then create a model of model that's
| filtering down to specialized models.
|
| same will happen for temporal information that stratifies and
| changes, ie, asking what a 1950s scientist understands about
| aquantim physics.
| VirusNewbie wrote:
| I don't understand the reasoning. if you merged two of the most
| specialized models together and simply doubled the
| size/parameter count, there is no reason the smaller one can
| generalize better than the larger one.
|
| It would simply have two different submodels inside its
| ubermodel.
| Davidzheng wrote:
| Your argument has a very concrete example. We are a generalist
| intelligence. And when we encountered chess, we eventually
| developed specialist intelligence to be the best at chess.
| marvin wrote:
| General intelligence builds tools as required, maybe having a
| whole bunch of them built-in already. Maybe the ones that are
| most frequently useful. Maybe a more general intelligence can
| build more of these tools into itself as required.
|
| Interesting philosophic topic how to ensure objectives aren't
| affected too much. You can make this mind experiment even as
| a human. There are parts of our goal system we might want to
| adjust, and parts we'd be horrified to, being very reluctant
| to accidentally risk affecting it.
| kjkjadksj wrote:
| What I want to know is how these systems compare to simple
| keyword search algorithms. E.g. presumably the person using these
| systems are medical professionals, typing in some symptoms
| perhaps gleaming for answers.
|
| Theoretically, all these date could be cataloged in a way where
| you can type in symptoms x y z, demographic information a b c,
| medical history d e f, and get a confidence score for a potential
| diagnosis bases on matching these factors to results from past
| association studies. A system like that I imagine would be easier
| to feed in true information and importantly would also offer a
| confidence estimate of the result. It would also be a lot more
| computationally simpler to run I'd imagine given the compute
| required to train a machine learning model vs keyword search
| algorithms you can often run locally across massive datasets.
| shmageggy wrote:
| Why did "Case study in Medicine" get stripped from the submitted
| title? It's way more clickbaity now.
| menzoic wrote:
| You can always fine tune a generalist model to get better results
| tedivm wrote:
| This was something we thought about a bit at Rad AI.
|
| I think one of the problems this paper ignores isn't whether you
| can get a general purpose model to beat a special purpose one,
| but whether it's really worth it (outside of the academic sense).
|
| Accuracy is obviously important, and anything that sacrifices
| accuracy in a medical areas is potentially dangerous. So
| everything I'm about to say assumes that accuracy remains the
| same or improves for a special purpose model (and in general that
| is the case- papers such as this talk about shrinking models as a
| way of improvement without sacrificing accuracy:
| https://arxiv.org/abs/1803.03635).
|
| All things being equal for accuracy, the model that performs
| either the fastest or the cheapest is going to win. For both of
| these cases one of the easiest ways to accomplish the goal is to
| use a smaller model. Specialized models are just about always
| smaller. Lower latency on requests and less energy usage per
| query are both big wins that affect the economics of the system.
|
| There are other benefits as well. It's much easier to experiment
| and compare specialized models, and there's less area for errors
| to leak in.
|
| So even if it's possible to get a model like GPT-4 to work as
| well as a specialized model, if you're actually putting something
| into production and you have the data it almost always makes
| sense to consider a specialized model.
| theossuary wrote:
| That's interesting, I'm currently working on an idea that
| assumes the opposite. Having built specialized models for
| years, the cost of having a data science team clean the data
| and build a model is pretty high, and it can take quite a while
| (especially if part of the project is setting up the data
| collection).
|
| For prototyping and for smaller use cases, it makes a lot of
| sense to use a much more general model. Obviously this doesn't
| apply to things like medicine, etc. But for much more general
| things like: check if someone is on the train-tracks, or number
| of people currently queuing in a certain area, or if there's a
| fight in a stadium; I think multi-modal models are going to
| take over. Not because they're efficient, or particularly fast;
| but because it'll be quick to implement, test, and iterate on.
|
| The cost of building a specialized model, and keeping it up to
| date, will far exceed the cost of an LVM in most niche use
| cases.
| tedivm wrote:
| I think it depends on how much you expect your model to be
| used, and how quickly the model needs to react. The higher
| either of those becomes the more likely you'll want to
| specialize.
|
| If you expect your model to be used a lot, and you don't have
| a way to distribute that pain (for instance, having a mobile
| app run the model locally on people's phones instead of
| remotely on your data centers) then it ends up being a cost
| balancing method. A single DGX machine with 8 GPUs is going
| to cost you about the same as a single engineer would. If
| cutting the model size down means you can reduce your number
| of machines that makes increasing headcount easier. The nice
| thing about data cleaning is that it's also an investment-
| you can keep using most data for a long time afterwords, and
| if you're smart then you're building automated techniques for
| cleaning that can be applied to new data coming in.
| rahimnathwani wrote:
| I'm curious to know what Rad AI ended up doing? IIRC the
| initial problem was how do you turn this set of radiology notes
| into some summary radiology notes, with a specific format. Is
| that right?
|
| If you were approaching this problem anew today, you'd probably
| try with GPT-4 and Claude, and then see what you could achieve
| by finetuning GPT-3.5.
|
| And, yes, for a given level of quality, the finetuned GPT-3.5
| will likely be cheaper than the GPT-4 version. But for
| radiology notes, perhaps you'd be happy to pay 10x per even if
| it were to give only a tiny improvement?
| tedivm wrote:
| I guess a question to ask is "What is GPT-4". Is it the
| algorithm, the weights, the data, or a combination of them
| all?
|
| To put it another way, the researchers at Rad AI consumed
| every paper that was out there including very cutting edge
| stuff. This included reimplimenting GPT-2 in house, as well
| as many other systems. However, we didn't have the same data
| that was used by OpenAI. We also didn't have their
| hyperparameters (and since our data was different it's not a
| guarantee that those would have been the best ones anyways).
|
| So with that in mind it's possible that Rad AI could today be
| using their own in house GPT-4, but specialized with their
| radiology data. In other words them using a specialized
| model, and them using GPT-4, wouldn't be contradictory.
|
| I do want to toss out a disclaimer that I left there in 2021,
| so I have no insights into their current setup other than
| what's publicly released. However I have no reason to believe
| they aren't still doing cutting edge work and building out
| custom stuff taking advantage of the latest papers and
| techniques.
| TrevorJ wrote:
| If I had to guess at what the next leap for AI would be, it looks
| something like a collection of small and large special-purpose
| models, with a model on top that excels at understanding which
| sub-models (if any) should be applied to the current problem
| space.
___________________________________________________________________
(page generated 2023-11-30 23:01 UTC)