[HN Gopher] Can generalist foundation models beat special-purpos...
       ___________________________________________________________________
        
       Can generalist foundation models beat special-purpose tuning?
        
       Author : wslh
       Score  : 100 points
       Date   : 2023-11-30 11:14 UTC (11 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | avereveard wrote:
       | gpt-4 is not foundational, it's rlhf tuned. also I didn't find
       | the actual gpt-4 model revision, so yeah. sloppy all around.
        
       | cl42 wrote:
       | I'm looking to run experiments like this for specific domains and
       | prompting strategies as a way to showcase the next version of
       | https://evals.phasellm.com If the paper above resonates with you
       | or if you have questions about promoting techniques, feel free to
       | reply below or email me at hello #at# phaseai #dot# com
       | 
       | Happy to help!
        
       | jerpint wrote:
       | A real test would be fine tuning gpt4 and comparing that to gpt4
       | medprompt
        
         | c7b wrote:
         | I think the point of the study was precisely to see how well
         | the original model without adaptations (my understanding is
         | that they only use prompting, which does not affect weights)
         | can perform. I think that question is arguably even more
         | interesting than 'How well can we train a model to perform on
         | med school questions?'. I'm not saying this is generalization,
         | because the training set surely included a lot of medical
         | literature, but if the base model without fine-tuning can
         | perform well in one important domain, that's a very interesting
         | data point (especially if we find the same to hold true for
         | multiple domains).
        
           | ramoz wrote:
           | Caveat this with the fact that they did rag.
        
             | practice9 wrote:
             | Yeah the Medprompt name is misleading
        
             | owl_brawl wrote:
             | The "RAG" part (over the training set) is by far the
             | smallest contribution to the performance gains reported
             | (see the ablation study in section 5.2). I don't think the
             | model is actually learning in-context from the selected
             | samples, but rather is continuing to be better conditioned
             | to sample from the right part of the pre-training
             | distribution here, which does a slightly better job when
             | the samples are on topic (vs. random)
        
       | sungam wrote:
       | We found that GPT-4 performs at a very high level on dermatology
       | specialist exam questions (which it would not have been trained
       | on)
       | 
       | https://www.medrxiv.org/content/10.1101/2023.07.13.23292418v...
        
         | 4death4 wrote:
         | Why wouldn't it have been trained on such questions?
        
           | sungam wrote:
           | Not publicly available
        
             | RandomLensman wrote:
             | Abstract says the sample questions were public available.
             | The point they are making is that the model wasn't
             | specifically trained for dermatology.
        
             | pyinstallwoes wrote:
             | What makes you think that the entire body of private access
             | journals and books hasn't been used?
        
         | civilized wrote:
         | Extensive medical information is publicly available on websites
         | like www.nhs.uk. So even if it wasn't literally trained on exam
         | questions, it may have picked up substantially identical
         | information elsewhere.
         | 
         | Which is still a significant feat. But I doubt it's doing
         | anything more than the sort of semantic lookup that is already
         | well-known to be within its capabilities.
        
         | RandomLensman wrote:
         | The questions were public available. The point is the model
         | wasn't specifically trained for dermatology.
        
       | evrydayhustling wrote:
       | This is a very interesting study, but the generalization implied
       | in the abstract -- that specialized models may not have
       | advantages in specialized domains -- is an overreach for two
       | reasons.
       | 
       | They introduce medprompt, which combines chain of thought
       | reasoning, supervision, and retrieval augmented generation to get
       | better results on novel questions. This is a cool way to leverage
       | supervision capacity outside of training/fine tuning! But they
       | compare this strategy, applied to GPT-4, with old prompting
       | strategies applied to MedPalm. This is apples and oranges -- you
       | could easily have taken the same supervision strategy (possibly
       | adapted for smaller attention window etc) to get a closer
       | comparison.
       | 
       | Second, MedPalm is a fine tuned 500b parameter model. GPT-4 is
       | estimated to be 3-4x size. So the comparison confounds emerging
       | capabilities at scale, in addition to prompting impact, with
       | anything you can say generally about the value of model
       | specialization.
        
       | djha-skin wrote:
       | The answer to most of these kinds of questions of generalist
       | versus specialist when it comes to AI, according to Richard
       | Sutton's "The Bitter Lesson"[1], is that specialized AI will win
       | for a while, but generalized will always win the long-term
       | because it's better at turning compute into results.
       | 
       | Whether or not that will pan out with "fine tuned" versus
       | "generalized" versions of the same data-eating algorithms remains
       | to be seen, but I suspect the bitter lesson might still apply.
       | 
       | 1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
        
         | nuz wrote:
         | Well the counter argument is that with an equal amount of
         | compute, specialized models will win over generalized ones. The
         | reason is that generalized ones have to spread out their
         | weights to be able to do a ton of irrelevant things to the
         | niche subject that a finetuned model can accomplish with fewer
         | and more focused weights.
        
           | rschiavone wrote:
           | I'd argue that yours is not a counter argument, but the same
           | thing said by grandparent rephrased. It's obvious that with
           | the same amount of computing power, specialized >
           | generalized. But in 2-5 years, the future generalized models
           | can beat the current specialized ones.
        
             | goldenkey wrote:
             | The future of this is the same as what happened with PCs.
             | The specificity will initially be thrown away, added back
             | later as an accelerator, and eventually, brought back into
             | the fold as an automatically used accelerator. It all comes
             | full circle as optimizations, hinting, tiering, and
             | heuristics.
             | 
             | AI will be used to select the net to automatically load.
             | Nets will be cached, branch predicted etc.
             | 
             | The future of AI software and hardware doesn't yet support
             | the scale we need for this type of generalized AI processor
             | (think CPU but call it an AIPU.)
             | 
             | And no, GPUs aren't an AIPU, we can't even fit whole some
             | of the largest models on these things without running them
             | in pieces. They don't have a higher level language yet,
             | like C, which would compile down to more specific actions
             | after optimizations are borne (not PTX/LLVM/Cuda/OpenCL.)
        
             | marcosdumay wrote:
             | No, the GP is stating that the short term advantage will
             | always exist, for fundamental reasons. It's just short term
             | because it has to be reacquired often, not because there
             | stops being an advantage.
             | 
             | In 5 years, the specialized model would still beat the
             | generalized ones. They would just be different from the
             | ones today.
        
               | PaulHoule wrote:
               | Couldn't you apply the same prompt engineering tricks to
               | the specialized model that their system beats?
        
               | marcosdumay wrote:
               | Well, I'm not sure how valid is the entire thing. It's
               | not clear how much cross-polination there is between
               | those datasets and the ChatGPT training set (IMO, if I
               | was training something like it, I'd take any specialized
               | dataset I can find), or how useful answering those
               | questions is to any real task.
               | 
               | Also, it's not only the prompt engineering. Training
               | ChatCPT was a decades long process, with billions of
               | dollars invested, and it takes a small cluster to run the
               | thing. How would the other models compare if given
               | similar resources?
               | 
               | Besides that, on the context of this thread, the bitter
               | lesson has absolutely no relation to the specialized vs.
               | general purpose model dichotomy. It's about the
               | underlining algorithms, that for this article are all
               | very similar. (And it's also not a known truth, and
               | looking less and less as an absolute truth as deep
               | learning advances, so take it with a huge grain of salt.)
        
           | fauigerzigerk wrote:
           | That's probably true in terms of benchmark problems, but at
           | the same time, losing knowledge of seemingly irrelevant
           | things could make the specialised model less creative when it
           | comes to looking for unexpected causes and links.
           | 
           | It would also be less robust in the face of exceptional
           | situations that should make it doubt the reliability and
           | relevance of its own training.
           | 
           | For instance, an autonomous driving system that doesn't know
           | the first thing about the zombie apocalypse could put its
           | passengers' lives at risk by refusing to run over
           | "pedestrians".
           | 
           | A specialised diagnostic system might not notice signs of
           | domestic violence that a GP would see. Being able to connect
           | the dots beyond the confines of some specialised field is
           | extremely useful in many situations.
        
             | notahacker wrote:
             | > For instance, an autonomous driving system that doesn't
             | know the first thing about the zombie apocalypse could put
             | its passengers' lives at risk by refusing to run over
             | "pedestrians".
             | 
             | I think that argument works better the other way. I'm
             | outspoken in arguing many of the edge cases in driving
             | _are_ general intelligence problems we 're not about to
             | solve by optimising software over a few billion more miles
             | in a simulator, but I don't want that intelligence so
             | generalised that my car starts running down elderly
             | pedestrians because there's been a lot of zombie literature
             | on the Internet lately. I'm pretty confident there's more
             | risk of people dying from that than a zombie apocalypse.
        
           | ryanjshaw wrote:
           | Wouldn't it become more and more difficult to improve the
           | specialist model? And wouldn't the ultimate generalized
           | models (AGI) eventually be able to improve themselves
           | automatically? Or am I talking about something different
           | here?
        
         | anon373839 wrote:
         | I think it might be stretching the Bitter Lesson a bit to
         | extend it to cover fine-tuning, since the original idea was
         | addressing something very different.
         | 
         | Anyway, even if generalist models are truly better at
         | everything, they'll never be faster or cheaper than their
         | specialist counterparts, and that matters.
        
         | AndrewKemendo wrote:
         | I mean the whole point of the "bitter" part is that if you push
         | most people that aren't convinced of the data hypothesis, they
         | will talk of a belief that there's "something special" that
         | happens to create intelligence that isn't just raw number
         | crunching (so to speak). It's a human bias in how we view
         | social hierarchy and expertise, and ignores the
         | epistemic/biological perspective.
         | 
         | It's a philosophical position as much as a technical one
        
         | nonameiguess wrote:
         | This is an empirical observation regarding implementation
         | details of _machine learning_ models specifically, not math
         | models in general. Clearly, in principle, specialized models
         | can beat general models simply because at least some
         | applications have exact models. No learned functional
         | approximator will beat an ALU or even an abacus at simple
         | arithmetic.
        
           | crosen99 wrote:
           | Well said, and fully agree. If you horse race two approaches,
           | you can of course arbitrarily arrive at a winner based solely
           | on the version of each approach you choose. You need a deeper
           | look if you want to generalize.
        
         | mistrial9 wrote:
         | no - because the range of responses in the generalized model
         | will tend in some asymptotic way to "patterns the machinery
         | recognizes" . Real knowledge is not like a machine process, in
         | every instance. Any detective novel has a moment when a faint
         | scent or a distant sighting connects very different inputs to
         | aid in solving a mystery. Lastly, knowledge has resonance and
         | harmonics; great Truths are paradox. Your bitter lesson is
         | convenient at this time of evolution with computing, not The
         | Answer.
        
         | loeber wrote:
         | Good read, thanks for sharing.
        
       | RandomLensman wrote:
       | Any sufficiently large library has a lot of specialty books in
       | it.
        
         | mistrial9 wrote:
         | you have got to be joking. This is not true in any real way.
         | 
         | source: six years in the book industry in California
        
       | upghost wrote:
       | Technical foul. Doing science based on closed-source model. 10
       | yard penalty for humanity.
       | 
       | GPT-4 _almost_ disappeared last week. Aside from the obvious AI
       | bus factor and  "continuity of science" concerns, I find it
       | absurd that so many papers and open source ML/LLM frameworks and
       | libraries exist that simply would not work at all if not for
       | GPT-4. Have we simply given up?
       | 
       | I thought this was _hacker_ news.
        
         | Philpax wrote:
         | I'm all in favour of doing reproducible work, especially
         | against open-source ML (I'm the maintainer of a library for LLM
         | inference), but what is the alternative here?
         | 
         | GPT-4 is still the flagship model "of humanity"; it is still
         | the best model that is publicly available. The intent of the
         | paper is to determine whether the best general model can
         | compete with specialized models - how do you do that without
         | using the best general model?
        
         | rafram wrote:
         | Half of science comes down to poking at black boxes to see how
         | they react. The universe is a closed-source model!
        
           | ZiiS wrote:
           | The other half is publishing results to fix that.
        
         | lambda_garden wrote:
         | Llamafile is (rightfully) the top of HN right now.
         | 
         | I have high hopes that models will end up a bit like Linux
         | servers, where everyone is building on open foundations.
        
         | pixl97 wrote:
         | Almost all modern science is gated by needing incredibly
         | massive amounts of money.
         | 
         | 'Closed source' science happens in every industry. While you
         | may not think of what DOW is doing at science, it is and huge
         | companies like that progress by their internal workings.
         | 
         | What you're stating is something else, that maybe we should
         | fund sciences, such as AI more, but we really did that in the
         | past and it had it's own fits and starts. Then transformers
         | came out of Google and private industry has been leading the
         | way in AI. If you want to keep up with cutting edge and the 10s
         | of millions needed to train them, you'll need to use these
         | private models.
         | 
         | Hackers hack stuff from private companies all the time. Hackers
         | use proprietary software. There is no purity test we have to
         | pass.
        
       | PaulHoule wrote:
       | I would think you could apply an inference process like that to a
       | smaller model and get results better than you get with zero-shot
       | on a larger model... And be able to afford to run the smaller
       | model as many times as it takes.
        
       | make3 wrote:
       | really ? "gpt4 beats our tiny model", that's your paper?
        
       | hallqv wrote:
       | Depends on your definition of winning - special-purpose tuning is
       | vastly more cost effective since it allows you to train a smaller
       | model that can perform specific tasks as good as a bigger one.
       | 
       | A good analogy is building a webapp - would you prefer to hire a
       | developer with 30+ years experience in various CS domains as well
       | as a PhD or a specialized webdev with 5 years experience at a
       | tenth of the rate?
        
       | Xcelerate wrote:
       | Not strictly related to their study, but if we consider
       | Solomonoff induction to be the most general model of all (albeit
       | not computable), then I'd say yes to the question, if only
       | because it will select the most specialized model(s) for the
       | particular problem at hand automatically.
       | 
       | One could argue universal general intelligence is simply the
       | ability to optimally specialize as necessary.
       | 
       | I think one aspect that is overlooked when people are involved is
       | that we create specialized or fine-tuned models precisely for the
       | reason that a general approach didn't work well enough. If it
       | had, we would have stopped there. So there's a selection bias in
       | that almost all fine-tuned models are initially better than
       | general models, at least until the general models catch up.
        
         | doingtheiroming wrote:
         | There are lots of ways to define "good enough" as well. What
         | are the costs of inferring several small experts contributing
         | to a decomposed workflow versus using GPT4 for example. If you
         | want to run multiple instances for different teams or
         | departments, how do the costs escalate. Do you want to include
         | patient data and have concerns about using a closed source
         | model to do so. Etc.
         | 
         | There's little doubt that GPT4 is going to be most capable at
         | most tasks either OotB or with prompt engineering as here. But
         | that doesn't mean it's the right approach to use now.
        
         | cyanydeez wrote:
         | I assume the next step is to find the right model size for
         | semantic specialism then create a model of model that's
         | filtering down to specialized models.
         | 
         | same will happen for temporal information that stratifies and
         | changes, ie, asking what a 1950s scientist understands about
         | aquantim physics.
        
         | VirusNewbie wrote:
         | I don't understand the reasoning. if you merged two of the most
         | specialized models together and simply doubled the
         | size/parameter count, there is no reason the smaller one can
         | generalize better than the larger one.
         | 
         | It would simply have two different submodels inside its
         | ubermodel.
        
         | Davidzheng wrote:
         | Your argument has a very concrete example. We are a generalist
         | intelligence. And when we encountered chess, we eventually
         | developed specialist intelligence to be the best at chess.
        
           | marvin wrote:
           | General intelligence builds tools as required, maybe having a
           | whole bunch of them built-in already. Maybe the ones that are
           | most frequently useful. Maybe a more general intelligence can
           | build more of these tools into itself as required.
           | 
           | Interesting philosophic topic how to ensure objectives aren't
           | affected too much. You can make this mind experiment even as
           | a human. There are parts of our goal system we might want to
           | adjust, and parts we'd be horrified to, being very reluctant
           | to accidentally risk affecting it.
        
       | kjkjadksj wrote:
       | What I want to know is how these systems compare to simple
       | keyword search algorithms. E.g. presumably the person using these
       | systems are medical professionals, typing in some symptoms
       | perhaps gleaming for answers.
       | 
       | Theoretically, all these date could be cataloged in a way where
       | you can type in symptoms x y z, demographic information a b c,
       | medical history d e f, and get a confidence score for a potential
       | diagnosis bases on matching these factors to results from past
       | association studies. A system like that I imagine would be easier
       | to feed in true information and importantly would also offer a
       | confidence estimate of the result. It would also be a lot more
       | computationally simpler to run I'd imagine given the compute
       | required to train a machine learning model vs keyword search
       | algorithms you can often run locally across massive datasets.
        
       | shmageggy wrote:
       | Why did "Case study in Medicine" get stripped from the submitted
       | title? It's way more clickbaity now.
        
       | menzoic wrote:
       | You can always fine tune a generalist model to get better results
        
       | tedivm wrote:
       | This was something we thought about a bit at Rad AI.
       | 
       | I think one of the problems this paper ignores isn't whether you
       | can get a general purpose model to beat a special purpose one,
       | but whether it's really worth it (outside of the academic sense).
       | 
       | Accuracy is obviously important, and anything that sacrifices
       | accuracy in a medical areas is potentially dangerous. So
       | everything I'm about to say assumes that accuracy remains the
       | same or improves for a special purpose model (and in general that
       | is the case- papers such as this talk about shrinking models as a
       | way of improvement without sacrificing accuracy:
       | https://arxiv.org/abs/1803.03635).
       | 
       | All things being equal for accuracy, the model that performs
       | either the fastest or the cheapest is going to win. For both of
       | these cases one of the easiest ways to accomplish the goal is to
       | use a smaller model. Specialized models are just about always
       | smaller. Lower latency on requests and less energy usage per
       | query are both big wins that affect the economics of the system.
       | 
       | There are other benefits as well. It's much easier to experiment
       | and compare specialized models, and there's less area for errors
       | to leak in.
       | 
       | So even if it's possible to get a model like GPT-4 to work as
       | well as a specialized model, if you're actually putting something
       | into production and you have the data it almost always makes
       | sense to consider a specialized model.
        
         | theossuary wrote:
         | That's interesting, I'm currently working on an idea that
         | assumes the opposite. Having built specialized models for
         | years, the cost of having a data science team clean the data
         | and build a model is pretty high, and it can take quite a while
         | (especially if part of the project is setting up the data
         | collection).
         | 
         | For prototyping and for smaller use cases, it makes a lot of
         | sense to use a much more general model. Obviously this doesn't
         | apply to things like medicine, etc. But for much more general
         | things like: check if someone is on the train-tracks, or number
         | of people currently queuing in a certain area, or if there's a
         | fight in a stadium; I think multi-modal models are going to
         | take over. Not because they're efficient, or particularly fast;
         | but because it'll be quick to implement, test, and iterate on.
         | 
         | The cost of building a specialized model, and keeping it up to
         | date, will far exceed the cost of an LVM in most niche use
         | cases.
        
           | tedivm wrote:
           | I think it depends on how much you expect your model to be
           | used, and how quickly the model needs to react. The higher
           | either of those becomes the more likely you'll want to
           | specialize.
           | 
           | If you expect your model to be used a lot, and you don't have
           | a way to distribute that pain (for instance, having a mobile
           | app run the model locally on people's phones instead of
           | remotely on your data centers) then it ends up being a cost
           | balancing method. A single DGX machine with 8 GPUs is going
           | to cost you about the same as a single engineer would. If
           | cutting the model size down means you can reduce your number
           | of machines that makes increasing headcount easier. The nice
           | thing about data cleaning is that it's also an investment-
           | you can keep using most data for a long time afterwords, and
           | if you're smart then you're building automated techniques for
           | cleaning that can be applied to new data coming in.
        
         | rahimnathwani wrote:
         | I'm curious to know what Rad AI ended up doing? IIRC the
         | initial problem was how do you turn this set of radiology notes
         | into some summary radiology notes, with a specific format. Is
         | that right?
         | 
         | If you were approaching this problem anew today, you'd probably
         | try with GPT-4 and Claude, and then see what you could achieve
         | by finetuning GPT-3.5.
         | 
         | And, yes, for a given level of quality, the finetuned GPT-3.5
         | will likely be cheaper than the GPT-4 version. But for
         | radiology notes, perhaps you'd be happy to pay 10x per even if
         | it were to give only a tiny improvement?
        
           | tedivm wrote:
           | I guess a question to ask is "What is GPT-4". Is it the
           | algorithm, the weights, the data, or a combination of them
           | all?
           | 
           | To put it another way, the researchers at Rad AI consumed
           | every paper that was out there including very cutting edge
           | stuff. This included reimplimenting GPT-2 in house, as well
           | as many other systems. However, we didn't have the same data
           | that was used by OpenAI. We also didn't have their
           | hyperparameters (and since our data was different it's not a
           | guarantee that those would have been the best ones anyways).
           | 
           | So with that in mind it's possible that Rad AI could today be
           | using their own in house GPT-4, but specialized with their
           | radiology data. In other words them using a specialized
           | model, and them using GPT-4, wouldn't be contradictory.
           | 
           | I do want to toss out a disclaimer that I left there in 2021,
           | so I have no insights into their current setup other than
           | what's publicly released. However I have no reason to believe
           | they aren't still doing cutting edge work and building out
           | custom stuff taking advantage of the latest papers and
           | techniques.
        
       | TrevorJ wrote:
       | If I had to guess at what the next leap for AI would be, it looks
       | something like a collection of small and large special-purpose
       | models, with a model on top that excels at understanding which
       | sub-models (if any) should be applied to the current problem
       | space.
        
       ___________________________________________________________________
       (page generated 2023-11-30 23:01 UTC)