[HN Gopher] My finetuned models beat OpenAI's GPT-4
___________________________________________________________________
My finetuned models beat OpenAI's GPT-4
Author : majc2
Score : 351 points
Date : 2024-07-01 08:53 UTC (14 hours ago)
(HTM) web link (mlops.systems)
(TXT) w3m dump (mlops.systems)
| scosman wrote:
| And that's the point of fine tuning models.
|
| Still good to see someone walk through their fine tuning process,
| with a mix of hosted and local options.
| scosman wrote:
| On that note: is there a good service for "here's my dataset",
| please fine tune these 9 models and give me evaluation stats?
| strickvl wrote:
| OpenpPipe - https://openpipe.ai/ - is probably the service
| that most closely resembles what you're asking for, but I
| found the evals weren't really what I wanted -- i.e.
| following my custom evaluation criteria -- so you probably
| will end up having to do that yourself anyway. But for the
| finetuning, they're all somewhat the same. Predibase and
| OpenPipe are two good options for that. Predibase has more
| base models for you to finetune, but it's a bit more unwieldy
| to work with. I wrote about that in a previous post here --
| https://mlops.systems/posts/2024-06-17-one-click-
| finetuning.....
| kcorbitt wrote:
| (Disclaimer: founder of OpenPipe). Thanks for the shout-
| out. Note that we're actively working on improved
| evaluations that will let you add more specific criteria as
| well as more evaluation types, like comparing field values
| to that of a golden dataset. This is definitely something
| that customers are asking for!
| scosman wrote:
| Wild to see them advertising collecting GPT4 responses for
| training other models. That's definitely not allowed by
| TOS. I suspect many do, but front page advertising is
| another thing entirely.
| tucnak wrote:
| Together.AI is a good starting point. Even though I'm not
| sure what fine-tuning method they're using, the results are
| REALLY good.
| w4nderlust wrote:
| Predibase ( http://predibase.com ), also referred in the
| article, is a platform specifically designed for exactly
| that. It also has "repos" for finetuning multiple models and
| comapre their performance and keeping things organzie. It
| also allow you to query any of the finetuned models on the
| fly from a single GPU with multi-lora serving. (Predibase
| founder here)
| geokon wrote:
| As I understood the point was not that they fine tuned a model
| and it got better
|
| They use a much simpler model, fine tune it, and manage to beat
| a way more advanced model
| wongarsu wrote:
| When jumping from 7B parameters to 70B to 400B (or whatever
| GPT-4 uses) most of the additional neurons seem to go towards
| a better world model and better reasoning (or whatever you
| want to call the inference of new information from known
| information). There doesn't seem to be any major improvements
| in basic language skills past 7B, and even 1B and 3B models
| do pretty well on that front.
|
| In that sense it's not that surprising that on a pure text
| extraction task with little "thinking" required a 7B model
| does well and outperforms other models after fine tuning. In
| the "noshotsfired" label GPT-4 is even accused of
| overthinking it.
|
| It is interesting how finetuned mistral-7b and llama3-7b
| outperform finetuned gpt3.5-turbo. I would tend to attribute
| that to those models being newer and "more advanced" despite
| their low parameter count, but maybe that's interpreting too
| much into a small score difference.
| scosman wrote:
| Re: 7b models vs gpt-3.5, I'm guessing different fine
| tuning parameters can account for the difference. The
| OpenAI fine tuning is a black box.
| scosman wrote:
| That's still the point. That model now does exactly one
| thing, and because of that can do better than a model 50x the
| size that tries to do everything. It will crush it in
| instruction following and consistency.
|
| A fine tuned 500b parameter model would probably beat the
| fine tuned 7b model, but only by a bit (depending on task
| obviously). A lot of that capacity is being used for
| knowledge, and isn't needed for extraction/classification
| tasks. Fine tuning isn't touching most of those weights. The
| smaller models need to focus on more general language skills,
| not answering "describe the evolution of France's economy in
| the 1800s".
| gillesjacobs wrote:
| This is entirely unsurprising and in-line with the finding that
| even small specialized models do better in information extraction
| and text classification. So no wonder finetuned large LMs do good
| too.
|
| Personally, my PhD did fine grained ACE-like event and sentiment
| extraction and "small" specialized finetuned transformers
| outperformed prompting LLMs like BERT and Roberta-large. Would
| love to see an inclusion of small model scores with some sota
| pipelines.
|
| This is great work anyway even if it replicates known results!
| pandatigox wrote:
| Your thesis sounds interesting! Do you have a link to it by any
| chance?
| wuschel wrote:
| Seconded! Any URI to your PhD?
| rovr138 wrote:
| Check https://www.researchgate.net/publication/356873749_Ex
| tractin...
| rovr138 wrote:
| Check https://www.researchgate.net/publication/356873749_Extr
| actin...
| gillesjacobs wrote:
| rovr beat me to it below. Here are more links:
| https://jacobsgill.es/phdobtained (fun fact: because my
| thesis contains published papers, I am in breach of a few
| journal's copyright by uploading my own thesis pdf, but
| fuck'em).
|
| LLM approaches were evaluated on my own time and but
| published (I left research after obtaining my PhD).
| pandatigox wrote:
| Thank you for the link! And congratulations on obtaining
| your PhD
|
| I have skimmed through it and it's truly amazing how good
| annotation of the dataset can lead to impressive results.
|
| I apologise in advance if the question seems ignorant: The
| blog post talked about fine-tuning models online. Given
| that BERT models can run comfortably on even iPhone
| hardware, were you able to finetune your models locally or
| did you have to do it online too? If so, are there any
| products that you recommend?
| gillesjacobs wrote:
| Thanks! The fine-tunes where done in 2019-21 on a 4xV100
| server with hyperparameter search, so thousands of
| individual fine-tuned models were trained in the end. I
| used weights and biased for experiment dashboarding the
| hyperparam search, but the hardware was our own GPU
| server (no cloud service used).
|
| I doubt you can fine-tune BERT-large on a phone. A
| quantized, inference optimised pipeline can be leaps and
| bounds more efficient and is not comparable with the
| huggingface training pipelines on full models I did at
| the time. For non-adapter based training you're going to
| need GPUs ideally.
| Mockapapella wrote:
| This is really cool -- thanks for posting it! I'll have to
| skim through it at some point since a lot of my work is in
| classifications models and mirrors the results you've seen
| SpaceManNabs wrote:
| > because my thesis contains published papers, ..., but f
| 'em
|
| Excluding the part in the middle because I don't wanna
| repost potential issues for you. I just wanted to comment
| that that is terrible. People often talk about the siloed
| nature of research in industry, without considering that
| academia supports the draconian publishing system. I
| understand IP protection, but IP protection doesn't have to
| mean no access. This is such a huge issue in the bio- world
| (biostats, genetics, etc).
| uolmir wrote:
| I don't know your circumstances but often you retain the
| right to distribute a "post print", ie the final text as
| published but absent journal formatting. A dissertation
| should fit that definition.
| gillesjacobs wrote:
| This is indeed often the case, however, my university
| reviews each thesis, and deemed it can only change to
| open access in 2026 (+5 years from defense).
|
| I think this is default policy for thesis based on
| publication agreements here.
|
| In any case, I am not too worried.
| renegade-otter wrote:
| The caveat here is that if you don't know how to create good
| specialized models - you are just wasting everyone't time and
| money:
|
| https://www.threads.net/@ethan_mollick/post/C46AfItO8RS?hl=e...
| gillesjacobs wrote:
| Exactly, BloombergGPT performed worse on financial sentiment
| analysis then much smaller fine-tuned Bert-based models.
|
| For many extractive tasks BloombergGPT was quite
| disappointing. A 5-10% performance hit with much larger
| inference cost compared to smaller models is not desirable.
|
| But the research investment for Bloomberg makes sense to take
| the risk: a do-it-all generative model can mean significant
| complexity reduction in maintenance and deployment overhead.
|
| It didn't directly pay off for many extractive tasks, but I
| bet they're iterating. Bloomberg has the data moat and the
| business needs in their core products to make it worthwhile.
| courseofaction wrote:
| Really interesting. Could the potentially controversial content
| of the target news article have an effect on ChatGPT's ability to
| summarize it?
| strickvl wrote:
| I think not. Normally if you get those kinds of errors you
| wouldn't get any output at all. In the blog I show that all 724
| of the test cases got proper JSON output etc for the queries so
| I don't think this was an issue. I think these kinds of topics
| would have been well covered in the training data, and probably
| the OSS models would have used similar data so I don't even
| think there's a disparity to be found between proprietary vs
| OSS models here.
| resource_waste wrote:
| >Normally if you get those kinds of errors you wouldn't get
| any output at all
|
| I am not sure. I disagree. If there is a pro-chatGPT user,
| I'm probably it.
|
| Ive often seen it give significantly less effort to answer
| the question.
| strickvl wrote:
| Interesting. I can maybe try finetuning one or two of the
| so-called 'uncensored' open models and see if that makes a
| difference. A bit harder to switch out the dataset
| completely, as that's really what I'm interested in :) I
| think the general point that finetuning a model for some
| custom task works is fairly uncontroversial, but if
| OpenAI's poor performance was on account of these kinds of
| guardrails it'd be yet another reason someone might want to
| finetune their own models I guess.
| gillesjacobs wrote:
| I use LLM information extraction for financial news articles
| with OpenAI Azure and it is a huge problem for me.
|
| 404 Content moderation response in 4% of articles. This is just
| financial news text.
|
| It is a prime reason we are considering open models.
| visarga wrote:
| What is a good fine-tuning script for Mistral and LLaMA3 on an
| A100?
| strickvl wrote:
| Depends a bit where you're running etc. This works for Modal,
| e.g., but they're just using axolotl under the hood so you can
| just connect to whatever cloud provider of choice you're using
| and then run axolotl straight. I did my finetunes across local
| GPUs, but it would have been just as easy to do it in a cloud
| environment using the same axolotl config.
| swalsh wrote:
| Unsloth is a great tool, super fast.
| strickvl wrote:
| But still only single GPU for now. I also heard great things
| about it, but wanted to make the maximum use of my multi-GPU
| local setup.
| pcwelder wrote:
| Here are some test data samples and corresponding closest train
| data rows to give you an idea of the task complexity.
|
| ---
|
| Test 1: KABUL, Afghanistan (Jan. 25, 2013) During a security
| operation in Andar district, Ghazni province, yesterday, an
| Afghan and coalition force killed the Taliban leader, Alaudin.
| Alaudin oversaw a group of insurgents responsible for conducting
| remote-controlled improvised explosive device and small-arms fire
| attacks against Afghan and coalition forces. Prior to his death,
| Alaudin was planning attacks against Afghan National Police in
| Ghazni province.
|
| Train: KABUL, Afghanistan (Jan. 8, 2013) - During a security
| operation in Washer district, Helmand province, yesterday, an
| Afghan and coalition force killed the Taliban leader, Mohammad
| Sayed, and one other insurgent. Mohammad Sayed distributed
| weapons and ammunition to Taliban fighters. Prior to his death,
| Sayed was attempting to acquire rockets for attacks targeting
| Afghan government officials in the province.
|
| ---
|
| Test 2: For Immediate Release
|
| KABUL, Afghanistan (Aug. 6, 2012) Afghan and coalition forces
| conducted a security operation in search of a Haqqani leader in
| Tsamkani district, Paktiya province, yesterday. During the
| operation the security force engaged a group of insurgents with a
| precision airstrike. After the strike, the Afghan and coalition
| security force conducted a follow-on assessment and confirmed
| several insurgents had been killed in the strike. They also
| confirmed the strike had not injured any civilians or damaged any
| civilian property.
|
| Train: For Immediate Release
|
| KABUL, Afghanistan (July 22, 2012) -- Afghan and coalition forces
| conducted a security operation in Muhammad Aghah district, Logar
| province, Saturday.
|
| During the operation, a group of armed insurgents were engaged
| with a precision airstrike. After the strike, the Afghan and
| coalition force conducted a follow-on assessment and confirmed
| multiple insurgents had been killed.
|
| The security force also confirmed the airstrike had not injured
| any civilians or damaged civilian property.
|
| ---
|
| Test 3: ISAF Joint Command Morning Operational Update March 24,
| 2011 ISAF Joint Command - Afghanistan 2011-03-S-081 For Immediate
| Release KABUL, Afghanistan (March 24, 2011) A separate Afghan and
| coalition security force targeted a Taliban IED cell leader in
| Kandahar today. The leader is responsible for planning, preparing
| and executing explosive-device attacks on Afghan civilians,
| Afghan and coalition security forces. The joint security force
| targeted the leader's suspected compound in Kandahar City based
| on tips from citizens. The security team contained the area and
| detained several suspected insurgents. There were no shots fired
| and no damage done to the targeted compound.
|
| Train: ISAF Joint Command Operational Update Dec. 22 ISAF Joint
| Command - Afghanistan 2010-12-S-267 2699, 2935, 3022, 3078 For
| Immediate Release Download PDF KABUL, Afghanistan (Dec. 22) -
| Several insurgents were killed by Afghan National Security and
| International Security Assistance Forces in separate clearing
| operations in southern Afghanistan over the last 24 hours. An
| Afghan Army and ISAF patrol spotted some insurgents emplacing an
| improvised explosive device in Sangin district, Helmand province
| today. After gaining positive identification, combined forces
| engaged the enemy position, killing two insurgents.
| botro wrote:
| Thanks for sharing this, It's well written and informative. I
| noticed you used 'temperature=1' in the GPT test for the example
| in the post. Is this best practice for a task requiring
| structured output? Have you tested other temperature settings? My
| casual understanding was that a temperature of 0 is best for
| these types of workloads while higher temperatures would be more
| effective for more 'creative' workloads.
| strickvl wrote:
| I followed whatever the guidance was for a specific model. Some
| of the LLM finetuning providers did indeed set the temperature
| to 0 and I followed that, but others suggested 1. I could
| probably iterate a bit to see what is best for each model, and
| I might well do that for the one that I choose as the one I'll
| be doubling down on in subsequent iterations / finetunes.
| Thanks for the suggestion!
| Tiberium wrote:
| GPT models shouldn't be used at temp 1 unless you only care
| about creative writing. They get much worse at factual stuff
| and code than with lower temperatures. And yes, 3.5 Turbo is
| less affected by this, which might be the reason why the
| models performed for you in reverse.
| mewpmewp2 wrote:
| For GPT, I would really urge to try again with 0. 1 kind of
| starts to force it to fail.
|
| I would say this actually invalidates the whole thing.
| bongodongobob wrote:
| You never use 1 for stuff like this. 1 is for poetry and
| creative writing. You need to redo this with temp=0 imo.
| XiphiasX wrote:
| 1) beat at what? 2) do they beat Claude 3.5 Sonnet?
| freehorse wrote:
| Did you read the article or just the title? It is all explained
| there.
| input_sh wrote:
| Have you tried clicking on the link and finding out?
| singularity2001 wrote:
| Just in the task of structured data extraction
|
| So very misleading title
| furyofantares wrote:
| > So very misleading title
|
| Eh, I can see that, but to me "finetuned model" pretty
| strongly implies some specific task
| denhaus wrote:
| For anyone interested, we wrote a paper on a similar topic:
| https://www.nature.com/articles/s41467-024-45563-x
| dimask wrote:
| Thanks for putting all this work and sharing it in such a detail!
| Data extraction/structuring data is the only serious application
| of LLMs I have actually engaged in for real work and found
| useful. I had to extract data from experience sampling reports
| which I could not share online, thus chatgpt etc was out of
| question. There were sentences describing onsets and offsets of
| events and descriptions of what went on. I ran models through
| llama.cpp to turn these into csv format with 4 columns (onset,
| offset, description, plus one for whether a specific condition
| was met in that event or not which had to interpreted through the
| description). Giving some examples of how I want it all
| structured in the prompt, was enough for many different models to
| do it right. Mixtral 8x7b was my favourite because it ran the
| fastest in that quality level on my laptop.
|
| I am pretty sure that a finetuned smaller model would be better
| and faster for this task. It would be great to start finetuning
| and sharing such smaller models: they do not really have to be
| really better than commercial LLMs that run online, as long as
| they are not at least worse. They are already much faster and
| cheaper, which is a big advantage for this purpose. There is
| already need for these tasks to be offline when one cannot share
| the data with openai and the like. Higher speed and lower cost
| also allow for more experimentation with more specific finetuning
| and prompts, with less care about token lengths of prompts and
| cost. This is an application where smaller, locally run,
| finetunable models can shine.
| strickvl wrote:
| Thanks! Yes one 'next step' that I'd like to do (probably
| around the work on deployment / inference that I'm turning to
| now) will be to see just how small I can get the model. Spacy
| have been pushing this kind of workflow (models in the order of
| tens of MB) for years and it's nice that there's a bit more
| attention to it. As you say, ideally I'd want lots of these
| tiny models that were super specialists at what they do, small
| in size and speedy in inference time. As I hinted towards the
| end of the post, however, keeping all that updated starts to
| get unwieldy at a certain point if you don't set it all up in
| the right way.
| hubraumhugo wrote:
| > Data extraction/structuring data is the only serious
| application of LLMs
|
| I fully agree. I realized this early on when experimenting with
| GPT-3 for web data extraction. After posting the first
| prototype on Reddit and HN, we started seeing a lot of demand
| for automating rule-based web scraping stacks (lots of
| maintenance, hard to scale). This eventually led to the
| creation of our startup (https://kadoa.com) focused on
| automating this "boring and hard" problem.
|
| It comes down to such relatively unexciting use cases where AI
| adds the most value.
|
| AI won't eliminate our jobs, but it will automate tedious,
| repetitive work such as web scraping, form filling, and data
| entry.
| furyofantares wrote:
| The way you cut that quote turns it into an assertion that
| doesn't exist in parent post.
|
| They didn't make the (incorrect) statement that no other
| serious, useful application exists.
|
| But that's how it reads when you cut off before "I have
| actually engaged in for real work and found useful"
| jappgar wrote:
| To be fair the original sentence could still be implying
| the same thing. The second half of the sentence just sounds
| like a hedge.
| dimask wrote:
| Well I precisely talked about things I have engaged
| professionally. Obviously this cannot cover everything
| one may do, eg I do not build chatbots for customer
| service or stuff like that, thus I obviously cannot speak
| for all possible applications of LLMs and how useful they
| may be. I am pretty sure there will be useful
| applications in fields I am not and will not be engaged
| in as nobody engages with everything. However, some other
| things that I have tried (eg copilots, summarising
| scientific articles) imo create much more hype than real
| value. They can be a bit useful if you know what to
| actually use them for and what their limits are, but
| nowhere close to the hype they generate, and I just find
| myself just googling again tbh. They are absolutely
| horrible especially with more niche subjects and areas.
| On the other hand, data extraction and structuring has a
| quite universal application, has already demonstrated
| usefulness and potential, and seems a quite realistic,
| down to earth application that I am happy to see other
| people and startups working on. Not as fancy, and harder
| to build hype upon, but very useful regardless.
| Tiberium wrote:
| Did you release the dataset and the code for testing? It would be
| interesting to check how 3.5 Sonnet performs on this task.
| mewpmewp2 wrote:
| The dataset is there:
|
| https://huggingface.co/datasets/strickvl/isafpressreleases_t...
|
| but when looking for rows where GPT-4o was deemed inaccurate
| then to me it seems the label was wrong or at least it wasn't
| possible to infer that certain label from the input text. But
| finetuned model was able to predict it.
|
| Which makes me wonder whether the finetuned models are poisoned
| with eval data...
|
| See this one:
|
| > ISAF Joint Command Morning Operational Update, March 8, 2011
| ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate
| Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition
| forces targeted a Taliban district chief, killed one insurgent
| and detained several others during an operation in Burkah
| district, Baghlan province, yesterday. The Taliban district
| chief maintains ties to Taliban senior leadership throughout
| Kunduz, Baghlan, and Takhar provinces. He is involved in
| purchasing weapons and IEDs. Intelligence reports led the
| security force to the targeted compound in the city, where
| Afghan forces called for all occupants to exit the buildings
| peacefully before conducting a search. During that time, an
| armed individual threatened the security force and the force
| returned fire, killing him. Several suspected insurgents were
| detained after initial questioning at the scene.
|
| It claims "Yesterday" on March 8, so you would assume March 7
| is correct start_date, but it's labelled Mar 6, and finetuned
| models get it "right", while GPT says Mar 7.
| wrsh07 wrote:
| I was wondering if there was some info in the bizarrely
| formatted date, but I think 022 is just the issue number:
| https://www.dvidshub.net/news/66703/correction-isaf-joint-
| co...
| mewpmewp2 wrote:
| Also a lot of the time the dates are wrong seems to be due
| to only having those formats, which does make me wonder
| again how do fine tuned get this right unless they have
| been fine tuned using eval data...
| alach11 wrote:
| Props to the author for releasing the data. My instinct is
| also to immediately suspect data leakage. It's super easy for
| this to happen. For example the original dataset could
| contain multiple articles about the same event.
| mewpmewp2 wrote:
| 1. It would be nice to see examples where GPT-4o was inaccurate,
| but best performing models were accurate.
|
| 2. It would be nice to try again with 0 temperature, as I do a
| lot of structured data extraction. In my experience 0 temperature
| should always be used, and it can make a huge difference.
| Temperature of 1 essentially means that it will start to pick
| tokens with lower probability of being accurate...
| sva_ wrote:
| Clickbait headline
| blackice_cowboy wrote:
| .
| mewpmewp2 wrote:
| I took a look at a random row to try to find why mistakes were
| happening.
|
| Why is this one labelled with start_date: 2011-02-07?
|
| > Afghan, Coalition Forces Clear Northern Kandahar ISAF Joint
| Command - Afghanistan 2011-02-D-081 For Immediate Release KABUL,
| Afghanistan (Feb. 12) - Afghan and coalition forces set out to
| provide security and assist the local population during a
| clearing operation in a remote village in Shah Wali Kot district,
| Kandahar province, Feb. 8. District Chief of Police Bacha Khan,
| and his policemen; Afghan commandos from 2nd Company, 3rd
| Commando Kandak, along with U.S. service members from Special
| Operations Task Force - South, searched the village throughout
| the day and detained 20 suspected insurgents. Also found were 80
| pounds (36 kilograms) of homemade explosives and various
| improvised explosive device-making materials. Leading a squad
| during the operation was Afghan commando Sgt. Hafiz Rahman, who
| said this operation has shown him progress. "The people are
| respecting us," Rahman said. "They ask us if we want tea, or 'do
| we want bread?' They are thankful for the security." Children
| during the operation brought commandos blankets in the evening
| and offered them food throughout the day.
|
| Trying to find the source, I'm also not seeing any indication of
| Feb 7.
|
| https://www.dvidshub.net/news/65238/afghan-police-commandos-...
|
| ---------------
|
| And why is this labelled as Mar 6, GPT-4o and I personally find
| Mar 7 to be logical.
|
| ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF
| Joint Command - Afghanistan 2011-03-S-022 For Immediate Release
| KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces
| targeted a Taliban district chief, killed one insurgent and
| detained several others during an operation in Burkah district,
| Baghlan province, yesterday. The Taliban district chief maintains
| ties to Taliban senior leadership throughout Kunduz, Baghlan, and
| Takhar provinces. He is involved in purchasing weapons and IEDs.
| Intelligence reports led the security force to the targeted
| compound in the city, where Afghan forces called for all
| occupants to exit the buildings peacefully before conducting a
| search. During that time, an armed individual threatened the
| security force and the force returned fire, killing him. Several
| suspected insurgents were detained after initial questioning at
| the scene.
|
| But despite that the "finetuned" model also gets Mar 6. How does
| the finetuned model get Mar 6?
| kcorbitt wrote:
| (Disclaimer: I'm the founder of OpenPipe, one of the fine-tuning
| services OP tried and ultimately the one that produced the
| highest performing model, it appears.)
|
| Data extraction is a use case that fine-tuned models are
| _fantastic_ at, so I 'm not surprised that OP got good results.
| That said, I've also found it's pretty easy to beat GPT-4 across
| many task types if you have a way of getting strong training
| data. We published some research[1] a week ago where we found
| that across 4 example tasks spanning creative summarization,
| question answering, data extraction and classification a fine-
| tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The
| key was to create a repeatable way of generating high-quality
| training data, which is also addressed in the post.
|
| [1]: https://openpipe.ai/blog/mixture-of-agents
| colordrops wrote:
| Why isn't someone providing a "meta model" that uses an LLM to
| choose between various fine tuned models depending on the
| question to get overall better results than gpt4?
| billmalarky wrote:
| Founding AI Engineer at OpenPipe here, using a fine tuned
| "router LLM" to route between various specialized (inc fine
| tuned but not necessarily) applied models depending on the
| input is becoming a common pattern in more modern "graph
| like" LLM applications.
|
| See LangGraph's "conditional edges" concept here:
| https://langchain-
| ai.github.io/langgraph/concepts/low_level/...
|
| You can see how that "routing function" could include a call
| to a "Router LLM." And yes, fine tuning is a great method to
| better improve the routing intelligence of said Router LLM.
|
| Great question btw!
| sheepscreek wrote:
| Very loosely, isn't this what is happening inside most LLMs
| that have a "multi-head" mechanism?
| bashfulpup wrote:
| Already a big thing. See the constellation architecture used
| here:
|
| https://arxiv.org/html/2403.13313v1
| GlassOwAter wrote:
| Is this something, as a tech enthusiast that's no expert, I can
| easily fine tune are run?
|
| My use case would be fine tuning on technical docs. Specific
| news, 2 years of blog posts, primary source material, and
| Twitter explainer thread. I want to gather all the niche
| information of a topic from the last two years, dump it into
| this and have an LLM that is a subject-matter expert.
| w4nderlust wrote:
| Here is an example of the Predibase platform, referred in the
| article for the Solar model, but that can train also Llama-3,
| Phi-3 and Mistral.
| https://www.youtube.com/watch?v=R2JQhzfaOFw&themeRefresh=1 I
| think you can assess by yourself if it's easy enough to do
| for you. (Predibase founder here)
| afro88 wrote:
| Fine tuning doesn't quite work that way. You have to format
| the training data set as request/response. The idea of fine
| tuning is to get the model to output things in a specific
| format, style or structure.
|
| Your use case is better suited to RAG. This is where you
| retrieve data from a large dataset and inject it into the
| user's request so the AI model has the context it needs to
| answer accurately.
|
| But that's not a silver bullet and you would need to spend
| significant time on chunking strategy and ranking of results
| to hopefully get a decent response accuracy.
| babelfish wrote:
| Is using model responses to train a new model against the ToS
| for the major LLM providers (OpenAI, Anthropic, etc)?
| yreg wrote:
| There doesn't seem to be any restriction like that in OpenAI
| terms.
| zepton wrote:
| There is: "you may not... Use Output to develop models that
| compete with OpenAI"
|
| (from https://openai.com/policies/terms-of-use/)
| yreg wrote:
| Thanks, I've missed that.
|
| I suppose the Output could be washed by publishing it on
| the web and having another entity crawl it.
|
| OpenAI doesn't treat anyone else's content any
| differently, acting like it's a fair game, so why should
| we care.
| babelfish wrote:
| It seems like you do not work for OpenPipe (OP), so it
| probably doesn't matter for you, but it could (should)
| matter a whole lot for OpenPipe and/or their customers
| jrm4 wrote:
| At the risk of sounding like an old head;
|
| Seems to me then, priority one should be "free and open source
| all the models as hard as possible, so that EVERYONE can fine-
| tune."
|
| (This being a subset of the idea of, free / open source is
| generally preferable for both freedom and quality)
| klabb3 wrote:
| It seems to me this means whoever has hoarded and declared
| ownership of the most personal data will make the best
| products. Kinda like how some people liked their targeted ads
| because they're more "relevant", only now it's not just ads but
| useful products. Another winner is of course platform owners
| like Apple and Microsoft who can scrape your data off their
| apps and products, even locally. This is a much bigger edge
| than being 3-6 months ahead in model quality.
|
| I despise the centralization of this tech as well, and while
| it's hopeful that smaller fine tuned models are better, they
| won't win (or barely stand a chance) out of the virtue of
| openness and privacy alone. Best we can hope for is
| proliferation in the small-medium sized business service space
| - that OpenAI tokens are not worth the extra expense if open
| models are commoditized and effective. This was probably Zuck's
| plan all along - to prevent centralized gate keepers in tech
| that's mainly benefiting his rivals. But the enemy of my enemy
| is my friend, so his actions may be the best he's ever done for
| the public good.
| jrm4 wrote:
| Your end point I think is exactly right.
|
| I think your first one is getting downvoted hard because your
| first sentence is not at all how any of this works.
|
| Sucking down personal data isn't JUST a bad idea for privacy,
| it's actually also bad for "making the best products," I
| think you're overstating the extent to which all that data
| that is stolen and sold to the highest bidder actually helps
| the company buying it?
| klabb3 wrote:
| Ah thanks for pointing out. I don't care much for LLMs at
| all, but my point was simply that whoever has data, and
| especially personalized data, has an upper hand in making
| LLMs into better end user product, for those that like
| them. This may be underestimated right now when most dick
| measuring is comparing model-model not integration into a
| product.
|
| > data that is stolen and sold to the highest bidder
|
| Didn't mean necessarily the data brokers (although that's
| an interesting angle), but say Apple now has a bunch of
| info about your calendar, email, contacts, then clearly
| they have an upper hand in providing better products than
| an anonymous API call. Not all products need
| personalization but LLMs? I can think of tons of use cases.
| toisanji wrote:
| I'm most excited about getting a faster model. A model like GPT4
| can be overkill because its too slow. What are the smallest fine
| tuned models that could beat a gpt4 model? Is it 7b or could a 3b
| model like phi3 do well for tasks like classification and
| summarization?
| uptownfunk wrote:
| Remember folks there is no free lunch :)
| simonw wrote:
| I'd be interested to see how well these fine-tuned models compare
| to Claude 3 Haiku (or one of the more expensive Claude models)
| with a larger set of examples.
|
| The Claude models all have a 200,000 token limit and respond
| _really_ well to examples - you can feed them in as chat JSON
| message pairs of user input / ideal assistant output.
|
| Haiku is dirt cheap for this kind of thing and with 200,000
| tokens you can probably provide a dozen or so examples.
| animanoir wrote:
| Anything beats GPT-4 nowdays to be honest.
| w4nderlust wrote:
| We got very similar findings: we published a paper that show that
| smaller LLMs (3-7b) when finetuned with LoRA can match or
| outperform GPT-4 on a variety of tasks (29 out of 31) including
| classification, summarization, info extraction, "reasoning".
| https://arxiv.org/abs/2405.00732 (Predibase cofounder and
| coauthor of the paper)
| michaelortega01 wrote:
| At Predibase, we recently conducted 700+ fine-tuning experiments
| to benchmark the performance of popular open-source LLMs across
| 30 tasks and compared their results to GPT-4.
|
| 85% of the time they beat GPT-4.
|
| You can see the results here: https://predibase.com/fine-tuning-
| index.
|
| The site has a series of interactive charts and a link to our
| Arxiv paper.
___________________________________________________________________
(page generated 2024-07-01 23:01 UTC)