[HN Gopher] My finetuned models beat OpenAI's GPT-4
       ___________________________________________________________________
        
       My finetuned models beat OpenAI's GPT-4
        
       Author : majc2
       Score  : 351 points
       Date   : 2024-07-01 08:53 UTC (14 hours ago)
        
 (HTM) web link (mlops.systems)
 (TXT) w3m dump (mlops.systems)
        
       | scosman wrote:
       | And that's the point of fine tuning models.
       | 
       | Still good to see someone walk through their fine tuning process,
       | with a mix of hosted and local options.
        
         | scosman wrote:
         | On that note: is there a good service for "here's my dataset",
         | please fine tune these 9 models and give me evaluation stats?
        
           | strickvl wrote:
           | OpenpPipe - https://openpipe.ai/ - is probably the service
           | that most closely resembles what you're asking for, but I
           | found the evals weren't really what I wanted -- i.e.
           | following my custom evaluation criteria -- so you probably
           | will end up having to do that yourself anyway. But for the
           | finetuning, they're all somewhat the same. Predibase and
           | OpenPipe are two good options for that. Predibase has more
           | base models for you to finetune, but it's a bit more unwieldy
           | to work with. I wrote about that in a previous post here --
           | https://mlops.systems/posts/2024-06-17-one-click-
           | finetuning.....
        
             | kcorbitt wrote:
             | (Disclaimer: founder of OpenPipe). Thanks for the shout-
             | out. Note that we're actively working on improved
             | evaluations that will let you add more specific criteria as
             | well as more evaluation types, like comparing field values
             | to that of a golden dataset. This is definitely something
             | that customers are asking for!
        
             | scosman wrote:
             | Wild to see them advertising collecting GPT4 responses for
             | training other models. That's definitely not allowed by
             | TOS. I suspect many do, but front page advertising is
             | another thing entirely.
        
           | tucnak wrote:
           | Together.AI is a good starting point. Even though I'm not
           | sure what fine-tuning method they're using, the results are
           | REALLY good.
        
           | w4nderlust wrote:
           | Predibase ( http://predibase.com ), also referred in the
           | article, is a platform specifically designed for exactly
           | that. It also has "repos" for finetuning multiple models and
           | comapre their performance and keeping things organzie. It
           | also allow you to query any of the finetuned models on the
           | fly from a single GPU with multi-lora serving. (Predibase
           | founder here)
        
         | geokon wrote:
         | As I understood the point was not that they fine tuned a model
         | and it got better
         | 
         | They use a much simpler model, fine tune it, and manage to beat
         | a way more advanced model
        
           | wongarsu wrote:
           | When jumping from 7B parameters to 70B to 400B (or whatever
           | GPT-4 uses) most of the additional neurons seem to go towards
           | a better world model and better reasoning (or whatever you
           | want to call the inference of new information from known
           | information). There doesn't seem to be any major improvements
           | in basic language skills past 7B, and even 1B and 3B models
           | do pretty well on that front.
           | 
           | In that sense it's not that surprising that on a pure text
           | extraction task with little "thinking" required a 7B model
           | does well and outperforms other models after fine tuning. In
           | the "noshotsfired" label GPT-4 is even accused of
           | overthinking it.
           | 
           | It is interesting how finetuned mistral-7b and llama3-7b
           | outperform finetuned gpt3.5-turbo. I would tend to attribute
           | that to those models being newer and "more advanced" despite
           | their low parameter count, but maybe that's interpreting too
           | much into a small score difference.
        
             | scosman wrote:
             | Re: 7b models vs gpt-3.5, I'm guessing different fine
             | tuning parameters can account for the difference. The
             | OpenAI fine tuning is a black box.
        
           | scosman wrote:
           | That's still the point. That model now does exactly one
           | thing, and because of that can do better than a model 50x the
           | size that tries to do everything. It will crush it in
           | instruction following and consistency.
           | 
           | A fine tuned 500b parameter model would probably beat the
           | fine tuned 7b model, but only by a bit (depending on task
           | obviously). A lot of that capacity is being used for
           | knowledge, and isn't needed for extraction/classification
           | tasks. Fine tuning isn't touching most of those weights. The
           | smaller models need to focus on more general language skills,
           | not answering "describe the evolution of France's economy in
           | the 1800s".
        
       | gillesjacobs wrote:
       | This is entirely unsurprising and in-line with the finding that
       | even small specialized models do better in information extraction
       | and text classification. So no wonder finetuned large LMs do good
       | too.
       | 
       | Personally, my PhD did fine grained ACE-like event and sentiment
       | extraction and "small" specialized finetuned transformers
       | outperformed prompting LLMs like BERT and Roberta-large. Would
       | love to see an inclusion of small model scores with some sota
       | pipelines.
       | 
       | This is great work anyway even if it replicates known results!
        
         | pandatigox wrote:
         | Your thesis sounds interesting! Do you have a link to it by any
         | chance?
        
           | wuschel wrote:
           | Seconded! Any URI to your PhD?
        
             | rovr138 wrote:
             | Check https://www.researchgate.net/publication/356873749_Ex
             | tractin...
        
           | rovr138 wrote:
           | Check https://www.researchgate.net/publication/356873749_Extr
           | actin...
        
           | gillesjacobs wrote:
           | rovr beat me to it below. Here are more links:
           | https://jacobsgill.es/phdobtained (fun fact: because my
           | thesis contains published papers, I am in breach of a few
           | journal's copyright by uploading my own thesis pdf, but
           | fuck'em).
           | 
           | LLM approaches were evaluated on my own time and but
           | published (I left research after obtaining my PhD).
        
             | pandatigox wrote:
             | Thank you for the link! And congratulations on obtaining
             | your PhD
             | 
             | I have skimmed through it and it's truly amazing how good
             | annotation of the dataset can lead to impressive results.
             | 
             | I apologise in advance if the question seems ignorant: The
             | blog post talked about fine-tuning models online. Given
             | that BERT models can run comfortably on even iPhone
             | hardware, were you able to finetune your models locally or
             | did you have to do it online too? If so, are there any
             | products that you recommend?
        
               | gillesjacobs wrote:
               | Thanks! The fine-tunes where done in 2019-21 on a 4xV100
               | server with hyperparameter search, so thousands of
               | individual fine-tuned models were trained in the end. I
               | used weights and biased for experiment dashboarding the
               | hyperparam search, but the hardware was our own GPU
               | server (no cloud service used).
               | 
               | I doubt you can fine-tune BERT-large on a phone. A
               | quantized, inference optimised pipeline can be leaps and
               | bounds more efficient and is not comparable with the
               | huggingface training pipelines on full models I did at
               | the time. For non-adapter based training you're going to
               | need GPUs ideally.
        
             | Mockapapella wrote:
             | This is really cool -- thanks for posting it! I'll have to
             | skim through it at some point since a lot of my work is in
             | classifications models and mirrors the results you've seen
        
             | SpaceManNabs wrote:
             | > because my thesis contains published papers, ..., but f
             | 'em
             | 
             | Excluding the part in the middle because I don't wanna
             | repost potential issues for you. I just wanted to comment
             | that that is terrible. People often talk about the siloed
             | nature of research in industry, without considering that
             | academia supports the draconian publishing system. I
             | understand IP protection, but IP protection doesn't have to
             | mean no access. This is such a huge issue in the bio- world
             | (biostats, genetics, etc).
        
             | uolmir wrote:
             | I don't know your circumstances but often you retain the
             | right to distribute a "post print", ie the final text as
             | published but absent journal formatting. A dissertation
             | should fit that definition.
        
               | gillesjacobs wrote:
               | This is indeed often the case, however, my university
               | reviews each thesis, and deemed it can only change to
               | open access in 2026 (+5 years from defense).
               | 
               | I think this is default policy for thesis based on
               | publication agreements here.
               | 
               | In any case, I am not too worried.
        
         | renegade-otter wrote:
         | The caveat here is that if you don't know how to create good
         | specialized models - you are just wasting everyone't time and
         | money:
         | 
         | https://www.threads.net/@ethan_mollick/post/C46AfItO8RS?hl=e...
        
           | gillesjacobs wrote:
           | Exactly, BloombergGPT performed worse on financial sentiment
           | analysis then much smaller fine-tuned Bert-based models.
           | 
           | For many extractive tasks BloombergGPT was quite
           | disappointing. A 5-10% performance hit with much larger
           | inference cost compared to smaller models is not desirable.
           | 
           | But the research investment for Bloomberg makes sense to take
           | the risk: a do-it-all generative model can mean significant
           | complexity reduction in maintenance and deployment overhead.
           | 
           | It didn't directly pay off for many extractive tasks, but I
           | bet they're iterating. Bloomberg has the data moat and the
           | business needs in their core products to make it worthwhile.
        
       | courseofaction wrote:
       | Really interesting. Could the potentially controversial content
       | of the target news article have an effect on ChatGPT's ability to
       | summarize it?
        
         | strickvl wrote:
         | I think not. Normally if you get those kinds of errors you
         | wouldn't get any output at all. In the blog I show that all 724
         | of the test cases got proper JSON output etc for the queries so
         | I don't think this was an issue. I think these kinds of topics
         | would have been well covered in the training data, and probably
         | the OSS models would have used similar data so I don't even
         | think there's a disparity to be found between proprietary vs
         | OSS models here.
        
           | resource_waste wrote:
           | >Normally if you get those kinds of errors you wouldn't get
           | any output at all
           | 
           | I am not sure. I disagree. If there is a pro-chatGPT user,
           | I'm probably it.
           | 
           | Ive often seen it give significantly less effort to answer
           | the question.
        
             | strickvl wrote:
             | Interesting. I can maybe try finetuning one or two of the
             | so-called 'uncensored' open models and see if that makes a
             | difference. A bit harder to switch out the dataset
             | completely, as that's really what I'm interested in :) I
             | think the general point that finetuning a model for some
             | custom task works is fairly uncontroversial, but if
             | OpenAI's poor performance was on account of these kinds of
             | guardrails it'd be yet another reason someone might want to
             | finetune their own models I guess.
        
         | gillesjacobs wrote:
         | I use LLM information extraction for financial news articles
         | with OpenAI Azure and it is a huge problem for me.
         | 
         | 404 Content moderation response in 4% of articles. This is just
         | financial news text.
         | 
         | It is a prime reason we are considering open models.
        
       | visarga wrote:
       | What is a good fine-tuning script for Mistral and LLaMA3 on an
       | A100?
        
         | strickvl wrote:
         | Depends a bit where you're running etc. This works for Modal,
         | e.g., but they're just using axolotl under the hood so you can
         | just connect to whatever cloud provider of choice you're using
         | and then run axolotl straight. I did my finetunes across local
         | GPUs, but it would have been just as easy to do it in a cloud
         | environment using the same axolotl config.
        
         | swalsh wrote:
         | Unsloth is a great tool, super fast.
        
           | strickvl wrote:
           | But still only single GPU for now. I also heard great things
           | about it, but wanted to make the maximum use of my multi-GPU
           | local setup.
        
       | pcwelder wrote:
       | Here are some test data samples and corresponding closest train
       | data rows to give you an idea of the task complexity.
       | 
       | ---
       | 
       | Test 1: KABUL, Afghanistan (Jan. 25, 2013) During a security
       | operation in Andar district, Ghazni province, yesterday, an
       | Afghan and coalition force killed the Taliban leader, Alaudin.
       | Alaudin oversaw a group of insurgents responsible for conducting
       | remote-controlled improvised explosive device and small-arms fire
       | attacks against Afghan and coalition forces. Prior to his death,
       | Alaudin was planning attacks against Afghan National Police in
       | Ghazni province.
       | 
       | Train: KABUL, Afghanistan (Jan. 8, 2013) - During a security
       | operation in Washer district, Helmand province, yesterday, an
       | Afghan and coalition force killed the Taliban leader, Mohammad
       | Sayed, and one other insurgent. Mohammad Sayed distributed
       | weapons and ammunition to Taliban fighters. Prior to his death,
       | Sayed was attempting to acquire rockets for attacks targeting
       | Afghan government officials in the province.
       | 
       | ---
       | 
       | Test 2: For Immediate Release
       | 
       | KABUL, Afghanistan (Aug. 6, 2012) Afghan and coalition forces
       | conducted a security operation in search of a Haqqani leader in
       | Tsamkani district, Paktiya province, yesterday. During the
       | operation the security force engaged a group of insurgents with a
       | precision airstrike. After the strike, the Afghan and coalition
       | security force conducted a follow-on assessment and confirmed
       | several insurgents had been killed in the strike. They also
       | confirmed the strike had not injured any civilians or damaged any
       | civilian property.
       | 
       | Train: For Immediate Release
       | 
       | KABUL, Afghanistan (July 22, 2012) -- Afghan and coalition forces
       | conducted a security operation in Muhammad Aghah district, Logar
       | province, Saturday.
       | 
       | During the operation, a group of armed insurgents were engaged
       | with a precision airstrike. After the strike, the Afghan and
       | coalition force conducted a follow-on assessment and confirmed
       | multiple insurgents had been killed.
       | 
       | The security force also confirmed the airstrike had not injured
       | any civilians or damaged civilian property.
       | 
       | ---
       | 
       | Test 3: ISAF Joint Command Morning Operational Update March 24,
       | 2011 ISAF Joint Command - Afghanistan 2011-03-S-081 For Immediate
       | Release KABUL, Afghanistan (March 24, 2011) A separate Afghan and
       | coalition security force targeted a Taliban IED cell leader in
       | Kandahar today. The leader is responsible for planning, preparing
       | and executing explosive-device attacks on Afghan civilians,
       | Afghan and coalition security forces. The joint security force
       | targeted the leader's suspected compound in Kandahar City based
       | on tips from citizens. The security team contained the area and
       | detained several suspected insurgents. There were no shots fired
       | and no damage done to the targeted compound.
       | 
       | Train: ISAF Joint Command Operational Update Dec. 22 ISAF Joint
       | Command - Afghanistan 2010-12-S-267 2699, 2935, 3022, 3078 For
       | Immediate Release Download PDF KABUL, Afghanistan (Dec. 22) -
       | Several insurgents were killed by Afghan National Security and
       | International Security Assistance Forces in separate clearing
       | operations in southern Afghanistan over the last 24 hours. An
       | Afghan Army and ISAF patrol spotted some insurgents emplacing an
       | improvised explosive device in Sangin district, Helmand province
       | today. After gaining positive identification, combined forces
       | engaged the enemy position, killing two insurgents.
        
       | botro wrote:
       | Thanks for sharing this, It's well written and informative. I
       | noticed you used 'temperature=1' in the GPT test for the example
       | in the post. Is this best practice for a task requiring
       | structured output? Have you tested other temperature settings? My
       | casual understanding was that a temperature of 0 is best for
       | these types of workloads while higher temperatures would be more
       | effective for more 'creative' workloads.
        
         | strickvl wrote:
         | I followed whatever the guidance was for a specific model. Some
         | of the LLM finetuning providers did indeed set the temperature
         | to 0 and I followed that, but others suggested 1. I could
         | probably iterate a bit to see what is best for each model, and
         | I might well do that for the one that I choose as the one I'll
         | be doubling down on in subsequent iterations / finetunes.
         | Thanks for the suggestion!
        
           | Tiberium wrote:
           | GPT models shouldn't be used at temp 1 unless you only care
           | about creative writing. They get much worse at factual stuff
           | and code than with lower temperatures. And yes, 3.5 Turbo is
           | less affected by this, which might be the reason why the
           | models performed for you in reverse.
        
           | mewpmewp2 wrote:
           | For GPT, I would really urge to try again with 0. 1 kind of
           | starts to force it to fail.
           | 
           | I would say this actually invalidates the whole thing.
        
           | bongodongobob wrote:
           | You never use 1 for stuff like this. 1 is for poetry and
           | creative writing. You need to redo this with temp=0 imo.
        
       | XiphiasX wrote:
       | 1) beat at what? 2) do they beat Claude 3.5 Sonnet?
        
         | freehorse wrote:
         | Did you read the article or just the title? It is all explained
         | there.
        
         | input_sh wrote:
         | Have you tried clicking on the link and finding out?
        
         | singularity2001 wrote:
         | Just in the task of structured data extraction
         | 
         | So very misleading title
        
           | furyofantares wrote:
           | > So very misleading title
           | 
           | Eh, I can see that, but to me "finetuned model" pretty
           | strongly implies some specific task
        
       | denhaus wrote:
       | For anyone interested, we wrote a paper on a similar topic:
       | https://www.nature.com/articles/s41467-024-45563-x
        
       | dimask wrote:
       | Thanks for putting all this work and sharing it in such a detail!
       | Data extraction/structuring data is the only serious application
       | of LLMs I have actually engaged in for real work and found
       | useful. I had to extract data from experience sampling reports
       | which I could not share online, thus chatgpt etc was out of
       | question. There were sentences describing onsets and offsets of
       | events and descriptions of what went on. I ran models through
       | llama.cpp to turn these into csv format with 4 columns (onset,
       | offset, description, plus one for whether a specific condition
       | was met in that event or not which had to interpreted through the
       | description). Giving some examples of how I want it all
       | structured in the prompt, was enough for many different models to
       | do it right. Mixtral 8x7b was my favourite because it ran the
       | fastest in that quality level on my laptop.
       | 
       | I am pretty sure that a finetuned smaller model would be better
       | and faster for this task. It would be great to start finetuning
       | and sharing such smaller models: they do not really have to be
       | really better than commercial LLMs that run online, as long as
       | they are not at least worse. They are already much faster and
       | cheaper, which is a big advantage for this purpose. There is
       | already need for these tasks to be offline when one cannot share
       | the data with openai and the like. Higher speed and lower cost
       | also allow for more experimentation with more specific finetuning
       | and prompts, with less care about token lengths of prompts and
       | cost. This is an application where smaller, locally run,
       | finetunable models can shine.
        
         | strickvl wrote:
         | Thanks! Yes one 'next step' that I'd like to do (probably
         | around the work on deployment / inference that I'm turning to
         | now) will be to see just how small I can get the model. Spacy
         | have been pushing this kind of workflow (models in the order of
         | tens of MB) for years and it's nice that there's a bit more
         | attention to it. As you say, ideally I'd want lots of these
         | tiny models that were super specialists at what they do, small
         | in size and speedy in inference time. As I hinted towards the
         | end of the post, however, keeping all that updated starts to
         | get unwieldy at a certain point if you don't set it all up in
         | the right way.
        
         | hubraumhugo wrote:
         | > Data extraction/structuring data is the only serious
         | application of LLMs
         | 
         | I fully agree. I realized this early on when experimenting with
         | GPT-3 for web data extraction. After posting the first
         | prototype on Reddit and HN, we started seeing a lot of demand
         | for automating rule-based web scraping stacks (lots of
         | maintenance, hard to scale). This eventually led to the
         | creation of our startup (https://kadoa.com) focused on
         | automating this "boring and hard" problem.
         | 
         | It comes down to such relatively unexciting use cases where AI
         | adds the most value.
         | 
         | AI won't eliminate our jobs, but it will automate tedious,
         | repetitive work such as web scraping, form filling, and data
         | entry.
        
           | furyofantares wrote:
           | The way you cut that quote turns it into an assertion that
           | doesn't exist in parent post.
           | 
           | They didn't make the (incorrect) statement that no other
           | serious, useful application exists.
           | 
           | But that's how it reads when you cut off before "I have
           | actually engaged in for real work and found useful"
        
             | jappgar wrote:
             | To be fair the original sentence could still be implying
             | the same thing. The second half of the sentence just sounds
             | like a hedge.
        
               | dimask wrote:
               | Well I precisely talked about things I have engaged
               | professionally. Obviously this cannot cover everything
               | one may do, eg I do not build chatbots for customer
               | service or stuff like that, thus I obviously cannot speak
               | for all possible applications of LLMs and how useful they
               | may be. I am pretty sure there will be useful
               | applications in fields I am not and will not be engaged
               | in as nobody engages with everything. However, some other
               | things that I have tried (eg copilots, summarising
               | scientific articles) imo create much more hype than real
               | value. They can be a bit useful if you know what to
               | actually use them for and what their limits are, but
               | nowhere close to the hype they generate, and I just find
               | myself just googling again tbh. They are absolutely
               | horrible especially with more niche subjects and areas.
               | On the other hand, data extraction and structuring has a
               | quite universal application, has already demonstrated
               | usefulness and potential, and seems a quite realistic,
               | down to earth application that I am happy to see other
               | people and startups working on. Not as fancy, and harder
               | to build hype upon, but very useful regardless.
        
       | Tiberium wrote:
       | Did you release the dataset and the code for testing? It would be
       | interesting to check how 3.5 Sonnet performs on this task.
        
         | mewpmewp2 wrote:
         | The dataset is there:
         | 
         | https://huggingface.co/datasets/strickvl/isafpressreleases_t...
         | 
         | but when looking for rows where GPT-4o was deemed inaccurate
         | then to me it seems the label was wrong or at least it wasn't
         | possible to infer that certain label from the input text. But
         | finetuned model was able to predict it.
         | 
         | Which makes me wonder whether the finetuned models are poisoned
         | with eval data...
         | 
         | See this one:
         | 
         | > ISAF Joint Command Morning Operational Update, March 8, 2011
         | ISAF Joint Command - Afghanistan 2011-03-S-022 For Immediate
         | Release KABUL, Afghanistan (March 8, 2011) Afghan and coalition
         | forces targeted a Taliban district chief, killed one insurgent
         | and detained several others during an operation in Burkah
         | district, Baghlan province, yesterday. The Taliban district
         | chief maintains ties to Taliban senior leadership throughout
         | Kunduz, Baghlan, and Takhar provinces. He is involved in
         | purchasing weapons and IEDs. Intelligence reports led the
         | security force to the targeted compound in the city, where
         | Afghan forces called for all occupants to exit the buildings
         | peacefully before conducting a search. During that time, an
         | armed individual threatened the security force and the force
         | returned fire, killing him. Several suspected insurgents were
         | detained after initial questioning at the scene.
         | 
         | It claims "Yesterday" on March 8, so you would assume March 7
         | is correct start_date, but it's labelled Mar 6, and finetuned
         | models get it "right", while GPT says Mar 7.
        
           | wrsh07 wrote:
           | I was wondering if there was some info in the bizarrely
           | formatted date, but I think 022 is just the issue number:
           | https://www.dvidshub.net/news/66703/correction-isaf-joint-
           | co...
        
             | mewpmewp2 wrote:
             | Also a lot of the time the dates are wrong seems to be due
             | to only having those formats, which does make me wonder
             | again how do fine tuned get this right unless they have
             | been fine tuned using eval data...
        
           | alach11 wrote:
           | Props to the author for releasing the data. My instinct is
           | also to immediately suspect data leakage. It's super easy for
           | this to happen. For example the original dataset could
           | contain multiple articles about the same event.
        
       | mewpmewp2 wrote:
       | 1. It would be nice to see examples where GPT-4o was inaccurate,
       | but best performing models were accurate.
       | 
       | 2. It would be nice to try again with 0 temperature, as I do a
       | lot of structured data extraction. In my experience 0 temperature
       | should always be used, and it can make a huge difference.
       | Temperature of 1 essentially means that it will start to pick
       | tokens with lower probability of being accurate...
        
       | sva_ wrote:
       | Clickbait headline
        
       | blackice_cowboy wrote:
       | .
        
       | mewpmewp2 wrote:
       | I took a look at a random row to try to find why mistakes were
       | happening.
       | 
       | Why is this one labelled with start_date: 2011-02-07?
       | 
       | > Afghan, Coalition Forces Clear Northern Kandahar ISAF Joint
       | Command - Afghanistan 2011-02-D-081 For Immediate Release KABUL,
       | Afghanistan (Feb. 12) - Afghan and coalition forces set out to
       | provide security and assist the local population during a
       | clearing operation in a remote village in Shah Wali Kot district,
       | Kandahar province, Feb. 8. District Chief of Police Bacha Khan,
       | and his policemen; Afghan commandos from 2nd Company, 3rd
       | Commando Kandak, along with U.S. service members from Special
       | Operations Task Force - South, searched the village throughout
       | the day and detained 20 suspected insurgents. Also found were 80
       | pounds (36 kilograms) of homemade explosives and various
       | improvised explosive device-making materials. Leading a squad
       | during the operation was Afghan commando Sgt. Hafiz Rahman, who
       | said this operation has shown him progress. "The people are
       | respecting us," Rahman said. "They ask us if we want tea, or 'do
       | we want bread?' They are thankful for the security." Children
       | during the operation brought commandos blankets in the evening
       | and offered them food throughout the day.
       | 
       | Trying to find the source, I'm also not seeing any indication of
       | Feb 7.
       | 
       | https://www.dvidshub.net/news/65238/afghan-police-commandos-...
       | 
       | ---------------
       | 
       | And why is this labelled as Mar 6, GPT-4o and I personally find
       | Mar 7 to be logical.
       | 
       | ISAF Joint Command Morning Operational Update, March 8, 2011 ISAF
       | Joint Command - Afghanistan 2011-03-S-022 For Immediate Release
       | KABUL, Afghanistan (March 8, 2011) Afghan and coalition forces
       | targeted a Taliban district chief, killed one insurgent and
       | detained several others during an operation in Burkah district,
       | Baghlan province, yesterday. The Taliban district chief maintains
       | ties to Taliban senior leadership throughout Kunduz, Baghlan, and
       | Takhar provinces. He is involved in purchasing weapons and IEDs.
       | Intelligence reports led the security force to the targeted
       | compound in the city, where Afghan forces called for all
       | occupants to exit the buildings peacefully before conducting a
       | search. During that time, an armed individual threatened the
       | security force and the force returned fire, killing him. Several
       | suspected insurgents were detained after initial questioning at
       | the scene.
       | 
       | But despite that the "finetuned" model also gets Mar 6. How does
       | the finetuned model get Mar 6?
        
       | kcorbitt wrote:
       | (Disclaimer: I'm the founder of OpenPipe, one of the fine-tuning
       | services OP tried and ultimately the one that produced the
       | highest performing model, it appears.)
       | 
       | Data extraction is a use case that fine-tuned models are
       | _fantastic_ at, so I 'm not surprised that OP got good results.
       | That said, I've also found it's pretty easy to beat GPT-4 across
       | many task types if you have a way of getting strong training
       | data. We published some research[1] a week ago where we found
       | that across 4 example tasks spanning creative summarization,
       | question answering, data extraction and classification a fine-
       | tuned Llama 3 8B was able to outperform GPT-4 on 3 of them. The
       | key was to create a repeatable way of generating high-quality
       | training data, which is also addressed in the post.
       | 
       | [1]: https://openpipe.ai/blog/mixture-of-agents
        
         | colordrops wrote:
         | Why isn't someone providing a "meta model" that uses an LLM to
         | choose between various fine tuned models depending on the
         | question to get overall better results than gpt4?
        
           | billmalarky wrote:
           | Founding AI Engineer at OpenPipe here, using a fine tuned
           | "router LLM" to route between various specialized (inc fine
           | tuned but not necessarily) applied models depending on the
           | input is becoming a common pattern in more modern "graph
           | like" LLM applications.
           | 
           | See LangGraph's "conditional edges" concept here:
           | https://langchain-
           | ai.github.io/langgraph/concepts/low_level/...
           | 
           | You can see how that "routing function" could include a call
           | to a "Router LLM." And yes, fine tuning is a great method to
           | better improve the routing intelligence of said Router LLM.
           | 
           | Great question btw!
        
           | sheepscreek wrote:
           | Very loosely, isn't this what is happening inside most LLMs
           | that have a "multi-head" mechanism?
        
           | bashfulpup wrote:
           | Already a big thing. See the constellation architecture used
           | here:
           | 
           | https://arxiv.org/html/2403.13313v1
        
         | GlassOwAter wrote:
         | Is this something, as a tech enthusiast that's no expert, I can
         | easily fine tune are run?
         | 
         | My use case would be fine tuning on technical docs. Specific
         | news, 2 years of blog posts, primary source material, and
         | Twitter explainer thread. I want to gather all the niche
         | information of a topic from the last two years, dump it into
         | this and have an LLM that is a subject-matter expert.
        
           | w4nderlust wrote:
           | Here is an example of the Predibase platform, referred in the
           | article for the Solar model, but that can train also Llama-3,
           | Phi-3 and Mistral.
           | https://www.youtube.com/watch?v=R2JQhzfaOFw&themeRefresh=1 I
           | think you can assess by yourself if it's easy enough to do
           | for you. (Predibase founder here)
        
           | afro88 wrote:
           | Fine tuning doesn't quite work that way. You have to format
           | the training data set as request/response. The idea of fine
           | tuning is to get the model to output things in a specific
           | format, style or structure.
           | 
           | Your use case is better suited to RAG. This is where you
           | retrieve data from a large dataset and inject it into the
           | user's request so the AI model has the context it needs to
           | answer accurately.
           | 
           | But that's not a silver bullet and you would need to spend
           | significant time on chunking strategy and ranking of results
           | to hopefully get a decent response accuracy.
        
         | babelfish wrote:
         | Is using model responses to train a new model against the ToS
         | for the major LLM providers (OpenAI, Anthropic, etc)?
        
           | yreg wrote:
           | There doesn't seem to be any restriction like that in OpenAI
           | terms.
        
             | zepton wrote:
             | There is: "you may not... Use Output to develop models that
             | compete with OpenAI"
             | 
             | (from https://openai.com/policies/terms-of-use/)
        
               | yreg wrote:
               | Thanks, I've missed that.
               | 
               | I suppose the Output could be washed by publishing it on
               | the web and having another entity crawl it.
               | 
               | OpenAI doesn't treat anyone else's content any
               | differently, acting like it's a fair game, so why should
               | we care.
        
               | babelfish wrote:
               | It seems like you do not work for OpenPipe (OP), so it
               | probably doesn't matter for you, but it could (should)
               | matter a whole lot for OpenPipe and/or their customers
        
       | jrm4 wrote:
       | At the risk of sounding like an old head;
       | 
       | Seems to me then, priority one should be "free and open source
       | all the models as hard as possible, so that EVERYONE can fine-
       | tune."
       | 
       | (This being a subset of the idea of, free / open source is
       | generally preferable for both freedom and quality)
        
         | klabb3 wrote:
         | It seems to me this means whoever has hoarded and declared
         | ownership of the most personal data will make the best
         | products. Kinda like how some people liked their targeted ads
         | because they're more "relevant", only now it's not just ads but
         | useful products. Another winner is of course platform owners
         | like Apple and Microsoft who can scrape your data off their
         | apps and products, even locally. This is a much bigger edge
         | than being 3-6 months ahead in model quality.
         | 
         | I despise the centralization of this tech as well, and while
         | it's hopeful that smaller fine tuned models are better, they
         | won't win (or barely stand a chance) out of the virtue of
         | openness and privacy alone. Best we can hope for is
         | proliferation in the small-medium sized business service space
         | - that OpenAI tokens are not worth the extra expense if open
         | models are commoditized and effective. This was probably Zuck's
         | plan all along - to prevent centralized gate keepers in tech
         | that's mainly benefiting his rivals. But the enemy of my enemy
         | is my friend, so his actions may be the best he's ever done for
         | the public good.
        
           | jrm4 wrote:
           | Your end point I think is exactly right.
           | 
           | I think your first one is getting downvoted hard because your
           | first sentence is not at all how any of this works.
           | 
           | Sucking down personal data isn't JUST a bad idea for privacy,
           | it's actually also bad for "making the best products," I
           | think you're overstating the extent to which all that data
           | that is stolen and sold to the highest bidder actually helps
           | the company buying it?
        
             | klabb3 wrote:
             | Ah thanks for pointing out. I don't care much for LLMs at
             | all, but my point was simply that whoever has data, and
             | especially personalized data, has an upper hand in making
             | LLMs into better end user product, for those that like
             | them. This may be underestimated right now when most dick
             | measuring is comparing model-model not integration into a
             | product.
             | 
             | > data that is stolen and sold to the highest bidder
             | 
             | Didn't mean necessarily the data brokers (although that's
             | an interesting angle), but say Apple now has a bunch of
             | info about your calendar, email, contacts, then clearly
             | they have an upper hand in providing better products than
             | an anonymous API call. Not all products need
             | personalization but LLMs? I can think of tons of use cases.
        
       | toisanji wrote:
       | I'm most excited about getting a faster model. A model like GPT4
       | can be overkill because its too slow. What are the smallest fine
       | tuned models that could beat a gpt4 model? Is it 7b or could a 3b
       | model like phi3 do well for tasks like classification and
       | summarization?
        
       | uptownfunk wrote:
       | Remember folks there is no free lunch :)
        
       | simonw wrote:
       | I'd be interested to see how well these fine-tuned models compare
       | to Claude 3 Haiku (or one of the more expensive Claude models)
       | with a larger set of examples.
       | 
       | The Claude models all have a 200,000 token limit and respond
       | _really_ well to examples - you can feed them in as chat JSON
       | message pairs of user input / ideal assistant output.
       | 
       | Haiku is dirt cheap for this kind of thing and with 200,000
       | tokens you can probably provide a dozen or so examples.
        
       | animanoir wrote:
       | Anything beats GPT-4 nowdays to be honest.
        
       | w4nderlust wrote:
       | We got very similar findings: we published a paper that show that
       | smaller LLMs (3-7b) when finetuned with LoRA can match or
       | outperform GPT-4 on a variety of tasks (29 out of 31) including
       | classification, summarization, info extraction, "reasoning".
       | https://arxiv.org/abs/2405.00732 (Predibase cofounder and
       | coauthor of the paper)
        
       | michaelortega01 wrote:
       | At Predibase, we recently conducted 700+ fine-tuning experiments
       | to benchmark the performance of popular open-source LLMs across
       | 30 tasks and compared their results to GPT-4.
       | 
       | 85% of the time they beat GPT-4.
       | 
       | You can see the results here: https://predibase.com/fine-tuning-
       | index.
       | 
       | The site has a series of interactive charts and a link to our
       | Arxiv paper.
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:01 UTC)