[HN Gopher] Smarter summaries with finetuning GPT-3.5 and chain ...
___________________________________________________________________
Smarter summaries with finetuning GPT-3.5 and chain of density
Author : ivanleomk
Score : 134 points
Date : 2023-11-13 16:12 UTC (6 hours ago)
(HTM) web link (jxnl.github.io)
(TXT) w3m dump (jxnl.github.io)
| huac wrote:
| nice work! generating good example data is the most important
| part of finetuning.
|
| imo summarization is also a fairly simple task -- I wouldn't be
| surprised if a fine-tuned open source model (eg llama 13 /
| mistral 7) would get to similar performance.
| jxnlco wrote:
| for sure! the one thing i was surprised by was how little data
| gpt3.5 needed, could love for a company to try how the scaling
| laws work for those smaller models.
| robbomacrae wrote:
| I find that bart large 410m [0] parameters does a fine job at
| summarizing. In Summer AI I alternate between using a copy of
| that bart large getting hyper-trained on feedback and Chat GPT
| 3.3 and honestly I don't have a preference between the results.
|
| However, thanks to this article I might revisit the
| summarization techniques used a try a fine tuned 3.5.
|
| It would be great to see these techniques compared to Chat GPT
| 4 Turbo.
|
| [0]: https://huggingface.co/facebook/bart-large-cnn
| themonk911 wrote:
| Gotta admit I spent some time thinking this was a new technique
| called 'chain of _destiny_ ' and was reading through the article
| trying to work out what kind of fate-based prompt engineering was
| happening.
| intelVISA wrote:
| Did the exact same thing :)
| mpalmer wrote:
| https://m.youtube.com/watch?v=jGxuWWGo8AY&t=9
| rzzzt wrote:
| It's a forgotten Wolfenstein sequel!
| Der_Einzige wrote:
| One of the fun parts of AI is finding out that abstractive
| summarization is "easy", but extractive summarization (which is
| what humans do far more often in practice) is still very hard.
| Partly because most datasets assume sentence level extractive
| summarization, which is often not how humans summarize documents.
|
| There's still tons of very low hanging fruit in the summarization
| work. I'm not aware of significant followup work to pointer
| networks besides pointer-generator networks, which these days are
| considered old news. Pointer based architectures are the ideal
| system for word level extractive summarizers, yet the very best
| extractive summarization systems today are usually nothing more
| than sentence selectors using some kinds of embeddings and cosine
| similarity.
|
| Happy to see such success with abstractive summaries, but the
| kind that myself and most other humans are interested in is still
| far from solved.
| msp26 wrote:
| Could you point me to more reading on extractive summarisation?
| A lot of what I see feels out of date compared to what should
| be possible now with LLMs.
| esafak wrote:
| Those repeated calls sound like a good way to rack up a bill and
| incur a high latency.
| jxnlco wrote:
| right which is why finetuning on the last one is a great save
| but preserves quality
| jph00 wrote:
| Minor correction: the article describes Chain of Density as
| "First introduced by Salesforce's AI Research wing" -- however
| the 1st author (who is a PhD student) and senior author are both
| at Columbia; only one of the 5 authors is at Salesforce.
| hackernewds wrote:
| prepared to see all these companies "invent" these techniques.
| fwiw people believe OpenAI "invented" chatgpt, whereas the
| inventors of the transformer model individually worked at
| competing companies during the research (Google Brain) and
| presently founded competing companies now.
| vinni2 wrote:
| The novelty of chatgpt was instruction tuning of transformers
| using reinforcement learning with human feedback and finding
| right dataset as well as annotations for it. before this
| transformers were good for some tasks but not so good for
| generating text. Even though OpenAI didn't invent
| transformers they did invent the technique needed to make
| chatgpt possible.
| jxnlco wrote:
| I'll fix this now!
| sandGorgon wrote:
| has anyone finetuned gpt 3.5 or llama, etc using their private
| data ? what is the best practice to generate training data.
|
| one way i have heard is to send a chunk of data to gpt4 and ask
| for questions to be generated. unsure of other ways. what has
| worked well ?
| vjb2tq4dws wrote:
| here is an example on how to generate synthetic data that you
| can adapt for your case
| https://dzlab.github.io/2023/09/22/palm-synthetic-data/
| just_boost_it wrote:
| Is this proven to work? ML models are usually trained to
| learn a model of the environment by giving them environment
| data. I would have expected feeding it model outputs just
| trains it to learn a model of the model creating the data.
|
| Without seeing some kind of demonstration otherwise, my
| feeling is that it would be like regressing stock price on
| inflation, then trying to generate more data using the
| regression model and random inflation numbers. All you'd
| learn is the model that you put in to generate the data.
| valine wrote:
| I'd think of it less like teaching the model something new,
| and more like enforcing a behavior the model can already
| output. Any decent raw model can output function names and
| parameters with prompt engineering. To do function calling,
| you need the model to output function names reliably for a
| wide variety of prompts. That's where the fine-tuning comes
| in.
| just_boost_it wrote:
| I could very easily believe that if I saw proof, but it
| just feels a bit wrong to train a model on model outputs.
|
| Even in the main article here, the model did better with
| fewer fine tuned examples. To us, the auto-generated
| examples might look different enough and might look good
| enough, but they were all generated algorithmically.
| Feeding more examples in might easily be leading it to
| focus on some artifact of the embeddings or generating
| model that we just don't perceive.
| visarga wrote:
| > it just feels a bit wrong to train a model on model
| outputs
|
| If you have a small student model and a large teacher it
| makes sense, the student is better off after this
| distillation.
|
| If you have a way to filter out low quality synthetic
| examples then it would be useful to generate a bunch more
| and take the best.
|
| If your LLM is an agent, then it can generate feedback
| signals from the environment. Even a human-AI chat is a
| form of environment for the model. Every human response
| can be evaluated as positive or negative reward.
|
| More fundamentally, organic datasets are very unbalanced,
| LLMs need more complex reasoning chains than what is
| usually available. There are some exceptions - in
| scientific papers, manuals and code you get very complex
| reasoning chains. But not in general. This issue can be
| fixed with synthetic data.
|
| And even in principle, if you have a model at level N and
| want to make a dataset at level N+1, then you need to
| boost your model. You can give it more tokens, more
| attempts or more tools.
| SubiculumCode wrote:
| If its a small amount of data, it seems RAG pieplines are
| better. is all I think I know.
| tobbe2064 wrote:
| Am i reading it right that the fine tune a model using 20
| examples and 5 epochs? That seems really weird for me
| isoprophlex wrote:
| Can't overfit when your learning rate is zero! _insert smart
| thinking meme_
| riku_iki wrote:
| LLMs are few shots learners, that's why many people put
| examples into prompt, this is the next step.
| ed wrote:
| I don't believe few shot performance dictates how quickly you
| can fine-tune.
|
| Most fine tunes will have much larger datasets (I am under
| the impression you want 10's of thousands of examples for
| most runs).
|
| So I'm similarly impressed 20 examples would make such a big
| difference.
|
| But also note entity density decreases as example count
| increases. This is counterintuitive -- maybe something else
| is going on here?
___________________________________________________________________
(page generated 2023-11-13 23:00 UTC)