[HN Gopher] Optimizing LLMs from a Dataset Perspective
___________________________________________________________________
Optimizing LLMs from a Dataset Perspective
Author : alexmolas
Score : 108 points
Date : 2023-09-15 15:49 UTC (7 hours ago)
(HTM) web link (sebastianraschka.com)
(TXT) w3m dump (sebastianraschka.com)
| pplonski86 wrote:
| What are other than fine-tuning methods to make LLM smarter? Im
| familair with RAG - Retrival Augumented Generation.
| heliophobicdude wrote:
| Perhaps not smarter but filtering seemingly computed or non-
| derived answers in demonstration data. Also include
| demonstrations that show steps leading up to a computation in a
| format for a runtime to parse and execute the computation.
| Similar to the Code Interpreter or Advanced Data Analysis in
| ChatGPT.
|
| Taking it a step further, I would include in the demonstration
| a test harness set up with a test suite to prove the proposed
| implementation.
|
| I would go through each demonstration which a fixed set of
| criteria measuring only passing tests but ones that also show a
| level of complexity and usefulness.
|
| Why? I was looking through CodeLlamas demonstration data for
| fine tuning and saw answers that were not even checked for
| correctness or usefulness.
| CodeL wrote:
| [dead]
| rasbt wrote:
| RLHF is a popular candidate, but the focus is more on
| "helpfulness" and "safety" -- I don't think it necessarily
| improves LLMs on reasoning benchmarks
| behnamoh wrote:
| if anything, RLHF makes the model dumber, not smarter.
| rasbt wrote:
| I think it could potentially make the model smarter, but
| it's up to how you collect the data to train the reward
| models. Currently, companies & papers that use RLHF focus
| on "safety" rankings, for example. But you could
| potentially collect labels "smartness" or "correctness"
| instead and train the the reward model one these. (And then
| use that reward model to finetune the LLM you want to
| improve.)
| omneity wrote:
| Other than fine-tuning and RAG, Guidance allows you to
| constrain the output of an LLM within a grammar, for example to
| guarantee JSON output 100% of the time.
|
| Here's one library to do this https://github.com/guidance-
| ai/guidance
| packet_nerd wrote:
| What would a good fine-tuning dataset for language translation
| look like?
|
| I want to try fine-tuning to machine translate to and from a
| fairly niche language
| (https://en.wikipedia.org/wiki/S'gaw_Karen_language). How much
| text would I need, and what format would be ideal?
|
| I have a number of book length texts, most only in the target
| language, and a few bilingual or multilingual. For the bilingual
| and multilingual texts, I can script out probably several
| thousand pairs of "translate the following text from
| <source_lang> to <target_lang>: <source_lang_text>
| <target_lang_text>". Do I need to vary the prompt and format, or
| can I expect the LLM to generalize to different translation
| requests? Is there value in repeating the material in different
| lengths? One set of sentence lengths, another paragraph, and
| another page or chapter length? Also what should be done with the
| monolingual texts, just ignore them?
| soultrees wrote:
| Language translation can be tricky because of the underlying
| nuances in each language so more context would probably be
| better, but using multiple steps to evaluate its performance on
| a key level would be a good way to improve the confidence.
|
| It might be beneficial to start your dataset at the key (word)
| level, generate some embeddings of the key pair in the source
| and target and stash them, then do the same for sentence level
| and just for fun, paragraph level. (I believe you could get
| enough context from the sentence level as a paragraph is just a
| group of sentences but it would still be interesting to
| generate paragraph level key pairs I think).
|
| From there you'd have a set of embeddings of each word src:tgt
| that also has context of how it fits in a sentence level and
| paragraph level with the respective nuances of each language.
|
| Once you have that dataset then you can augment your data with
| prompts like you're using but also including some contextual
| references of word pairs, and sentence pairs in your prompt
| which should corner the LLM into the right path.
|
| Edit: not an expert so will heed if someone smarter comes
| along.
| packet_nerd wrote:
| Oh, yes, pairs of words is a good idea. I also have a
| bilingual dictionary and can generate a prompt for each entry
| something like "here's a word in <lang_a>, write a dictionary
| definition for it in <lang_b>: <lang_a_word>:
| <lang_b_definition".
| Philpax wrote:
| I was hoping that this would go more into the details of dataset
| selection and what makes for high-quality data, but it seems to
| be more a prelude to a Lit-GPT advertisement :/
| philipkglass wrote:
| I have wondered if the very big models trained on a Big Pile of
| Everything can be used to curate smaller, higher quality data
| sets that lead to high performing models with smaller parameter
| counts. Not only are smaller models easier to distribute and
| faster at inference time, but it offers a licensing escape hatch
| if future copyright law changes or court rulings make it hard to
| publicly offer models trained on non-permissively licensed
| material.
|
| 1) Train an initial big model on everything you can get, yielding
| a capable but tainted-in-some-jurisdictions model. Keep that
| model private.
|
| 2) Use the big tainted model to narrow or distill the source
| data. One way is by identifying the document subset that can be
| used freely (old public domain works, user generated content
| uploaded to your own service that users already assented to your
| own company's ToS on, government documents, things with
| unrestricted Creative Commons licensing...) The other way is by
| using it to build "just the facts" distillations from
| restrictively licensed material.
|
| 3) Train an untainted model using just the factual distillations
| and/or the permissively licensed material.
| 3abiton wrote:
| Doesn't that lead to model collapse?
| IanCal wrote:
| Not sure on the licensing but yes you can do that technically.
|
| Phi-1 and therefore phi-1.5 are partially trained on gpt3.5
| generated synthetic textbooks.
| saurik wrote:
| The premise here is specifically not to train it on generated
| output of the bigger model but to merely use the bigger model
| to better curate non-generated (and thereby untainted) inputs
| for the training set of the smaller model.
___________________________________________________________________
(page generated 2023-09-15 23:00 UTC)