[HN Gopher] Optimizing LLMs from a Dataset Perspective
       ___________________________________________________________________
        
       Optimizing LLMs from a Dataset Perspective
        
       Author : alexmolas
       Score  : 108 points
       Date   : 2023-09-15 15:49 UTC (7 hours ago)
        
 (HTM) web link (sebastianraschka.com)
 (TXT) w3m dump (sebastianraschka.com)
        
       | pplonski86 wrote:
       | What are other than fine-tuning methods to make LLM smarter? Im
       | familair with RAG - Retrival Augumented Generation.
        
         | heliophobicdude wrote:
         | Perhaps not smarter but filtering seemingly computed or non-
         | derived answers in demonstration data. Also include
         | demonstrations that show steps leading up to a computation in a
         | format for a runtime to parse and execute the computation.
         | Similar to the Code Interpreter or Advanced Data Analysis in
         | ChatGPT.
         | 
         | Taking it a step further, I would include in the demonstration
         | a test harness set up with a test suite to prove the proposed
         | implementation.
         | 
         | I would go through each demonstration which a fixed set of
         | criteria measuring only passing tests but ones that also show a
         | level of complexity and usefulness.
         | 
         | Why? I was looking through CodeLlamas demonstration data for
         | fine tuning and saw answers that were not even checked for
         | correctness or usefulness.
        
         | CodeL wrote:
         | [dead]
        
         | rasbt wrote:
         | RLHF is a popular candidate, but the focus is more on
         | "helpfulness" and "safety" -- I don't think it necessarily
         | improves LLMs on reasoning benchmarks
        
           | behnamoh wrote:
           | if anything, RLHF makes the model dumber, not smarter.
        
             | rasbt wrote:
             | I think it could potentially make the model smarter, but
             | it's up to how you collect the data to train the reward
             | models. Currently, companies & papers that use RLHF focus
             | on "safety" rankings, for example. But you could
             | potentially collect labels "smartness" or "correctness"
             | instead and train the the reward model one these. (And then
             | use that reward model to finetune the LLM you want to
             | improve.)
        
         | omneity wrote:
         | Other than fine-tuning and RAG, Guidance allows you to
         | constrain the output of an LLM within a grammar, for example to
         | guarantee JSON output 100% of the time.
         | 
         | Here's one library to do this https://github.com/guidance-
         | ai/guidance
        
       | packet_nerd wrote:
       | What would a good fine-tuning dataset for language translation
       | look like?
       | 
       | I want to try fine-tuning to machine translate to and from a
       | fairly niche language
       | (https://en.wikipedia.org/wiki/S'gaw_Karen_language). How much
       | text would I need, and what format would be ideal?
       | 
       | I have a number of book length texts, most only in the target
       | language, and a few bilingual or multilingual. For the bilingual
       | and multilingual texts, I can script out probably several
       | thousand pairs of "translate the following text from
       | <source_lang> to <target_lang>: <source_lang_text>
       | <target_lang_text>". Do I need to vary the prompt and format, or
       | can I expect the LLM to generalize to different translation
       | requests? Is there value in repeating the material in different
       | lengths? One set of sentence lengths, another paragraph, and
       | another page or chapter length? Also what should be done with the
       | monolingual texts, just ignore them?
        
         | soultrees wrote:
         | Language translation can be tricky because of the underlying
         | nuances in each language so more context would probably be
         | better, but using multiple steps to evaluate its performance on
         | a key level would be a good way to improve the confidence.
         | 
         | It might be beneficial to start your dataset at the key (word)
         | level, generate some embeddings of the key pair in the source
         | and target and stash them, then do the same for sentence level
         | and just for fun, paragraph level. (I believe you could get
         | enough context from the sentence level as a paragraph is just a
         | group of sentences but it would still be interesting to
         | generate paragraph level key pairs I think).
         | 
         | From there you'd have a set of embeddings of each word src:tgt
         | that also has context of how it fits in a sentence level and
         | paragraph level with the respective nuances of each language.
         | 
         | Once you have that dataset then you can augment your data with
         | prompts like you're using but also including some contextual
         | references of word pairs, and sentence pairs in your prompt
         | which should corner the LLM into the right path.
         | 
         | Edit: not an expert so will heed if someone smarter comes
         | along.
        
           | packet_nerd wrote:
           | Oh, yes, pairs of words is a good idea. I also have a
           | bilingual dictionary and can generate a prompt for each entry
           | something like "here's a word in <lang_a>, write a dictionary
           | definition for it in <lang_b>: <lang_a_word>:
           | <lang_b_definition".
        
       | Philpax wrote:
       | I was hoping that this would go more into the details of dataset
       | selection and what makes for high-quality data, but it seems to
       | be more a prelude to a Lit-GPT advertisement :/
        
       | philipkglass wrote:
       | I have wondered if the very big models trained on a Big Pile of
       | Everything can be used to curate smaller, higher quality data
       | sets that lead to high performing models with smaller parameter
       | counts. Not only are smaller models easier to distribute and
       | faster at inference time, but it offers a licensing escape hatch
       | if future copyright law changes or court rulings make it hard to
       | publicly offer models trained on non-permissively licensed
       | material.
       | 
       | 1) Train an initial big model on everything you can get, yielding
       | a capable but tainted-in-some-jurisdictions model. Keep that
       | model private.
       | 
       | 2) Use the big tainted model to narrow or distill the source
       | data. One way is by identifying the document subset that can be
       | used freely (old public domain works, user generated content
       | uploaded to your own service that users already assented to your
       | own company's ToS on, government documents, things with
       | unrestricted Creative Commons licensing...) The other way is by
       | using it to build "just the facts" distillations from
       | restrictively licensed material.
       | 
       | 3) Train an untainted model using just the factual distillations
       | and/or the permissively licensed material.
        
         | 3abiton wrote:
         | Doesn't that lead to model collapse?
        
         | IanCal wrote:
         | Not sure on the licensing but yes you can do that technically.
         | 
         | Phi-1 and therefore phi-1.5 are partially trained on gpt3.5
         | generated synthetic textbooks.
        
           | saurik wrote:
           | The premise here is specifically not to train it on generated
           | output of the bigger model but to merely use the bigger model
           | to better curate non-generated (and thereby untainted) inputs
           | for the training set of the smaller model.
        
       ___________________________________________________________________
       (page generated 2023-09-15 23:00 UTC)