[HN Gopher] Phi-2: Self-Extend Boosts Performance, Extends Conte...
       ___________________________________________________________________
        
       Phi-2: Self-Extend Boosts Performance, Extends Context to 8k
       Without Training
        
       Author : georgehill
       Score  : 78 points
       Date   : 2024-01-12 09:05 UTC (13 hours ago)
        
 (HTM) web link (old.reddit.com)
 (TXT) w3m dump (old.reddit.com)
        
       | behnamoh wrote:
       | Despite what people say about ph-2, I never liked its responses.
       | It clearly lacks depth and consistency.
        
         | eightysixfour wrote:
         | I feel like Phi-2 is all about the context of the fact that it
         | is a <3b parameter model. It is incredible in that context, it
         | isn't incredible in others.
        
           | make3 wrote:
           | it's definitely what it's about. it's insanely strong for its
           | size
        
             | yufeng66 wrote:
             | Phi-2 basically demonstrated that you don't need a very
             | large model to figure out language. It not very smart but
             | speaks perfect English. it's not obvious the best way to
             | gain IQ is to have a larger language model. some other
             | structure might be needed.
        
               | behnamoh wrote:
               | But isn't that something that even smaller GPT-2 models
               | demonstrated already?
        
         | visarga wrote:
         | maybe it's useful for fine-tuning jobs, not for free prompting
         | 
         | I have had a similar bad experience with it trying to prompt it
        
         | m3kw9 wrote:
         | It's a model for low rM usage 4gigs? I heard it would work well
         | for RAG to speak to your docs, instead of relying purely on its
         | parameters to answer.
        
         | coder543 wrote:
         | Did you try Dolphin Phi-2? The Dolphin fine-tune seems better
         | to me in this case.
        
       | valine wrote:
       | The method they use is surprisingly simple. They claim GPTs can't
       | effectively generate beyond the context window because our models
       | overfit on positional encodings. The fix is literally to cap the
       | positional encodings at inference time.
       | 
       | It makes sense intuitively that the exact position of tokens
       | really only matters for adjacent or near adjacent tokens. For far
       | away tokens a rough position is fine.
        
         | nulld3v wrote:
         | RoPE scaling has been a thing for a while already:
         | https://arxiv.org/abs/2306.15595
         | 
         | Does anybody know what the difference is between the approach
         | in OP vs other RoPE scaling approaches?
        
           | hexaga wrote:
           | The core problem is: there's not enough unique, trained
           | positions. Naively going past the end of training ctx makes
           | you run straight into out of distribution positions, and
           | things become incoherent.
           | 
           | For a model trained with a ctx size of 2, that looks like:
           | `[0, 1, *incoherence starts* 2, 3]`
           | 
           | Existing RoPE scaling methods try to stay in-distribution by
           | assigning positions between the known in-distribution ones:
           | `[0, 0.5, 1, 1.5]`. This is still ~kinda OOD, but works w/
           | some fine tuning.
           | 
           | The method in OP breaks the core premise that we need unique
           | positions _at all_ , and just gives multiple tokens the same
           | position: `[0, 0, 1, 1]`.
        
         | cma wrote:
         | Wouldn't that mean (until higher level embeddings) compound
         | phrases far away are unordered? And numbers are fragmented by
         | token boundary and scrambled up?
        
           | valine wrote:
           | Embeddings at lower layers aren't going to be looking very
           | far beyond nearby or adjacent embeddings as they refine their
           | meaning. For a number like 3.14, the tokens 3 and 14 are
           | important to each other, but entirely unimportant to
           | understand the meaning of a question later in the context.
           | It's only at later layers that an embedding representing the
           | concept of PI becomes important to the question embeddings.
           | 
           | As I understand it the positional encodings are calculated
           | relative to the token in question. It's not like 3 and 14 are
           | unordered tokens from their own perspective.
        
       | scosman wrote:
       | Phi-2 and TinyLlama are so so so impressive for being < 3B
       | parameter models. They can run on a phone, and are pretty snappy.
       | 
       | Benchmarks:
       | https://github.com/ggerganov/llama.cpp/discussions/4508
       | 
       | I don't see them taking over general purpose chat/query use
       | cases, but fined tuned to a specific use case and embedded into
       | mobile apps, might be how we see LLMs jump from cool tech demos
       | to something that's present in most products.
        
       | te_chris wrote:
       | Has anyone successfully fine tuned it for function calling?
       | Thinking could use a lightweight model like this to interpret
       | args in a pipeline then format them for passing down the chain
        
       ___________________________________________________________________
       (page generated 2024-01-12 23:01 UTC)