[HN Gopher] Phi-2: Self-Extend Boosts Performance, Extends Conte...
___________________________________________________________________
Phi-2: Self-Extend Boosts Performance, Extends Context to 8k
Without Training
Author : georgehill
Score : 78 points
Date : 2024-01-12 09:05 UTC (13 hours ago)
(HTM) web link (old.reddit.com)
(TXT) w3m dump (old.reddit.com)
| behnamoh wrote:
| Despite what people say about ph-2, I never liked its responses.
| It clearly lacks depth and consistency.
| eightysixfour wrote:
| I feel like Phi-2 is all about the context of the fact that it
| is a <3b parameter model. It is incredible in that context, it
| isn't incredible in others.
| make3 wrote:
| it's definitely what it's about. it's insanely strong for its
| size
| yufeng66 wrote:
| Phi-2 basically demonstrated that you don't need a very
| large model to figure out language. It not very smart but
| speaks perfect English. it's not obvious the best way to
| gain IQ is to have a larger language model. some other
| structure might be needed.
| behnamoh wrote:
| But isn't that something that even smaller GPT-2 models
| demonstrated already?
| visarga wrote:
| maybe it's useful for fine-tuning jobs, not for free prompting
|
| I have had a similar bad experience with it trying to prompt it
| m3kw9 wrote:
| It's a model for low rM usage 4gigs? I heard it would work well
| for RAG to speak to your docs, instead of relying purely on its
| parameters to answer.
| coder543 wrote:
| Did you try Dolphin Phi-2? The Dolphin fine-tune seems better
| to me in this case.
| valine wrote:
| The method they use is surprisingly simple. They claim GPTs can't
| effectively generate beyond the context window because our models
| overfit on positional encodings. The fix is literally to cap the
| positional encodings at inference time.
|
| It makes sense intuitively that the exact position of tokens
| really only matters for adjacent or near adjacent tokens. For far
| away tokens a rough position is fine.
| nulld3v wrote:
| RoPE scaling has been a thing for a while already:
| https://arxiv.org/abs/2306.15595
|
| Does anybody know what the difference is between the approach
| in OP vs other RoPE scaling approaches?
| hexaga wrote:
| The core problem is: there's not enough unique, trained
| positions. Naively going past the end of training ctx makes
| you run straight into out of distribution positions, and
| things become incoherent.
|
| For a model trained with a ctx size of 2, that looks like:
| `[0, 1, *incoherence starts* 2, 3]`
|
| Existing RoPE scaling methods try to stay in-distribution by
| assigning positions between the known in-distribution ones:
| `[0, 0.5, 1, 1.5]`. This is still ~kinda OOD, but works w/
| some fine tuning.
|
| The method in OP breaks the core premise that we need unique
| positions _at all_ , and just gives multiple tokens the same
| position: `[0, 0, 1, 1]`.
| cma wrote:
| Wouldn't that mean (until higher level embeddings) compound
| phrases far away are unordered? And numbers are fragmented by
| token boundary and scrambled up?
| valine wrote:
| Embeddings at lower layers aren't going to be looking very
| far beyond nearby or adjacent embeddings as they refine their
| meaning. For a number like 3.14, the tokens 3 and 14 are
| important to each other, but entirely unimportant to
| understand the meaning of a question later in the context.
| It's only at later layers that an embedding representing the
| concept of PI becomes important to the question embeddings.
|
| As I understand it the positional encodings are calculated
| relative to the token in question. It's not like 3 and 14 are
| unordered tokens from their own perspective.
| scosman wrote:
| Phi-2 and TinyLlama are so so so impressive for being < 3B
| parameter models. They can run on a phone, and are pretty snappy.
|
| Benchmarks:
| https://github.com/ggerganov/llama.cpp/discussions/4508
|
| I don't see them taking over general purpose chat/query use
| cases, but fined tuned to a specific use case and embedded into
| mobile apps, might be how we see LLMs jump from cool tech demos
| to something that's present in most products.
| te_chris wrote:
| Has anyone successfully fine tuned it for function calling?
| Thinking could use a lightweight model like this to interpret
| args in a pipeline then format them for passing down the chain
___________________________________________________________________
(page generated 2024-01-12 23:01 UTC)