[HN Gopher] Phi-4 Bug Fixes
___________________________________________________________________
Phi-4 Bug Fixes
Author : danielhanchen
Score : 183 points
Date : 2025-01-10 21:17 UTC (1 days ago)
(HTM) web link (unsloth.ai)
(TXT) w3m dump (unsloth.ai)
| danielhanchen wrote:
| Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest
| MIT licensed LLM to be on par with GPT-4o mini
|
| 1. End of sentence should be <|im_end|> not <|endoftext|>
|
| 2. Chat template should not auto add an assistant prompt
|
| 3. Padding token should not be EOS but <|dummy_87|>
|
| I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit
| quants, dynamic quants and all fixes to
| https://huggingface.co/unsloth
|
| I also made a Colab notebook to finetune Phi-4 on a free GPU:
| https://colab.research.google.com/github/unslothai/notebooks...
| sunaookami wrote:
| Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft
| just doesn't care.
|
| >to be on par with GPT-4o mini
|
| Phi is known to overfit benchmarks. It's way, way worse then
| that.
| throwaway314155 wrote:
| Anecdotally, I've been experimenting with Phi-4 the past hour
| or so (so, yeah, not very comprehensive) and it's certainly a
| strong model. Definitely better than the previous Phi models.
| danielhanchen wrote:
| Yep Phi-4 definitely is better than Phi-3.5!
| danielhanchen wrote:
| Phi-3 should be fixed as well - but yes there were bugs as
| well! https://x.com/danielhanchen/status/1782853167572832650
|
| Phi-3's sliding window should be 2048 and not 2047, and they
| also had chat template issues - I uploaded correct versions
| to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
| simonw wrote:
| Huh! That may explain why I kept on getting visible <|im_end|>
| output when I tried running a Phi-4 GGUF file using llama.cpp.
| danielhanchen wrote:
| Oh yes exactly! I trimmed it out now :)
|
| The better chat template should be:
|
| {% for message in messages %}{% if (message['role'] ==
| 'system') %}{{'<|im_start|>system<|im_sep|>' +
| message['content'] + '<|im_end|>'}}{% elif (message['role']
| == 'user') %}{{'<|im_start|>user<|im_sep|>' +
| message['content'] + '<|im_end|>'}}{% elif (message['role']
| == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' +
| message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{%
| if add_generation_prompt %}{{
| '<|im_start|>assistant<|im_sep|>' }}{% endif %}
| CGamesPlay wrote:
| > We converted Phi-4 to Llama's architecture for better
| accuracy and easier use.
|
| What does this mean? When I think about "model architecture", I
| think about the number of weights in each layer, the
| organization of the layers, etc. And AFAIK, it's untenable to
| "port" a model from one to the other without effectively
| retraining it. So what does it actually mean to "convert to
| Llama's architecture"?
| Sn0wCoder wrote:
| Would guess GGUF so you can run on llama.cpp, LM Studio,
| etc..., but OP can hopefully clarity further for you.
| danielhanchen wrote:
| Yep converting to Llama arch definitely makes accessibility
| much better - also many fast LLM serving libraries normally
| support Llama, so it makes it easier to port and use!
| danielhanchen wrote:
| Oh Phi-4's architecture is inspired from Llama itself, except
| they merged the attention matrices into 1 large matrix for
| better FLOP utilization, and the gate/up matrices in the MLP.
|
| Phi-3 use to use sliding window attention, but they got rid
| of that in Phi-4.
|
| So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch
| (by unmerging the merges), and now you can "Llama-fy" Phi-4
| to Llama arch.
|
| The reason why accuracy increases in finetuning is because
| during LoRA finetuning, you learn only 1 A matrix for merged
| QKV, whilst unmerging it creates 3 A matrices - this allows
| the model to have more freedom to learn new features.
| behnamoh wrote:
| I know some of those words... Man, do you recommend any
| blog/book/etc. that teaches me how to know this stuff?
|
| Most books are either too low level or too high level.
| danielhanchen wrote:
| If it helps there a few YouTube recordings of conferences
| and workshops I did about stuff!
|
| Low level Technicals of LLMs:
| https://www.youtube.com/watch?v=pRM_P6UfdIc
|
| CUDA / GPU Mode talk about it here:
| https://www.youtube.com/watch?v=hfb_AIhDYnA
|
| Chat with PyTorch team here:
| https://www.youtube.com/watch?v=MQwryfkydc0
|
| PyTorch Conference talk here:
| https://www.youtube.com/watch?v=PdtKkc5jB4g
| sroussey wrote:
| Can you convert to ONNX so I can try in web browser?
| sroussey wrote:
| Would like to update this:
|
| https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
| danielhanchen wrote:
| Oh I can probs try doing this!
| lostmsu wrote:
| The benchmark results of the model before and after the "fixes"
| do not match numbers reported in the model card:
| https://huggingface.co/microsoft/phi-4
|
| According to Microsoft MATH score should be 80.4, while both
| original and the "fixed" models as run by unsloth only score just
| over 12.3. So either Microsoft made a few huge mistakes, or
| unsloth was not able to run their model correctly.
| danielhanchen wrote:
| Oh yes I found this to be a bit strange - I uploaded our
| versions and Microsoft's own version to Hugging Face's public
| LLM leaderboard - https://huggingface.co/spaces/open-llm-
| leaderboard/open_llm_...
|
| You can see Microsoft's own original Phi-3 scores 12.31% - I'm
| unsure why. My fixes at least pushes it to 20%.
|
| It's possible because HF's benchmark does "Scoring: Exact
| match: Was the solution generated correct and in the expected
| format" which might be the issue
| t1amat wrote:
| Daniel's fixes to Phi-4 make it the best scoring Phi-4 on HF's
| Open LLM Leaderboard. Great job on that.
|
| Unsloth is a masterpiece, keep up the great work!
| danielhanchen wrote:
| Thanks a lot!
| TZubiri wrote:
| Ah yes, drawing ASCII art, the de facto benchmark for evaluating
| LLM quality.
| danielhanchen wrote:
| Anecdotal evidence was provided to show some Redditors tested
| it out - but I do agree it's not correct to show that as an
| example - so I uploaded our fixed versions to Hugging Face's
| public LLM leaderboard here:
| https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
| - this shows the fixes do in fact work!
| make3 wrote:
| "Yes it improves performance!" _proceeds to show the most
| unconvincing stats ever_
|
| you can probably blow on your GPU and get a similar performance
| change
| refulgentis wrote:
| I'm sorry, I don't understand what you mean. I checked the
| original article again too. As it stands, my understanding is
| you are claiming:
|
| - blowing on a GPU (which I take to mean doing roughly nothing)
|
| - gets roughly the same perf change
|
| - as moving from fp16 to q4
| danielhanchen wrote:
| Are you referring to the finetuning part?
|
| The multiple bug fixes are separate from the finetuning
| sections - Unsloth itself makes finetuning 2x faster and use
| 70% less memory - the bug fixes are totally detached from
| finetuning - ie you can take the fixed version we uploaded at
| https://huggingface.co/unsloth/phi-4, and use it in any
| framework or inference engine.
|
| Apologies I'm confused on the comment sorry.
|
| If you're questioning the credibility of the bug fixes - we
| fixed 8 bugs in Gemma
| https://x.com/danielhanchen/status/1765446273661075609,
| multiple bugs in Llama, Mistral, Qwen, a gradient
| accumulation bug
| https://x.com/danielhanchen/status/1846235913443262891 and
| much more
| grumpopotamus wrote:
| 2x faster than what?
| danielhanchen wrote:
| Oh 2x faster and uses >70% less memory than Hugging Face
| + Flash Attention 2! I did a CUDA / GPU Mode talk about
| it here: https://www.youtube.com/watch?v=hfb_AIhDYnA Also
| to the PyTorch team here:
| https://www.youtube.com/watch?v=MQwryfkydc0 and the
| PyTorch Conference here:
| https://www.youtube.com/watch?v=PdtKkc5jB4g
| kouteiheika wrote:
| > Oh 2x faster and uses >70% less memory than Hugging
| Face + Flash Attention 2!
|
| Is this doing the same type of fine-tuning, or are you
| comparing full bf16 fine-tuning in HF with 4-bit QLoRA in
| Unsloth (in which case it's not really an apples-to-
| apples comparison)? If it's the latter then do you have a
| comparison of the former?
| danielhanchen wrote:
| Oh I compared 4bit QLoRA HF+FA2 with Unsloth 4bit QLoRA.
|
| 16bit LoRA have similar boosts in performance!
|
| Full bf16 full finentuning is not yet supported, but
| it'll come out soon!
| danielhanchen wrote:
| Update - the Phi-4 team is working on adding all our fixes to
| the original model!
| https://huggingface.co/microsoft/phi-4/discussions/21
| danielhanchen wrote:
| I uploaded our fixed versions to
| https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
| which show the difference in scores.
|
| I agree it's not super convincing, so I provided anecdotal
| evidence as well - I'll work with the Phi-4 team to upstream
| these fixes!
|
| PS for further credibility, we also fixed 8 bugs in Gemma 1 -
| see https://x.com/danielhanchen/status/1765446273661075609 ,
| multiple bugs in Llama, Mistral, Qwen and other models
| adultSwim wrote:
| Are there alternatives to unsloth?
|
| I would love to use it but the open/free version only handles one
| GPU, and it's unclear how much the paid version would cost. I
| have some limited access to multiple older NVidia cards and would
| love to make better use of them while I'm still learning. My
| budget for learning/projects is rather modest.
|
| Hopefully they succeed. At work I could make a strong case for
| going with them as they allow keeping data local only, instead of
| relying on an API.
| danielhanchen wrote:
| Multi GPU support is definitely coming to Unsloth OSS! Our goal
| was to release it this month, but unsure on exact timelines -
| maybe next month!!
| adultSwim wrote:
| Thank you!
| danielhanchen wrote:
| I'll ping you when it comes along!
| wsintra2022 wrote:
| >Reddit comments show our fixes make Phi-4 inference much better
|
| I'd like to try 'Reddit comments show my fixes make app better'
| in my next review
| danielhanchen wrote:
| Fixed versions are also independently scored by Hugging Face's
| Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-
| leaderboard/open_llm_...
|
| The Reddit LocalLlama community is actually pretty cool -
| tonnes of research actually comes from the community - for
| example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE
| Scaling, many LLM benchmarks - many researchers use LocalLlama
| to share research and discuss on new stuff.
|
| I know a lot of AI researchers use the "LocalLlama vibe check"
| which essentially is an anecdotal approach to LLM evaluation -
| ie instead of relying on Chat LMsys or LLM benchmarks, 3rd
| party crowd sourced vibe checks sometimes do much better.
| danielhanchen wrote:
| As an update - the Phi-4 team is actively working on
| incorporating all fixes! See
| https://huggingface.co/microsoft/phi-4/discussions/21
| danielhanchen wrote:
| Update: The Phi-4 team is actively working on adding all our
| fixes into the original model!
| https://huggingface.co/microsoft/phi-4/discussions/21
| excerionsforte wrote:
| Available on Ollama already:
| https://ollama.com/vanilj/phi-4-unsloth
| danielhanchen wrote:
| Oh fabulous! :)
| tandr wrote:
| looking at "original" Phi4 on ollama, it looks like they have
| fixed parameters issue for im_start/end
| NooneAtAll3 wrote:
| Application Error
|
| TypeError: m(...).findLast is not a function
|
| at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)
|
| at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)
|
| at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)
|
| at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)
|
| at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)
|
| at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)
|
| at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)
|
| at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)
|
| at MessagePort.M
| (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235
| danielhanchen wrote:
| Sorry are there some issues with our website?
| NooneAtAll3 wrote:
| yep, it appears for a second - then displays only this :(
| danielhanchen wrote:
| Oh no :( Do you know which device / platform?
| greensh wrote:
| Microsoft developed and trained Phi-4. How can there be bugs in
| their official implementation? Does this mean they trained und
| evaluated it on their own completly different code and then
| ported it to the huggingface library for compatibility?
| danielhanchen wrote:
| The chat template adding an assistant prompt by default for
| example is also shown in the technical report - so they did
| this during training. The issue is inference workloads should
| not have this, otherwise inference workloads might
| inadvertently append extra assistant prompts or forget about it
| - so hence I removed it.
|
| The rest I'm not sure - for eg the EOS token should be im_end
| and not endoftext - it could be a small mistake
| dorian-graph wrote:
| These seem like amazingly egregious mistakes MS made? Or is it
| not as bad as it seems? I suppose, I'm curious how these kinds of
| mistakes happen for a model release.
| danielhanchen wrote:
| This happens quite often sadly - for example I fixed 8 bugs in
| Gemma https://x.com/danielhanchen/status/1765446273661075609,
| multiple bugs in Llama, Mistral, Qwen, a gradient accumulation
| bug https://x.com/danielhanchen/status/1846235913443262891 etc
|
| I wouldn't blame model training teams - sadly it's relatively
| hard to coordinate large teams so it might have been
| overlooked.
|
| But hey - I'm always here to fix them up :))
| c1b wrote:
| daniel youre a legend, thanks for all you do!
|
| one question, I see perf comparisons here are done on an L4, but
| isn't this SKU very rare? Im used to T4 at that tier
| danielhanchen wrote:
| Thanks!! Oh Colab provides L4s - but the benchmarks are similar
| for T4!
|
| In fact Unsloth is the only framework afaik that fits in a t4
| for finentuning with reasonable sequence lengths!
| RandyOrion wrote:
| Hi. It's nice to see these fixes.
|
| I got a question after checking results on the open LLM
| leaderboard[1].
|
| Comparing the result of NyxKrage/Microsoft_Phi-4 and
| microsoft/phi-4 or unsloth/phi-4, I can see fixing both the
| tokenizer and chat template causes the performance of both IFEval
| and BBH to increase. However, the performance on MATH, GPQA and
| MUSR degrades A LOT.
|
| Is there any explanation on why this is happening?
|
| [1] https://huggingface.co/spaces/open-llm-
| leaderboard/open_llm_...
| danielhanchen wrote:
| Yep that is something I've been dumbfounded by as well - the
| official Microsoft phi4 upload also suffers on MATH so at least
| we can rule out it's because I did something wrong.
|
| I thought of two possibilities:
|
| 1. 509 does better on MATH but absolutely terribly on IFEVAL
| because it does not use a chat template - whilst others so use
| the chat template.
|
| 2. I think HF uses exact matching I think so maybe that's the
| culprit.
|
| I can test 1. by resubmitting without using the chat template!
| sinuhe69 wrote:
| How big is GPT4o-mini? Some sources say it's 8b big, but I guess
| they have different models with different sizes. But if
| GPT4o-mini is just 8b, I don't see the point of a "distilled"
| model, which requires a much bigger network but still not on par
| with the original. Because it's open source?
| danielhanchen wrote:
| I think it's a MoE probably with 8b activated parameters - so
| not exactly 8b but maybe 8b x 8
|
| But on many benchmarks it does surpass GPT 4o mini - some
| benchmarks are even better than GPT 4o.
|
| But yes in general it's because its a powerful open source
| models!
| m3kw9 wrote:
| But fixing a model is the first I've heard of.
| danielhanchen wrote:
| Oh I do this quite a lot and tweet about it! For example I
| fixed 8 bugs in Gemma
| https://x.com/danielhanchen/status/1765446273661075609,
| multiple bugs in Llama, Mistral, Qwen, a gradient accumulation
| bug https://x.com/danielhanchen/status/1846235913443262891 etc
___________________________________________________________________
(page generated 2025-01-11 23:01 UTC)