[HN Gopher] Phi-4 Bug Fixes
       ___________________________________________________________________
        
       Phi-4 Bug Fixes
        
       Author : danielhanchen
       Score  : 183 points
       Date   : 2025-01-10 21:17 UTC (1 days ago)
        
 (HTM) web link (unsloth.ai)
 (TXT) w3m dump (unsloth.ai)
        
       | danielhanchen wrote:
       | Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest
       | MIT licensed LLM to be on par with GPT-4o mini
       | 
       | 1. End of sentence should be <|im_end|> not <|endoftext|>
       | 
       | 2. Chat template should not auto add an assistant prompt
       | 
       | 3. Padding token should not be EOS but <|dummy_87|>
       | 
       | I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit
       | quants, dynamic quants and all fixes to
       | https://huggingface.co/unsloth
       | 
       | I also made a Colab notebook to finetune Phi-4 on a free GPU:
       | https://colab.research.google.com/github/unslothai/notebooks...
        
         | sunaookami wrote:
         | Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft
         | just doesn't care.
         | 
         | >to be on par with GPT-4o mini
         | 
         | Phi is known to overfit benchmarks. It's way, way worse then
         | that.
        
           | throwaway314155 wrote:
           | Anecdotally, I've been experimenting with Phi-4 the past hour
           | or so (so, yeah, not very comprehensive) and it's certainly a
           | strong model. Definitely better than the previous Phi models.
        
             | danielhanchen wrote:
             | Yep Phi-4 definitely is better than Phi-3.5!
        
           | danielhanchen wrote:
           | Phi-3 should be fixed as well - but yes there were bugs as
           | well! https://x.com/danielhanchen/status/1782853167572832650
           | 
           | Phi-3's sliding window should be 2048 and not 2047, and they
           | also had chat template issues - I uploaded correct versions
           | to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
        
         | simonw wrote:
         | Huh! That may explain why I kept on getting visible <|im_end|>
         | output when I tried running a Phi-4 GGUF file using llama.cpp.
        
           | danielhanchen wrote:
           | Oh yes exactly! I trimmed it out now :)
           | 
           | The better chat template should be:
           | 
           | {% for message in messages %}{% if (message['role'] ==
           | 'system') %}{{'<|im_start|>system<|im_sep|>' +
           | message['content'] + '<|im_end|>'}}{% elif (message['role']
           | == 'user') %}{{'<|im_start|>user<|im_sep|>' +
           | message['content'] + '<|im_end|>'}}{% elif (message['role']
           | == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' +
           | message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{%
           | if add_generation_prompt %}{{
           | '<|im_start|>assistant<|im_sep|>' }}{% endif %}
        
         | CGamesPlay wrote:
         | > We converted Phi-4 to Llama's architecture for better
         | accuracy and easier use.
         | 
         | What does this mean? When I think about "model architecture", I
         | think about the number of weights in each layer, the
         | organization of the layers, etc. And AFAIK, it's untenable to
         | "port" a model from one to the other without effectively
         | retraining it. So what does it actually mean to "convert to
         | Llama's architecture"?
        
           | Sn0wCoder wrote:
           | Would guess GGUF so you can run on llama.cpp, LM Studio,
           | etc..., but OP can hopefully clarity further for you.
        
             | danielhanchen wrote:
             | Yep converting to Llama arch definitely makes accessibility
             | much better - also many fast LLM serving libraries normally
             | support Llama, so it makes it easier to port and use!
        
           | danielhanchen wrote:
           | Oh Phi-4's architecture is inspired from Llama itself, except
           | they merged the attention matrices into 1 large matrix for
           | better FLOP utilization, and the gate/up matrices in the MLP.
           | 
           | Phi-3 use to use sliding window attention, but they got rid
           | of that in Phi-4.
           | 
           | So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch
           | (by unmerging the merges), and now you can "Llama-fy" Phi-4
           | to Llama arch.
           | 
           | The reason why accuracy increases in finetuning is because
           | during LoRA finetuning, you learn only 1 A matrix for merged
           | QKV, whilst unmerging it creates 3 A matrices - this allows
           | the model to have more freedom to learn new features.
        
             | behnamoh wrote:
             | I know some of those words... Man, do you recommend any
             | blog/book/etc. that teaches me how to know this stuff?
             | 
             | Most books are either too low level or too high level.
        
               | danielhanchen wrote:
               | If it helps there a few YouTube recordings of conferences
               | and workshops I did about stuff!
               | 
               | Low level Technicals of LLMs:
               | https://www.youtube.com/watch?v=pRM_P6UfdIc
               | 
               | CUDA / GPU Mode talk about it here:
               | https://www.youtube.com/watch?v=hfb_AIhDYnA
               | 
               | Chat with PyTorch team here:
               | https://www.youtube.com/watch?v=MQwryfkydc0
               | 
               | PyTorch Conference talk here:
               | https://www.youtube.com/watch?v=PdtKkc5jB4g
        
         | sroussey wrote:
         | Can you convert to ONNX so I can try in web browser?
        
           | sroussey wrote:
           | Would like to update this:
           | 
           | https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
        
           | danielhanchen wrote:
           | Oh I can probs try doing this!
        
       | lostmsu wrote:
       | The benchmark results of the model before and after the "fixes"
       | do not match numbers reported in the model card:
       | https://huggingface.co/microsoft/phi-4
       | 
       | According to Microsoft MATH score should be 80.4, while both
       | original and the "fixed" models as run by unsloth only score just
       | over 12.3. So either Microsoft made a few huge mistakes, or
       | unsloth was not able to run their model correctly.
        
         | danielhanchen wrote:
         | Oh yes I found this to be a bit strange - I uploaded our
         | versions and Microsoft's own version to Hugging Face's public
         | LLM leaderboard - https://huggingface.co/spaces/open-llm-
         | leaderboard/open_llm_...
         | 
         | You can see Microsoft's own original Phi-3 scores 12.31% - I'm
         | unsure why. My fixes at least pushes it to 20%.
         | 
         | It's possible because HF's benchmark does "Scoring: Exact
         | match: Was the solution generated correct and in the expected
         | format" which might be the issue
        
       | t1amat wrote:
       | Daniel's fixes to Phi-4 make it the best scoring Phi-4 on HF's
       | Open LLM Leaderboard. Great job on that.
       | 
       | Unsloth is a masterpiece, keep up the great work!
        
         | danielhanchen wrote:
         | Thanks a lot!
        
       | TZubiri wrote:
       | Ah yes, drawing ASCII art, the de facto benchmark for evaluating
       | LLM quality.
        
         | danielhanchen wrote:
         | Anecdotal evidence was provided to show some Redditors tested
         | it out - but I do agree it's not correct to show that as an
         | example - so I uploaded our fixed versions to Hugging Face's
         | public LLM leaderboard here:
         | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
         | - this shows the fixes do in fact work!
        
       | make3 wrote:
       | "Yes it improves performance!" _proceeds to show the most
       | unconvincing stats ever_
       | 
       | you can probably blow on your GPU and get a similar performance
       | change
        
         | refulgentis wrote:
         | I'm sorry, I don't understand what you mean. I checked the
         | original article again too. As it stands, my understanding is
         | you are claiming:
         | 
         | - blowing on a GPU (which I take to mean doing roughly nothing)
         | 
         | - gets roughly the same perf change
         | 
         | - as moving from fp16 to q4
        
           | danielhanchen wrote:
           | Are you referring to the finetuning part?
           | 
           | The multiple bug fixes are separate from the finetuning
           | sections - Unsloth itself makes finetuning 2x faster and use
           | 70% less memory - the bug fixes are totally detached from
           | finetuning - ie you can take the fixed version we uploaded at
           | https://huggingface.co/unsloth/phi-4, and use it in any
           | framework or inference engine.
           | 
           | Apologies I'm confused on the comment sorry.
           | 
           | If you're questioning the credibility of the bug fixes - we
           | fixed 8 bugs in Gemma
           | https://x.com/danielhanchen/status/1765446273661075609,
           | multiple bugs in Llama, Mistral, Qwen, a gradient
           | accumulation bug
           | https://x.com/danielhanchen/status/1846235913443262891 and
           | much more
        
             | grumpopotamus wrote:
             | 2x faster than what?
        
               | danielhanchen wrote:
               | Oh 2x faster and uses >70% less memory than Hugging Face
               | + Flash Attention 2! I did a CUDA / GPU Mode talk about
               | it here: https://www.youtube.com/watch?v=hfb_AIhDYnA Also
               | to the PyTorch team here:
               | https://www.youtube.com/watch?v=MQwryfkydc0 and the
               | PyTorch Conference here:
               | https://www.youtube.com/watch?v=PdtKkc5jB4g
        
               | kouteiheika wrote:
               | > Oh 2x faster and uses >70% less memory than Hugging
               | Face + Flash Attention 2!
               | 
               | Is this doing the same type of fine-tuning, or are you
               | comparing full bf16 fine-tuning in HF with 4-bit QLoRA in
               | Unsloth (in which case it's not really an apples-to-
               | apples comparison)? If it's the latter then do you have a
               | comparison of the former?
        
               | danielhanchen wrote:
               | Oh I compared 4bit QLoRA HF+FA2 with Unsloth 4bit QLoRA.
               | 
               | 16bit LoRA have similar boosts in performance!
               | 
               | Full bf16 full finentuning is not yet supported, but
               | it'll come out soon!
        
           | danielhanchen wrote:
           | Update - the Phi-4 team is working on adding all our fixes to
           | the original model!
           | https://huggingface.co/microsoft/phi-4/discussions/21
        
         | danielhanchen wrote:
         | I uploaded our fixed versions to
         | https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
         | which show the difference in scores.
         | 
         | I agree it's not super convincing, so I provided anecdotal
         | evidence as well - I'll work with the Phi-4 team to upstream
         | these fixes!
         | 
         | PS for further credibility, we also fixed 8 bugs in Gemma 1 -
         | see https://x.com/danielhanchen/status/1765446273661075609 ,
         | multiple bugs in Llama, Mistral, Qwen and other models
        
       | adultSwim wrote:
       | Are there alternatives to unsloth?
       | 
       | I would love to use it but the open/free version only handles one
       | GPU, and it's unclear how much the paid version would cost. I
       | have some limited access to multiple older NVidia cards and would
       | love to make better use of them while I'm still learning. My
       | budget for learning/projects is rather modest.
       | 
       | Hopefully they succeed. At work I could make a strong case for
       | going with them as they allow keeping data local only, instead of
       | relying on an API.
        
         | danielhanchen wrote:
         | Multi GPU support is definitely coming to Unsloth OSS! Our goal
         | was to release it this month, but unsure on exact timelines -
         | maybe next month!!
        
           | adultSwim wrote:
           | Thank you!
        
             | danielhanchen wrote:
             | I'll ping you when it comes along!
        
       | wsintra2022 wrote:
       | >Reddit comments show our fixes make Phi-4 inference much better
       | 
       | I'd like to try 'Reddit comments show my fixes make app better'
       | in my next review
        
         | danielhanchen wrote:
         | Fixed versions are also independently scored by Hugging Face's
         | Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-
         | leaderboard/open_llm_...
         | 
         | The Reddit LocalLlama community is actually pretty cool -
         | tonnes of research actually comes from the community - for
         | example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE
         | Scaling, many LLM benchmarks - many researchers use LocalLlama
         | to share research and discuss on new stuff.
         | 
         | I know a lot of AI researchers use the "LocalLlama vibe check"
         | which essentially is an anecdotal approach to LLM evaluation -
         | ie instead of relying on Chat LMsys or LLM benchmarks, 3rd
         | party crowd sourced vibe checks sometimes do much better.
        
         | danielhanchen wrote:
         | As an update - the Phi-4 team is actively working on
         | incorporating all fixes! See
         | https://huggingface.co/microsoft/phi-4/discussions/21
        
       | danielhanchen wrote:
       | Update: The Phi-4 team is actively working on adding all our
       | fixes into the original model!
       | https://huggingface.co/microsoft/phi-4/discussions/21
        
       | excerionsforte wrote:
       | Available on Ollama already:
       | https://ollama.com/vanilj/phi-4-unsloth
        
         | danielhanchen wrote:
         | Oh fabulous! :)
        
         | tandr wrote:
         | looking at "original" Phi4 on ollama, it looks like they have
         | fixed parameters issue for im_start/end
        
       | NooneAtAll3 wrote:
       | Application Error
       | 
       | TypeError: m(...).findLast is not a function
       | 
       | at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)
       | 
       | at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)
       | 
       | at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)
       | 
       | at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)
       | 
       | at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)
       | 
       | at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)
       | 
       | at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)
       | 
       | at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)
       | 
       | at MessagePort.M
       | (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235
        
         | danielhanchen wrote:
         | Sorry are there some issues with our website?
        
           | NooneAtAll3 wrote:
           | yep, it appears for a second - then displays only this :(
        
             | danielhanchen wrote:
             | Oh no :( Do you know which device / platform?
        
       | greensh wrote:
       | Microsoft developed and trained Phi-4. How can there be bugs in
       | their official implementation? Does this mean they trained und
       | evaluated it on their own completly different code and then
       | ported it to the huggingface library for compatibility?
        
         | danielhanchen wrote:
         | The chat template adding an assistant prompt by default for
         | example is also shown in the technical report - so they did
         | this during training. The issue is inference workloads should
         | not have this, otherwise inference workloads might
         | inadvertently append extra assistant prompts or forget about it
         | - so hence I removed it.
         | 
         | The rest I'm not sure - for eg the EOS token should be im_end
         | and not endoftext - it could be a small mistake
        
       | dorian-graph wrote:
       | These seem like amazingly egregious mistakes MS made? Or is it
       | not as bad as it seems? I suppose, I'm curious how these kinds of
       | mistakes happen for a model release.
        
         | danielhanchen wrote:
         | This happens quite often sadly - for example I fixed 8 bugs in
         | Gemma https://x.com/danielhanchen/status/1765446273661075609,
         | multiple bugs in Llama, Mistral, Qwen, a gradient accumulation
         | bug https://x.com/danielhanchen/status/1846235913443262891 etc
         | 
         | I wouldn't blame model training teams - sadly it's relatively
         | hard to coordinate large teams so it might have been
         | overlooked.
         | 
         | But hey - I'm always here to fix them up :))
        
       | c1b wrote:
       | daniel youre a legend, thanks for all you do!
       | 
       | one question, I see perf comparisons here are done on an L4, but
       | isn't this SKU very rare? Im used to T4 at that tier
        
         | danielhanchen wrote:
         | Thanks!! Oh Colab provides L4s - but the benchmarks are similar
         | for T4!
         | 
         | In fact Unsloth is the only framework afaik that fits in a t4
         | for finentuning with reasonable sequence lengths!
        
       | RandyOrion wrote:
       | Hi. It's nice to see these fixes.
       | 
       | I got a question after checking results on the open LLM
       | leaderboard[1].
       | 
       | Comparing the result of NyxKrage/Microsoft_Phi-4 and
       | microsoft/phi-4 or unsloth/phi-4, I can see fixing both the
       | tokenizer and chat template causes the performance of both IFEval
       | and BBH to increase. However, the performance on MATH, GPQA and
       | MUSR degrades A LOT.
       | 
       | Is there any explanation on why this is happening?
       | 
       | [1] https://huggingface.co/spaces/open-llm-
       | leaderboard/open_llm_...
        
         | danielhanchen wrote:
         | Yep that is something I've been dumbfounded by as well - the
         | official Microsoft phi4 upload also suffers on MATH so at least
         | we can rule out it's because I did something wrong.
         | 
         | I thought of two possibilities:
         | 
         | 1. 509 does better on MATH but absolutely terribly on IFEVAL
         | because it does not use a chat template - whilst others so use
         | the chat template.
         | 
         | 2. I think HF uses exact matching I think so maybe that's the
         | culprit.
         | 
         | I can test 1. by resubmitting without using the chat template!
        
       | sinuhe69 wrote:
       | How big is GPT4o-mini? Some sources say it's 8b big, but I guess
       | they have different models with different sizes. But if
       | GPT4o-mini is just 8b, I don't see the point of a "distilled"
       | model, which requires a much bigger network but still not on par
       | with the original. Because it's open source?
        
         | danielhanchen wrote:
         | I think it's a MoE probably with 8b activated parameters - so
         | not exactly 8b but maybe 8b x 8
         | 
         | But on many benchmarks it does surpass GPT 4o mini - some
         | benchmarks are even better than GPT 4o.
         | 
         | But yes in general it's because its a powerful open source
         | models!
        
       | m3kw9 wrote:
       | But fixing a model is the first I've heard of.
        
         | danielhanchen wrote:
         | Oh I do this quite a lot and tweet about it! For example I
         | fixed 8 bugs in Gemma
         | https://x.com/danielhanchen/status/1765446273661075609,
         | multiple bugs in Llama, Mistral, Qwen, a gradient accumulation
         | bug https://x.com/danielhanchen/status/1846235913443262891 etc
        
       ___________________________________________________________________
       (page generated 2025-01-11 23:01 UTC)