[HN Gopher] Phi-4: Microsoft's Newest Small Language Model Speci...
       ___________________________________________________________________
        
       Phi-4: Microsoft's Newest Small Language Model Specializing in
       Complex Reasoning
        
       Author : lappa
       Score  : 409 points
       Date   : 2024-12-13 01:54 UTC (3 days ago)
        
 (HTM) web link (techcommunity.microsoft.com)
 (TXT) w3m dump (techcommunity.microsoft.com)
        
       | parmesean wrote:
       | 13.8 epochs of the benchmarks?
        
       | xeckr wrote:
       | Looks like it punches way above its weight(s).
       | 
       | How far are we from running a GPT-3/GPT-4 level LLM on regular
       | consumer hardware, like a MacBook Pro?
        
         | lappa wrote:
         | It's easy to argue that Llama-3.3 8B performs better than
         | GPT-3.5. Compare their benchmarks, and try the two side-by-
         | side.
         | 
         | Phi-4 is yet another _step_ towards a small, open, GPT-4 level
         | model. I think we 're getting quite close.
         | 
         | Check the benchmarks comparing to GPT-4o on the first page of
         | their technical report if you haven't already
         | https://arxiv.org/pdf/2412.08905
        
           | vulcanash999 wrote:
           | Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B
           | model as far as I'm aware.
        
         | anon373839 wrote:
         | We're already past that point! MacBooks can easily run models
         | exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma
         | 2 9B. These models run at very comfortable speeds on Apple
         | Silicon. And they are distinctly more capable and less prone to
         | hallucination than GPT-3.5 was.
         | 
         | Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to
         | GPT-4, and they will run on MacBook Pros with at least 64GB of
         | RAM. However, I have an M3 Max and I can't say that models of
         | this size run at comfortable speeds. They're a bit sluggish.
        
           | noman-land wrote:
           | The coolness of local LLMs is THE only reason I am sadly
           | eyeing upgrading from M1 64GB to M4/5 128+GB.
        
             | Terretta wrote:
             | Compare performance on various Macs here as it gets
             | updated:
             | 
             | https://github.com/ggerganov/llama.cpp/discussions/4167
             | 
             | OMM, Llama 3.3 70B runs at ~7 text generation tokens per
             | second on Macbook Pro Max 128GB, while generating GPT-4
             | feeling text with more in depth responses and fewer
             | bullets. Llama 3.3 70B also doesn't fight the system
             | prompt, it leans in.
             | 
             | Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX)
             | centered UI, include MLX in your search term when
             | downloading models.
             | 
             | Also, do not scrimp on the storage. At 60GB - 100GB per
             | model, it takes a day of experimentation to use 2.5TB of
             | storage in your model cache. And remember to exclude that
             | path from your TimeMachine backups.
        
               | noman-land wrote:
               | Thank you for all the tips! I'd probably go 128GB 8TB
               | because of masochism. Curious, what makes so many of the
               | M4s in the red currently.
        
               | vessenes wrote:
               | It's all memory bandwidth related -- what's slow is
               | loading these models into memory, basically. The last die
               | from Apple with _all_ the channels was the M2 Ultra, and
               | I bet that 's what tops those leader boards. M4 has not
               | had a Max or an Ultra release yet; when it does (and it
               | seems likely it will), those will be the ones to get.
        
               | ant6n wrote:
               | What if you have a Macbook Air with 16GB (the bechmarks
               | dont seem to show memory).
        
               | simonw wrote:
               | You could definitely run an 8B model on that, and some of
               | those are getting very capable now.
               | 
               | The problem is that often you can't run anything else.
               | I've had trouble running larger models in 64GB when I've
               | had a bunch of Firefox and VS Code tabs open at the same
               | time.
        
               | xdavidliu wrote:
               | I thought VSCode was supposed to be lightweight, though I
               | suppose with extensions it can add up
        
               | chris_st wrote:
               | I have a M2 Air with 24GB, and have successfully run some
               | 12B models such as mistral-nemo. Had other stuff going as
               | well, but it's best to give it as much of the machine as
               | possible.
        
               | gcanyon wrote:
               | I recently upgraded to exactly this machine for exactly
               | this reason, but I haven't taken the leap and installed
               | anything yet. What's your favorite model to run on it?
        
               | evilduck wrote:
               | 8B models with larger contexts, or even 9-14B parameter
               | models quantized.
               | 
               | Qwen2.5 Coder 14B at a 4 bit quantization could run but
               | you will need to be diligent about what else you have in
               | memory at the same time.
        
             | stkdump wrote:
             | I bought an old used desktop computer, a used 3090, and
             | upgraded the power supply, all for around 900EUR. Didn't
             | assemble it all yet. But it will be able to comfortably run
             | 30B parameter models with 30-40 T/s. The M4 Max can do ~10
             | T/s, which is not great once you really want to rely on it
             | for your productivity.
             | 
             | Yes, it is not "local" as I will have to use the internet
             | when not at home. But it will also not drain the battery
             | very quickly when using it, which I suspect would happen to
             | a Macbook Pro running such models. Also 70B models are out
             | of reach of my setup, but I think they are painfully slow
             | on Mac hardware.
        
             | jazzyjackson wrote:
             | I'm returning my 96GB m2 max. It can run unquantized llama
             | 3.3 70B but tokens per second is slow as molasses and still
             | I couldn't find any use for it, just kept going back to
             | perplexity when I actually needed to find an answer to
             | something.
        
             | alecco wrote:
             | I'm waiting for next gen hardware. All the companies are
             | aiming for AI acceleration.
        
             | kleiba wrote:
             | Sorry, I'm not up to date, but can you run GPTs locally or
             | only vanilla LLMs?
        
           | noodletheworld wrote:
           | > MacBooks can easily run models exceeding GPT-3.5, such as
           | Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.
           | 
           | gpt-3.5-turbo is generally considered to be about 20B params.
           | An 8B model does not exceed it.
           | 
           | > Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to
           | GPT-4
           | 
           | I'm skeptical; the llama 3.1 405B model is the only
           | comparable model I've used, and it's _significantly_ larger
           | than the 70B models you can run locally.
           | 
           | The 405B model takes a bit of effort to run (1), and your
           | average mac book doesn't ship with 128GB of ram, but
           | _technically_ yes, if you get the max spec M4 Max + 128GB
           | unified ram, you can run it.
           | 
           | ...and it 's similar, but (see again, AI leader boards) not
           | as good as what you can get from gpt-4o.
           | 
           | > How far are we from running a GPT-3/GPT-4 level LLM on
           | regular consumer hardware, like a MacBook Pro?
           | 
           | Is a $8000 MBP regular consumer hardware? If you don't think
           | so, then the answer is probably no.
           | 
           | Lots of research into good smaller models is going on right
           | now, but _right now_ , they are not comparable to the larger
           | models in terms of quality-of-output.
           | 
           | [1] - https://medium.com/@aleksej.gudkov/how-to-run-
           | llama-405b-a-c...
        
             | runako wrote:
             | > Is a $8000 MBP regular consumer hardware?
             | 
             | May want to double-check your specs. 16" w/128GB & 2TB is
             | $5,400.
        
             | anon373839 wrote:
             | > gpt-3.5-turbo is generally considered to be about 20B
             | params. An 8B model does not exceed it.
             | 
             | The industry has moved on from the old Chinchilla scaling
             | regime, and with it the conviction that LLM capability is
             | mainly dictated by parameter count. OpenAI didn't disclose
             | how much pretraining they did for 3.5-Turbo, but GPT 3 was
             | trained on 300 billion tokens of text data. In contrast,
             | Llama 3.1 was trained on _15 trillion_ tokens of data.
             | 
             | Objectively, Llama 3.1 8B and other small models have
             | exceeded GPT-3.5-Turbo in benchmarks and human preference
             | scores.
             | 
             | > Is a $8000 MBP regular consumer hardware?
             | 
             | As user `bloomingkales` notes down below, a $499 Mac Mini
             | can run 8B parameter models. An $8,000 expenditure is not
             | required.
        
             | tosh wrote:
             | A Mac with 16GB RAM can run qwen 7b, gemma 9b and similar
             | models that are somewhere between GPT3.5 and GPT4.
             | 
             | Quite impressive.
        
               | jazzyjackson wrote:
               | on what metric?
               | 
               | Why would OpenAI bother serving GPT4 if customers would
               | be just as happy with a tiny 9B model?
        
               | tosh wrote:
               | https://lmarena.ai/
               | 
               | Check out the lmsys leaderboard. It has an overall
               | ranking as well as ranking for specific categories.
               | 
               | OpenAI are also serving gpt4o mini. That said afaiu it's
               | not known how large/small mini is.
               | 
               | Being more useful than GPT3.5 is not a high bar anymore.
        
               | simonw wrote:
               | Don't confuse GPT-4 and GPT-4o.
               | 
               | GPT-4o is a much better experience than the smaller local
               | models. You can see that in the lmarena benchmarks or
               | from trying them out yourself.
        
             | PhilippGille wrote:
             | >> Llama 3.3 70B and Qwen 2.5 72B are certainly comparable
             | to GPT-4
             | 
             | > I'm skeptical; the llama 3.1 405B model is the only
             | comparable model I've used, and it's significantly larger
             | than the 70B models you can run locally.
             | 
             | Every new Llama generation achieved to beat larger models
             | of the previous generation with smaller ones.
             | 
             | Check Kagi's LLM benchmark:
             | https://help.kagi.com/kagi/ai/llm-benchmark.html
             | 
             | Check the HN thread around the 3.3 70b release:
             | https://news.ycombinator.com/item?id=42341388
             | 
             | And their own benchmark results in their model card:
             | https://github.com/meta-llama/llama-
             | models/blob/main/models%...
             | 
             | Groq's post about it: https://groq.com/a-new-scaling-
             | paradigm-metas-llama-3-3-70b-...
             | 
             | Etc
        
               | int_19h wrote:
               | They still do not beat GPT-4, however.
               | 
               | And benchmarks are very misleading in this regard. We've
               | seen no shortage of even 8B models claiming that they
               | beat GPT-4 and Claude in benchmarks. Every time this
               | happens, once you start actually using the model, it's
               | clear that it's not actually on par.
        
               | simonw wrote:
               | GPT-4 from March 2023, not GPT-4o from May 2024.
        
             | zozbot234 wrote:
             | > Is a $8000 MBP regular consumer hardware? If you don't
             | think so, then the answer is probably no.
             | 
             | The very first Apple McIntosh was not far from that price
             | at its release. Adjusted for inflation of course.
        
           | kgeist wrote:
           | >MacBooks can easily run models exceeding GPT-3.5, such as
           | Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.
           | 
           | If only those models supported anything other than English
        
             | simonw wrote:
             | Llama 3.1 8B advertises itself as multilingual.
             | 
             | All of the Qwen models are basically fluent in both English
             | and Chinese.
        
               | kgeist wrote:
               | Llama 8B is multilingual on paper, but the quality is
               | very bad compared to English. It generally understands
               | grammar, and you can understand what it's trying to say,
               | but the choice of words is very off most of the time,
               | often complete gibberish. If you can imagine the output
               | of an undertrained model, this is it. Meanwhile GPT3.5
               | had far better output that you could use in production.
        
             | barrell wrote:
             | Cohere just announced Command R7B. I haven't tried it yet
             | but their larger models are the best multilingual models
             | I've used
        
             | numpad0 wrote:
             | Is subtext to this uncensored Chinese support?
        
         | simonw wrote:
         | We're there. Llama 3.3 70B is GPT-4 level and runs on my 64GB
         | MacBook Pro: https://simonwillison.net/2024/Dec/9/llama-33-70b/
         | 
         | The Qwen2 models that run on my MacBook Pro are GPT-4 level
         | too.
        
           | BoorishBears wrote:
           | Saying these models are at GPT-4 level is setting anyone who
           | doesn't place special value on the local aspect up for
           | disappointment.
           | 
           | Some people do place value on running locally, and I'm not
           | against then for it, but realistically no 70B class model has
           | the amount of general knowledge or understanding of nuance as
           | any recent GPT-4 checkpoint.
           | 
           | That being said these models are still very strong compared
           | to what we had a year ago and capable of useful work
        
             | simonw wrote:
             | I said GPT-4, not GPT-4o. I'm talking about a model that
             | feels equivalent to the GPT-4 we were using in March of
             | 2023.
        
               | int_19h wrote:
               | I remember using GPT-4 when it first dropped to get a
               | feeling of its capabilities, and no, I wouldn't say that
               | llama-3.3-70b is comparable.
               | 
               | At the end of the day, there's only so much you can cram
               | into any given number of parameters, regardless of what
               | any artificial benchmark says.
        
               | simonw wrote:
               | I envy your memory.
        
           | n144q wrote:
           | I wouldn't call 64GB MacBook Pro "regular consumer hardware".
        
             | jsheard wrote:
             | Yeah, a computer which starts at $3900 is really stretching
             | that classification. Plus if you're that serious about
             | local LLMs then you'd probably want the even bigger RAM
             | option, which adds another $800...
        
               | evilduck wrote:
               | An optioned up minivan is also expensive but doesn't cost
               | as much as a firetruck. It's expensive but still very
               | much consumer hardware. A 3x4090 rig is more expensive
               | and still consumer hardware. An H100 is not, you can buy
               | like 7 of these optioned up MBP for a single H100.
        
               | michaelt wrote:
               | In my experience, people use the term in two separate
               | ways.
               | 
               | If I'm running a software business selling software that
               | runs on 'consumer hardware' the more people can run my
               | software, the more people can pay me. For me, the term
               | means the hardware used by a _typical-ish_ consumer. I
               | 'll check the Steam hardware survey, find the 75th-
               | percentile gamer has 8 cores, 32GB RAM, 12GB VRAM - and
               | I'd better make sure my software works on a machine like
               | that.
               | 
               | On the other hand, 'consumer hardware' could also be used
               | to simply mean hardware available off-the-shelf from
               | retailers who sell to consumers. By this definition,
               | 128GB of RAM is 'consumer hardware' even if it only
               | counts as 0.5% in Steam's hardware survey.
        
               | evilduck wrote:
               | On the Steam Hardware Survey the average gamer uses a
               | computer with a 1080p display too. That doesn't somehow
               | make any gaming laptop with a 2k screen sold in the last
               | half decade a non-consumer product. For that matter the
               | average gaming PC on Steam is even above average relative
               | to the average computer. The typical office computer or
               | school Chromebook is likely several generations older
               | doesn't have an NPU or discrete GPU at all.
               | 
               | For AI and LLMs, I'm not aware of any company even
               | selling the models assets directly to consumers, they're
               | either completely unavailable (OpenAI) or freely licensed
               | so the companies training them aren't really dependent
               | what the average person has for commercial success.
        
               | criddell wrote:
               | In the early 80's, people were spending more than $3k for
               | an IBM 5150. For that price you got 64 kB of RAM, a
               | floppy drive, and monochrome monitor.
               | 
               | Today, lots of people spend far more than that for gaming
               | PCs. An Alienware R16 (unquestionably a consumer PC) with
               | 64 GB of RAM starts at $4700.
               | 
               | It _is_ an expensive computer, but the best mainstream
               | computers at any particular time have always cost between
               | $2500 and $5000.
        
             | russellbeattie wrote:
             | I have to disagree. I understand it's very expensive, but
             | it's still a consumer product available to anyone with a
             | credit card.
             | 
             | The comparison is between something you can buy off the
             | shelf like a powerful Mac, vs something powered by a Grace
             | Hopper CPU from Nvidia, which would require both lots of
             | money and a business relationship.
             | 
             | Honestly, people pay $4k for nice TVs, refrigerators and
             | even couches, and those are not professional tools by any
             | stretch. If LLMs needed a $50k Mac Pro with maxed out
             | everything, that might be different. But anything that's a
             | laptop is definitely regular consumer hardware.
        
               | PhunkyPhil wrote:
               | There's definitely been plenty sources of hardware
               | capable of running LLMs out there for a while, Mac or
               | not. A couple 4090s or P40s will run 3.1 70b. Or, since
               | price isn't a limit, there are other easier & more
               | powerful options like a [tinybox](https://tinygrad.org/#t
               | inybox:~:text=won%27t%20be%20consider...).
        
         | refulgentis wrote:
         | We're there, Llama 3.1 8B beats Gemini Advanced for $20/month.
         | Telosnex with llama 3.1 8b GGUF from bartowski.
         | https://telosnex.com/compare/ (How!? tl;dr: I assume Google is
         | sandbagging and hasn't updated the underlying Gemini)
        
         | bloomingkales wrote:
         | M4 Mac mini 16gb for $500. It's literally an inferencing block
         | (small too, fits in my palm). I feel like the whole world needs
         | one.
        
           | alganet wrote:
           | > inferencing block
           | 
           | Did you mean _external gpu_?
           | 
           | Choose any 12GB or more video card with GDDR6 or superior and
           | you'll have at least double the performance of a base m4
           | mini.
           | 
           | The base model is almost an older generation. Thunderbolt 4
           | instead of 5, slower bandwidths, slower SSDs.
        
             | kgwgk wrote:
             | > you'll have at least double the performance of a base m4
             | mini
             | 
             | For $500 all included?
        
               | alganet wrote:
               | The base mini is 599.
               | 
               | Here's a config for around the same price. All brand new
               | parts for 573. You can spend the difference improving any
               | part you wish, or maybe get an used 3060 and go AM5
               | instead (Ryzen 8400F). Both paths are upgradeable.
               | 
               | https://pcpartpicker.com/list/ftK8rM
               | 
               | Double the LLM performance. Half the desktop performance.
               | But you can use both at the same time. Your computer will
               | not slow down when running inference.
        
               | bloomingkales wrote:
               | That's a really nice build.
        
               | alganet wrote:
               | Another possible build is to use a mini-pc and M.2
               | connections
               | 
               | You'll need a mini-pc with two M.2 slots, like this:
               | 
               | https://www.amazon.com/Beelink-SER7-7840HS-Computer-
               | Display/...
               | 
               | And a riser like this:
               | 
               | https://www.amazon.com/CERRXIAN-Graphics-Left-PCI-
               | Express-Ex...
               | 
               | And some courage to open it and rig the stuff in.
               | 
               | Then you can plug a GPU on it. It should have decent load
               | times. Better than an eGPU, worse than the AM4 desktop
               | build, fast enough to beat the M4 (once the data is in
               | the GPU, it doesn't matter).
               | 
               | It makes for a very portable setup. I haven't built it,
               | but I think it's a reasonable LLM choice comparable to
               | the M4 in speed and portability while still being
               | upgradable.
               | 
               | Edit: and you'll need an external power supply of at
               | least 400W:)
        
         | ActorNightly wrote:
         | Why would you want to though? You already can get free access
         | to large LLMs and nobody is doing anything groundbreaking with
         | them.
        
           | jckahn wrote:
           | I only use local, open source LLMs because I don't trust
           | cloud-based LLM hosts with my data. I also don't want to
           | build a dependence on proprietary technology.
        
       | simonw wrote:
       | The most interesting thing about this is the way it was trained
       | using synthetic data, which is described in quite a bit of detail
       | in the technical report: https://arxiv.org/abs/2412.08905
       | 
       | Microsoft haven't officially released the weights yet but there
       | are unofficial GGUFs up on Hugging Face already. I tried this
       | one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
       | 
       | I got it working with my LLM tool like this:                 llm
       | install llm-gguf       llm gguf download-model https://huggingfac
       | e.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
       | llm chat -m gguf/phi-4-Q4_K_M
       | 
       | Here are some initial transcripts:
       | https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...
       | 
       | More of my notes on Phi-4 here:
       | https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
        
         | syntaxing wrote:
         | Wow, those responses are better than I expected. Part of me was
         | expecting terrible responses since Phi-3 was amazing on paper
         | too but terrible in practice.
        
           | refulgentis wrote:
           | One of the funniest tech subplots in recent memory.
           | 
           | TL;DR it was nigh-impossible to get it to emit the proper
           | "end of message" token. (IMHO the chat training was too
           | rushed). So all the local LLM apps tried silently hacking
           | around it. The funny thing to me was _no one_ would say it
           | out loud. Field isn 't very consumer friendly, yet.
        
             | TeMPOraL wrote:
             | Speaking of, I wonder if and how many of the existing
             | frontends, interfaces and support packages that generalize
             | over multiple LLMs, and include Anthropic, actually know
             | how to prompt it correctly. Seems like most developers
             | missed the memo on
             | https://docs.anthropic.com/en/docs/build-with-
             | claude/prompt-..., and I regularly end up in situation in
             | which I wish they gave more minute control on how the
             | request is assembled (proprietary), and/or am considering
             | gutting the app/library myself (OSS; looking at you,
             | Aider), just to have file uploads, or tools, or whatever
             | other smarts the app/library does, encoded in a way that
             | uses Claude to its full potential.
             | 
             | I sometimes wonder how many other model or vendor-specific
             | improvements there are, that are missed by third-party
             | tools despite being well-documented by the vendors.
        
               | refulgentis wrote:
               | Hah, good call out: there was such a backlash and quick
               | turnaround on Claude requiring XML tool calls, I think
               | people just sort of forgot about it altogether.
               | 
               | You might be interested in Telosnex, been working on it
               | for ~year and it's in good shape and is more or less
               | designed for this sort of flexibility / allowing user
               | input into requests. Pick any* provider, write up your
               | own canned scripts, with incremental complexity: ex. your
               | average user would just perceive it as "that AI app with
               | the little picker for search vs. chat vs. art"
               | 
               | * OpenAI, Claude, Mistral, Groq Llama 3.x, and one I'm
               | forgetting....Google! And .gguf
        
             | regularfry wrote:
             | In a field like this the self-doubt of "surely it wouldn't
             | be this broken, I must just be holding it wrong" is strong.
        
         | lifeisgood99 wrote:
         | The SVG created for the first prompt is valid but is a garbage
         | image.
        
           | simonw wrote:
           | Yeah, it didn't do very well on that one. The best I've had
           | from a local model there was from QwQ:
           | https://simonwillison.net/2024/Nov/27/qwq/
        
           | refulgentis wrote:
           | For context, pelican riding a bicycle:
           | https://imgur.com/a/2nhm0XM
           | 
           | Copied SVG from gist into figma, added dark gray #444444
           | background, exported as PNG 1x.
        
           | bentcorner wrote:
           | In general I've had poor results with LLMs generating
           | pictures using text instructions (in my case I've tried to
           | get them to generate pictures using plots in KQL). They work
           | but the pictures are very very basic.
           | 
           | I'd be interested for any LLM emitting any kind of text-to-
           | picture instructions to get results that are beyond a
           | kindergartner-cardboard-cutout levels of art.
        
             | simonw wrote:
             | That's why I use the SVG pelican riding a bicycle thing as
             | a benchmark: it's a deliberately absurd and extremely
             | difficult task.
        
               | Teever wrote:
               | I'm really glad that I see someone else doing something
               | similar. I had the epiphany a while ago that if LLMs can
               | interpret textual instructions to draw a picture and
               | output the design in another textual format that this a
               | strong indicator that they're more than just stochastic
               | parrots.
               | 
               | My personal test has been "A horse eating apples next to
               | a tree" but the deliberate absurdity of your example is a
               | much more useful test.
               | 
               | Do you know if this is a recognized technique that people
               | use to study LLMs?
        
               | simonw wrote:
               | I've seen people using "draw a unicorn using tikz"
               | https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-
               | exploratio...
        
               | girvo wrote:
               | That came, IIRC, from one of the OpenAI or Microsoft
               | people (Sebastian Bubeck); it was recounted in an NPR
               | podcast "Greetings from Earth"
               | 
               | https://www.thisamericanlife.org/803/transcript
        
               | krackers wrote:
               | It's in this presentation
               | https://www.youtube.com/watch?v=qbIk7-JPB2c
               | 
               | The most significant part I took away is that when safety
               | "alignment" was done the ability plummeted. So that
               | really makes me wonder how much better these models would
               | be if they weren't lobotomized to prevent them from
               | saying bad words.
        
               | int_19h wrote:
               | I did some experiments of my own after this paper, but
               | letting GPT-4 run wild, picking its own scene. It wanted
               | to draw a boat on a lake, and I also asked it to throw in
               | some JS animations, so it made the sun set:
               | 
               | https://int19h.org/chatgpt/lakeside/index.html
               | 
               | One interesting thing that I found out while doing this
               | is that if you ask GPT-4 to produce SVG suitable for use
               | in HTML, it will often just generate base64-encoded data:
               | URIs directly. Which do contain valid SVG inside as
               | requested.
        
               | memhole wrote:
               | Not sure if this counts. I recently went from description
               | of a screenshot of graph to generate pandas code and plot
               | from description. Conceptually it was accurate.
               | 
               | I don't think it reflects any understanding. But to go
               | from screenshot to conceptually accurate and working code
               | was impressive.
        
               | MyFirstSass wrote:
               | But how will that prove that it's more than a stochastic
               | parrot, honestly curious?
               | 
               | Isn't it just like any kind of conversion or translation?
               | Ie. a relationship mapping between diffrent domains and
               | just as much parroting "known" paths between parts of
               | different domains?
               | 
               | If "sun" is associated with "round", "up high",
               | "yellow","heat" in english that will map to those things
               | in SVG or in whatever bizarre format you throw at with
               | relatively isomorphic paths existing there just knitted
               | together as a different metamorphosis or cluster of
               | nodes.
               | 
               | On a tangent it's interesting what constitutes the
               | heaviest nodes in the data, how shared is "yellow" or "up
               | high" between different domains, and what is above and
               | below them hierarchically weight-wise. Is there a
               | heaviest "thing in the entire dataset"?
               | 
               | If you dump a heatmap of a description of the sun and an
               | SVG of a sun - of the neuron / axon like cloud of data in
               | some model - would it look similar in some way?
        
               | sabbaticaldev wrote:
               | that's a huge stretch for parroting
        
               | accrual wrote:
               | Appreciate your rapid analysis of new models, Simon. Have
               | any models you've tested performed well on the pelican
               | SVG task?
        
               | simonw wrote:
               | gemini-exp-1206 is my new favorite:
               | https://simonwillison.net/2024/Dec/6/gemini-exp-1206/
               | 
               | Claude 3.5 Sonnet is in second place:
               | https://github.com/simonw/pelican-bicycle?tab=readme-ov-
               | file...
        
               | accrual wrote:
               | The Gemini result is quite impressive, thanks for sharing
               | these!
        
               | codedokode wrote:
               | They probably trained it for this specific task
               | (generating SVG images), right?
        
               | simonw wrote:
               | I'm hoping that nobody has deliberately trained on SVG
               | images of pelicans riding bicycles yet.
        
             | pizza wrote:
             | I do with Claude:
             | https://news.ycombinator.com/item?id=42351796#42355665
        
           | chen_dev wrote:
           | Amazon Nova models:
           | 
           | https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1.
           | ..
        
         | Havoc wrote:
         | >Microsoft haven't officially released the weights
         | 
         | Thought it was official just not on huggingface but rather
         | whatever azure competitor thing they're pushing?
        
           | simonw wrote:
           | I found their AI Foundry thing so hard to figure out I
           | couldn't tell if they had released weights (as opposed to a
           | way of running it via an API).
           | 
           | Since there are GGUFs now so someone must have released some
           | weights somewhere.
        
             | Havoc wrote:
             | Yeah the weights were on there apparently.
             | 
             | Planned week delay between release on their own platform
             | and hf
             | 
             | But much like you I decided I can be patient / use the
             | ggufs
        
             | lhl wrote:
             | The safetensors are in the phi-4 folder of the very repo
             | you linked in your OP.
        
         | fisherjeff wrote:
         | Looks like someone's _finally_ caught up with The Hallmark
         | Channel's LLM performance
        
         | algo_trader wrote:
         | > More of my notes on Phi-4 here:
         | https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
         | 
         | Nice. Thanks.
         | 
         | Do you think sampling the stack traces of millions of machines
         | is a good dataset for improving code performance? Maybe sample
         | android/jvm bytecode.
         | 
         | Maybe a sort of novelty sampling to avoid re-sampling hot-path?
        
         | mirekrusin wrote:
         | This "draw pelican riding on bicycle" is quite deep if you
         | think about it.
         | 
         | Phi is all about synthetic training and prompt -> svg -> render
         | -> evaluate image -> feedback loop feels like ideal fit for
         | synthetic learning.
         | 
         | You can push it quite far with stuff like basic 2d physics etc
         | with plotting scene after N seconds or optics/rays, magnetic
         | force etc.
         | 
         | SVG as LLM window to physical world.
        
           | dartos wrote:
           | > SVG as LLM window to physical world.
           | 
           | What? let's try not to go full forehead into hype.
           | 
           | SVGs would be an awfully poor analogy for the physical
           | world...
        
             | ben_w wrote:
             | SVGs themselves are just an image format; but because of
             | their vector nature, they could easily be mapped onto
             | values from a simulation in a physics engine -- at least,
             | in the game physics sense of the word, rods and springs
             | etc., as a fluid simulation is clearly a better map to
             | raster formats.
             | 
             | If that physics engine were itself a good model for the
             | real world, then you could do simulated evolution to get an
             | end result that is at least as functional as a bike (though
             | perhaps it wouldn't look like a traditional bike) even if
             | the only values available to the LLM were the gross
             | characteristics like overall dimensions and mass.
             | 
             | But I'd say the chance of getting a pelican SVG out of a
             | model like this is mostly related to lots of text
             | describing the anatomy of pelicans, and it would not gain
             | anything from synthetic data.
        
               | dartos wrote:
               | > but because of their vector nature, they could easily
               | be mapped onto values from a simulation in a physics
               | engine.
               | 
               | I don't think the fact that the images are described with
               | vectors magically makes it better for representing
               | physics than any other image representation. Maybe less
               | so, since there will be so much textual information not
               | related to the physical properties of the object.
               | 
               | What about them makes it easier to map to physics than an
               | AABB?
               | 
               | For soft body physics, im pretty sure a simpler sort of
               | distance field representation would even be better. (I'm
               | not as familiar with soft body as rigid body)
        
               | ben_w wrote:
               | For rendering them, more than for anything else. There's
               | a convenient 1-to-1 mapping in both directions.
               | 
               | You can of course just rasterise the vector for output,
               | it's not like people view these things on oscilloscopes.
        
         | tkellogg wrote:
         | I added Phi-4 to my reasoning model collection because it seems
         | to exhibit reasoning behavior, it stopped to consider
         | alternatives before concluding. I assume this is related to
         | their choice in training data:
         | 
         | > Chain-of-Thought: Data should encourage systematic reasoning,
         | teaching the model various approaches to the problems in a
         | step-by-step manner.
         | 
         | https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md
        
         | belter wrote:
         | > it was trained using synthetic data
         | 
         | Is this not supposed to cause Model collapse?
        
           | fulafel wrote:
           | No.
        
             | belter wrote:
             | Is this paper wrong? - https://arxiv.org/abs/2311.09807
        
               | simonw wrote:
               | It shows that if you deliberately train LLMs against
               | their own output in a loop you get problems. That's not
               | what synthetic data training does.
        
               | belter wrote:
               | I understand and appreciate your clarification. However
               | would it not be the case some synthetic data strategies,
               | if misapplied, can resemble the feedback loop scenario
               | and thus risk model collapse?
        
           | rhdunn wrote:
           | It depends on how you construct the synthetic data and how
           | the model is trained on that data.
           | 
           | For diffusion-based image generators training only on
           | synthetic data over repeated model training can cause model
           | collapse as errors in the output can amplify in the trained
           | model. It's usually the 2nd or 3rd model created this way
           | (with output of the previous used as input for the first) for
           | it to collapse.
           | 
           | It was found that using primary data along side synthetic
           | data avoided the model collapse. Likewise, if you also have
           | some sort of human scoring/evaluation you can help avoid
           | artefacts.
        
           | simonw wrote:
           | This is why I don't think model collapse actually matters:
           | people have been deliberately training LLMs on synthetic data
           | for over a year at this point.
           | 
           | As far as I can tell model collapse happens when you
           | deliberately train LLMs on low quality LLM-generated data so
           | that you can write a paper about it.
        
           | nxobject wrote:
           | As someone who's a completely layman: I wonder if the results
           | of model collapse are no worse than, say, sufficiently
           | complex symbolic AI (modulo consistency and fidelity?)
        
           | ziofill wrote:
           | I may have misunderstood, but I think that it depends a lot
           | on the existence of a validation mechanism. Programming
           | languages have interpreters and compilers that can provide a
           | useful signal, while for images and natural language there
           | isn't such an automated mechanism, or at least its not that
           | straightforward.
        
         | patrick0d wrote:
         | this vibe check is more insightful to me than the popular
         | evals. nice job!
        
         | vergessenmir wrote:
         | When working with GGUF what chat templates do you use? Pretty
         | much every gguf I've imported into ollama has given me garbage
         | response. Converting the tokenizer json has yielded mixed
         | results.
         | 
         | For example how do you handle the phi-4 models gguf chat
         | template?
        
           | simonw wrote:
           | I use whatever what template is baked into the GGUF file.
           | 
           | You can click on the little info icon on Hugging Face to see
           | that directly.
           | 
           | For https://huggingface.co/matteogeniaccio/phi-4/tree/main?sh
           | ow_... that's this:                 {% for message in
           | messages %}{% if       (message['role'] == 'system')
           | %}{{'<|im_start|>system<|im_sep|>' +       message['content']
           | + '<|im_end|>'}}{%       elif (message['role'] == 'user')
           | %}{{'<|im_start|>user<|im_sep|>' +       message['content'] +
           | '<|im_end|><|im_start|>assistant<|im_sep|>'}}{%       elif
           | (message['role'] == 'assistant')       %}{{message['content']
           | + '<|im_end|>'}}{%       endif %}{% endfor %}
        
         | mhh__ wrote:
         | Along those lines (synthetic data) I would keep an eye on the
         | chinese labs given that they are probably quite data and
         | compute constrained, in English at least.
        
       | thot_experiment wrote:
       | For prompt adherence it still fails on tasks that Gemma2 27b
       | nails every time. I haven't been impressed with any of the Phi
       | family of models. The large context is very nice, though Gemma2
       | plays very well with self-extend.
        
         | jacoblambda wrote:
         | Yeah they mention this in the weaknesses section.
         | 
         | > While phi-4 demonstrates relatively strong performance in
         | answering questions and performing reasoning tasks, it is less
         | proficient at rigorously following detailed instructions,
         | particularly those involving specific formatting requirements.
        
           | thot_experiment wrote:
           | Ah good catch, I am forever cursed in my preference for snake
           | over camel.
        
         | impossiblefork wrote:
         | It's a much smaller model though.
         | 
         | I think the point is more the demonstration that such a small
         | model can have such good performance than any actual
         | usefulness.
        
           | magicalhippo wrote:
           | Gemma2 9B has significantly better prompt adherence than
           | Llama 3.1 8B in my experience.
           | 
           | I've just assumed it's down to how it was trained, but no
           | expert.
        
       | travisgriggs wrote:
       | Where have I been? What is a "small" language model? Wikipedia
       | just talks about LLMs. Is this a sort of spectrum? Are there
       | medium language models? Or is it a more nuanced classifier?
        
         | narag wrote:
         | 7B vs 70B parameters... I think. The small ones fit in the
         | memory of consumer grade cards. That's what I more or less know
         | (waiting for my new computer to arrive this week)
        
           | agnishom wrote:
           | How many parameters did ChatGPT have in Dec 2022 when it
           | first broke into mainstream news?
        
             | simonw wrote:
             | I don't think that's ever been shared, but it's predecessor
             | GPT-3 Da Vinci was 175B.
             | 
             | One of the most exciting trends of the past year has been
             | models getting dramatically smaller while maintaining
             | similar levels of capability.
        
             | reissbaker wrote:
             | GPT-3 had 175B, and the original ChatGPT was probably just
             | a GPT-3 finetune (although they called it gpt-3.5, so it
             | _could_ have been different). However, it was severely
             | undertrained. Llama-3.1-8B is better in most ways than the
             | original ChatGPT; a well-trained ~70B usually feels
             | GPT-4-level. The latest Llama release, llama-3.3-70b, goes
             | toe-to-toe even with much larger models (albeit is bad at
             | coding, like all Llama models so far; it 's not inherent to
             | the size, since Qwen is good, so I'm hoping the Llama 4
             | series is trained on more coding tokens).
        
               | swyx wrote:
               | > However, it was severely undertrained
               | 
               | by modern standards. at the time, it was trained
               | according to neural scaling laws oai believed to hold.
        
         | dboreham wrote:
         | There are all sizes of models from a few GB to hundreds of GB.
         | Small presumably means small enough to run on end-user
         | hardware.
        
         | hagen_dogs wrote:
         | I _think_ it came from this paper, TinyStories
         | (https://arxiv.org/abs/2305.07759). iirc this was also the
         | inspiration for the Phi family of models. The essential point
         | (of the TinyStories paper), "if we train a model on text meant
         | for 3-4 year olds, since that's much simpler shouldn't we need
         | fewer parameters?" Which is correct. In the original they have
         | a model that's 32 Million parameters and they compare it GPT-2
         | (1.5 Billion parameters) and the 32M model does much better.
         | Microsoft has been interesed in this because "lower models ==
         | less resource usage" which means they can run on consumer
         | devices. You can easily run TinyStories from your phone, which
         | is presumably what Microsoft wants to do too.
        
         | tbrownaw wrote:
         | It's a marketing term for the idea that quality over quantity
         | in training data will lead to smaller models that work as well
         | as larger models.
        
       | jsight wrote:
       | I really like the ~3B param version of phi-3. It wasn't very
       | powerful and overused memory, but was surprisingly strong for
       | such a small model.
       | 
       | I'm not sure how I can be impressed by a 14B Phi-4. That isn't
       | really small any more, and I doubt it will be significantly
       | better than llama 3 or Mistral at this point. Maybe that will be
       | wrong, but I don't have high hopes.
        
       | excerionsforte wrote:
       | Looks like someone converted it for Ollama use already:
       | https://ollama.com/vanilj/Phi-4
        
         | accrual wrote:
         | I've had great success with quantized Phi-4 12B and Ollama so
         | far. It's as fast as Llama 3.1 8B but the results have been
         | (subjectively) higher quality. I copy/pasted some past requests
         | into Phi-4 and found the answers were generally better.
        
       | ai_biden wrote:
       | I'm not too excited by Phi-4 benchmark results - It
       | is#BenchmarkInflation.
       | 
       | Microsoft Research just dropped Phi-4 14B, an open-source model
       | that's turning heads. It claims to rival Llama 3.3 70B with a
       | fraction of the parameters -- 5x fewer, to be exact.
       | 
       | What's the secret? Synthetic data. -> Higher quality, Less
       | misinformation, More diversity
       | 
       | But the Phi models always have great benchmark scores, but they
       | always disappoint me in real-world use cases.
       | 
       | Phi series is famous for to be trained on benchmarks.
       | 
       | I tried again with the hashtag#phi4 through Ollama - but its not
       | satisfactory.
       | 
       | To me, at the moment - IFEval is the most important llm
       | benchmark.
       | 
       | But look the smart business strategy of Microsoft:
       | 
       | have unlimited access to gpt-4 the input prompt it to generate
       | 30B tokens train a 1B parameter model call it phi-1 show
       | benchmarks beating models 10x the size never release the data
       | never detail how to generate the data( this time they told in
       | very high level) claim victory over small models
        
       | mupuff1234 wrote:
       | So we moved from "reasoning" to "complex reasoning".
       | 
       | I wonder what will be next month's buzzphrase.
        
         | TeMPOraL wrote:
         | > _So we moved from "reasoning" to "complex reasoning"._
         | 
         | Only from the perspective of those still complaining about the
         | use of the term "reasoning", who now find themselves left
         | behind as the world has moved on.
         | 
         | For everyone else, the phrasing change perfectly fits the
         | technological change.
        
           | HarHarVeryFunny wrote:
           | Reasoning basically means multi-step prediction, but to be
           | general the reasoner also needs to be able to:
           | 
           | 1) Realize when it's reached an impasse, then backtrack and
           | explore alternatives
           | 
           | 2) Recognize when no further progress towards the goal
           | appears possible, and switch from exploiting existing
           | knowledge to exploring/acquiring new knowledge to attempt to
           | proceed. An LLM has limited agency, but could for example ask
           | a question or do a web search.
           | 
           | In either case, prediction failure needs to be treated as a
           | learning signal so the same mistake isn't repeated, and when
           | new knowledge is acquired that needs to be remembered. In
           | both cases this learning would need to persist beyond the
           | current context in order to be something that the LLM can
           | build on in the future - e.g. to acquire a job skill that may
           | take a lot of experience/experimentation to master.
           | 
           | It doesn't matter what you call it (basic or advanced), but
           | it seems that current attempts at adding reasoning to LLMs
           | (e.g. GPT-o1) are based around 1), a search-like strategy,
           | and learning is in-context and ephemeral. General animal-like
           | reasoning needs to also support 2) - resolving impasses by
           | targeted new knowledge acquisition (and/or just curiosity-
           | driven experimentation), as well as continual learning.
        
         | criddell wrote:
         | If you graded humanity on their reasoning ability, I wonder
         | where these models would score?
         | 
         | I think once they get to about the 85th percentile, we could
         | upgrade the phrase to advanced reasoning. I'm roughly equating
         | it with the percentage of the US population with at least a
         | master's degree.
        
           | chairhairair wrote:
           | All current LLMs openly make simple mistakes that are
           | completely incompatible with true "reasoning" (in the sense
           | any human would have used that term years ago).
           | 
           | I feel like I'm taking crazy pills sometimes.
        
             | criddell wrote:
             | How do you assess how true one's reasoning is?
        
             | simonw wrote:
             | Genuine question: what does "reasoning" mean to you?
        
             | int_19h wrote:
             | If you showed the raw output of, say, QwQ-32 to any
             | engineer from 10 years ago, I suspect they would be
             | astonished to hear that this doesn't count as "true
             | reasoning".
        
       | zurfer wrote:
       | Model releases without comprehensive coverage of benchmarks make
       | me deeply skeptical.
       | 
       | The worst was the gpt4o update in November. Basically a 2 liner
       | on what it is better at and in reality it regressed in multiple
       | benchmarks.
       | 
       | Here we just get MMLU, which is widely known to be saturated and
       | knowing they trained on synthetic data, we have no idea how much
       | "weight" was given to having MMLU like training data.
       | 
       | Benchmarks are not perfect, but they give me context to build
       | upon. ---
       | 
       | edit: the benchmarks are covered in the paper:
       | https://arxiv.org/pdf/2412.08905
        
       | PoignardAzur wrote:
       | Saying that a 14B model is "small" feels a little silly at this
       | point. I _guess_ it doesn 't require a high-end graphics card?
        
       | liminal wrote:
       | Is 14B parameters still considered small?
        
       ___________________________________________________________________
       (page generated 2024-12-16 23:02 UTC)