[HN Gopher] Phi-4: Microsoft's Newest Small Language Model Speci...
___________________________________________________________________
Phi-4: Microsoft's Newest Small Language Model Specializing in
Complex Reasoning
Author : lappa
Score : 409 points
Date : 2024-12-13 01:54 UTC (3 days ago)
(HTM) web link (techcommunity.microsoft.com)
(TXT) w3m dump (techcommunity.microsoft.com)
| parmesean wrote:
| 13.8 epochs of the benchmarks?
| xeckr wrote:
| Looks like it punches way above its weight(s).
|
| How far are we from running a GPT-3/GPT-4 level LLM on regular
| consumer hardware, like a MacBook Pro?
| lappa wrote:
| It's easy to argue that Llama-3.3 8B performs better than
| GPT-3.5. Compare their benchmarks, and try the two side-by-
| side.
|
| Phi-4 is yet another _step_ towards a small, open, GPT-4 level
| model. I think we 're getting quite close.
|
| Check the benchmarks comparing to GPT-4o on the first page of
| their technical report if you haven't already
| https://arxiv.org/pdf/2412.08905
| vulcanash999 wrote:
| Did you mean Llama-3.1 8B? Llama 3.3 currently only has a 70B
| model as far as I'm aware.
| anon373839 wrote:
| We're already past that point! MacBooks can easily run models
| exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma
| 2 9B. These models run at very comfortable speeds on Apple
| Silicon. And they are distinctly more capable and less prone to
| hallucination than GPT-3.5 was.
|
| Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to
| GPT-4, and they will run on MacBook Pros with at least 64GB of
| RAM. However, I have an M3 Max and I can't say that models of
| this size run at comfortable speeds. They're a bit sluggish.
| noman-land wrote:
| The coolness of local LLMs is THE only reason I am sadly
| eyeing upgrading from M1 64GB to M4/5 128+GB.
| Terretta wrote:
| Compare performance on various Macs here as it gets
| updated:
|
| https://github.com/ggerganov/llama.cpp/discussions/4167
|
| OMM, Llama 3.3 70B runs at ~7 text generation tokens per
| second on Macbook Pro Max 128GB, while generating GPT-4
| feeling text with more in depth responses and fewer
| bullets. Llama 3.3 70B also doesn't fight the system
| prompt, it leans in.
|
| Consider e.g. LM Studio (0.3.5 or newer) for a Metal (MLX)
| centered UI, include MLX in your search term when
| downloading models.
|
| Also, do not scrimp on the storage. At 60GB - 100GB per
| model, it takes a day of experimentation to use 2.5TB of
| storage in your model cache. And remember to exclude that
| path from your TimeMachine backups.
| noman-land wrote:
| Thank you for all the tips! I'd probably go 128GB 8TB
| because of masochism. Curious, what makes so many of the
| M4s in the red currently.
| vessenes wrote:
| It's all memory bandwidth related -- what's slow is
| loading these models into memory, basically. The last die
| from Apple with _all_ the channels was the M2 Ultra, and
| I bet that 's what tops those leader boards. M4 has not
| had a Max or an Ultra release yet; when it does (and it
| seems likely it will), those will be the ones to get.
| ant6n wrote:
| What if you have a Macbook Air with 16GB (the bechmarks
| dont seem to show memory).
| simonw wrote:
| You could definitely run an 8B model on that, and some of
| those are getting very capable now.
|
| The problem is that often you can't run anything else.
| I've had trouble running larger models in 64GB when I've
| had a bunch of Firefox and VS Code tabs open at the same
| time.
| xdavidliu wrote:
| I thought VSCode was supposed to be lightweight, though I
| suppose with extensions it can add up
| chris_st wrote:
| I have a M2 Air with 24GB, and have successfully run some
| 12B models such as mistral-nemo. Had other stuff going as
| well, but it's best to give it as much of the machine as
| possible.
| gcanyon wrote:
| I recently upgraded to exactly this machine for exactly
| this reason, but I haven't taken the leap and installed
| anything yet. What's your favorite model to run on it?
| evilduck wrote:
| 8B models with larger contexts, or even 9-14B parameter
| models quantized.
|
| Qwen2.5 Coder 14B at a 4 bit quantization could run but
| you will need to be diligent about what else you have in
| memory at the same time.
| stkdump wrote:
| I bought an old used desktop computer, a used 3090, and
| upgraded the power supply, all for around 900EUR. Didn't
| assemble it all yet. But it will be able to comfortably run
| 30B parameter models with 30-40 T/s. The M4 Max can do ~10
| T/s, which is not great once you really want to rely on it
| for your productivity.
|
| Yes, it is not "local" as I will have to use the internet
| when not at home. But it will also not drain the battery
| very quickly when using it, which I suspect would happen to
| a Macbook Pro running such models. Also 70B models are out
| of reach of my setup, but I think they are painfully slow
| on Mac hardware.
| jazzyjackson wrote:
| I'm returning my 96GB m2 max. It can run unquantized llama
| 3.3 70B but tokens per second is slow as molasses and still
| I couldn't find any use for it, just kept going back to
| perplexity when I actually needed to find an answer to
| something.
| alecco wrote:
| I'm waiting for next gen hardware. All the companies are
| aiming for AI acceleration.
| kleiba wrote:
| Sorry, I'm not up to date, but can you run GPTs locally or
| only vanilla LLMs?
| noodletheworld wrote:
| > MacBooks can easily run models exceeding GPT-3.5, such as
| Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.
|
| gpt-3.5-turbo is generally considered to be about 20B params.
| An 8B model does not exceed it.
|
| > Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to
| GPT-4
|
| I'm skeptical; the llama 3.1 405B model is the only
| comparable model I've used, and it's _significantly_ larger
| than the 70B models you can run locally.
|
| The 405B model takes a bit of effort to run (1), and your
| average mac book doesn't ship with 128GB of ram, but
| _technically_ yes, if you get the max spec M4 Max + 128GB
| unified ram, you can run it.
|
| ...and it 's similar, but (see again, AI leader boards) not
| as good as what you can get from gpt-4o.
|
| > How far are we from running a GPT-3/GPT-4 level LLM on
| regular consumer hardware, like a MacBook Pro?
|
| Is a $8000 MBP regular consumer hardware? If you don't think
| so, then the answer is probably no.
|
| Lots of research into good smaller models is going on right
| now, but _right now_ , they are not comparable to the larger
| models in terms of quality-of-output.
|
| [1] - https://medium.com/@aleksej.gudkov/how-to-run-
| llama-405b-a-c...
| runako wrote:
| > Is a $8000 MBP regular consumer hardware?
|
| May want to double-check your specs. 16" w/128GB & 2TB is
| $5,400.
| anon373839 wrote:
| > gpt-3.5-turbo is generally considered to be about 20B
| params. An 8B model does not exceed it.
|
| The industry has moved on from the old Chinchilla scaling
| regime, and with it the conviction that LLM capability is
| mainly dictated by parameter count. OpenAI didn't disclose
| how much pretraining they did for 3.5-Turbo, but GPT 3 was
| trained on 300 billion tokens of text data. In contrast,
| Llama 3.1 was trained on _15 trillion_ tokens of data.
|
| Objectively, Llama 3.1 8B and other small models have
| exceeded GPT-3.5-Turbo in benchmarks and human preference
| scores.
|
| > Is a $8000 MBP regular consumer hardware?
|
| As user `bloomingkales` notes down below, a $499 Mac Mini
| can run 8B parameter models. An $8,000 expenditure is not
| required.
| tosh wrote:
| A Mac with 16GB RAM can run qwen 7b, gemma 9b and similar
| models that are somewhere between GPT3.5 and GPT4.
|
| Quite impressive.
| jazzyjackson wrote:
| on what metric?
|
| Why would OpenAI bother serving GPT4 if customers would
| be just as happy with a tiny 9B model?
| tosh wrote:
| https://lmarena.ai/
|
| Check out the lmsys leaderboard. It has an overall
| ranking as well as ranking for specific categories.
|
| OpenAI are also serving gpt4o mini. That said afaiu it's
| not known how large/small mini is.
|
| Being more useful than GPT3.5 is not a high bar anymore.
| simonw wrote:
| Don't confuse GPT-4 and GPT-4o.
|
| GPT-4o is a much better experience than the smaller local
| models. You can see that in the lmarena benchmarks or
| from trying them out yourself.
| PhilippGille wrote:
| >> Llama 3.3 70B and Qwen 2.5 72B are certainly comparable
| to GPT-4
|
| > I'm skeptical; the llama 3.1 405B model is the only
| comparable model I've used, and it's significantly larger
| than the 70B models you can run locally.
|
| Every new Llama generation achieved to beat larger models
| of the previous generation with smaller ones.
|
| Check Kagi's LLM benchmark:
| https://help.kagi.com/kagi/ai/llm-benchmark.html
|
| Check the HN thread around the 3.3 70b release:
| https://news.ycombinator.com/item?id=42341388
|
| And their own benchmark results in their model card:
| https://github.com/meta-llama/llama-
| models/blob/main/models%...
|
| Groq's post about it: https://groq.com/a-new-scaling-
| paradigm-metas-llama-3-3-70b-...
|
| Etc
| int_19h wrote:
| They still do not beat GPT-4, however.
|
| And benchmarks are very misleading in this regard. We've
| seen no shortage of even 8B models claiming that they
| beat GPT-4 and Claude in benchmarks. Every time this
| happens, once you start actually using the model, it's
| clear that it's not actually on par.
| simonw wrote:
| GPT-4 from March 2023, not GPT-4o from May 2024.
| zozbot234 wrote:
| > Is a $8000 MBP regular consumer hardware? If you don't
| think so, then the answer is probably no.
|
| The very first Apple McIntosh was not far from that price
| at its release. Adjusted for inflation of course.
| kgeist wrote:
| >MacBooks can easily run models exceeding GPT-3.5, such as
| Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B.
|
| If only those models supported anything other than English
| simonw wrote:
| Llama 3.1 8B advertises itself as multilingual.
|
| All of the Qwen models are basically fluent in both English
| and Chinese.
| kgeist wrote:
| Llama 8B is multilingual on paper, but the quality is
| very bad compared to English. It generally understands
| grammar, and you can understand what it's trying to say,
| but the choice of words is very off most of the time,
| often complete gibberish. If you can imagine the output
| of an undertrained model, this is it. Meanwhile GPT3.5
| had far better output that you could use in production.
| barrell wrote:
| Cohere just announced Command R7B. I haven't tried it yet
| but their larger models are the best multilingual models
| I've used
| numpad0 wrote:
| Is subtext to this uncensored Chinese support?
| simonw wrote:
| We're there. Llama 3.3 70B is GPT-4 level and runs on my 64GB
| MacBook Pro: https://simonwillison.net/2024/Dec/9/llama-33-70b/
|
| The Qwen2 models that run on my MacBook Pro are GPT-4 level
| too.
| BoorishBears wrote:
| Saying these models are at GPT-4 level is setting anyone who
| doesn't place special value on the local aspect up for
| disappointment.
|
| Some people do place value on running locally, and I'm not
| against then for it, but realistically no 70B class model has
| the amount of general knowledge or understanding of nuance as
| any recent GPT-4 checkpoint.
|
| That being said these models are still very strong compared
| to what we had a year ago and capable of useful work
| simonw wrote:
| I said GPT-4, not GPT-4o. I'm talking about a model that
| feels equivalent to the GPT-4 we were using in March of
| 2023.
| int_19h wrote:
| I remember using GPT-4 when it first dropped to get a
| feeling of its capabilities, and no, I wouldn't say that
| llama-3.3-70b is comparable.
|
| At the end of the day, there's only so much you can cram
| into any given number of parameters, regardless of what
| any artificial benchmark says.
| simonw wrote:
| I envy your memory.
| n144q wrote:
| I wouldn't call 64GB MacBook Pro "regular consumer hardware".
| jsheard wrote:
| Yeah, a computer which starts at $3900 is really stretching
| that classification. Plus if you're that serious about
| local LLMs then you'd probably want the even bigger RAM
| option, which adds another $800...
| evilduck wrote:
| An optioned up minivan is also expensive but doesn't cost
| as much as a firetruck. It's expensive but still very
| much consumer hardware. A 3x4090 rig is more expensive
| and still consumer hardware. An H100 is not, you can buy
| like 7 of these optioned up MBP for a single H100.
| michaelt wrote:
| In my experience, people use the term in two separate
| ways.
|
| If I'm running a software business selling software that
| runs on 'consumer hardware' the more people can run my
| software, the more people can pay me. For me, the term
| means the hardware used by a _typical-ish_ consumer. I
| 'll check the Steam hardware survey, find the 75th-
| percentile gamer has 8 cores, 32GB RAM, 12GB VRAM - and
| I'd better make sure my software works on a machine like
| that.
|
| On the other hand, 'consumer hardware' could also be used
| to simply mean hardware available off-the-shelf from
| retailers who sell to consumers. By this definition,
| 128GB of RAM is 'consumer hardware' even if it only
| counts as 0.5% in Steam's hardware survey.
| evilduck wrote:
| On the Steam Hardware Survey the average gamer uses a
| computer with a 1080p display too. That doesn't somehow
| make any gaming laptop with a 2k screen sold in the last
| half decade a non-consumer product. For that matter the
| average gaming PC on Steam is even above average relative
| to the average computer. The typical office computer or
| school Chromebook is likely several generations older
| doesn't have an NPU or discrete GPU at all.
|
| For AI and LLMs, I'm not aware of any company even
| selling the models assets directly to consumers, they're
| either completely unavailable (OpenAI) or freely licensed
| so the companies training them aren't really dependent
| what the average person has for commercial success.
| criddell wrote:
| In the early 80's, people were spending more than $3k for
| an IBM 5150. For that price you got 64 kB of RAM, a
| floppy drive, and monochrome monitor.
|
| Today, lots of people spend far more than that for gaming
| PCs. An Alienware R16 (unquestionably a consumer PC) with
| 64 GB of RAM starts at $4700.
|
| It _is_ an expensive computer, but the best mainstream
| computers at any particular time have always cost between
| $2500 and $5000.
| russellbeattie wrote:
| I have to disagree. I understand it's very expensive, but
| it's still a consumer product available to anyone with a
| credit card.
|
| The comparison is between something you can buy off the
| shelf like a powerful Mac, vs something powered by a Grace
| Hopper CPU from Nvidia, which would require both lots of
| money and a business relationship.
|
| Honestly, people pay $4k for nice TVs, refrigerators and
| even couches, and those are not professional tools by any
| stretch. If LLMs needed a $50k Mac Pro with maxed out
| everything, that might be different. But anything that's a
| laptop is definitely regular consumer hardware.
| PhunkyPhil wrote:
| There's definitely been plenty sources of hardware
| capable of running LLMs out there for a while, Mac or
| not. A couple 4090s or P40s will run 3.1 70b. Or, since
| price isn't a limit, there are other easier & more
| powerful options like a [tinybox](https://tinygrad.org/#t
| inybox:~:text=won%27t%20be%20consider...).
| refulgentis wrote:
| We're there, Llama 3.1 8B beats Gemini Advanced for $20/month.
| Telosnex with llama 3.1 8b GGUF from bartowski.
| https://telosnex.com/compare/ (How!? tl;dr: I assume Google is
| sandbagging and hasn't updated the underlying Gemini)
| bloomingkales wrote:
| M4 Mac mini 16gb for $500. It's literally an inferencing block
| (small too, fits in my palm). I feel like the whole world needs
| one.
| alganet wrote:
| > inferencing block
|
| Did you mean _external gpu_?
|
| Choose any 12GB or more video card with GDDR6 or superior and
| you'll have at least double the performance of a base m4
| mini.
|
| The base model is almost an older generation. Thunderbolt 4
| instead of 5, slower bandwidths, slower SSDs.
| kgwgk wrote:
| > you'll have at least double the performance of a base m4
| mini
|
| For $500 all included?
| alganet wrote:
| The base mini is 599.
|
| Here's a config for around the same price. All brand new
| parts for 573. You can spend the difference improving any
| part you wish, or maybe get an used 3060 and go AM5
| instead (Ryzen 8400F). Both paths are upgradeable.
|
| https://pcpartpicker.com/list/ftK8rM
|
| Double the LLM performance. Half the desktop performance.
| But you can use both at the same time. Your computer will
| not slow down when running inference.
| bloomingkales wrote:
| That's a really nice build.
| alganet wrote:
| Another possible build is to use a mini-pc and M.2
| connections
|
| You'll need a mini-pc with two M.2 slots, like this:
|
| https://www.amazon.com/Beelink-SER7-7840HS-Computer-
| Display/...
|
| And a riser like this:
|
| https://www.amazon.com/CERRXIAN-Graphics-Left-PCI-
| Express-Ex...
|
| And some courage to open it and rig the stuff in.
|
| Then you can plug a GPU on it. It should have decent load
| times. Better than an eGPU, worse than the AM4 desktop
| build, fast enough to beat the M4 (once the data is in
| the GPU, it doesn't matter).
|
| It makes for a very portable setup. I haven't built it,
| but I think it's a reasonable LLM choice comparable to
| the M4 in speed and portability while still being
| upgradable.
|
| Edit: and you'll need an external power supply of at
| least 400W:)
| ActorNightly wrote:
| Why would you want to though? You already can get free access
| to large LLMs and nobody is doing anything groundbreaking with
| them.
| jckahn wrote:
| I only use local, open source LLMs because I don't trust
| cloud-based LLM hosts with my data. I also don't want to
| build a dependence on proprietary technology.
| simonw wrote:
| The most interesting thing about this is the way it was trained
| using synthetic data, which is described in quite a bit of detail
| in the technical report: https://arxiv.org/abs/2412.08905
|
| Microsoft haven't officially released the weights yet but there
| are unofficial GGUFs up on Hugging Face already. I tried this
| one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
|
| I got it working with my LLM tool like this: llm
| install llm-gguf llm gguf download-model https://huggingfac
| e.co/matteogeniaccio/phi-4/resolve/main/phi-4-Q4_K_M.gguf
| llm chat -m gguf/phi-4-Q4_K_M
|
| Here are some initial transcripts:
| https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...
|
| More of my notes on Phi-4 here:
| https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
| syntaxing wrote:
| Wow, those responses are better than I expected. Part of me was
| expecting terrible responses since Phi-3 was amazing on paper
| too but terrible in practice.
| refulgentis wrote:
| One of the funniest tech subplots in recent memory.
|
| TL;DR it was nigh-impossible to get it to emit the proper
| "end of message" token. (IMHO the chat training was too
| rushed). So all the local LLM apps tried silently hacking
| around it. The funny thing to me was _no one_ would say it
| out loud. Field isn 't very consumer friendly, yet.
| TeMPOraL wrote:
| Speaking of, I wonder if and how many of the existing
| frontends, interfaces and support packages that generalize
| over multiple LLMs, and include Anthropic, actually know
| how to prompt it correctly. Seems like most developers
| missed the memo on
| https://docs.anthropic.com/en/docs/build-with-
| claude/prompt-..., and I regularly end up in situation in
| which I wish they gave more minute control on how the
| request is assembled (proprietary), and/or am considering
| gutting the app/library myself (OSS; looking at you,
| Aider), just to have file uploads, or tools, or whatever
| other smarts the app/library does, encoded in a way that
| uses Claude to its full potential.
|
| I sometimes wonder how many other model or vendor-specific
| improvements there are, that are missed by third-party
| tools despite being well-documented by the vendors.
| refulgentis wrote:
| Hah, good call out: there was such a backlash and quick
| turnaround on Claude requiring XML tool calls, I think
| people just sort of forgot about it altogether.
|
| You might be interested in Telosnex, been working on it
| for ~year and it's in good shape and is more or less
| designed for this sort of flexibility / allowing user
| input into requests. Pick any* provider, write up your
| own canned scripts, with incremental complexity: ex. your
| average user would just perceive it as "that AI app with
| the little picker for search vs. chat vs. art"
|
| * OpenAI, Claude, Mistral, Groq Llama 3.x, and one I'm
| forgetting....Google! And .gguf
| regularfry wrote:
| In a field like this the self-doubt of "surely it wouldn't
| be this broken, I must just be holding it wrong" is strong.
| lifeisgood99 wrote:
| The SVG created for the first prompt is valid but is a garbage
| image.
| simonw wrote:
| Yeah, it didn't do very well on that one. The best I've had
| from a local model there was from QwQ:
| https://simonwillison.net/2024/Nov/27/qwq/
| refulgentis wrote:
| For context, pelican riding a bicycle:
| https://imgur.com/a/2nhm0XM
|
| Copied SVG from gist into figma, added dark gray #444444
| background, exported as PNG 1x.
| bentcorner wrote:
| In general I've had poor results with LLMs generating
| pictures using text instructions (in my case I've tried to
| get them to generate pictures using plots in KQL). They work
| but the pictures are very very basic.
|
| I'd be interested for any LLM emitting any kind of text-to-
| picture instructions to get results that are beyond a
| kindergartner-cardboard-cutout levels of art.
| simonw wrote:
| That's why I use the SVG pelican riding a bicycle thing as
| a benchmark: it's a deliberately absurd and extremely
| difficult task.
| Teever wrote:
| I'm really glad that I see someone else doing something
| similar. I had the epiphany a while ago that if LLMs can
| interpret textual instructions to draw a picture and
| output the design in another textual format that this a
| strong indicator that they're more than just stochastic
| parrots.
|
| My personal test has been "A horse eating apples next to
| a tree" but the deliberate absurdity of your example is a
| much more useful test.
|
| Do you know if this is a recognized technique that people
| use to study LLMs?
| simonw wrote:
| I've seen people using "draw a unicorn using tikz"
| https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-
| exploratio...
| girvo wrote:
| That came, IIRC, from one of the OpenAI or Microsoft
| people (Sebastian Bubeck); it was recounted in an NPR
| podcast "Greetings from Earth"
|
| https://www.thisamericanlife.org/803/transcript
| krackers wrote:
| It's in this presentation
| https://www.youtube.com/watch?v=qbIk7-JPB2c
|
| The most significant part I took away is that when safety
| "alignment" was done the ability plummeted. So that
| really makes me wonder how much better these models would
| be if they weren't lobotomized to prevent them from
| saying bad words.
| int_19h wrote:
| I did some experiments of my own after this paper, but
| letting GPT-4 run wild, picking its own scene. It wanted
| to draw a boat on a lake, and I also asked it to throw in
| some JS animations, so it made the sun set:
|
| https://int19h.org/chatgpt/lakeside/index.html
|
| One interesting thing that I found out while doing this
| is that if you ask GPT-4 to produce SVG suitable for use
| in HTML, it will often just generate base64-encoded data:
| URIs directly. Which do contain valid SVG inside as
| requested.
| memhole wrote:
| Not sure if this counts. I recently went from description
| of a screenshot of graph to generate pandas code and plot
| from description. Conceptually it was accurate.
|
| I don't think it reflects any understanding. But to go
| from screenshot to conceptually accurate and working code
| was impressive.
| MyFirstSass wrote:
| But how will that prove that it's more than a stochastic
| parrot, honestly curious?
|
| Isn't it just like any kind of conversion or translation?
| Ie. a relationship mapping between diffrent domains and
| just as much parroting "known" paths between parts of
| different domains?
|
| If "sun" is associated with "round", "up high",
| "yellow","heat" in english that will map to those things
| in SVG or in whatever bizarre format you throw at with
| relatively isomorphic paths existing there just knitted
| together as a different metamorphosis or cluster of
| nodes.
|
| On a tangent it's interesting what constitutes the
| heaviest nodes in the data, how shared is "yellow" or "up
| high" between different domains, and what is above and
| below them hierarchically weight-wise. Is there a
| heaviest "thing in the entire dataset"?
|
| If you dump a heatmap of a description of the sun and an
| SVG of a sun - of the neuron / axon like cloud of data in
| some model - would it look similar in some way?
| sabbaticaldev wrote:
| that's a huge stretch for parroting
| accrual wrote:
| Appreciate your rapid analysis of new models, Simon. Have
| any models you've tested performed well on the pelican
| SVG task?
| simonw wrote:
| gemini-exp-1206 is my new favorite:
| https://simonwillison.net/2024/Dec/6/gemini-exp-1206/
|
| Claude 3.5 Sonnet is in second place:
| https://github.com/simonw/pelican-bicycle?tab=readme-ov-
| file...
| accrual wrote:
| The Gemini result is quite impressive, thanks for sharing
| these!
| codedokode wrote:
| They probably trained it for this specific task
| (generating SVG images), right?
| simonw wrote:
| I'm hoping that nobody has deliberately trained on SVG
| images of pelicans riding bicycles yet.
| pizza wrote:
| I do with Claude:
| https://news.ycombinator.com/item?id=42351796#42355665
| chen_dev wrote:
| Amazon Nova models:
|
| https://gist.github.com/uschen/38fc65fa7e43f5765a584c6cd24e1.
| ..
| Havoc wrote:
| >Microsoft haven't officially released the weights
|
| Thought it was official just not on huggingface but rather
| whatever azure competitor thing they're pushing?
| simonw wrote:
| I found their AI Foundry thing so hard to figure out I
| couldn't tell if they had released weights (as opposed to a
| way of running it via an API).
|
| Since there are GGUFs now so someone must have released some
| weights somewhere.
| Havoc wrote:
| Yeah the weights were on there apparently.
|
| Planned week delay between release on their own platform
| and hf
|
| But much like you I decided I can be patient / use the
| ggufs
| lhl wrote:
| The safetensors are in the phi-4 folder of the very repo
| you linked in your OP.
| fisherjeff wrote:
| Looks like someone's _finally_ caught up with The Hallmark
| Channel's LLM performance
| algo_trader wrote:
| > More of my notes on Phi-4 here:
| https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
|
| Nice. Thanks.
|
| Do you think sampling the stack traces of millions of machines
| is a good dataset for improving code performance? Maybe sample
| android/jvm bytecode.
|
| Maybe a sort of novelty sampling to avoid re-sampling hot-path?
| mirekrusin wrote:
| This "draw pelican riding on bicycle" is quite deep if you
| think about it.
|
| Phi is all about synthetic training and prompt -> svg -> render
| -> evaluate image -> feedback loop feels like ideal fit for
| synthetic learning.
|
| You can push it quite far with stuff like basic 2d physics etc
| with plotting scene after N seconds or optics/rays, magnetic
| force etc.
|
| SVG as LLM window to physical world.
| dartos wrote:
| > SVG as LLM window to physical world.
|
| What? let's try not to go full forehead into hype.
|
| SVGs would be an awfully poor analogy for the physical
| world...
| ben_w wrote:
| SVGs themselves are just an image format; but because of
| their vector nature, they could easily be mapped onto
| values from a simulation in a physics engine -- at least,
| in the game physics sense of the word, rods and springs
| etc., as a fluid simulation is clearly a better map to
| raster formats.
|
| If that physics engine were itself a good model for the
| real world, then you could do simulated evolution to get an
| end result that is at least as functional as a bike (though
| perhaps it wouldn't look like a traditional bike) even if
| the only values available to the LLM were the gross
| characteristics like overall dimensions and mass.
|
| But I'd say the chance of getting a pelican SVG out of a
| model like this is mostly related to lots of text
| describing the anatomy of pelicans, and it would not gain
| anything from synthetic data.
| dartos wrote:
| > but because of their vector nature, they could easily
| be mapped onto values from a simulation in a physics
| engine.
|
| I don't think the fact that the images are described with
| vectors magically makes it better for representing
| physics than any other image representation. Maybe less
| so, since there will be so much textual information not
| related to the physical properties of the object.
|
| What about them makes it easier to map to physics than an
| AABB?
|
| For soft body physics, im pretty sure a simpler sort of
| distance field representation would even be better. (I'm
| not as familiar with soft body as rigid body)
| ben_w wrote:
| For rendering them, more than for anything else. There's
| a convenient 1-to-1 mapping in both directions.
|
| You can of course just rasterise the vector for output,
| it's not like people view these things on oscilloscopes.
| tkellogg wrote:
| I added Phi-4 to my reasoning model collection because it seems
| to exhibit reasoning behavior, it stopped to consider
| alternatives before concluding. I assume this is related to
| their choice in training data:
|
| > Chain-of-Thought: Data should encourage systematic reasoning,
| teaching the model various approaches to the problems in a
| step-by-step manner.
|
| https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md
| belter wrote:
| > it was trained using synthetic data
|
| Is this not supposed to cause Model collapse?
| fulafel wrote:
| No.
| belter wrote:
| Is this paper wrong? - https://arxiv.org/abs/2311.09807
| simonw wrote:
| It shows that if you deliberately train LLMs against
| their own output in a loop you get problems. That's not
| what synthetic data training does.
| belter wrote:
| I understand and appreciate your clarification. However
| would it not be the case some synthetic data strategies,
| if misapplied, can resemble the feedback loop scenario
| and thus risk model collapse?
| rhdunn wrote:
| It depends on how you construct the synthetic data and how
| the model is trained on that data.
|
| For diffusion-based image generators training only on
| synthetic data over repeated model training can cause model
| collapse as errors in the output can amplify in the trained
| model. It's usually the 2nd or 3rd model created this way
| (with output of the previous used as input for the first) for
| it to collapse.
|
| It was found that using primary data along side synthetic
| data avoided the model collapse. Likewise, if you also have
| some sort of human scoring/evaluation you can help avoid
| artefacts.
| simonw wrote:
| This is why I don't think model collapse actually matters:
| people have been deliberately training LLMs on synthetic data
| for over a year at this point.
|
| As far as I can tell model collapse happens when you
| deliberately train LLMs on low quality LLM-generated data so
| that you can write a paper about it.
| nxobject wrote:
| As someone who's a completely layman: I wonder if the results
| of model collapse are no worse than, say, sufficiently
| complex symbolic AI (modulo consistency and fidelity?)
| ziofill wrote:
| I may have misunderstood, but I think that it depends a lot
| on the existence of a validation mechanism. Programming
| languages have interpreters and compilers that can provide a
| useful signal, while for images and natural language there
| isn't such an automated mechanism, or at least its not that
| straightforward.
| patrick0d wrote:
| this vibe check is more insightful to me than the popular
| evals. nice job!
| vergessenmir wrote:
| When working with GGUF what chat templates do you use? Pretty
| much every gguf I've imported into ollama has given me garbage
| response. Converting the tokenizer json has yielded mixed
| results.
|
| For example how do you handle the phi-4 models gguf chat
| template?
| simonw wrote:
| I use whatever what template is baked into the GGUF file.
|
| You can click on the little info icon on Hugging Face to see
| that directly.
|
| For https://huggingface.co/matteogeniaccio/phi-4/tree/main?sh
| ow_... that's this: {% for message in
| messages %}{% if (message['role'] == 'system')
| %}{{'<|im_start|>system<|im_sep|>' + message['content']
| + '<|im_end|>'}}{% elif (message['role'] == 'user')
| %}{{'<|im_start|>user<|im_sep|>' + message['content'] +
| '<|im_end|><|im_start|>assistant<|im_sep|>'}}{% elif
| (message['role'] == 'assistant') %}{{message['content']
| + '<|im_end|>'}}{% endif %}{% endfor %}
| mhh__ wrote:
| Along those lines (synthetic data) I would keep an eye on the
| chinese labs given that they are probably quite data and
| compute constrained, in English at least.
| thot_experiment wrote:
| For prompt adherence it still fails on tasks that Gemma2 27b
| nails every time. I haven't been impressed with any of the Phi
| family of models. The large context is very nice, though Gemma2
| plays very well with self-extend.
| jacoblambda wrote:
| Yeah they mention this in the weaknesses section.
|
| > While phi-4 demonstrates relatively strong performance in
| answering questions and performing reasoning tasks, it is less
| proficient at rigorously following detailed instructions,
| particularly those involving specific formatting requirements.
| thot_experiment wrote:
| Ah good catch, I am forever cursed in my preference for snake
| over camel.
| impossiblefork wrote:
| It's a much smaller model though.
|
| I think the point is more the demonstration that such a small
| model can have such good performance than any actual
| usefulness.
| magicalhippo wrote:
| Gemma2 9B has significantly better prompt adherence than
| Llama 3.1 8B in my experience.
|
| I've just assumed it's down to how it was trained, but no
| expert.
| travisgriggs wrote:
| Where have I been? What is a "small" language model? Wikipedia
| just talks about LLMs. Is this a sort of spectrum? Are there
| medium language models? Or is it a more nuanced classifier?
| narag wrote:
| 7B vs 70B parameters... I think. The small ones fit in the
| memory of consumer grade cards. That's what I more or less know
| (waiting for my new computer to arrive this week)
| agnishom wrote:
| How many parameters did ChatGPT have in Dec 2022 when it
| first broke into mainstream news?
| simonw wrote:
| I don't think that's ever been shared, but it's predecessor
| GPT-3 Da Vinci was 175B.
|
| One of the most exciting trends of the past year has been
| models getting dramatically smaller while maintaining
| similar levels of capability.
| reissbaker wrote:
| GPT-3 had 175B, and the original ChatGPT was probably just
| a GPT-3 finetune (although they called it gpt-3.5, so it
| _could_ have been different). However, it was severely
| undertrained. Llama-3.1-8B is better in most ways than the
| original ChatGPT; a well-trained ~70B usually feels
| GPT-4-level. The latest Llama release, llama-3.3-70b, goes
| toe-to-toe even with much larger models (albeit is bad at
| coding, like all Llama models so far; it 's not inherent to
| the size, since Qwen is good, so I'm hoping the Llama 4
| series is trained on more coding tokens).
| swyx wrote:
| > However, it was severely undertrained
|
| by modern standards. at the time, it was trained
| according to neural scaling laws oai believed to hold.
| dboreham wrote:
| There are all sizes of models from a few GB to hundreds of GB.
| Small presumably means small enough to run on end-user
| hardware.
| hagen_dogs wrote:
| I _think_ it came from this paper, TinyStories
| (https://arxiv.org/abs/2305.07759). iirc this was also the
| inspiration for the Phi family of models. The essential point
| (of the TinyStories paper), "if we train a model on text meant
| for 3-4 year olds, since that's much simpler shouldn't we need
| fewer parameters?" Which is correct. In the original they have
| a model that's 32 Million parameters and they compare it GPT-2
| (1.5 Billion parameters) and the 32M model does much better.
| Microsoft has been interesed in this because "lower models ==
| less resource usage" which means they can run on consumer
| devices. You can easily run TinyStories from your phone, which
| is presumably what Microsoft wants to do too.
| tbrownaw wrote:
| It's a marketing term for the idea that quality over quantity
| in training data will lead to smaller models that work as well
| as larger models.
| jsight wrote:
| I really like the ~3B param version of phi-3. It wasn't very
| powerful and overused memory, but was surprisingly strong for
| such a small model.
|
| I'm not sure how I can be impressed by a 14B Phi-4. That isn't
| really small any more, and I doubt it will be significantly
| better than llama 3 or Mistral at this point. Maybe that will be
| wrong, but I don't have high hopes.
| excerionsforte wrote:
| Looks like someone converted it for Ollama use already:
| https://ollama.com/vanilj/Phi-4
| accrual wrote:
| I've had great success with quantized Phi-4 12B and Ollama so
| far. It's as fast as Llama 3.1 8B but the results have been
| (subjectively) higher quality. I copy/pasted some past requests
| into Phi-4 and found the answers were generally better.
| ai_biden wrote:
| I'm not too excited by Phi-4 benchmark results - It
| is#BenchmarkInflation.
|
| Microsoft Research just dropped Phi-4 14B, an open-source model
| that's turning heads. It claims to rival Llama 3.3 70B with a
| fraction of the parameters -- 5x fewer, to be exact.
|
| What's the secret? Synthetic data. -> Higher quality, Less
| misinformation, More diversity
|
| But the Phi models always have great benchmark scores, but they
| always disappoint me in real-world use cases.
|
| Phi series is famous for to be trained on benchmarks.
|
| I tried again with the hashtag#phi4 through Ollama - but its not
| satisfactory.
|
| To me, at the moment - IFEval is the most important llm
| benchmark.
|
| But look the smart business strategy of Microsoft:
|
| have unlimited access to gpt-4 the input prompt it to generate
| 30B tokens train a 1B parameter model call it phi-1 show
| benchmarks beating models 10x the size never release the data
| never detail how to generate the data( this time they told in
| very high level) claim victory over small models
| mupuff1234 wrote:
| So we moved from "reasoning" to "complex reasoning".
|
| I wonder what will be next month's buzzphrase.
| TeMPOraL wrote:
| > _So we moved from "reasoning" to "complex reasoning"._
|
| Only from the perspective of those still complaining about the
| use of the term "reasoning", who now find themselves left
| behind as the world has moved on.
|
| For everyone else, the phrasing change perfectly fits the
| technological change.
| HarHarVeryFunny wrote:
| Reasoning basically means multi-step prediction, but to be
| general the reasoner also needs to be able to:
|
| 1) Realize when it's reached an impasse, then backtrack and
| explore alternatives
|
| 2) Recognize when no further progress towards the goal
| appears possible, and switch from exploiting existing
| knowledge to exploring/acquiring new knowledge to attempt to
| proceed. An LLM has limited agency, but could for example ask
| a question or do a web search.
|
| In either case, prediction failure needs to be treated as a
| learning signal so the same mistake isn't repeated, and when
| new knowledge is acquired that needs to be remembered. In
| both cases this learning would need to persist beyond the
| current context in order to be something that the LLM can
| build on in the future - e.g. to acquire a job skill that may
| take a lot of experience/experimentation to master.
|
| It doesn't matter what you call it (basic or advanced), but
| it seems that current attempts at adding reasoning to LLMs
| (e.g. GPT-o1) are based around 1), a search-like strategy,
| and learning is in-context and ephemeral. General animal-like
| reasoning needs to also support 2) - resolving impasses by
| targeted new knowledge acquisition (and/or just curiosity-
| driven experimentation), as well as continual learning.
| criddell wrote:
| If you graded humanity on their reasoning ability, I wonder
| where these models would score?
|
| I think once they get to about the 85th percentile, we could
| upgrade the phrase to advanced reasoning. I'm roughly equating
| it with the percentage of the US population with at least a
| master's degree.
| chairhairair wrote:
| All current LLMs openly make simple mistakes that are
| completely incompatible with true "reasoning" (in the sense
| any human would have used that term years ago).
|
| I feel like I'm taking crazy pills sometimes.
| criddell wrote:
| How do you assess how true one's reasoning is?
| simonw wrote:
| Genuine question: what does "reasoning" mean to you?
| int_19h wrote:
| If you showed the raw output of, say, QwQ-32 to any
| engineer from 10 years ago, I suspect they would be
| astonished to hear that this doesn't count as "true
| reasoning".
| zurfer wrote:
| Model releases without comprehensive coverage of benchmarks make
| me deeply skeptical.
|
| The worst was the gpt4o update in November. Basically a 2 liner
| on what it is better at and in reality it regressed in multiple
| benchmarks.
|
| Here we just get MMLU, which is widely known to be saturated and
| knowing they trained on synthetic data, we have no idea how much
| "weight" was given to having MMLU like training data.
|
| Benchmarks are not perfect, but they give me context to build
| upon. ---
|
| edit: the benchmarks are covered in the paper:
| https://arxiv.org/pdf/2412.08905
| PoignardAzur wrote:
| Saying that a 14B model is "small" feels a little silly at this
| point. I _guess_ it doesn 't require a high-end graphics card?
| liminal wrote:
| Is 14B parameters still considered small?
___________________________________________________________________
(page generated 2024-12-16 23:02 UTC)