[HN Gopher] Apple M3 Ultra
___________________________________________________________________
Apple M3 Ultra
Author : ksec
Score : 726 points
Date : 2025-03-05 13:59 UTC (9 hours ago)
(HTM) web link (www.apple.com)
(TXT) w3m dump (www.apple.com)
| nottorp wrote:
| > support for more than half a terabyte of unified memory
|
| Soldered?
| simlevesque wrote:
| As are all Apple M devices.
| universenz wrote:
| Is there a single Apple SoC where they've provided removable
| ram? Not that I can recall.
| danpalmer wrote:
| Is there even an existing replaceable memory standard that
| would meet the current needs of Apple's "Unified Memory"
| architecture? I'm not an expert but I'd suspect probably not.
| The bus probably looks a lot more like VRAM on GPUs, and I've
| never seen a GPU with replaceable RAM.
| jsheard wrote:
| CAMM2 could kinda work, but each module is only 128-bit so
| I think the furthest you could possibly push it is a
| 512-bit M Max equivalent with CAMM2 modules north, east,
| west and south of the SOC. There just isn't room to put
| eight modules right next to the SOC for a 1024-bit bus like
| the M Ultra.
| eigenspace wrote:
| Framework said that when they built a Strix Halo machine,
| AMD assigned an engineer to work with them on seeing if
| there's a way to get CAMM2 memory working with it, and
| after a bunch of back and forth it was decided that CAMM2
| still made the traces too long to maintain proper signal
| integrity due to the 256 bit interface.
|
| These machines have a 512 bit interface, so presumably
| even worse.
| jsheard wrote:
| Yeah, but AMDs memory controllers are really finnicky.
| That might have been more of a Strix Halo issue than a
| CAMM2 issue.
| eigenspace wrote:
| Entirely possible. Obviously Apple wouldn't have been
| interested in letting you upgrade the RAM even if it was
| doable.
|
| I'd love to have more points of comparison available, but
| Strix Halo is the most analogous chip to an M-series chip
| on the market right now from a memory point of view, so
| it's hard to really know anything.
|
| I very much hope CAMM2 or something else can be made to
| work with a Strix-like setup in the future, but I have my
| doubts.
| zamadatix wrote:
| Current (individual, not counting dual socketed) AMD Epyc
| CPUs have 576 GB/s over a 768 bit bus using socketed
| DIMMs.
| eigenspace wrote:
| My understanding is that works out due to the lower clock
| speeds of those RAM modules though right?
|
| It's getting that bandwdith by going very wide on very
| very very many channels, rather than trying to push a
| gigantic amount of bandwidth through only a few channels.
| zamadatix wrote:
| Yeah, "channels" are just a roundabout way to say "wider
| bus" and you can't get too much past 128 GB/s of memory
| bandwidth without leaning heavily into a very wide bus
| (i.e. more than the "standard" 128 bit we're used to on
| consumer x86) regardless who's making the chip. Looking
| at it from the bus width perspective:
|
| - The AI Max+ 395 is a 256 bit bus ("4 channels") of 8000
| MHz instead of 128 bits ("2 channels") of 16000 MHz
| because you can't practically get past 9000 MHz in a
| consumer device, even if you solder the RAM, at the
| moment. Max capacity 128 GB.
|
| - 5th Gen Epyc is a 768 bit bus ("12 channels") of 6000
| MHz because that lets you use a standard socketed setup.
| Max capacity 6 TB.
|
| - M3 Ultra is a 1024 bit bus ("16 channels") of "~6266
| MHz" as it's 2x the M3 Max (which is 512 bits wide) and
| we know the final bandwidth is ~800 GB/s. Max capacity
| 512 GB.
|
| Note: "Channels" is in quotes because the number of bits
| per channel isn't actually the same per platform (and
| DDR5 is actually 2x32 bit channels per DIMM instead of
| 1x64 per DIMM like older DDR... this kind of shit is why
| just looking at the actual bit width is easier :p).
|
| So really the frequencies aren't that different even
| though these are completely different products across
| completely different segments. The overwhelming factor is
| bus width (channels) and the rest is more or less design
| choice noise from the perspective of raw performance.
| hoseja wrote:
| It's really unfortunate that GPUs aren't fully customizable
| daughterboards, isn't it.
| nottorp wrote:
| I thought so too when they launched the M1, but I soon got
| corrected.
|
| The memory bus is the same as for modules, it's just very
| short. The higher end SoCs have more memory bandwidth
| because the bus is wider (i.e. more modules in parallel).
|
| You could blame DDR5 (who thought having a speed
| negotiation that can go over a minute at boot is a good
| idea?), but I blame the obsession with thin and the ability
| to overcharge your customers.
|
| > I've never seen a GPU with replaceable RAM
|
| I still have one :) It's an ISA Trident TVGA 8900 that I
| personally upgraded from 512k VRAM to one full megabyte!!!
| reaperducer wrote:
| _Soldered?_
|
| Figure out a way to make it unified without also soldering it,
| and you'll be a billionaire.
|
| Or are you just grinding a tired, 20-year-old axe.
| jonjojojon wrote:
| Like all intel/amd integrated graphics that use the systems
| ram as vram?
| rsynnott wrote:
| _That_, in itself, wouldn't be that difficult, and there are
| shared-memory setups that do use modular memory. Where you'd
| really run into trouble is making it _fast_; this is very,
| very high bandwidth memory.
| ZekeSulastin wrote:
| Not even Framework has escaped from soldered RAM for this kind
| of thing.
| klausa wrote:
| It's not soldered, it's _on the package_ with the SoC.
| georgeburdell wrote:
| Probably on package at best
| klausa wrote:
| Right, yes, sorry for imprecise language!
| riidom wrote:
| Thanks for clarifying
| eigenspace wrote:
| It is _not_ on die. It's soldered onto the package.
|
| There's a good reason it's soldered, i.e. the wide memory
| interface and huge bandwidth mean that the extra trace
| lengths needed for an upgradable RAM slot would screw up the
| memory timings too much, but there's no need to make false
| claims like saying it's on-die.
| sschueller wrote:
| > RAM slot would screw up the memory timings
|
| Existing ones possibly but why not build something that
| lets you snap-in a BGA package just like we snap in CPUs on
| full sized PC mainboards?
| eigenspace wrote:
| The longer traces are the problem. They want these
| modules as physically close as possible to the CPU to
| make the timings work out and maintain signal integrity.
|
| It's the same reason nobody sells GPUs that have user
| upgradable non-soldered GDDR VRAM modules.
| varispeed wrote:
| You know that memory can be "easily" de-soldered and soldered
| at home?
|
| The issue is availability of chips and most likely you have to
| know which components to change so the new memory is
| recognised. For instance that could be changing a resistor to
| different value or bridging certain pads.
| A4ET8a8uTh0_v2 wrote:
| This viewpoint is interesting. It is not exactly inaccurate,
| but it does appear to be missing a point. Soldering in itself
| is a valuable and useful skill, but I can't say you can just
| get in and start de-soldering willy-nilly as opposed to
| opening a box and upgrading ram by plopping stuff in a
| designated spot.
|
| What if both are an issue?
| varispeed wrote:
| Do you know that "plopping stuff in a designated spot" can
| also be out of reach to some people? I know plenty who
| would give their computer to a tech do to the upgrade for
| them even if they are shown in person how to do all the
| steps. Soldering is just one step (albeit fairly big) above
| that. But the fact this can be done at home with fairly
| inexpensive tools, means tech person with reasonable skill
| could do it, so such upgrade could be accessible in
| computer/phone repair shop if parts were available to do
| so. Soldering is not a barrier - what I am trying to say.
| universenz wrote:
| 96gb on baseline model m3 ultra with a max of 512gb! Looks like
| they're leaning in hard with the AI crowd.
| datadrivenangel wrote:
| Unclear what devices this will be in outside of the mac studio.
| Also most of the comparisons were with M1 and M2 chips, not M4.
| reaperducer wrote:
| _most of the comparisons were with M1 and M2 chips, not M4._
|
| Is anyone other than a vanishingly small number of hard core
| hobbiests going to upgrade from an M4 to an M4 Ultra?
| nordsieck wrote:
| > Is anyone other than a vanishingly small number of hard
| core hobbiests going to upgrade from an M4 to an M4 Ultra?
|
| I expect that the 2 biggest buyers of M4 Ultra will be people
| who want to run LLMs locally, and people who want the highest
| performance machine they can get (professionals), but are
| wedded to mac-only software.
| bredren wrote:
| Anecdotal, and reasonable criticisms of the release aside,
| OpenAI's gpt-4.5 introduction video was done from a hard-
| to-miss Apple laptop.
|
| It is reasonable to say many folks in the field prefer to
| work on mac hardware.
| dlachausse wrote:
| It is a bit misleading to do that, but in fairness to Apple,
| almost nobody is upgrading to this from an M4 Mac, so those are
| probably more useful comparisons.
| mythz wrote:
| Ultra disappointing, they waited 2 years just to push out a
| single gen bump, even my last year's iPad Pro runs M4.
| heeton wrote:
| For AI workflows that's quite a lot cheaper than the
| alternative in GPUs.
| mythz wrote:
| Yeah VRAM option is good (if it performs well), just sad we'd
| have to drop 10K to access it tied to a prev gen M3 when
| they'll likely have M5 by the end of the year.
|
| Hard to drop that much cash on an outdated chip.
| TheTxT wrote:
| 512GB unified memory is absolutely wild for AI stuff! Compared to
| how many NVIDIA GPUs you would need, the pricing looks almost
| reasonable.
| InTheArena wrote:
| A server with 512GB of high-bandwidth GPU addressable RAM in a
| server is probably a six figure expenditure. If memory is your
| constrain, this is absolutely the server for you.
|
| (sorry, should have specified that the NPU and GPU cores need
| to access that ram and have reasonable performance). I
| specified it above, but people didn't read that :-)
| Numerlor wrote:
| A basic brand new server can easily do 512gb. Not as fast as
| soldered memory but it should be maybe mid to high 5 figures
| la_oveja wrote:
| 5 figures? can be done in 6k
| https://x.com/carrigmat/status/1884244369907278106
| InTheArena wrote:
| That's CPU only memory, not high bandwidth, and not
| addressable by the GPU.
| jeffbee wrote:
| There isn't anything particularly high-bandwidth about
| Apple's DDR5 implementation, either. They just have a lot
| of channels, which is why I compared it to a 24-channel
| EPYC system. I agree that their integrated GPU
| architecture hits a unique design point that you don't
| get from nvidia, who prefer to ship smaller amounts of
| very different kinds of memory. Apple's architecture may
| be more suited to some workloads but it hasn't exactly
| grabbed the machine learning market.
| buildbot wrote:
| M3 Ultra has 819GB/s, and a single epyc cpu with 12
| channels has 460GB/s. As far as I know, llama.cpp and
| friends don't scale across multiple sockets so you can't
| use a dual socket Turin system to match the M3 Ultra.
|
| Also, 32GB DDR5 RDIMMS are ~200, so that's 5K for 24
| right there. Then you need 2x CPUs, at ~1K for the
| cheapest, and you need 2, and then a motherboard that's
| another 1K. So for 8K (more, given you need a case, power
| supply, and cooling!), you get a system with about half
| the memory bandwidth, much higher power consumption, and
| very large.
| adrian_b wrote:
| Partial correction, an Epyc CPU with 12 channels has 576
| GB/s, i.e. DDR5-6000 x 768 bits. That is 70% of the Apple
| memory bandwidth, but with possibly much more memory (768
| GB in your example).
|
| You do not need 2 CPUs. If however you use 2 CPUs, then
| the memory bandwidth doubles, to 1152 GB/s, exceeding
| Apple by 40% in memory bandwidth. The cost of the memory
| would be about the same, by using 16 GB modules, but the
| MB would be more expensive and the second CPU would add
| to the price.
| buildbot wrote:
| Ah, I didn't realize they'd upped the memory bandwidth to
| DDR5-6000 (vs 4800), thanks for the correction!
|
| The memory bandwidth does not double, I believe. See this
| random issue for a graph that has single/dual socket
| measurements, there is essentially no difference:
| https://github.com/abetlen/llama-cpp-python/issues/1098
|
| Perhaps this is incorrect now, but I also know with 2x
| 4090s you don't get higher tokens per second than 1x 4090
| with llama.cpp, just more memory capacity.
|
| (All if this only applies to llama.cpp, I have no
| experience with other software and how memory bandwidth
| may scale across sockets)
| adrian_b wrote:
| The memory bandwidth does double, but in order to exploit
| it the program must be written and executed with care in
| the memory placement, taking into account NUMA, so that
| the cores should access mostly memory attached to the
| closest memory controller and not memory attached to the
| other socket.
|
| With a badly organized program, the performance can be
| limited not by the memory bandwidth, which is always
| exactly double for a dual-socket system, but by the
| transfers on the inter-socket links.
|
| Moreover, your link is about older Intel Xeon Sapphire
| Rapids CPUs, with inferior memory interfaces and with
| more quirks in memory optimization.
| buildbot wrote:
| Yes, I believe in theory a correctly written program
| could scale across sockets, depending on the problem at
| hand.
|
| But where is your data? For llama.cpp? For whatever dual
| socket CPU system you want. That's all I am claiming.
| adrian_b wrote:
| Googling for what you ask has found immediately this
| discussion:
|
| https://github.com/ggml-org/llama.cpp/discussions/11733
|
| about the scaling of llama.cpp and DeepSeek on some dual-
| socket AMD systems.
|
| While it was rather tricky, after many experiments they
| have obtained an almost double speed on two sockets,
| especially on AMD Turin.
|
| However, if you look at the actual benchmark data, that
| must be much lower than what is really possible, because
| their test AMD Turin system (named there P1) had only two
| thirds of the memory channels populated, i.e. performance
| limited by memory bandwidth could be increased by 50%,
| and they had 16-core CPUs, so performance limited by
| computation could be increased around 10 times.
| aurareturn wrote:
| CPUs do not have enough compute typically. You'll be
| compute bottlenecked before bandwidth if the model is
| large enough.
|
| Time to first token, context length, and tokens/s are
| significantly inferior on CPUs when dealing with larger
| models even if the bandwidth is the same.
| adrian_b wrote:
| One big server CPUs can have a computational capability
| similar to a mid-range desktop NVIDIA GPU.
|
| When used for ML/AI applications, a consumer GPU has much
| better performance per dollar.
|
| Nevertheless, when it is desired to use much more memory
| than in a desktop GPU, a dual-socket server can have
| higher memory bandwidth than most desktop GPUs, i.e. more
| than an RTX 4090, and a computational capability that for
| FP32 could exceed an RTX 4080, but it would be slower for
| low-precision data where the NVIDIA tensor cores can be
| used.
| KeplerBoy wrote:
| addressable is a weird choice of words here.
|
| CUDA has had managed memory for a long time now. You
| absolutely can address the entire host memory from your
| GPU. It will fetch it, if it's needed. Not fast, but
| addressable.
| p_ing wrote:
| Windows has been doing this since what... the AGP era?
| Though this is a function of the ISA rather than the OS.
| Numerlor wrote:
| Ah seems like I remembered the CPU price for a higher
| tier CPU which can cost the 6k on their own.
|
| Thinking about it you can get a decent 256gb on consumer
| platforms now too, but the speed will be a bit crap and
| would need to make sure the platform ully supports ECC
| UDIMMs
| behnamoh wrote:
| except that you cannot run multiple language models on Apple
| Silicon in parallel
| kevin42 wrote:
| I'm curious why not. I am running a few different models on
| my mac studio. I'm using llama.cpp, and it performs
| amazingly fast for the $7k I spent.
| behnamoh wrote:
| I said in parallel.
| jeffbee wrote:
| That doesn't sound right. The marginal cost of +768GB of DDR5
| ECC memory in an EPYC system is < $5k.
| InTheArena wrote:
| GPU accessible RAM.
| numpad0 wrote:
| moot point if tok/s benchmark results are the same or
| worse.
| DrBenCarson wrote:
| Not moot if you care about producing those tokens with
| the largest available models
| kjreact wrote:
| Are the benchmarks worse? Running LLMs in system memory
| is rather painful. I am having a hard time finding
| benchmarks for running large models using system memory.
| Can you point me to some benchmarks you're referring to?
| adrian_b wrote:
| In a dual-socket EPYC system, the memory bandwidth is
| higher than in this Apple system by 40% (i.e. 1152 GB/s),
| and the memory capacity can be many times higher.
|
| Like another poster said, 768 GB of ECC RDIMM DDR5-6000
| costs around $5000.
|
| Any program whose performance is limited by memory
| bandwidth, as it can be frequently the case for
| inference, will run significantly faster in such an EPYC
| server than in the Apple system, even when running on the
| CPU.
|
| Even for computationally-limited programs, the difference
| between server CPUs and consumer GPUs is not great. One
| Epyc CPU may have about the same number of FP32 execution
| units as an RTX 4070, while running at a higher clock
| frequency (but it lacks the tensor units of an NVIDIA
| GPU, which can greatly accelerate the execution, where
| applicable).
| aurareturn wrote:
| Any program whose performance is limited by memory
| bandwidth, as it can be frequently the case for
| inference, will run significantly faster in such an EPYC
| server than in the Apple system, even when running on the
| CPU.
|
| Source on this? CPUs would be very compute constrained.
| adrian_b wrote:
| According to Apple, the GPU of M3 Ultra has 80 graphics
| cores, which should mean 10240 FP32 execution units, the
| same like an NVIDIA RTX 4080 Super.
|
| However Apple does not say anything about the GPU clock
| frequency, which I assume that it is significantly less
| than that of NVIDIA.
|
| In comparison, a dual-socket AMD Turin can have up to
| 12288 FP32 execution units, i.e. 20% more than an Apple
| GPU.
|
| Moreover, the clock frequency of the AMD CPU must be much
| higher than that of the Apple GPU, so it is likely that
| the AMD system may be at least twice faster for computing
| some graphic application than the Apple M3 Ultra GPU.
|
| I do not know what facilities exist in the Apple GPU for
| accelerating the computations with low-precision data
| types, like the tensor cores of NVIDIA GPUs.
|
| While for graphic applications big server CPUs are
| actually less compute constrained than almost all
| consumer GPUs (except RTX 4090/5090), the GPUs can be
| faster for ML/AI applications that use low-precision data
| types, but this is not at all certain for the Apple GPU.
|
| Even if the Apple GPU happens to be faster for some low-
| precision data type, the difference cannot be great.
|
| However a server that would beat the Apple M3 Ultra GPU
| computationally would cost much more than $10k, because
| it would need CPUs with many cores.
|
| If the goal is only to have a system with 50% more memory
| and 40% more memory bandwidth than the Apple system, that
| can be done at a $10k price.
|
| While such a system would become compute constrained more
| often than an Apple GPU, it would still beat it every
| time when the memory would be the bottleneck.
| jeroenhd wrote:
| If you're going to overthrow your entire AI workflow to use a
| different API anyway, surely the AMD Instinct accelerator cards
| make more sense. They're expensive, but also a lot faster, and
| you don't need to deal with making your code work on macOS.
| codedokode wrote:
| I don't think API has any value because writing software is
| free and hardware for ML is super expensive.
| internetter wrote:
| > writing software is free
|
| says who? NVIDIA has essentially entrenched themselves
| thanks to CUDA
| knowitnone wrote:
| I'd like to hire you to write free software
| wmf wrote:
| Doesn't AMD Instinct cost >$50K for 512GB?
| chakintosh wrote:
| 14k for a maxed out Mac Studio
| varjag wrote:
| Call me a unit fundamentalist but calling 512Gb "over half a
| terabyte memory" irks me to no end.
| klausa wrote:
| It's over half a _tera_byte; exactly half of _tebi_byte if you
| wanna be a fundamentalist.
| varjag wrote:
| It is exactly the opposite. Every computer architecture in
| production addresses memory in the powers of two.
|
| SI has no business in memory size nomenclature as it is not
| derived from fundamental physical units. The whole klownbyte
| change was pushed through by hard drive marketers in 1990s.
| esafak wrote:
| Do SSD companies do the same thing? We ought to go back to
| referring to storage capacity in powers of two.
| jl6 wrote:
| SSDs have added weirdness like 3-bit TLC cells and
| overprovisioning. Usable storage size of an SSD is
| typically not an exact power of 10 _or_ 2.
| umanwizard wrote:
| > Every computer architecture in production addresses
| memory in the powers of two.
|
| What does it mean to "address memory in powers of two" ?
| There are certainly machines with non-power-of-two memory
| quantities; 96 GiB is common for example.
|
| > The whole klownbyte change was pushed through by hard
| drive marketers in 1990s.
|
| The metric prefixes based on powers of 10 have been around
| since the 1790s.
| varjag wrote:
| > What does it mean to "address memory in powers of two"
| ? There are certainly machines with non-power-of-two
| memory quantities; 96 GiB is common for example.
|
| I challenge you to show me any SKU from any memory
| manufacturer that has a power of 10 capacity. Or a CPU
| whose address space is a power of 10. This is an
| unavoidable artefact of using a binary address bus.
|
| > The metric prefixes based on powers of 10 have been
| around since the 1790s.
|
| And Babylonians used power of 60, what gives?
| kstrauser wrote:
| *bibytes are a practical joke played on computer scientists
| by the salespeople to make it sound like we're drunk. "Tell
| us more about your mebibytes, Fred _elbows colleague,
| listen to this_ ".
|
| If Donald Knuth and Gordon Bell say we use base-2 for RAM,
| that's good enough for me.
| transcriptase wrote:
| Perhaps they're including the CPU cache and rounding down for
| brevity.
| kissiel wrote:
| You're nitpicking, but then you use lowercase b for a byte ;)
| okamiueru wrote:
| Don't know what the prior extreme apple is alluding to here. But,
| apple marketing is what it is.
| dlachausse wrote:
| Interesting that they're releasing M3 Ultra after the M4 Macs
| have already shipped.
|
| I wonder if the plan is to only release Ultras for odd number
| generations.
| _alex_ wrote:
| m2 ultra tho
| pier25 wrote:
| They released the M2 Ultra
| dlachausse wrote:
| Good point, I forgot about that. Maybe it just got really
| delayed in production.
| ryao wrote:
| Reportedly Apple is using its own silicon in data centers
| to run "Apple Intelligence" and other things like machine
| translation in safari. I suspect that the initial supply
| was sent to Apple's datacenters.
| jmull wrote:
| I'm guessing it's more because "Ultra" versions, which "fuse"
| multiple chips take significant additional engineering work. So
| we might expect an ultra M4 next year, possibly after non-ultra
| M5s are released.
| iambateman wrote:
| People who know more than me: they're talking a lot about RAM and
| not much about GPU.
|
| Do you expect this will be able to handle AI workloads well?
|
| All I've heard for the past two years is how important a beefy
| GPU is. Curious if that holds true here too.
| lynndotpy wrote:
| VRAM is what takes a model from "can not run at all" to "can
| run" (even if slowly), hence the emphasis.
| dartos wrote:
| You can say the same about GPU clock speed as well...
| vlovich123 wrote:
| No, with limited VRAM you could offload the model partially
| or split across CPU and GPU. And since CPU has swap, you
| could run the absolute largest model. It's just really really
| slow.
| jeffhuys wrote:
| Really, really, really, really, really, REALLY REALLY slow.
| Espressosaurus wrote:
| The difference between Deepseek-r1:70b (edit: actually 32b)
| running on an M4 Pro (48 GB unified RAM, 14 CPU cores, 20
| GPU cores) and on an AMD box (64 GB DDR4, 16 core 5950X,
| RTX 3080 with 10 GB of RAM) is more than a factor of 2.
|
| The M4 pro was able to answer the test prompt twice--once
| on battery and once on mains power--before the AMD box was
| able to finish processing.
|
| The M4's prompt parsing took significantly longer, but
| token generation was significantly faster.
|
| Having the memory to the cores that matter makes a big
| difference.
| vlovich123 wrote:
| You're adding detail that's not relevant to anything I
| said. I was saying this statement:
|
| > VRAM is what takes a model from "can not run at all" to
| "can run" (even if slowly), hence the emphasis.
|
| Is false. Regardless of how much VRAM you have, if the
| criteria is "can run even if slowly", all machines can
| run all models because you have swap. It's unusably slow
| but that's not what OP was claiming the difference is.
| Espressosaurus wrote:
| The criteria for purchase for anybody trying to use it is
| "run slowly but acceptably" vs. "run so slow as to be
| unusable".
|
| My memory is wrong, it was the 32b. I'm running the 70b
| against a similar prompt and the 5950X is probably going
| to take over an hour for what the M4 managed in about 7
| minutes.
|
| edit: an hour later and the 5950 isn't even done thinking
| yet. Token generation is generously around 1 token/s.
|
| edit edit: final statistics. M4 Pro managing 4 tokens/s
| prompt eval, 4.8 tokens/s token generation. 5950X
| managing 150 tokens/s prompt eval, and 1 token/s
| generation.
|
| Perceptually I can live with the M4's performance. It's a
| set prompt, do something else, come back sort of thing.
| The 5950/RTX3080's is too slow to be even remotely usable
| with the 70b parameter model.
| vlovich123 wrote:
| I don't disagree. I'm just taking OP at the literal
| statement they made.
| Retr0id wrote:
| When it comes to LLMs in particular, it comes down to memory
| size+bandwidth more than anything else.
| simlevesque wrote:
| What's more important isn't how beefy it is, it's how much
| memory it has.
|
| These are unified memory. The M3 Ultra with 512gb has as much
| VRAM as sixteen 5090.
| qwertox wrote:
| A beefy GPU which can't hold models in VRAM is of very limited
| use. You'll see 16 GB of VRAM on gamer Nvidia cards, the RTX
| 5090 being an exception with 32 GB VRAM. The professional cards
| have around 96 GB of VRAM.
|
| The thing with these Apple chips is that they have unified
| memory, where CPU and GPU use the same memory chips, which
| means that you can load huge models into RAM (no longer VRAM,
| because that doesn't exist on those devices). And while Apple's
| integrated GPU isn't as powerful as an Nvidia GPU, it is
| powerful enough for non-professional workloads and has the huge
| benefit of access to lots of memory.
| _zoltan_ wrote:
| which professional card has 96GB of VRAM?
| qwertox wrote:
| Like the NVIDIA H100 NVL 94GB HBM3 PCIe 5.0 Data Center GPU
| for 27.651,20 EUR
|
| https://www.primeline-
| solutions.com/de/nvidia-h100-nvl-94gb-...
| _zoltan_ wrote:
| that's not a professional line card, but a data center
| card.
| gatienboquet wrote:
| LLMs are primarily "memory-bound" rather than "compute-bound"
| during normal use.
|
| The model weights (billions of parameters) must be loaded into
| memory before you can use them.
|
| Think of it like this: Even with a very fast chef (powerful
| CPU/GPU), if your kitchen counter (VRAM) is too small to lay
| out all the ingredients, cooking becomes inefficient or
| impossible.
|
| Processing power still matters for speed once everything fits
| in memory, but it's secondary to having enough VRAM in the
| first place.
| whimsicalism wrote:
| Transformers are typically memory- _bandwidth_ bound during
| decoding. This chip is going to have a much worse memory b /w
| than the nvidia chips.
|
| My guess is that these chips could be compute-bound though
| given how little compute capacity they have.
| Gracana wrote:
| It's pretty close. A 3090 or 4090 has about 1TB/s of memory
| bandwidth, while the top Apple chips have a bit over
| 800GB/s. Where you'll see a big difference is in prompt
| processing. Without the compute power of a pile of GPUs,
| chewing through long prompts, code, documents etc is going
| to be slower.
| whimsicalism wrote:
| nobody in industry is using a 4090, they are using H100s
| which have 3TB/s. Apple also doesn't have any equivalent
| to nvlink.
|
| I agree that compute is likely to become the bottleneck
| for these new Apple chips, given they only have like
| ~0.1% the number of flops
| Gracana wrote:
| I chose the 3090/4090 because it seems to me that this
| machine could be a replacement for a workstation or a
| homelab rig at a similar price point, but not a $100-250k
| server in a datacenter. It's not really surprising or
| interesting that the datacenter GPUs are superior.
|
| FWIW I went the route of "bunch of GPUs in a desktop
| case" because I felt having the compute oomph was worth
| it.
| _zoltan_ wrote:
| 4.8TB/s on H200, 8TB/s on B200, pretty insane.
| Gracana wrote:
| That's wild, somehow I hadn't seen the B200 specs before
| now. I wish I could have even a fraction of that!
| gatienboquet wrote:
| VRAM capacity is the initial gatekeeper, then bandwidth
| becomes the limiting factor.
| whimsicalism wrote:
| i suspect that compute actually might be the limiter for
| these chips before b/w, but not certain
| cubefox wrote:
| > Transformers are typically memory-bandwidth bound during
| decoding.
|
| Not in case of language models, which are typically bound
| by memory size rather than bandwidth.
| whimsicalism wrote:
| nope
| cubefox wrote:
| I assume even this one won't run on an RTX 5090 due to
| constrained memory size:
| https://news.ycombinator.com/item?id=43270843
| whimsicalism wrote:
| sure on consumer GPUs but that is not what is
| constraining the model inference in most actual industry
| setups. technically even then, you are CPU-GPU memory
| bandwidth bound more than just GPU memory, although that
| is maybe splitting hairs
| matwood wrote:
| I was able to run and use the DeepSeek distilled 24gb on an M1
| Max with 64gb of ram. It wasn't speedy, but it was usable. I
| imagine the M3/4s are much faster, especially on smaller, more
| specific models.
| chvid wrote:
| Now make a data center version.
| ksec wrote:
| Previous model of M2 Ultra had max memory of 192GB. Or 128GB for
| Pro and some other M3 model, which I think is plenty for even
| 99.9% of professional task.
|
| They now bump it to _512GB_. Along with _insane_ price tag of
| $9499 for 512GB Mac Studio. I am pretty sure this is some AI Gold
| rush.
| dwighttk wrote:
| Maybe .1% of tasks need this RAM, why are they charging so
| much?
| regularfry wrote:
| The narrower the niche, the more you can charge.
| rewtraw wrote:
| because that's how much its worth
| internetter wrote:
| Its not though. For consumer computers somewhere 1k-4k
| there's nothing better. But for the price of 512gb of RAM
| you could buy that + a crazy CPU + 2x 5090s by building
| your own. The market fit is "needs power; needs/wants
| macOS; has no budget" which is so incredibly niche. But in
| terms of raw compute output there's absolutely no chance
| this is providing bang for buck
| jeffhuys wrote:
| Do you understand that it's UNIFIED RAM, so it doubles as
| vRAM? I would love to know what computer you can build
| for <10k with 0.5TB of VRAM.
| kjreact wrote:
| 2x 5090s would only give you 64GB of memory to work with
| re:LLM workloads, which is what people are talking about
| in this thread. The 512GB of system RAM you're referring
| to would not be useful in this context. Apple's unified
| memory architecture is the part you're missing.
| DrBenCarson wrote:
| How much VRAM do you get on those 2x 5090s?
|
| How much would it cost to get up to 512gb?
| pier25 wrote:
| Because the minority that needs that much RAM can't work
| without it.
|
| In the media composing world they use huge orchestral
| templates with hundreds and hundreds of tracks with millions
| of samples loaded into memory.
| A4ET8a8uTh0_v2 wrote:
| I think the answer is because they can ( there is a market
| for it ). The benefit to a crazy person like me that with
| this addition, I might be able to grab 128gb version at a
| lower price.
| znpy wrote:
| because they know there will be a large amount of people that
| don't know this much ram but they'll buy it anyway.
| cjbgkagh wrote:
| I don't need 512GB of RAM but the moment I do I'm certain
| I'll have bigger things to worry about than a $10K price tag.
| almostgotcaught wrote:
| This is Pascal's wager written in terms of ... RAM. The
| original didn't make sense and neither does this iteration.
| cjbgkagh wrote:
| I would still wait until I need it before buying it...
| agloe_dreams wrote:
| Because the .1% is who will buy it? I mean, yeah, supply and
| demand. High demand in a niche with no supply currently means
| large margins.
|
| I don't think anyone commercially offers nearly this much
| unified memory or NPU/GPUs with anything near 512GB of
| memory.
| madeofpalk wrote:
| Maybe because .1% of tasks need this RAM, it attracts a .1%
| price tag
| Spooky23 wrote:
| With all things semiconductor, low volume = higher cost (and
| margin).
|
| The people who need the crazy resource can tie it to some
| need that costs more. You'd spend like $10k running a machine
| with similar capabilities in AWS in a month.
| Sharlin wrote:
| It enables the use of _giant_ AI models on a personal
| computer. Might not run too fast though. But at least it 's
| possible _at all_.
| InTheArena wrote:
| Every single AI shop on the planet is trying to figure out if
| there is enough compute or not to make this a reasonable AI
| path. If the answer is yes, that 10k is a absolute bargain.
| 827a wrote:
| Is this actually true? Were people doing this with the 192gb
| of the M2 Ultra?
|
| I'm curious to learn how AI shops are actually doing model
| development if anyone has experience there. What I imagined
| was: Its all in the "cloud" (or, their own infra), and the
| local machine doesn't matter. If it did matter, the nvidia
| software stack is too important, especially given that a
| 512gb M3 Ultra config costs $10,000+.
| DrBenCarson wrote:
| You're largely correct for training models
|
| Where this hardware shines is inference (aka developing
| products on top of the models themselves)
| 827a wrote:
| True. But with Project Digits supposedly around the
| corner, which supposedly costs $3,000 and supports
| ConnectX and runs Blackwell; what's the over-under on
| just buying two of those at about half the price of one
| maxed M3 Ultra Mac Studio?
| DrBenCarson wrote:
| And how much VRAM will Project Digits have?
| internetter wrote:
| No AI shop is buying macs to use as a server. Apple should
| really release some server macOS distribution, maybe even
| rackable M-series chips. I believe they have one internally.
| jerjerjer wrote:
| Why would any business pay Apple Tax for a backend, server
| product?
| Spooky23 wrote:
| > that 10k is a absolute bargain
|
| The higher end NVidia workstation boxes won't run well on
| normal 20amp plugs. So you need to move them to a computer
| room (whoops, ripped those out already) or spend months
| getting dedicated circuits run to office spaces.
| magnetometer wrote:
| Didn't really think about this before, but that seems to be
| mainly an issue in Northern / Central America and Japan. In
| Germany, for example, typical household plugs are 16A at
| 230V.
| someothherguyy wrote:
| In the US, normal circuits aren't always 20A, especially in
| residential buildings, where they are more commonly 15A in
| bedrooms and offices.
|
| https://en.wikipedia.org/wiki/NEMA_connector
| hervature wrote:
| To clarify, the circuit is almost always 20A with 15A
| being used for lighting. However, the outlet itself is
| almost always 15A because you put multiple outlets on a
| single circuit. You are going to see very few 20A in
| outlets (which have a T shaped prong) in residential.
| theturtle32 wrote:
| While technically true, the NEMA 5-15R receptacles are
| rated for use on 20A circuits, and circuits for
| receptacles are almost always 20A circuits, in modern
| construction at least. Older builds may not be, of
| course.
|
| That said, if your load is going to be a continuous load
| drawing 80% of the rated amperage, it really should be a
| NEMA 5-20 plug and receptacle, the one where one of the
| prongs is horizontal instead of vertical. Swapping out
| the receptacle for one that accepts a NEMA 5-20P plug is
| like $5.
|
| If you are going to actually run such a load on a 20A
| circuit with multiple receptacles, you will want to make
| sure you're not plugging anything substantial into any of
| the other receptacles on that circuit. A couple LED
| lights are fine. A microwave or kettle, not so much.
| andrewmcwatters wrote:
| > and circuits for receptacles are almost always 20A
| circuits, in modern construction at least.
|
| This is not true. Standard builds (a majority) still use
| 15-amp circuits where 20-amp is not required by NEC.
| NorwegianDude wrote:
| Not much to figure out. It's 2x M4 Max, so you need 100 of
| these to match the TOPS of even a single consumer card like
| the RTX 5090.
| jeffhuys wrote:
| Sure, but if you have models like DeepSeek - 400GB - that
| won't fit on a consumer card.
| NorwegianDude wrote:
| True. But an AI shop doesn't care about that. They get
| more performance for the money by going for multiple
| Nvidia GPUs. I have 512 GB ram on my PC too with 8 memory
| channels, but it's not like it's usable for AI workloads.
| It's nice to have large amounts of RAM, but increasing
| the batch size during training isn't going to help when
| compute is the bottleneck.
| DrBenCarson wrote:
| Now do VRAM
| wpm wrote:
| It's 2x M3 Max
| alberth wrote:
| > It's 2x M4 Max
|
| Not exactly though.
|
| This can have 512GB unified memory, 2x M4 Max can only have
| 128GB total (64GB each).
| ZeroTalent wrote:
| No, because there is no CUDA. We have fast and cheap
| alternatives to NVIDIA, but they do not have CUDA. This is
| why NVIDIA has 90% margins on its hardware.
| HPsquared wrote:
| LLMs easily use a lot of RAM, and these systems are MUCH, MUCH
| cheaper (though slower) than a GPU setup with the equivalent
| RAM.
|
| A 4-bit quantization of Llama-3.1 405b, for example, should fit
| nicely.
| segmondy wrote:
| The question will be how it will perform. I suspect Deepseek,
| Llama405B demonstrated the need for larger memory. Right now
| folks could build an epyc system with that much ram or more to
| run Deepseek at about 6 tokens/sec for a fraction of that cost.
| However not everyone is a tinker, so there's a market for this
| for those that don't want to be bothered. You say "AI Gold
| rush" like it's a bad thing, it's not.
| bloppe wrote:
| Big question is: Does the $10k price already reflect Trump's
| tariffs on China? Or will the price rise further still..
| desertmonad wrote:
| Time to upgrade m1 ultra I guess! M1 ultra has been pretty good
| with deepseek locally.
| _alex_ wrote:
| what flavor of deepseek are you running? what kind of
| performance are you seeing?
| InTheArena wrote:
| Whoa. M3 instead of M4. I wonder if this was basically binning,
| but I thought that I had read somewhere that the interposer that
| enabled this for the M1 chips where not available.
|
| That Said, 512GB of unified ram with access to the NPU is
| absolutely a game changer. My guess is that Apple developed this
| chip for their internal AI efforts, and are now at the point
| where they are releasing it publicly for others to use. They
| really need a 2U rack form for this though.
|
| This hardware is really being held back by the operating system
| at this point.
| klausa wrote:
| >I had read somewhere that the interposer that enabled this for
| the M1 chips where not available.
|
| With all my love and respect for "Apple rumors" writers; this
| was always "I read five blogposts about CPU design and now I'm
| an expert!" territory.
|
| The speculation was based on the M3 Maxes die shots not having
| the interposer visible, which... implies basically nothing
| whether that _could have_ been supported in an M3 Ultra
| configuration; as evidenced by the announcement today.
| sroussey wrote:
| I'm guessing it's not really a M3.
|
| No M3 has thunderbolt 5.
|
| This is a new chip with M3 marketing. I'd expect this from
| Intel, not Apple.
| klausa wrote:
| Baseline M4 doesn't have Thunderbolt 5 either; only the
| Pro/Max variants do.
|
| The press-release even calls TB5 out: >Each Thunderbolt 5
| port is supported by its own custom-designed controller
| directly on the chip.
|
| Given that they're doing the same on A-series chips (A18
| Pro with 10Gbps USB-C; A18 with USB 2.0); I imagine it's
| just relatively simple to swap the I/O blocks around and
| they're doing this for cost and/or product segmentation
| reasons.
| sroussey wrote:
| Which means this is a whole new chip. It may be M3 based,
| but with added interposer support and new thunderbolt
| stuff.
|
| Which, at this point, why not just use M4 as a base?
| kridsdale1 wrote:
| Could be that M4 requires a different TSMC fab that is at
| full production doing iPhones.
| operatingthetan wrote:
| Or they are saving the M4 Ultra name for later on ...
| klausa wrote:
| >Which, at this point, why not just use M4 as a base?
|
| I imagine that making those chips is quite a bit more
| involved than just taking the files for M3 Max, and copy-
| pasting them twice into a new project.
|
| I imagine it just takes more time to
| design/verify/produce them; especially given they're not
| selling very many of them, so they're probably not super-
| high-priority projects.
| hinkley wrote:
| TB 5 seems like the sort of thing you could 'slap on' to a
| beefy enough chip.
|
| Or the sort of thing you put onto a successor when you had
| your fingers crossed that the spec and hardware would
| finalize in time for your product launch but the fucking
| committee went into paralysis again at the last moment and
| now your product has to ship 4 months before you can put TB
| 5 hardware on shelves. So you put your TB4 circuitry on a
| chip that has the bandwidth to handle TB5 and you wait for
| the sequel.
| kridsdale1 wrote:
| Sounds like you've seen some things.
| jagged-chisel wrote:
| > This hardware is really being held back by the operating
| system at this point.
|
| Please elucidate.
| diggan wrote:
| https://news.ycombinator.com/item?id=43243075 ("Apple's
| Software Quality Crisis" - 1134 comments)
|
| ^ has a lot of elaborations on this subject
| internetter wrote:
| This is more about "average" end user software, not the
| type of software that would be running on a machine like
| this. Yes their applications fell off, but if you're paying
| for 512gb of RAM apple notes being slow isn't the
| bottleneck
| diggan wrote:
| Lack of focus on quality of software affects all types of
| workloads, not just consumer-oriented or professional-
| oriented in isolation.
| gjsman-1000 wrote:
| Nah, if I ever wrote an article about the software crisis
| on the Linux desktop, there'd be flames here making
| Apple's issues look small.
| diggan wrote:
| It'd be an interesting flame war in the comments, if
| nothing else, go for it! I'm happy to give plenty of
| concrete evidence why Linux is more suitable for
| professionals than macOS is in 2025 :)
| hedora wrote:
| Try copy pasting bash snippets between any linux text
| editor and terminal.
|
| Now try the same with notes on a mac. Notes mangles the
| punctuation and zsh is not bash.
| internetter wrote:
| Omg I despise the fact that there's n competing GUI
| standards on linux, zero visual consistency.
|
| I love diversity in websites, and apps for that matter,
| but this isn't diversity, it is the uncanny valley
| between bespoke graphic design and homogeneity.
|
| Say what you want about SwiftUI, but it makes consistent,
| good looking apps. Unless something has changed, GTK is a
| usability disaster.
|
| And that's before I get into how much both X11 _and_
| wayland suck equally.
|
| There's so much I miss about Linux, but there's so much I
| don't
| WD-42 wrote:
| People are paying the richest company in the world for
| their software crisis on Linux.
| knowitnone wrote:
| if you do write something, please seperate enterprise,
| developer, end user, embedded/RT because they all have
| different requirements.
| internetter wrote:
| > Lack of focus on quality of software affects all types
| of workloads, not just consumer-oriented or professional-
| oriented in isolation.
|
| The apps are developed by different teams. MacOS apps are
| containerized. Saying macOS's performance is hindered by
| Notes.app is like saying that Windows is hindered by
| Paint.exe. Notes.app is just a default[0]
|
| [0]: though, I dislike saying this because I always feel
| like I need to mention that even Notes links against a
| hilarious amount of private APIs that could easily be
| exposed to other developers but... aren't.
| InTheArena wrote:
| No native docker support, no headless management options
| (enterprise strength), Limited QoS management, lack of robust
| python support (out of the box), interactive user focused
| security model.
| bredren wrote:
| >lack of robust python support (out of the box)
|
| What would robust python support oob look like?
| FergusArgyll wrote:
| uv pre-installed! /s
| flats wrote:
| I feel you on a lot of this! But out of the box Python
| support? Does anybody actually want that? It's pretty darn
| quick & straightforward to get a Python environment up &
| running on MacOS. Maybe I'm misunderstanding what you mean
| here.
| p_ing wrote:
| No one would want OOTB Python support. You'd be stuck on
| a version you didn't want to use.
| hedora wrote:
| I want it. That way, like code I write in any other
| language, it'll run reliably on other people's machines a
| few years from now.
|
| I avoid writing python, so I'm usually the "other people"
| in that sentence.
| fauigerzigerk wrote:
| _> it'll run reliably on other people's machines a few
| years from now_
|
| That's optimistic. What if the system Python gets
| upgraded? For some reason, Python libraries tend to be
| super picky about the Python versions they support (not
| just Python 2 vs 3).
| kstrauser wrote:
| 1. I run Docker and Podman on my Macs.
|
| 2. If you mean MDM, there are several good options. Screen
| sharing and SSH are build in.
|
| 3. In what sense?
|
| 4. `uv python install whatever` is infinitely better than
| upgrading on the OS vendor's schedule.
|
| 5. What does that affect?
| mschuster91 wrote:
| > 1. I run Docker and Podman on my Macs.
|
| That's using a Linux VM. The idea people are asking about
| is native process isolation. Yes you'd have to rebuild
| Docker containers based on some sort of (small) macOS
| base layer and Homebrew/Macports, but hey. Being able to
| even run nodejs or php with its thousands of files
| _natively_ would be a gamechanger in performance.
| hedora wrote:
| Also, it were possible to containerize macos, or even do
| an unintended vm installation, then it'd be possible for
| apple to automatically regression test their stuff.
| devmor wrote:
| >I run Docker and Podman on my Macs.
|
| The same way Windows users run them. In a linux VM.
|
| You don't get real on-hardware containerization.
| naikrovek wrote:
| surprisingly, _Windows_ containers on Windows are not run
| in a VM. Well, not necessarily; they can be.
|
| It is definitely odd that Macs have no native container
| support, though, especially when you learn that _Windows_
| does.
| devmor wrote:
| That is an important point, I didn't really think of it
| since I've never had a reason to use Windows containers.
| naikrovek wrote:
| that's ok, no one thinks of windows, and fewer people
| than that would ever use a windows container.
| p_ing wrote:
| Well, Windows (in a form) is the hypervisor for the Azure
| infrastructure. Azure Web Sites when run as Windows/IIS
| are Windows containers. Makes sense.
|
| Honestly I don't know what XNU/Darwin is good for. It
| doesn't do anything especially well compared to *BSD,
| Linux, and NT.
| hedora wrote:
| Its async i/o APIs are best in class (i.e., compatible
| with BSD, and not Linux's epoll tire fire).
|
| Not disagreeing though.
| kstrauser wrote:
| Ah, I see what you're saying. Basically, Darwin doesn't
| support cgroups, so Docker runs Linux in a VM to get
| that.
| devmor wrote:
| I don't think it supports userland namespaces either,
| which is another important part of container isolation.
| pmarreck wrote:
| > lack of robust python support
|
| There is no such thing. Tell me, which combination of the
| 15+ virtual environments, dependency management and Python
| version managers would you use? And how would you prevent
| "project collision" (where one Python project bumps into
| another one and one just stops working)? Example: SSL
| library differences across projects is a notorious culprit.
|
| Python is garbage and I don't understand why people put up
| with this crap unless you seriously only run ONE SINGLE
| Python project at a time and do not care what else silently
| breaks. Having to run every Python app in its own Docker
| image (which is the only real solution to this, if you
| don't want to learn Nix, which you really should, because
| it is better thanks to determinism... but entails its own
| set of issues) is not a reasonable compromise.
|
| Was so glad when the Elixir guys came out with this
| recently, to at least be able to use Python, but in a very
| controlled, not-insane way:
| https://dashbit.co/blog/running-python-in-elixir-its-fine
| simonw wrote:
| uv
|
| (Not saying Apple should bundle that, but it's the best
| current answer to running many different Python projects
| without using something like Docker)
| mapcars wrote:
| > at least be able to use Python, but in a very
| controlled, not-insane way
|
| Thats funny, about 10 years ago I started my career in a
| startup that had Python business logic running under
| Erlang (via custom connector) which handled supervision
| and task distribution, and it looked insane for me at the
| time.
|
| Even today I think it can be useful but is very hard to
| maintain, and containers are a good enough way to handle
| python.
| kstrauser wrote:
| Virtualenv's been a thing for many years, it's built into
| Python, and it solves all that without adding additional
| tooling.
|
| And if you're genuinely asking, everything's converging
| toward uv. If you pick only one, use that and be done
| with it.
| hedora wrote:
| I've been using virtualenv for a decade, and we use uv at
| work.
|
| Neither fixed anything. They just make it slightly less
| painful to deal with python scripts' constant bitrot.
|
| They also make python uniquely difficult to dockerize.
| kstrauser wrote:
| That's so completely, diametrically opposite of my
| experience with both that I can't help but wonder how it
| ended up there.
|
| > They also make python uniquely difficult to dockerize.
| RUN pip install uv && uv sync
|
| Tada, done. No, seriously. That's the whole invocation.
| whimsicalism wrote:
| these are solved problems now, check back in. uv is now
| the standard
| tomn wrote:
| This is incoherent to me. Your complaints are about
| packaging, but the elixir wrapper doesn't deal with that
| in any way -- it just wraps UV, which you could use
| without elixir.
|
| What am I missing?
|
| Also, typically when people say things like
|
| > Tell me, which combination of the 15+ virtual
| environments, dependency management and Python version
| managers
|
| It means they have been trapped in a cycle of thinking
| "just one more tool will surely solve my problem",
| instead of realising that the tools _are_ the problem,
| and if you just use the official methods (virtualenv and
| pip from a stock python install), things mostly just
| work.
| kstrauser wrote:
| I agree. Python certainly had its speedbumps, but it's
| utterly manageable today and has been for years and
| years. It seems like people get hung up on there not
| being 1 official way to do things, but I think that's
| been great, too: the competition gave us nice things like
| Poetry and UV. The odds are slim that a Rust tool
| would've been accepted as the official Python.org-
| supplied system, but now we have it.
|
| There are reasons to want something more featureful than
| plain pip. Even without them, pip+virtualenv has been
| completely usable for, what, 15 years now?
| duped wrote:
| > No native docker support
|
| Honest question: why do you want this in MacOS? Do you
| understand what docker does? (it's fundamentally a linux
| technology, unless you are asking for user namespaces and
| chroot w/o SIP on MacOS, but that doesn't make sense since
| the app sandbox exists).
|
| MacOS doesn't have the fundamental ecosystem problems that
| beget the need for docker.
|
| If the answer is "I want to run docker containers because I
| have them" then use orbstack or run linux through the
| virtualization framework (not Docker desktop). It's
| remarkably fast.
| jeffhuys wrote:
| Docker Desktop now offers an option to use the
| virtualization framework, and works pretty well. But
| you're still constantly running a VM because "docker is
| how devs work now right?". I agree with your comment.
| raydev wrote:
| > MacOS doesn't have the fundamental ecosystem problems
| that beget the need for docker.
|
| Anyone wanting to run and manage their own suite of Macs
| to build multiple massive iOS and Mac apps at scale, for
| dozens or hundreds or thousands of developers deploying
| their changes.
|
| xcodebuild is by far the most obvious "needs native for
| max perf" but there are a few other tools that require
| macOS. But obviously if you have multiple repos and apps,
| you might require many different versions of the same
| tools to build everything.
|
| Sounds like a perfect use case for native containers.
| egorfine wrote:
| > why do you want this in MacOS?
|
| I have a small rackmounted rendering farm using mac
| minis, which outperform everything in the Intel world,
| even order of magnitude more expensive.
|
| I run macOS on my personal and development computers for
| over a decade and I use Linux since inception on server
| side.
|
| My experience: running server-side macOS is such a PITA
| it's not even funny. It may even pretend it has ssh while
| in fact the ssh server is only available on good days and
| only after Remote Desktop logged in at least once.
| Launchd makes you wanna crave systemd. etc, etc.
|
| So, about docker. I would absolutely love to run my app
| in a containerized environment on a Mac in order to not
| touch the main OS.
| mannyv wrote:
| Funny, I ran a bunch of Mac minis in colo for over a
| decade with no problems. Maybe you have a config problem?
|
| Of course, I had a LOM/KVM and redundant networking etc.
| They were substantially more reliable than the Dell
| equipment that I used in my day job for sure.
| duped wrote:
| What would a containerization environment on MacOS give
| you that you don't already have? Like concretely - what
| does containerization _mean_ in the context of a MacOS
| user space?
|
| In Linux, it means something very specific: a
| user/mount/pid/network namespace, overlayfs to provide a
| rootfs, chroot to pivot to the new root to do your work,
| and port forwarding between the host/guest systems.
|
| On MacOS I don't know what containerization means short
| of virtualization. But you have virtualization on MacOS
| already, so why not use that?
| e40 wrote:
| I torrent things from two different hosts on my gigabit
| network. The macos stack literally cannot handle the full
| bandwidth I have. It fails and the machine needs to be
| rebooted to fix it. It's not pretty on the way into this
| state, either. Other remote connections to the computer are
| unreliable. On Linux, running the same app in a docker
| container works perfectly. Transmission is the app.
| petecooper wrote:
| >Transmission is the app.
|
| Former Transmission user here.
|
| I realise you didn't ask, but you might find some
| improvements in qBittorrent.
| jihadjihad wrote:
| I haven't had any issue running BiglyBT on my M1 MacBook,
| granted I don't run it all day every day but everything
| runs plenty fast for my needs (25-30 MB/s for well-seeded
| torrents).
| jeffhuys wrote:
| I went to Transmission years and years ago because it's
| just simple. It has all the options if you need them, but
| no HUUUGE interface with RSS feeds, 10001 stats about
| your download, categories, tags, etc etc etc.
|
| Transmission is just a small, floating window with your
| downloads. Click for more. It fits in the macOS vibe. But
| I'm a person that fully adopted the original macOS "way
| of working" - kicked the full-screen habit I had in
| windows and never felt better.
|
| Can I ask, why would you go FROM Transmission to
| qBittorrent?
| petecooper wrote:
| >why would you go FROM Transmission to qBittorrent?
|
| In my case: some torrents wouldn't find known-good seeds
| in Transmission but worked fine in qBittorrent; there's
| reasonable (but not perfect) support for libtorrent 2.0
| in qBittorrent; my download speeds and overall
| responsiveness is anecdotally better in qBittorrent, and;
| I make use of some of the nitty gritty settings in
| qBittorrent.
| jeffhuys wrote:
| Well there's a list of good reasons! Thanks for
| answering. I haven't had any problems with finding seeds,
| and no need for libtorrent but now I know how to fix that
| when I do encounter those situations.
| e40 wrote:
| The Linux version, in a container no less, handles the
| entire gigabit bandwidth.
|
| And let's be clear, it wasn't the app that had problems,
| the Apple Remote Desktop connection to the machine failed
| when the speeds got above 40MB/s and the network
| interface stopped working around 80MB/s.
|
| I think Transmission works perfectly fine. I've been
| using it for 10+ years with no issues at all on Linux.
|
| I forgot to mention this is a Mac mini/Intel (2018).
| kstrauser wrote:
| I get nearly 10Gbps from my NAS to my Mac Studio. It
| absolutely can handle that bandwidth. It may not handle
| that specific client well for unrelated reasons.
| egorfine wrote:
| Bandwidth, yes. Connections count, no.
| drumttocs8 wrote:
| To expatiate with perspicuity:
|
| The Apple ecosystem is a walled garden.
| behnamoh wrote:
| > My guess is that Apple developed this chip for their internal
| AI efforts
|
| what internal AI efforts?
|
| Apple Intelligence is bunkers, and Apple MLX framework remains
| a hobby project for Apple
| layer8 wrote:
| https://security.apple.com/blog/private-cloud-compute/
|
| https://security.apple.com/documentation/private-cloud-
| compu...
|
| https://techcrunch.com/2024/12/11/apple-reportedly-
| developin...
|
| https://techcrunch.com/2025/02/24/apple-commits-500b-to-
| us-m...
| InTheArena wrote:
| Apple stated that they were deploying their own hardware for
| next generation Siri. My thesis is that this is the hardware
| they developed.
|
| If so, this is hardly a hobby project.
|
| It may not be effective, but there is serious cash behind
| this.
| gatienboquet wrote:
| They use their own M chips for IA. They are far more advanced
| on AI than the majority of company.
|
| They are using OpenAI for now but in couple months they will
| own the full chain of value.
| behnamoh wrote:
| we've heard that claim for the past three years, but every
| effort by them points to the opposite. don't get me wrong,
| I would love for Apple Intelligence to be smart enough on
| my iPhone and on my Mac, but honestly, the current version
| is a complete disappointment.
| DrBenCarson wrote:
| Apple are working on the hard problems of making AI
| useful (call them "agents"), not AGI
|
| 1. Small models running locally with well-established
| tool interfaces ("app intents")
|
| 2. Large models running in a bespoke cloud that can
| securely and quickly load all relevant tokens from a
| device before running inference
|
| No AI lab is even close to what Apple is trying to
| deliver in the next ~12 months
| Spooky23 wrote:
| They're taking a different and more difficult path of
| integrating AI with existing apps and workflows.
|
| It's their spin of the Google strategy of targeting providjng
| services to their enterprise GCP customer. I think we'll see
| more out of them long term.
| DrBenCarson wrote:
| Apple have been putting ML models running on their own
| silicon into production for far longer than any of their
| competitors. They publish some of the most innovative ML
| research
|
| They also own distribution to the wealthiest and most
| influential people in the world
|
| Don't get lost in recency bias
| Teever wrote:
| FTFA
|
| > Apple's custom-built UltraFusion packaging technology uses an
| embedded silicon interposer that connects two M3 Max dies
| across more than 10,000 signals, providing over 2.5TB/s of low-
| latency interprocessor bandwidth, and making M3 Ultra appear as
| a single chip to software.
| InTheArena wrote:
| I RTFA, RMFP
|
| The comment was that the press had reported that the
| interposer wasn't available. This obviously uses some form of
| interposer, so the question is if the press missed it, or
| Apple has something new.
| nsteel wrote:
| > uses an _embedded_ silicon interposer
|
| It sounds like they're using TSMC's new LSI (Local Si
| Interconnect) technology, which is their version of Intel's
| EMIB. It's essentially small islands of silicon, just
| around the inter-chip connections, embedded within the
| organic substrate. This gives the advantages of silicon
| interconnect, without the cost and size restrictions of a
| silicon interposer. It would not be visible from just
| looking at the package.
|
| https://www.anandtech.com/show/16031/tsmcs-version-of-
| emib-l...
|
| https://semianalysis.com/2022/01/06/advanced-packaging-
| part-...
| darthrupert wrote:
| Yeah, if only Apple at least semi-supported Linux, their
| computers would have no competition.
| dwedge wrote:
| I've been buying and using MBP for 6 or 7 years now, and just
| _assumed_ I could run Linux on one if I wanted to. I just
| spent a couple of days trying to get a 2018 MBP working with
| Linux and found out [edit to clarify] that my other ARM MBP
| basically won 't work.
|
| I just want a break from MacOS, I'll be buying a Thinkpad and
| will probably never come back. This isn't my moaning, I
| understand it's their market, but if their hardware supported
| Linux (especially dual booting) or Docker native, I'd
| probably be buying Apple for the next decade and now I just
| won't be.
| creddit wrote:
| > trying to get a 2018 MBP working with Linux and found out
| ARM basically doesn't work.
|
| Since the M series of ARM processors didn't come out until
| 2020, that would make a lot of sense.
| dwedge wrote:
| Two separate laptops, I could have been clearer. I have
| an old 2018 I wanted to try it on, and my daily is M2
| that would have been next.
| dghlsakjg wrote:
| A 2018 MacBook would be an intel x86 chip. It's incredibly
| easy to get Linux running on that machine.
| dwedge wrote:
| Getting Linux running wasn't difficult. But Mint lost
| audio (everything else worked), the specialised Mint
| kernel lost both audio and wifi, and Arch lost both wifi
| and the onboard keyboard.
|
| I'm sure with tinkering I could eventually get it
| working, but I'm well past the point of wanting to tinker
| with hardware and drivers to get Linux working.
| goosedragons wrote:
| Because of the T2 chip it's actually pretty annoying.
| Mainline kernels I think are still missing keyboard and
| trackpad support for those models. Plus a host of other
| issues.
| sunshowers wrote:
| No, there's a bunch of MBP generations in the middle that
| just never got any Linux attention.
| LordIllidan wrote:
| 2018 MBP is Intel unless you're referring to the T2 chip?
| dwedge wrote:
| I could have written it clearer. I have both, Intel was
| the first attempt and when I was struggling to get it up
| without losing one of wifi, audio and onboard keyboard
| and read that ARM was worse I gave up. Even the best
| combination I had (no audio but everything else working)
| would kill bluetooth after a while if wifi was connected
| to 2.6. I don't like their hardware enough to fight with
| it.
| officeplant wrote:
| Loved my M1 mini, loved my M2 air. I've moved on to 2024 HP
| Elitebook with an AMD R7 8840U, 1TB replaceable NVME, 32gb
| of socketed DDR5. 14in laptop with a serviceable enough
| 1920x1200 matte screen. $800 and a 3 hour drive to the
| nearest Microcenter. I gave Apple another try (refused
| apple from 2009-2020 because of the nvidia era issues) and
| I just can't stomach living off of piles of external drives
| anymore to make up for their lack luster storage space on
| the affordable units.
|
| The HP Elitebook was on Ubuntu's list of compatible tested
| laptops and came in hundreds of dollars less than a
| Thinkpad. Most of the comparably priced on sale T14's I
| could find were all crap Intel spec'd ones.
|
| Months in I don't regret it at all and Linux support has
| been fantastic even for a fairly newer Ryzen chip and not
| the latest kernel. (I stick to LTS releases of most
| Distros) Shoving in 4TB of NVME storage and 96GB of DDR5
| should I feel the need to upgrade would still put me only
| around $1300 invested in this machine.
| dwedge wrote:
| I'm not really moaning about the cost or lack of
| upgradability. I mean, I don't like it but at least you
| know what you're getting into. I just always assumed
| Linux as a backup was an option, and more and more OSX is
| annoying me (last 2 or 3 days it keeps dropping bluetooth
| for 30 seconds) and more and more I just find the
| interface distracting. Plus whether it works with
| external displays over USB C is a crapshoot.
|
| I'll miss the battery life of the M1 chips, and I'm going
| to have to re-learn how to type (CTRL instead of ALT, fn
| rarely being on the left, I use fn+left instead of CTRL A
| in terminals) but otherwise, I think I'm done.
| brailsafe wrote:
| Surely you're using that thing as a laptop in a minority
| of cases though, looks like it's basically just specs you
| bought. That's fine, but if that's all you want then it
| seems like rather than trying to give a mac a reasonable
| go of it as opposed to whatever else, you were trying to
| instead explore a fundamental difference in how you value
| technology products, which is quite a different battle.
| least wrote:
| I think the only laptops you won't find weird issues with
| linux are from smaller manufacturers dedicated to shipping
| them like the kde laptop or system76. Every other hardware
| manufacturer, including those that ship laptops with linux
| preinstalled, probably have weird hardware
| incompatibilities because they don't fully customize their
| SKUs with linux support in mind.
|
| Not that I'm discouraging you from switching or anything.
| If Linux is what you want/need, there's definitely better
| laptops to be had than a Macbook for that purpose. It's
| just that weird incompatibilities and having to fight with
| the operating system on random issues is, at least in my
| experience, normal when using a linux laptop. Even my T480
| which has overall excellent compatibility isn't trouble-
| free.
| dwedge wrote:
| Something like the brightness buttons not working, or
| sleep being a little erratic is ok. No released wifi
| drivers, bluetooth issues, and audio and the keyboard not
| working are not ok. Apple going backwards in terms of
| supporting Linux is not something I'm ok with.
| least wrote:
| There are wifi drivers; you just have to install them
| separately because they use broadcom chips. It's a
| proprietary blob. The other things do work, but it
| requires special packages and you'll need an external
| keyboard while installing. It's a pain to install, for
| sure, but it's not insurmountably difficult to get it
| installed.
|
| Apple Silicon chips are arguably more compatible with
| Asahi Linux [1], but that's largely in thanks to the hard
| work of Marcan, who's stepped down as project lead from
| the project [2].
|
| Overall I still think the right choice is to find a
| laptop better suited for the purpose of running linux on
| it, just something that requires more careful
| consideration than people think. Framework laptops, which
| seem well suited since ideologically it meshes well with
| linux users, can be a pain to set up as well.
|
| [1] https://asahilinux.org/
|
| [2] https://marcan.st/2025/02/resigning-as-asahi-linux-
| project-l...
| dwedge wrote:
| I know there are wifi and keyboard drivers, because the
| live boots and installers work with them, but then when
| it comes to installing they're gone. I know it's not
| insurmountable, and 10 years ago I'd have done it, but I
| spent a few hours and got sick of it. I agree with you
| that it's probably better to get another laptop.
| carlosjobim wrote:
| No competition among the Linux userbase - which is a client
| segment that you want to avoid at all costs.
| kokada wrote:
| > This hardware is really being held back by the operating
| system at this point.
|
| Apple could either create a 2U rack hardware and support Linux
| (and I mean Apple supporting it, not hobbysts), or have a build
| of Darwin headless that could run on that hardware. But in the
| later case, we probably wouldn't have much software available
| (though I am sure people would eventually starting porting
| software to it, there is already MacPorts and Homebrew and I am
| sure they could be adapted to eventually run in that platform).
|
| But Apple is also not interested in that market, so this will
| probably never happen.
| naikrovek wrote:
| > But Apple is also not interested in that market, so this
| will probably never happen.
|
| they're just a tiny company with shareholders who are really
| tired of never earning back their investments. give 'em a
| break. I mean they're still so small that they must protect
| themselves by requiring that macs be used for publishing
| iPhone and iPad applications.
| hnaccount_rng wrote:
| Not to get in the way of good snark or anything. But..
| Apple isn't _requiring_ that everyone uses MacOS on their
| systems. But you have to bring your own engineering effort
| to actually make another OS run. And so far Asahi is the
| only effort that I'm aware of (there were alternatives in
| the very beginning, but they didn't even get to M2 right?)
| thesuitonym wrote:
| > But you have to bring your own engineering effort to
| actually make another OS run.
|
| I mean, that's usually how it works though. When IBM
| launched the PS/2, they didn't support anything other
| than PC-DOS and OS/2, Microsoft had to make MS-DOS work
| for it (I mean... they _did_ get support from IBM, but
| not really), the 386BSD and Linux communities brought the
| engineering effort without IBM 's involvement.
|
| When Apple was making Motorola Macs, they may have given
| Be a little help, but didn't support any other OSes that
| appeared. Same with PowerPC.
|
| All of the support for alternative OSes has always come
| from the community, whether that's volunteers or a
| commercial interest with cash to burn. Why should that
| change for Apple silicon?
| jorams wrote:
| Note that they said (emphasis mine):
|
| > they're still so small that they must protect
| themselves by requiring that _macs be used for publishing
| iPhone and iPad applications._
|
| They're not talking about Apple's silicon as a target,
| but as a development platform.
| ewzimm wrote:
| There has to be someone at Apple with a contact at IBM that
| could make Fedora Apple Remix happen. It may not be on-brand,
| but this is a prime opportunity to make the competition look
| worse. File it under Community projects at
| https://opensource.apple.com/projects
| asadm wrote:
| https://www.globalnerdy.com/wordpress/wp-
| content/uploads/200...
| alwillis wrote:
| I wouldn't be so sure about that.
|
| https://news.ycombinator.com/item?id=43271486
| AlchemistCamp wrote:
| Keep in mind the minimum configuration that has 512GB of
| unified RAM is $9,499.
| 42lux wrote:
| Still cheap if the only thing you look for is vram.
| nsteel wrote:
| And how is it only PS9,699.00!! Does that dollar price
| include sales tax or are Brits finally getting a bargain?
| vr46 wrote:
| The US prices never include state sales tax IIRC. Maybe
| we're finally getting some parity.
| seanmcdirmid wrote:
| You could always buy one at an apple store without sales
| tax (e.g. Portland Oregon). But they might not have that
| one in stock...
| mastax wrote:
| Tariffs perhaps?
| kgwgk wrote:
| What's the bargain?
|
| There is also "parity" in other products like a MacBook Pro
| from PS1,599 / $1,599 or an iPhone 16 from PS799 / $799.
| PS9,699 / $9,499 is worse than that!
| DrBenCarson wrote:
| Cheap relative to the alternatives
| stego-tech wrote:
| I cannot express how dirt cheap that pricepoint is for what's
| on offer, especially when you're comparing it to rackmount
| servers. By the time you've shoehorned in an nVidia GPU and
| all that RAM, you're easily looking at 5x that MSRP; sure,
| you get proper redundancy and extendable storage for that
| added cost, but now you also need redundant UPSes and have
| local storage to manage instead of centralized SANs or NASes.
|
| For SMBs or Edge deployments where redundancy isn't as
| critical or budgets aren't as large, this is an _incredibly_
| compelling offering... _if_ Apple actually had a competent
| server OS to layer on top of that hardware, which it does
| not.
|
| If they did, though...whew, I'd be quaking in my boots if I
| were the usual Enterprise hardware vendors. That's a _damn
| frightening_ piece of competition.
| cubefox wrote:
| I assume there is a very good reason why AMD and Intel
| aren't releasing a similar product.
| stego-tech wrote:
| From my outsider perspective, it's pretty straightforward
| why they don't.
|
| In Intel's case, there's ample coverage of the company's
| lack of direction and complacency on existing hardware,
| even as their competitors ate away at their moat, year
| after year. AMD with their EPYC chips taking datacenter
| share, Apple moving to in-house silicon for their entire
| product line, Qualcomm and Microsoft partnering with
| ongoing exploration of ARM solutions. A lack of
| competency in leadership over that time period has
| annihilated their lead in an industry they used to
| single-handedly _dictate_ , and it's unlikely they'll
| recover that anytime soon. So in a sense, Intel _cannot_
| make a similar product, in a timely manner, that competes
| in this segment.
|
| As for AMD, it's a bit more complicated. They're seeing
| pleasant success in their CPU lineup, and have all but
| thrown in the towel on higher-end GPUs. The industry has
| broadly rallied around CUDA instead of OpenCL or other
| alternatives, especially in the datacenter, and AMD
| realizes it's a fool's errand to try and compete directly
| there when it's a monopoly in practice. Instead of
| squandering capital to compete, they can just continue
| succeeding and working on their own moat in the areas
| they specialize in - mid-range GPUs for work and gaming,
| CPUs targeting consumers and datacenters, and APUs
| finding their way into game consoles, handhelds, and
| other consumer devices or Edge compute systems.
|
| And that's just getting into the specifics of those two
| companies. The reality is that any vendor who hasn't
| already unveiled their own chips or accelerators is
| coming in at what's perceived to be the "top" of the
| bubble or market. They'd lack the capital or moat to
| really build themselves up as a proper competitor, and
| are more likely to just be acquired in the current
| regulatory environment (or lack thereof) for a quick
| payout to shareholders. There's a reason why the
| persistent rumor of Qualcomm purchasing part or whole of
| Intel just won't die: the x86 market is rather stagnant,
| churning out mediocre improvements YoY at growing
| pricepoints, while ARM and RISC chips continue to
| innovate on modern manufacturing processes and chip
| designs. The growth is _not_ in x86, but a juggernaut
| like Qualcomm would be an ideal buyer for a "dying" or
| "completed" business like Intel's, where the only thing
| left to do is constantly iterate for diminishing returns.
| kridsdale1 wrote:
| Well said.
| AlchemistCamp wrote:
| It's not quite an apples to apples comparison, no pun
| intended. I guess we'll see how it sells.
| kllrnohj wrote:
| > By the time you've shoehorned in an nVidia GPU and all
| that RAM, you're easily looking at 5x that MSRP
|
| That nvidia GPU setup will actually have the compute grunt
| to make use of the RAM, though, which this M3 Ultra
| probably realistically doesn't. After all, if the only
| thing that mattered was RAM then the 2TB you can shove into
| an Epyc or Xeon would already be dominating the AI
| industry. But they aren't, because it isn't. It certainly
| hits at a unique combination of things, but whether or not
| that's maximally useful for the money is a completely
| different story.
| rbanffy wrote:
| Had the M3 GPU been much wider, it would be constrained
| by the memory bandwidth. It might still have an advantage
| over Nvidia competitors in that it has 512GB accessible
| to it and will need to push less memory across socket
| boundaries.
|
| It all depends on the workload you want to run.
| stego-tech wrote:
| You're forgetting what Apple's been baking into their
| silicon for (nearly? over?) a decade: the Neural
| Processing Unit (NPU), now called the "Neural Engine".
| That's their secret sauce that makes their kit more
| competitive for endpoint and edge inference than standard
| x86 CPUs. It's why I can get similarly satisfying
| performance on my old M1 Pro Macbook Pro with a scant
| 16GB of memory as I can on my 10900k w/ 64GB RAM and an
| RTX 3090 under the hood. Just to put these two into
| context, I ran the latest version of LM Studio with the
| deepseek-r1-distill-llama-8b model @ Q8_0, both with the
| exact same prompt and maximally offloaded onto hardware
| acceleration and memory, with a context window that was
| entirely empty: Write me an AWS
| CloudFormation file that does the following:
| * Deploys an Amazon Kubernetes Cluster * Deploys
| Busybox in the namespace "Test1", including creating that
| Namespace * Deploys a second Busybox in the
| namespace "Test3", including creating that Namespace
| * Creates a PVC for 60GB of storage
|
| The M1Pro laptop with 16GB of Unified Memory:
| * 21.28 seconds for "Thinking" * 0.22s to the first
| token * 18.65 tokens/second over 1484 tokens in its
| responses * 1m:23s from sending the input to
| completion of the output
|
| The 10900k CPU, with 64GB of RAM and a full-fat RTX 3090
| GPU in it: * 10.88 seconds for "thinking"
| * 0.04s to first token * 58.02 tokens/second over
| 1905 tokens in its responses * 0m:34s from sending
| the input to completion of the output
|
| Same model, same loader, different architectures and
| resources. This is why a lot of the AI crowd are on Macs:
| their chip designs, especially the Neural Engine and
| GPUs, allow quite competent edge inference while sipping
| comparative thimbles of energy. It's why if I were all-in
| on LLMs or leveraged them for work more often (which I
| intend to, given how I'm currently selling my generalist
| expertise to potential employers), I'd be seriously
| eyeballing these little Mac Studios for their local
| inference capabilities.
| kllrnohj wrote:
| Uh.... I must be missing something here, because you're
| hyping up Apple's NPU only to show it getting absolutely
| obliterated by the equally old 3090? Your 10900K having
| 64gb of RAM is also irrelevant here...
| stego-tech wrote:
| You're missing the the bigger picture by getting bogged
| down in technical details. To an end user, the difference
| between thirty seconds and ninety seconds is often
| irrelevant for things like AI, where they _expect_ a
| delay while it "thinks". When taken in that context,
| you're now comparing a 14" laptop running off its
| battery, to a desktop rig gulping down ~500W according to
| my UPS, for a mere 66% reduction in runtime for a single
| query at the expense of 5x the power draw.
|
| Sure, the desktop machine performs better, as would a
| datacenter server jam-packed full of Blackwell GPUs, but
| _that 's not what's exciting_ about Apple's
| implementation. It's the _efficiency_ of it all, being
| able to handle modern models on comparatively "weaker"
| hardware most folks would dismiss outright. _That 's_ the
| point I was trying to make.
| kllrnohj wrote:
| We're talking about the m3 ultra here, which is also wall
| powered and also expensive. Nobody is interested in
| dropping upwards of $10,000 on a Mac Studio to have
| "okay" performance just because an unrelated product is
| battery powered. Similarly saving a few bucks on
| electricity to triple the time the much, much more
| expensive engineer time spent waiting on results is
| foolish
|
| Also Apple isn't unique in having an NPU in a laptop.
| Fucking everyone does at this point.
| baq wrote:
| This is a 'shut up and take my money' price, it'll fly off
| the shelves.
| jread wrote:
| $8549 with 1TB storage
| rbanffy wrote:
| It can connect to external storage easily.
| exabrial wrote:
| If Apple supported Linux (headless) natively, and we could rack
| m4 pros, I absolutely would use them in our Colo.
|
| The CPUs have zero competition in terms of speed, memory
| bandwidth. Still blown away no other company has been able to
| produce Arm server chips that can compete.
| notpushkin wrote:
| Asahi is a thing. For headless usage it's pretty much ready
| to go already.
| criddell wrote:
| The Asahi maintainer resigned recently. What that means for
| the future only time will tell. I probably wouldn't want to
| make a big investment in it right now.
| seabrookmx wrote:
| Your wording makes it sound like it was a one-man show.
| Asahi has a really strong contributor base, new
| leadership[1], and the backing of Fedora via the Asahi
| Fedora Remix. While Hector resigning is a loss, I don't
| think it's a death knell for the project.
|
| [1]: https://asahilinux.org/2025/02/passing-the-torch/
| hoppp wrote:
| He was the lead developer and very prominent figure. I
| think it probably boils down to funding the new
| developments.
| whimsicalism wrote:
| it was pretty close to a one man show
| skyyler wrote:
| By what grounds do you make this statement?
|
| My understanding is there are dozens of people working on
| it.
| whimsicalism wrote:
| e: I'm removing this comment because on reflection I
| think it is probably some form of doxxing and being right
| on the internet isn't that important.
| ArchOversight wrote:
| You believe that Hector Martin is also Asahi Lina?
|
| https://bsky.app/profile/lina.yt
|
| https://github.com/AsahiLina
| whimsicalism wrote:
| e: snip
| raydev wrote:
| I thought this was confirmed a couple years ago.
| surajrmal wrote:
| You make it sound like there was only one.
| lynndotpy wrote:
| Not at all for M3 or M4. Support is for M2 and M1
| currently.
| EgoIncarnate wrote:
| M3 support in Asahi is still heavily WIP. I think it
| doesn't even have display support, Ethernet, or Wifi yet, I
| think it's only serial over USB . Without any GPU or ANE
| support, it's not very useful for AI stuff.
| https://asahilinux.org/docs/M3-Series-Feature-Support/
| WD-42 wrote:
| It's only a thing for the M1. Asahi is a Sisyphean effort
| to keep up with new hardware and the outlook is pretty grim
| at the moment.
|
| Apple's whole m.o. is to take FOSS software, repackage it
| and sell it. They don't want people using it directly.
| hedora wrote:
| The last I checked, AMD was outperforming Apple perf/dollar
| on the high end, though they were close on perf/watt for the
| TDPs where their parts overlapped.
|
| I'd be curious to know if this changes that. It'd take a lot
| more than doubling cores to take out the very high power AMD
| parts, but this might squeeze them a bit.
|
| Interestingly, AMD has also been investing heavily in unified
| RAM. I wonder if they have / plan an SoC that competes 1:1
| with this. (Most of the parts I'm referring to are set up for
| discrete graphics.)
| aurareturn wrote:
| The M4 Pro is 56% faster in ST performance against AMD's
| new Strix Halo while being 3.6x more efficient.
|
| Source: https://www.notebookcheck.net/AMD-Ryzen-AI-
| Max-395-Analysis-...
|
| Cinebench 2024 results.
| hedora wrote:
| That's a laptop part, so it makes different tradeoffs.
|
| Somewhere on the internet there is a tdp wattage vs
| performance x-y plot. There's a pareto optimal region
| where all the apple and amd parts live. Apple owns low
| tdp, AMD owns high tdp. They duke it out in the middle.
| Intel is nowhere close to the line.
|
| I'd guess someone has made one that includes datacenter
| ARM, but I've never seen it.
| aurareturn wrote:
| High TDP? You mean server-grade CPUs? Apple doesn't make
| those.
| derefr wrote:
| True, but these "Ultra" chips do target the same niche as
| (some) high-TDP chips.
|
| Workstations (like the Mac Studio) have traditionally
| been a space where "enthusiast"-grade consumer parts
| (think Threadripper) and actual server parts competed.
| The owner of a workstation didn't _usually_ care about
| their machine 's TDP; they just cared that it could chew
| through their workloads as quickly as possible. But,
| unlike an actual server, workstations didn't need the
| super-high core count required for _multitenant_
| parallelism; and would go idle for long stretches -- thus
| benefitting (though not requiring) more-efficient power
| management that could drive down _baseline_ TDP.
| aurareturn wrote:
| Oh you. mean Threadripper. I thought you were talking
| about Epyc.
|
| Anyway, I don't think it's comparable really. This thing
| comes with a fat GPU, NPU, and unified memory.
| Threadripper is just a CPU.
| mort96 wrote:
| The GPU and NPU shouldn't be consuming power when not in
| use. Why shouldn't we compare M3 Ultra to Threadripper?
| diggan wrote:
| Isn't the rack-mounted Mac Pro supposedly "server-grade"
| (https://www.apple.com/shop/buy-mac/mac-pro/rack)?
|
| At least judging by the mounts, they want them to be used
| that way, even though the CPU might not fit with the de
| facto industry label for "server-grade".
| aurareturn wrote:
| Server grade CPUs. I thought he was referring to Epyc
| CPUs.
| hedora wrote:
| Indeed. The M3 Ultra is in the midrange where they duke
| it out. Similarly, for its niche, the iPhone CPU is was
| better than AMD's low end processors.
|
| Anyway the Apple config in the article costs about 5x
| more than a comparable low end AMD server with 512GB of
| ram, but adds an NPU. AMD has NPUs in lower end stuff;
| not sure about this TDP range.
| refulgentis wrote:
| > You mean server-grade CPUs? Apple doesn't make those.
|
| Right.
|
| It is coming up because we're in a thread about using
| them as server CPUs. (c.f. "colo", "2U" in OP and OP's
| child), and the person you're replying to is making the
| same point you are
|
| For years now, people will comment "these are the best
| chips, I'd replace all chips with them."
|
| Then someone points out perf/watt is not perf.
|
| Then someone else points out some M-series is much faster
| than a random CPU.
|
| And someone else points out that the random CPU is not a
| top performing CPU.
|
| And someone else points out M-series are optimized for
| perf/watt and it'd suck if it wasn't.
|
| I love my MacBook, the M-series has no competitors in the
| case it's designed for.
|
| I'd just prefer, at this point, that we can skip long
| threads rehashing it.
|
| It's a great chip. It's not the fastest, and it's better
| for that. We want perf/watt in our mobile devices.
| There's fundamental, well-understood, engineering
| tradeoffs that imply being great at that necessitates the
| existence of faster processors.
| aurareturn wrote:
| It's a great chip. It's not the fastest,
|
| It has the world's fastest single thread.
| refulgentis wrote:
| I can't quite tell what's going on here, earlier, you
| seem to be clear -- c.f. "Apple doesn't make server-grade
| CPUs"
| aurareturn wrote:
| Correct. But their M4 line has the fastest single thread
| performance in the world.
| nameequalsmain wrote:
| According to what source? Passmark says otherwise[1]. The
| fastest Intel CPUs have both a higher single thread and
| multi thread score in that test.
|
| [1] https://www.cpubenchmark.net/singleThread.html
| refulgentis wrote:
| Well, no, right?
|
| The _M4 Max_ had great, I would argue the best at time of
| release, single _core_ results on _Geekbench_.
|
| That is a different claim from M4 line has the top single
| thread performance in the world.
|
| I'm curious:
|
| You're signalling both that you understand the
| fundamental tradeoff ("Apple doesn't make server-grade
| CPUs") and that you are talking about something else
| (follow-up with M4 family has top single-thread
| performance)
|
| What drives that? What's the other thing you're hoping to
| communicate?
|
| If you are worried that if you leave it at "Apple doesn't
| make server-grade CPUs", that people will think M4s
| aren't as great as they are, this is a technical-enough
| audience, I think we'll understand :) It doesn't come
| across as denigrating the M-series, but as understanding
| a fundamental, physically-based, tradeoff.
| yxhuvud wrote:
| It also include gaming machines. Of course, Apple also
| don't make those.
| tomrod wrote:
| > tdp wattage vs performance x-y plot
|
| This?
|
| https://www.videocardbenchmark.net/power_performance.html
| #sc...
| echoangle wrote:
| That's GPUs, not CPUs
| nick_ wrote:
| Same. I'm not sure what to make of the various claims. I
| personally defer to this table in general:
| https://www.cpubenchmark.net/power_performance.html.
|
| I'm not sure how those benchmarks translate to common real
| world use cases.
| hoppp wrote:
| What about serviceability? These come with soldered in ssd?
| That would be an issue for server use, Its too expensive to
| throw it away all for a broken ssd.
| gjsman-1000 wrote:
| Nah, in many businesses, everything is on a schedule. For
| desktop computers, a common cycle is 4 years. For servers,
| maybe a little longer, but not by much. After that date
| arrives, it's liquidate everything and rebuild.
|
| Having things consistently work is much cheaper than down
| days caused by your ancient equipment. Apple's SSDs will
| make it to 5 years no problem - and more likely, 10-15
| years.
| hedora wrote:
| At my last N jobs, companies built high end server farms
| and carefully specced all the hardware. Then they looked
| at SSD specs and said "these are all fine".
|
| Fast forward 2 years: The $50-$250K machines have a 100%
| drive failure rate, and some poor bastard has to fly from
| data center to data center to swap the $60 drive for a
| $120 one, then re-rack and re-image each machine.
|
| Anyway, soldering a decent SSD to the motherboard board
| would actually improve reliability at all those places.
| olyjohn wrote:
| What does soldering it to the board have to do with
| reliability?
|
| If they were soldered onto those systems you talk about,
| all those would have had to be replaced instead of just
| having the drive swapped out and re-imaged.
| wtallis wrote:
| I think the implication was that a soldered SSD doesn't
| give the customer as much chance to pick the wrong SSD.
| But it's still possible for the customer to have a
| different use case in mind than the OEM did when the OEM
| is picking what SSD to include.
| choilive wrote:
| What company was specc'ing out a 6 figure machine just to
| put in a consumer class SSD?
| galad87 wrote:
| No, the SSD isn't soldered, it has got one or two removable
| modules: https://everymac.com/systems/apple/mac-studio/mac-
| studio-faq...
| PaulHoule wrote:
| If I read this right, the r8g.48xlarge at AMZN [1] has 192
| cores and 1536GB which exceeds the M3 Ultra in some metrics.
|
| It reminds me of the 1990s when my old school was using Sun
| machines based on the 68k series and later SPARC and we were
| blown away with the toaster-sized HP PA RISC machine that was
| used for student work for all the CS classes.
|
| Then Linux came out and it was clear the 386 trashed them all
| in terms of value and as we got the 486 and 586 and further
| generations, the Intel architecture trashed them in every
| respect.
|
| The story then was that Intel was making more parts than
| anybody else so nobody else could afford to keep up the
| investment.
|
| The same is happening with parts for phones and TSMC's
| manufacturing dominance -- and today with chiplets you can
| build up things like the M3 Ultra out of smaller parts.
|
| [1] https://aws.amazon.com/ec2/instance-types/r8g/
| hedora wrote:
| In fairness, the sun and dec boxes I used back then (up to
| about 1999) could hold their own against intel machines.
|
| Then, one day, we built a 5 machine amd athlon xp linux
| cluster for $2000 ($400/machine) that beat all the unix and
| windows server hardware by at least 10x on $/perf.
|
| It's nice that we have more than one viable cpu vendor
| these days, though it seems like there's only one viable
| fab company.
| PaulHoule wrote:
| In 1998-1999 I had a DEC Alpha on my desktop that was
| really impressive, it was a 64-bit machine a few years
| before you could get a 64-bit Athlon.
| hedora wrote:
| Yeah.
|
| For what we needed, five 32 bit address spaces was enough
| DRAM. The individual CPU parts were way more than 20% as
| fast, and the 100Mbit switch was good enough.
|
| (The data basically fit in ram, so network transport time
| to load a machine was bounded by 4GiB / 8MiB / sec = 500
| seconds. Also, the hard disks weren't much faster than
| network back then.)
| winocm wrote:
| The Alpha architecture was 64-bit from the very beginning
| (though the amount of addressable virtual memory and
| physical memory depends on the processor implementation).
|
| I think it goes something like: -
| 2106x/EV4: 34-bit physical, 43-bit virtual -
| 21164/EV5: 40-bit physical, 43-bit virtual -
| 21264/EV6: 44-bit physical, 48-bit virtual
|
| The EV6 is a bit quirky as it is 43-bit by default, but
| can use 48-bits when I_CTL<VA_48> or VA_CTL<VA_48> is
| set. (the distinction of the registers is for each access
| type, i.e: instruction fetch versus data load/store)
|
| The 21364/EV7 likely has the same characteristics as EV6,
| but the hardware reference manual seems to have been lost
| to time...
| PaulHoule wrote:
| My understanding is that the VAX from Digital was the
| mother of all "32-bit" architectures to replace the dead
| end PDP-11 (had a 64kbyte user space so wasn't really
| that much better than an Apple ][) and PDP-10/20 (36-bit
| words were awkward after the 8-bit byte took over the
| industry) The 68k and 386 protected mode were imitations
| of the VAX.
|
| Digital struggled with the microprocessor transition
| because they didn't want to kill their cash cow
| minicomputers with microcomputer-based replacements. They
| went with the 64-bit Alpha because they wanted to rule
| the high end in the CMOS age. And they did, for a little
| while. But the mass market caught up.
| nsteel wrote:
| It seems Graviton 4 CPUs have 12-channels of DDR5-5600 i.e
| 540GB/s main memory bandwidth for the CPU to use. M3 Ultra
| has 64-channels of LPDDR5-6400 i.e. ~800GB/s of memory
| bandwidth for the CPU or the GPU to use. So the M3 Ultra
| has way fewer (CPU) cores, but way more memory bandwidth.
| Depends what you're doing.
| icecube123 wrote:
| Yea ive been thinking about this for a few years. The Mx
| series's chip would sell into data centers like crazy if
| apple went after that market. Especially if they created a
| server tuned chip. It could probably be their 2nd biggest
| product line behind the iphone. The performance and
| efficiency is awesome. I guess it would be meat to see some
| web serving and database benchmarks to really know.
| kridsdale1 wrote:
| TSMC couldn't make enough at the leading node in addition
| to all the iPhone chips Apple has to sell. There's a
| physical thoughput limit. That's why this isn't M4.
| Apofis wrote:
| Doesn't MacOS support these things? I'm sure Apple runs these
| on their datacenters somehow?
| rbanffy wrote:
| > The CPUs have zero competition in terms of speed, memory
| bandwidth.
|
| Maybe not at the same power consumption, but I'm sure mid-
| range Xeons and EPYCs mop the floor with the M3 Ultra in CPU
| performance. What the M3 Ultra has that nobody else comes
| close is a decent GPU near a pool of half a terabyte of RAM.
| Thaxll wrote:
| Apple does not make server CPUs, they make consumer low W
| CPUs, it's very different.
|
| FYI Apple runs Linux in their DC, so no Apple hardware in
| their own servers.
| alwillis wrote:
| > Apple does not make server CPUs, they make consumer low W
| CPUs, it's very different.
|
| This is silly. Given the performance per watt, the M series
| would be great in a data center. As you all know,
| electricity for running the servers and cooling for the
| servers are the two biggest ongoing costs for a data
| center; the M series requires less power and runs more
| efficiently than the average Intel or AMD-based server.
|
| > FYI Apple runs Linux in their DC, so no Apple hardware in
| their own servers.
|
| That's certainly no longer the case. Apple announced their
| Private Cloud Compute [1] initiative--Apple designed
| servers running Apple Silicon to support Apple Intelligence
| functions that can't run on-device.
|
| BTW, Apple just announced a $500 billion investment [2] in
| US-based manufacturing, including a 250,000 square foot
| facility to make _servers_. Yes, these will obviously be
| for their Private Cloud Compute servers... but it doesn 't
| have to be only for that purpose.
|
| From the press release:
|
| _As part of its new U.S. investments, Apple will work with
| manufacturing partners to begin production of servers in
| Houston later this year. A 250,000-square-foot server
| manufacturing facility, slated to open in 2026, will create
| thousands of jobs._
|
| _Previously manufactured outside the U.S., the servers
| that will soon be assembled in Houston play a key role in
| powering Apple Intelligence, and are the foundation of
| Private Cloud Compute, which combines powerful AI
| processing with the most advanced security architecture
| ever deployed at scale for AI cloud computing. The servers
| bring together years of R &D by Apple engineers, and
| deliver the industry-leading security and performance of
| Apple silicon to the data center._
|
| _Teams at Apple designed the servers to be incredibly
| energy efficient, reducing the energy demands of Apple data
| centers -- which already run on 100 percent renewable
| energy. As Apple brings Apple Intelligence to customers
| across the U.S., it also plans to continue expanding data
| center capacity in North Carolina, Iowa, Oregon, Arizona,
| and Nevada._
|
| [1]: https://security.apple.com/blog/private-cloud-compute/
|
| [2]: https://www.apple.com/newsroom/2025/02/apple-will-
| spend-more...
| stego-tech wrote:
| > This hardware is really being held back by the operating
| system at this point.
|
| It really is. Even if they themselves won't bring back their
| old XServe OS variant, I'd really appreciate it if they at
| least partnered with a Linux or BSD (good callout, ryao) dev to
| bring a server OS to the hardware stack. The consumer OS, while
| still better (to my subjective tastes) than Windows, is
| increasingly hampered by bloat and cruft that make it untenable
| for production server workloads, at least to my subjective
| standards.
|
| A server OS that just treats the underlying hardware like a
| hypervisor would, making the various components attachable or
| shareable to VMs and Containers on top, would make these things
| incredibly valuable in smaller datacenters or Edge use cases.
| Having an on-prem NPU with that much RAM would be a godsend for
| local AI acceleration among a shared userbase on the LAN.
| ryao wrote:
| Given shared heritage, I would expect to see Apple work with
| FreeBSD before I would expect Apple to work with Linux.
| stego-tech wrote:
| You are _technically_ correct (the best kind of correct).
| I'm just a filthy heathen who lumps the BSDs and Linux
| distros under "Linux" as an _incredibly incorrect_ catchall
| for casual discourse.
| hedora wrote:
| I heard OpenBSD has been working for a while.
|
| I'm continually surprised Apple doesn't just donate
| something like 0.1% of their software development budget to
| proton and the asahi projects. It'd give them a big chunk
| of the gaming and server markets pretty much overnight.
|
| I guess they're too busy adding dark patterns that re-
| enable siri and apple intelligence instead.
| hinkley wrote:
| I miss the XServe almost as much as I miss the Airport
| Extreme.
| stego-tech wrote:
| I feel like Apple and Ubiquiti have a missed collaboration
| opportunity on the latter point, especially with the
| latter's recent UniFi Express unit. It feels like pairing
| Ubiquiti's kit with Apple's Homekit could benefit both, by
| making it easier for Homekit users to create new VLANs
| specifically for Homekit devices, thereby improving
| security - with Apple dubbing the term, say, "Secure Device
| Network" or some marketingspeak to make it easier for
| average consumers to understand. An AppleTV unit could even
| act as a limited CloudKey for UniFi devices like Access
| Points, or UniFi Cameras to connect/integrate as Homekit
| Cameras.
|
| Don't get me wrong, _I_ wouldn 't use that feature (I
| prefer self-hosting it all myself), but for folks like my
| family members, it'd be a killer addition to the lineup
| that makes my life supporting them much easier.
| jmyeet wrote:
| I've been looking at the potential for Apple to make really
| interesting LLM hardware. Their unified memory model could be a
| real game-changer because NVidia really forces market
| segmentation by limiting memory.
|
| It's worth adding the M3 Ultra has 819GB/s memory bandwidth
| [1]. For comparison the RTX 5090 is 1800GB/s [2]. That's still
| less but the M4 Mac Minis have 120-300GB/s and this will limit
| token throughput so 819GB/s is a vast improvement.
|
| For $9500 you can buy a M3 Ultra Mac Studio with 512GB of
| unified memory. I think that has massive potential.
|
| [1]: https://www.apple.com/mac-studio/specs/
|
| [2]: https://www.nvidia.com/en-us/geforce/graphics-
| cards/50-serie...
| hedora wrote:
| Other than the NPU, it's not really a game changer; here's a
| 512GB AMD deepseek build for $2000:
|
| https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...
| flakiness wrote:
| The low energy use can be a game changer if you live in a
| crappy apartment with limited power capacity. I gave up my
| big GPU box dream because of that.
| aurareturn wrote:
| between 4.25 to 3.5 TPS (tokens per second) on the Q4 671b
| full model.
|
| 3.5 - 4.25 tokens/s. You're torturing yourself. Especially
| with a reasoning model.
|
| This will run it at 40 tokens/s based on rough calculation.
| Q4 quant. 37b active parameters.
|
| 5x higher price for 10x higher performance.
| hinkley wrote:
| Also you don't have to deal with Windows. Which people who
| do not understand Apple are very skilled at not noticing.
|
| If you've ever used git, svn, or an IDE side by side on
| corporate Windows versus Apple I don't know why you would
| ever go back.
| hatthew wrote:
| Is there a reason one couldn't use linux?
| bigyabai wrote:
| The PC doesn't have to run Windows either. Strictly
| speaking, professional applications see MacOS support as
| an Apple-sanctioned detriment.
|
| > If you've ever used git, svn, or an IDE side by side
|
| I still reach for Windows, even though it's a dogshit OS.
| I would rather use WSL to write and deploy a single app,
| as opposed to doing my work in a Linux VM or (god forbid)
| writing and debugging multiple versions just to support
| my development runtime. If I'm going to use an ad-
| encumbered commercial service-slop OS, I might as well
| pick the one that doesn't actively block my work.
| brailsafe wrote:
| It's also just clearly a powerful and interesting
| tinkering project, which there are valid arguments for,
| but this can just chill out on your desk as an elegant
| general productivity machine. What it wouldn't do that
| the tinkering project could do is be upgraded, act as a
| powerful gaming pc, or cause migraines from constant fan
| noise.
|
| The custom build would work great though, and even moreso
| in a server room and as well continues to reveal by
| comparison how excessively Apple prices it's components.
| intrasight wrote:
| It certainly is held back and that is unfortunate. But if you
| can run your workloads on this amazing machine, then that's a
| lot of compute for the buck.
|
| I assume that there's a community of developers focusing on
| leveraging this hardware instead of complaining about the
| operating system.
| hinkley wrote:
| Given that the M1 Ultra and M2 Ultra also exist, I'd expect
| either straight binning, or two designs that use mostly the
| same designs for the cores but more of them and a few extra
| features.
|
| I love Apple but they love to speak in half truths in product
| launches. Are they saying the M3 Ultra is their first
| Thunderbolt 5 computer? I don't recall seeing any previous
| announcements.
| kridsdale1 wrote:
| M4 Pro MacBook and Mini have TB5.
| hajile wrote:
| One of the leakers who got this Mac Studio right claims Apple
| is reserving the M4 ultra for the Mac Pro to differentiate the
| products a bit more.
| GeekyBear wrote:
| I also wondered about binning, so I pulled together how heavily
| Apple's Max chips were binned in shipping configurations.
|
| M1 Max - 24 to 32 GPU cores
|
| M2 Max - 30 to 38 GPU cores
|
| M3 Max - 30 to 40 GPU cores
|
| M4 Max - 32 to 40 GPU cores
|
| I also looked up the announcement dates for the Max and the
| Ultra variant in each generation.
|
| M1 Max - October 18, 2021
|
| M1 Ultra - March 8, 2022
|
| M2 Max - January 17, 2023
|
| M2 Ultra - June 5, 2023
|
| M3 Max - October 30, 2023
|
| M3 Ultra - March 12, 2025
|
| M4 Max - October 30, 2024
|
| > My guess is that Apple developed this chip for their internal
| AI efforts
|
| As good a guess as any, given the additional delay between the
| M3 Max and Ultra being made available to the public.
| jonplackett wrote:
| I'm missing the point. What is it you're concluding from
| these dates?
| GeekyBear wrote:
| I was referring to the additional year of delay between the
| M3 Max and M3 Ultra announcements when compared to the M1
| and M2 generations.
|
| The theory that the M3 Ultra was being produced, but
| diverted for internal use makes as much sense as any theory
| I've seen.
|
| It makes at least as much sense as the "TSMC had difficulty
| producing enough defect free M3 Max chips" theory.
| behnamoh wrote:
| 819GB/s bandwidth...
|
| what's the point of 512GB RAM for LLMs on this Mac Studio if the
| speed is painfully slow?
|
| it's as if Apple doesn't want to compete with Nvidia... this is
| really disappointing in a Mac Studio. FYI: M2 Ultra already has
| 800GB/s bandwidth
| gatienboquet wrote:
| NVIDIA RTX 4090: ~1,008 GB/s
|
| NVIDIA RTX 4080: ~717 GB/s
|
| AMD Radeon RX 7900 XTX: ~960 GB/s
|
| AMD Radeon RX 7900 XT: ~800 GB/s
|
| How's that slow exactly ?
|
| You can have 10000000Gb/s and without enough VRAM it's useless.
| ttul wrote:
| I have a 4090 and, out of curiosity, I looked up the FLOPS in
| comparison with Apple chips.
|
| Nvidia RTX 4090 (Ada Lovelace)
|
| FP32: Approximately 82.6 TFLOPS
|
| FP16: When using its 4th-generation Tensor Cores in FP16 mode
| with FP32 accumulation, it can deliver roughly 165.2 TFLOPS
| (in non-tensor mode, the FP16 rate is similar to FP32).
|
| FP8: The Ada architecture introduces support for an FP8
| format; using this mode (again with FP32 accumulation), the
| RTX 4090 can achieve roughly 330.3 TFLOPS (or about 660.6
| TOPS, depending on how you count operations).
|
| Apple M1 Ultra (The previous-generation top-end Apple chip)
|
| FP32: Around 15.9 TFLOPS (as reported in various benchmarks)
|
| FP16: By similar scaling, FP16 performance would be roughly
| double that value--approximately 31.8 TFLOPS (again, an
| estimate based on common patterns in Apple's GPU designs)
|
| FP8: Like the M3 family, the M1 Ultra does not support a
| dedicated FP8 precision mode.
|
| So a $2000 Nvidia 4090 gives you about 5x the FLOPS, but with
| far less high speed RAM (24GB vs. 512GB from Apple in the new
| M3 Ultra). The RAM bandwidth on the Nvidia card is over
| 1TBps, compared with 800GBps for Apple Silicon.
|
| Apple is catching up here and I am very keen for them to
| continue doing so! Anything that knocks Nvidia down a notch
| is good for humanity.
| bigyabai wrote:
| > Anything that knocks Nvidia down a notch is good for
| humanity.
|
| I don't love Nvidia a whole lot but I can't understand
| where this sentinent comes from. Apple abandoned their
| partnership with Nvidia, tried to support their own CUDA
| alternative with blackjack and hookers (OpenCL), abandoned
| _that_ , and began rolling out a proprietary replacement.
|
| CUDA sucks for the average Joe, but Apple abandoned any
| chance of taking the high road when they cut ties with
| Khronos. Apple doesn't want better AI infrastructure for
| humanity; they envy the control Nvidia wields and want it
| for themselves. Metal versus CUDA is the type of
| competition where no matter who wins, humanity loses. Bring
| back OpenCL, then we'll talk about net positives again.
| whimsicalism wrote:
| h100 sxm - 3TB/s
|
| vram is not really the limiting factor for serious actors in
| this space
| gatienboquet wrote:
| If my grandmother had wheels, she'd be a bicycle
| aurareturn wrote:
| what's the point of 512GB RAM for LLMs on this Mac Studio if
| the speed is painfully slow?
|
| You can fit the entire Deepseek 671B q4 into this computer and
| get 41 tokens/s because it's an MoE model.
| KingOfCoders wrote:
| Your comments went from
|
| "40 tokens/s by my calculations"
|
| to
|
| "40 tokens/s"
|
| to
|
| "41 tokens/s"
|
| Is there a dice involved in "your calculations?"
| pier25 wrote:
| So weird they released the Mac Studio with an M4 Max and M3
| Ultra.
|
| Why? They have too many M3 chips on stock?
| bigfishrunning wrote:
| The M4 Max is faster, the M3 Ultra supports more unified memory
| -- So pick whichever meets your requirements
| pier25 wrote:
| Yes but why not release an M4 Ultra?
| wpm wrote:
| Because the M4 architecture doesn't have the interconnects
| needed to fuse two Max SoCs together.
| johntitorjr wrote:
| Lots of AI HW is focused on RAM (512GB!). I have a cost-sensitive
| application that needs speed (300+ TOPS), but only 1GB of RAM.
| Are there any HW companies focused on that space?
| Havoc wrote:
| Greyskull cards might be a fit. Think they're not entirely plug
| and play though
| xyzsparetimexyz wrote:
| Isn't that just any discrete (Nvidia,AMD) GPU?
| NightlyDev wrote:
| Most recent GPUs will do. An older RTX 4070 is over 400 TOPS,
| the new RTX 5070 is around 1000 TOPS, and the RTX 5090 is
| around 3600 TOPS.
| johntitorjr wrote:
| Yeah, that's basically where I'm at with options. Not ideal
| for a cost sensitive application.
| stefan_ wrote:
| Just buy any gaming card? Even something like the Jetson AGX
| Orin boasts 275 TOPS (but they add in all kind of different
| subsystems to reach that number).
| johntitorjr wrote:
| The Jetson is interesting!
|
| Can you elaborate on how the TOPS value is inflated? What GPU
| would be the equivalent of the Jetson AGX Orin?
| stefan_ wrote:
| The problem with the TOPS is that they add in ~100 TOPS
| from the "Deep Learning Accelerator" coprocessors, but they
| have a lot of awkward limitations on what they can do (and
| software support is terrible). The GPU is an Ampere
| generation, but there is no strict consumer GPU equivalent.
| crest wrote:
| Too bad it lacks even the streaming mode SVE2 found in M4 cores.
| If only Apple would provide a full SVE2 implementation to put
| pressure on ARM to make it non-optional so AArch64 isn't
| effectively restricted to NEON for SIMD.
| vlovich123 wrote:
| This is for AI which is going to benefit more from use of metal
| / NPU than SIMD.
| bigyabai wrote:
| Sure, but larger models that fit in that 512gb memory are
| going to take a long time to tokenize/detokenize without
| hardware-accelerated BLAS.
| danieldk wrote:
| Why would you need BLAS for tokenization/detokenization?
| Pretty much everyone still uses BBPE which amounts to
| iteratively applying merges.
|
| (Maybe I'm missing something here.)
| ryao wrote:
| Tokenization/detokenization does not use BLAS.
| stouset wrote:
| Hell I'm just sitting here hoping the future M5 adopts SVE. Not
| even SVE2.
| lauritz wrote:
| They update the Studio to M3 Ultra now, so M4 Ultra can
| presumably go directly into the Mac Pro at WWDC? Interesting
| timing. Maybe they'll change the form factor of the Mac Pro, too?
|
| Additionally, I would assume this is a very low-volume product,
| so it being on N3B isn't a dealbreaker. At the same time, these
| chips must be very expensive to make, so tying them with luxury-
| priced RAM makes some kind of sense.
| jsheard wrote:
| > Maybe they'll change the form factor of the Mac Pro, too?
|
| Either that or kill the Mac Pro altogether, the current
| iteration is such a half-assed design and blatantly terrible
| value compared to the Studio that it feels like an end-of-the-
| road product just meant to tide PCIe users over until they can
| migrate everything to Thunderbolt.
|
| They recycled a design meant to accommodate multiple beefy GPUs
| even though GPUs are no longer supported, so most of the
| cooling and power delivery is vestigial. Plus the PCIe
| expansion was quietly downgraded, Apple Silicon doesn't have a
| ton of PCIe lanes so the slots are _heavily_ oversubscribed
| with PCIe switches.
| pier25 wrote:
| I've always maintained that the M2 Mac Pro was really a dev
| kit for manufacturers of PCI parts. It's such a meaningless
| product otherwise.
| lauritz wrote:
| I agree. Nonetheless, I agree with Siracusa that the Mac Pro
| makes sense as a "halo car" in the Mac lineup.
|
| I just find it interesting that you can currently buy a M2
| Ultra Mac Pro that is weaker than the Mac Studio (for a
| comparable config) at a higher price. I guess it "remains a
| product in their lineup" and we'll hear more about it later.
|
| Additionally: If they wanted to scrap it down the road, why
| would they do this now?
| madeofpalk wrote:
| The current Mac Pro is not a "halo car". It's a large USB-A
| dongle for a Mac Studio.
| crowcroft wrote:
| Agree with this, and it doesn't seem like it's a priority for
| Apple to bring the kind of expandability back any time soon.
|
| Maybe they can bring back the trash can.
| jsheard wrote:
| Isn't the Mac Studio the new trash can? I can't think of
| how a non-expandable Mac Pro could be meaningfully
| different to the Studio unless they introduce an even
| bigger chip above the Ultra.
| xp84 wrote:
| > Mac Studio the new trash can?
|
| Indeed, and tbh it really commits even more to the non-
| expandability that the Trashcan's designers seemed to be
| going for. After all, the Trashcan at least had
| replaceable RAM and storage. The Mac Studio has
| proprietary storage modules for no reason aside from
| Apple's convenience/profits (and of course the
| 'integrated' RAM which I'll charitably assume was done
| for altruistic reasons because of how it's "shared.")
|
| The difference is that today users are accepting modern
| Macs where they rejected the Trashcan. I think it's
| because Apple's practices have become more widespread
| anyway*, and certain parts of the strategy like the RAM
| thing at least have upsides. That, and the thermals are
| better because the Trashcan's thermal design was not fit
| for purpose.
|
| * I was trying to fix a friend's nice Lenovo laptop
| recently -- it turned out to just have some bad RAM, but
| when we opened it up we found it was soldered :(
| crowcroft wrote:
| Oh yea I wasn't clear I just meant bring back the design
| - agree the studio basically is the trash can.
| newsclues wrote:
| The Mac Pro could exist as a PCIe expansion slot storage case
| that accepts a logic board from the frequently updated
| consumer models. Or multiple Mac Studio logic boards all in
| one case with your expansion cards all working together.
| agloe_dreams wrote:
| My understanding was that Apple wanted to figure out how to
| build systems with multi-SOCs to replace the Ultra chips. The
| way it is currently done means that the Max chips need to be
| designed around the interconnect. Theoretically speaking, a
| multi-SOC setup could also scale beyond two chips and into a
| wider set of products.
| aurareturn wrote:
| I'm not sure if multi-SoC is possible because having 2 GPUs
| together such that the OS sees it as one big GPU is not very
| possible if the SoCs are separated.
| rbanffy wrote:
| Ultra is already two big M3 chips coupled through an
| interposer. Apple is curiously not going the way of chiplets
| like the big CPU crowd is.
| lauritz wrote:
| Interestingly, Apple apparently confirmed to a French website
| that M4 lacks the interconnect required to make an "Ultra"
| [0][1], so contrary to what I originally thought, they maybe
| won't make this after all? I'll take this report with a grain
| of salt, but apparently it's coming directly from Apple.
|
| Makes it even more puzzling what they are doing with the M2 Mac
| Pro.
|
| [0] https://www.numerama.com/tech/1919213-m4-max-et-m3-ultra-
| let...
|
| [1] More context on Macrumors:
| https://www.macrumors.com/2025/03/05/apple-confirms-m4-max-l...
| raydev wrote:
| Honestly I don't think we'll see the M4 Ultra at all this year.
| That they introduced the Studio with an M3 Ultra tells me M4
| Ultras are too costly or they don't have capacity to build
| them.
|
| And anyway, I think the M2 Mac Pro was Apple asking customers
| "hey, can you do anything interesting with these PCIe slots?
| because we can't think of anything outside of connectivity
| expansion really"
|
| RIP Mac Pro unless they redesign Apple Silicon to allow for
| upgradeable GPUs.
| layer8 wrote:
| Apple says that not every generation will get an "Ultra"
| variant: https://arstechnica.com/apple/2025/03/apple-
| announces-m3-ult...
| mrtksn wrote:
| Let's say you want to have the absolute max memory(512GB) to run
| AI models and let's say that you are O.K. with plugging a drive
| to archive your model weights then you can get this for a little
| bit shy of $10K. What a dream machine.
|
| Compared to Nvidia's Project DIGITS which is supposed to cost $3K
| and be available "soon", you can get a specs matching 128GB & 4TB
| version of this Mac for about $4700 and the difference would be
| that you can actually get it in a week and will run macOS(no idea
| how much performance difference to expect).
|
| I can't wait to see someone testing the full DeepSeek model on
| this, maybe this would be the first little companion AI device
| that you can fully own and can do whatever you like with it,
| hassle-free.
| bloomingkales wrote:
| There's an argument that replaceable pc parts is what you want
| at that price point, but Apple usually provides multi year
| durability on their pcs. An Apple ai brick should last awhile.
| behnamoh wrote:
| > I can't wait to see someone testing the full DeepSeek model
| on this
|
| at 819 GB per second bandwidth, the experience would be
| terrible
| mrtksn wrote:
| How many t/s would you expect? I think I feel perfectly fine
| when its over 50.
|
| Also, people figured a way to run these things in parallel
| easily. The device is pretty small, I think for someone who
| wouldn't mind the price tag stacking 2-3 of those wouldn't be
| that bad.
| behnamoh wrote:
| I know you're referring to the exolabs app, but the t/s is
| really not that good. it uses thunderbolt instead of
| NVlink.
| yk wrote:
| I think I've seen 800 GB/s memory bandwidth, so a q4 quant
| of a 400 B model should be 4 t/s if memory bound.
| coder543 wrote:
| DeepSeek-R1 only has 37B active parameters.
|
| A back of the napkin calculation: 819GB/s / 37GB/tok = 22
| tokens/sec.
|
| Realistically, you'll have to run quantized to fit inside of
| the 512GB limit, so it could be more like 22GB of data
| transfer per token, which would yield 37 tokens per second as
| the theoretical limit.
|
| It is likely going to be very usable. As other people have
| pointed out, the Mac Studio is also not the only option at
| this price point... but it is neat that it _is_ an option.
| bearjaws wrote:
| Not sure why you are being downvoted, we already know the
| performance numbers due to memory bandwidth constraints on
| the M4 Max chips, it would apply here as well.
|
| 525GB/s to 1000GB/s will double the TPS at best, which is
| still quite low for large LLMs.
| lanceflt wrote:
| Deepseek R1 (full, Q1) is 14t/s on an M2 Ultra, so this
| should be around 20t/s
| NightlyDev wrote:
| The full deepseek R1 model needs more memory than 512GB. The
| model is 720GB alone. You can run a quantized version on it,
| but not the full model.
| giancarlostoro wrote:
| At 9 grand I would certainly hope that they support the device
| software wise longer than they supported my 2017 Macbook Air. I
| see no reason to be forced to cough up 10 grand essentially every
| 7 years to Apple, that's ridiculous.
| moondev wrote:
| > support for more than half a terabyte of unified memory -- the
| most ever in a personal computer
|
| AMD Ryzen Threadripper PRO 3995WX released over four years ago
| and supports 2TB (64c/128t)
|
| > Take your workstation's performance to the next level with the
| AMD Ryzen Threadripper PRO 3995WX 2.7 GHz 64-Core sWRX8
| Processor. Built using the 7nm Zen Core architecture with the
| sWRX8 socket, this processor is designed to deliver exceptional
| performance for professionals such as artists, architects,
| engineers, and data scientists. Featuring 64 cores and 128
| threads with a 2.7 GHz base clock frequency, a 4.2 GHz boost
| frequency, and 256MB of L3 cache, this processor significantly
| reduces rendering times for 8K videos, high-resolution photos,
| and 3D models. The Ryzen Threadripper PRO supports up to 128 PCI
| Express 4.0 lanes for high-speed throughput to compatible
| devices. It also supports up to 2TB of eight-channel ECC DDR4
| memory at 3200 MHz to help efficiently run and multitask
| demanding applications.
| ryao wrote:
| I suspect that they do not consider workstations to be personal
| computers.
| agloe_dreams wrote:
| No the comment misunderstood the difference between CPU
| memory and unified memory. This can dedicate 500GB of high
| bandwidth memory to the GPU. - ~3.5X that of an H200.
| Shank wrote:
| > unified memory
|
| So unified memory means that the memory is accessible to the
| GPU and the CPU in a shared pool. AMD does not have that.
| mythz wrote:
| AMD Ryzen AI Max SoC chips have that [1], but it maxes out at
| 128GB RAM.
|
| [1] https://www.amd.com/en/products/processors/laptop/ryzen/a
| i-3...
| curt15 wrote:
| What about AMD Instinct accelerators like the MI300A[1]?
| Doesn't that use a single memory pool for both CPU and GPU
| cores?
|
| [1] https://www.amd.com/en/products/accelerators/instinct/mi3
| 00/...
| lowercased wrote:
| I don't think that's "unified memory" though.
| JamesSwift wrote:
| > unified memory
|
| Its a very specific claim that isnt comparing itself to DIMMs
| aaronmdjones wrote:
| > It also supports up to 2TB of eight-channel ECC DDR4 memory
| at 3200 MHz (sic) to help efficiently run and multitask
| demanding applications.
|
| 8 channels at 3200 MT/s (1600 MHz) is only 204.8 GB/sec; less
| than a quarter of what the M3 Ultra can do. It's also not GPU-
| addressable, meaning it's not actually unified memory at all.
| gatienboquet wrote:
| No benchmarks yet for the LLMs :(
| xyst wrote:
| I might like Apple again if the SoC could be sold separately and
| opened up. It would be interesting to see a PC with Asahi or
| Windows running on Apple's chips.
| c0deR3D wrote:
| When would Apple silicons made natively support for OSes such as
| Linux? Apple seemlingly reluctant to release detailed technical
| reference manual for M-series SoCs, which makes running Linux
| natively on Apple silicon challenging.
| bigyabai wrote:
| Probably never. We don't have official Linux support for the
| iPhone or iPad, I would't hold out hope for Apple to change
| their tune.
| dylan604 wrote:
| That makes sense to me though. If you don't run iOS, you
| don't have App Store and that means a loss of revenue.
| bigyabai wrote:
| Right. Same goes for MacOS and all of it's convenient
| software services. Apple might stand to sell more units
| with a more friendlier stance towards Linux, but unless it
| sells more Apple One subscriptions or increases hardware
| margins on the Mac, I doubt Cook would consider it.
|
| If you sit around expecting selflessness from Apple you
| will waste an enormous amount of time, trust me.
| AndroTux wrote:
| If you don't run macOS, you don't have Apple iCloud Drive,
| Music, Fitness, Arcade, TV+ and News and that means a loss
| of revenue.
| dylan604 wrote:
| As I replied in else where here, I do not run any Apple
| Services on my Mac hardware. I do on my iDevices though,
| but that's a different topic. Again, I could be the edge
| case
| bigyabai wrote:
| > I do not run any Apple Services on my Mac hardware
|
| Not even OCSP?
| dylan604 wrote:
| I have no idea what that is, so ???
|
| But if you're being pedantic, I meant Apple SaaS
| requiring monthly payments or any other form of using
| something from Apple where I give them money outside the
| purchase of their hardware.
|
| If you're talking background services as part of macOS,
| then you're being intentionally obtuse to the point and
| you know it
| jobs_throwaway wrote:
| You lose out on revenue from people who require OS freedom
| though
| orangecat wrote:
| All seven of them. I kid, I have a lot of sympathy for
| that position, but as a practical matter running Linux
| VMs on an M4 works great, you even get GPU acceleration.
| dylan604 wrote:
| That's what's weird to me too. It's not like they would lose
| sales of macOS as it is given away with the hardware. So if
| someone wants to buy Apple hardware to run Linux, it does not
| have a negative affect to AAPL
| bigfishrunning wrote:
| Except the linux users won't be buying Apple software, from
| the app store or elsewhere. They won't subscribe to iCloud.
| dylan604 wrote:
| I have Mac hardware and and have spent $0 through the Mac
| App Store. I do not use iCloud on it either. I do on
| iDevices though. I must be an edge case though.
| c0deR3D wrote:
| Same here.
| xp84 wrote:
| All of us on HN are basically edge cases. The main target
| market of Macs is super dependent on Apple service
| subscriptions.
|
| Maybe that's why they ship with insultingly-small SSDs by
| default, so that as people's photo libraries, Desktop and
| Documents folders fill up, Apple can "fix your problem"
| for you by selling you the iCloud/Apple One plan to
| offload most of the stuff to only live in iCloud.
|
| Either they spend the $400 up front to get 2 notches up
| on the SSD upgrade, to match what a reasonable device
| would come with, or they spend that $400 $10 a month for
| the 40 month likely lifetime of the computer. Apple wins
| either way.
| jeroenhd wrote:
| While I don't think Apple wants to change course from its
| services-oriented profit model, surely someone within Apple
| has run the calculations for a server-oriented M3/M4
| device. They're not far behind server CPUs in terms of
| performance while running a lot cooler AND having
| accelerated amd64 support, which Ampere lacks.
|
| Whatever the profit margin on an iMac Studio is these days,
| surely improving non-consumer options becomes profitable at
| some point if you start selling them by the thousands to
| data centers.
| cosmic_cheese wrote:
| Those buying the hardware to run Linux also aren't writing
| software for macOS to help make the platform more
| attractive.
| dylan604 wrote:
| There are a large number of macOS users that are not app
| software devs. There's a large base of creative users
| that couldn't code their way out of a wet paper bag, yet
| spend lots of money on Mac hardware.
|
| This forum looses track of the world outside this echo
| chamber
| cosmic_cheese wrote:
| I'm among them, even if creative works aren't my bread
| and butter (I'm a dev with a bit of an artistic bent).
|
| That said, attracting creative users also adds value to
| the platform by creating demand for creative software for
| macOS, which keeps existing packages for macOS maintained
| and brings new ones on board every so often.
| dylan604 wrote:
| I'm a mix of both, however, my dev time does not create
| macOS or iDevice apps. My dev is still focused on
| creative/media workflows, while I still get work for
| photo/video. I don't even use Xcode any further than
| running the CLI command to install the necessary tools to
| have CLI be useful.
| re-thc wrote:
| > So if someone wants to buy Apple hardware to run Linux, it
| does not have a negative affect to AAPL
|
| It does. Support costs. How do you prove it's a hardware
| failure or software? What should they do? Say it
| "unofficially" supports Linux? People would still try to get
| support. Eventually they'd have to test it themselves etc.
| dylan604 wrote:
| Apple has already been in this spot. With the TrashCan
| MacPro, there was an issue with DaVinci Resolve under OS X
| at the time where the GPU was cause render issues. If you
| then rebooted into Windows with BootCamp using the exact
| same hardware and open up the exact same Resolve project
| with the exact same footage, the render errors disappeared.
| Apple blamed Resolve. DaVinci blamed GPU drivers. GPU
| blamed Apple.
| re-thc wrote:
| > Apple has already been in this spot.
|
| Has been. This is importance. Past tense. Maybe that's
| the point - they gave up on it acknowledging the extra
| costs / issues.
| k8sToGo wrote:
| We used to have bootcamp though.
| dylan604 wrote:
| There you go using logical arguments in an emotional
| illogical debate.
| amelius wrote:
| But then they'd have to open up their internal documentation
| of their silicon, which could possibly be a legal disaster
| (patents).
| WillAdams wrote:
| Is it not an option to run Darwin? What would Linux offer that
| that would not?
| internetter wrote:
| Darwin is a terrible server operating system. Even getting a
| process to run at server boot reliably is a nightmare.
| kbolino wrote:
| I don't think Darwin has been directly distributed in
| bootable binary format for _many_ years now. And, as far as I
| know, it has never been made available in that format for
| Apple silicon.
| cpfleming wrote:
| https://asahilinux.org/
| NorwegianDude wrote:
| The memory amount is fantastic, memory bandwidth is half
| decent(~800 GB/s), and the compute capabilities are terrible(36
| TOPS).
|
| For comparison, a single consumer card like the RTX 5090 is only
| 32 GB of memory, has 1792 GB/s memory and 3593 TOPS of compute.
|
| The use cases will be limited. While you can't run a 600B model
| directly like Apple says(cause you need more memory for that),
| you can run a quantized version, but it will be very slow unless
| its a MoE architecture.
| Havoc wrote:
| >36 Tops
|
| Thats going to be the NPU specifically. Pretty much nothing on
| llm front seems to use NPUs at this stage (copilot snapdragon
| laptops aside) so not sure the low number is a problem
| BonoboIO wrote:
| A factor of 100 faster in compute ... wow.
|
| It will be interesting when somebody will upgrade the ram ram
| of the 5090 like they did with 4090s
| bilbo0s wrote:
| They're a bit confused and not comparing the same compute.
|
| Pretty sure they're comparing Nvidia's gpu to Apple's npu.
| NorwegianDude wrote:
| I'm not confused at all. It's the real numbers. Feel free
| to provide anything that suggests that the TOPS of the GPU
| in M chips are faster than the dedicated hardware for it.
| But you can't, cause it's not true. If you think Apple
| added the neural engine just for fun then I don't know what
| to tell you.
|
| You have a fundamental flaw in your understanding of how
| both chips work. Not using the tensor cores would be
| slower, and the same goes for apples neural engine. The
| numbers are both for the hardware both have implemented for
| maximum performance for this task.
| llm_nerd wrote:
| I do think people are going a little overboard with all the
| commentary about AI in this discussion, and you rightly cite
| some of the empirical reasons. People are trying to rationalize
| convincing themselves to buy one of these, but they're deluding
| themselves.
|
| It's nice that these devices have loads of memory, but they
| don't have remotely the necessary level of compute to be
| competitive in the AI space. As a fun thing to run a local LLM
| as a hobbyist, sure, but this presents zero threat to nvidia.
|
| Apple hardware is irrelevant in the AI space, outside of making
| YouTube "I ran a quantized LLM on my 128GB Mac Mini" type
| content for clicks, and this release doesn't change that.
|
| Looks like a great desktop chip though.
|
| It would be nice if nvidia could start giving their less
| expensive offerings more memory, though they're currently in
| the realm Intel was 15 yearsago, thinking that their biggest
| competition is themselves.
| dagmx wrote:
| You're comparing two different things.
|
| The compute level you're talking about on the M3 Ultra is the
| neural engine. Not including the GPU.
|
| I expect the GPU here will be behind a 5090 for compute but not
| by the unrelated numbers you're quoting. After all, the 5090
| alone is multiple times the wattage of this SoC.
| bigyabai wrote:
| > After all, the 5090 alone is multiple times the wattage of
| this SoC.
|
| FWIW, normalizing the wattages (or even underclocking the
| GPU) will still give you an Nvidia advantage most days.
| Apple's GPU designs are closer to AMD's designs than
| Nvidia's, which means they omit a lot of AI accelerators to
| focus on a less-LLM-relevent raster performance figure.
|
| Yes, the GPU is faster than the NPU. But Apple's GPU designs
| haven't traditionally put their competitors out of a job.
| dagmx wrote:
| M2 Ultra is ~250W (averaging various reports since Apple
| don't publish) for the entire SoC.
|
| 5090 is 575W without the CPU.
|
| You'd have to cut the Nvidia to a quarter and then find a
| comparable CPU to normalize the wattage for an actual
| comparison.
|
| I agree that Apple GPUs aren't putting the dedicated GPU
| companies in danger on the benchmarks, but they're also not
| really targeting it? They're in completely different zones
| on too many fronts to really compare.
| bigyabai wrote:
| Well, select your hardware of choice and see for yourself
| then: https://browser.geekbench.com/opencl-benchmarks
|
| > but they're also not really targeting it?
|
| That's fine, but it's not an excuse to ignore the
| power/performance ratio.
| dagmx wrote:
| But I'm not ignoring the power/performance ratio? If
| anything, you are doing that by handwaving away the
| difference.
|
| Give me a comparable system build where the NVIDIA GPU +
| any CPU of your choice is running at the same wattage as
| an M2 Ultra, and outperforms it on average. You'd get
| 150W for the GPU and 150W for the CPU.
|
| Again, you can't really compare the two. They're
| inherently different systems unless you only care about
| singular metrics.
| llm_nerd wrote:
| Using the NPU numbers grossly _overstates_ the AI performance
| of the Apple Silicon hardware, so they 're actually giving
| Apple the benefit of the doubt.
|
| Most AI training and inference (including generative AI) is
| bound by large scale matrix MACs. That's why nvidia fills
| their devices with enormous numbers of tensor cores and Apple
| / Qualcomm et al are adding NPUs, filling largely the same
| gap. Only nvidia's not only are a magnitude+ more performant,
| they've massively more flexible (in types and applications),
| usable for training and inference, while Apple's is only even
| useful for a limited set of inference tasks (due to
| architecture and type limits).
|
| Apple can put the effort in and making something actually
| competitive with nvidia, but this isn't it.
| dagmx wrote:
| Care to share the TOPs numbers for the Apple GPUs and show
| how this would "grossly overstate" the numbers?
|
| Apple won't compete with NVIDIA, I'm not arguing that. But
| your opening line will only make sense if you can back up
| the numbers and the GPU performance is lower than the ANE
| TOPS.
| llm_nerd wrote:
| Tensor / neural cores are very easy to benchmark and give
| a precise number because they do a single well-defined
| thing at a large scale. So GPU numbers are less common
| and much more use-specific.
|
| However the M2 Ultra GPU is estimated, with every bit of
| compute power working together, at about 26 TOPS.
| dagmx wrote:
| Could you provide a link for that TOPS count? (And
| specifically TOPs with comparable unit sizes since NVIDIA
| and Apple did not use the same units till recently)
|
| The only similar number I can find is for TFLOPS vs TOPS
|
| Again I'm not saying the GPU will be comparable to an
| NVIDIA one, but that the comparison point isn't sensible
| in the comments I originally replied to.
| NorwegianDude wrote:
| No, I'm not. I'm comparing the TOPS of the M3 Ultra and the
| tensor cores of the RTX 5090.
|
| If not, what is the TOPS of the GPU, and why isn't apple
| talking about it if there is more performance hidden
| somewhere? Apple states 18 TOPS for the M3 Max. And why do
| you think Apple added the neural engine, if not to accelerate
| compute?
|
| The power draw is quite a bit higher, but it's still much
| more efficient as the performance is much higher.
| dagmx wrote:
| The ANE and tensor cores are not comparable though. One is
| literally meant for low cost inference while the others are
| meant for acceleration of training.
|
| If you squint, yeah they look the same, but so does the
| microcontroller on the GPU and a full blown CPU. They're
| fundamentally different purposes, architectures and scale
| of use.
|
| The ANE can't even really be used directly. Apple heavily
| restricts the use via CoreML APIs for inference. It's only
| usable for smaller, lightweight models.
|
| If you're comparing to the tensor cores, you really need to
| compare against the GPU which is what gets used by apples
| ml frameworks such as MLX for training etc.
|
| It will still be behind the NVIDIA GpU, but not by anywhere
| near the same numbers.
| llm_nerd wrote:
| >The ANE and tensor cores are not comparable though
|
| They're both built to do the most common computation in
| AI (both training and inference), which is multiply and
| accumulate of matrices - A * B + C. The ANE is far more
| limited because they decided to spend a lot less silicon
| space on it, focusing on low-power inference of quantized
| models. It is fantastically useful for a lot of on-device
| things like a lot of the photo features (e.g. subject
| detection, text extraction, etc).
|
| And yes, you need to use CoreML to access it _because_ it
| 's so limited. In the future Apple will absolutely, with
| 100% certainty, make an ANE that is as flexible and
| powerful as tensor cores, and they force you through
| CoreML because it will automatically switch to using it
| (where now you submit a job to CoreML and for many it
| will opt to use the CPU/GPU instead, or a combination
| thereof. It's an elegant, forward thinking
| implementation). Their AI performance and credibility
| will greatly improve when they do.
|
| >you really need to compare against the GPU
|
| From a raw performance perspective, the ANE is capable of
| more matrix multiply/accumulates than the GPU is on Apple
| Silicon, it's just limited to types and contexts that
| make it unsuitable for training, or even for many
| inference tasks.
| NorwegianDude wrote:
| So now the TOPS are not comparable because M3 is much
| slower than an Nvidia GPU? That's not how comparisons
| work.
|
| My numbers are correct, the M3 Ultra has around 1 % of
| the TOPS performance of a RTX 5090.
|
| Comparing against the GPU would look even worse for
| apple. Do you think Apple added the neural engine just
| for fun? This is exactly what the neural engine is there
| for.
| dagmx wrote:
| You're completely missing the point. The ANE is not
| equivalent as a component to the tensor cores. It has
| nothing to do with comparison of TOPs but as what they're
| intended for.
|
| Try and use the ANE in the same way you would use the
| tensor cores. Hint: you can't, because the hardware and
| software will actively block you.
|
| They're meant for fundamentally different use cases and
| power loads. Even apples own ML frameworks do not use the
| ANE for anything except inference.
| tempodox wrote:
| I could salivate over the hardware no end, if only Apple software
| (including the OS) weren't that shoddy.
| bredren wrote:
| Apart from enabling an 120h update to the XDR Pro, does TB5 offer
| a viable pathway for eGPUs on Apple Silicon macbooks?
|
| This is a cool computer, but not something I'd want to lug
| around.
| mohsen1 wrote:
| For AI stuff, 120GB/s is not really useful really...
| submeta wrote:
| I am confused. I got an M4 with 64 GB Ram. Did I buy something
| from the future? :) Now why M3 now? Not M4 Ultra.
| seanmcdirmid wrote:
| It took them awhile to developed their ultra chip and this is
| what they had ready? I'm sure they are working on the M4 ultra,
| but they are just slow at it.
|
| I bought a refubished M3 max to run LLMs (can only go up to 70b
| with 4 bit quant), and it is only slightly slower than the more
| expensive M4 max.
| opan wrote:
| Haven't the Max/Ultra type chips always come much later, close
| to when the next number of standard chips came out? M2 Max was
| not available when M2 launched, for example.
| SirMaster wrote:
| An Ultra has never come out after the next gen base model,
| let alone the next gen Pro/Max model before.
|
| M1: November 10, 2020
|
| M1 Pro: October 18, 2021
|
| M1 Max: October 18, 2021
|
| M1 Ultra: March 8, 2022
|
| -------------------------
|
| M2: June 6, 2022
|
| M2 Pro: January 17, 2023
|
| M2 Max: January 17, 2023
|
| M2 Ultra: June 5, 2023
|
| -------------------------
|
| M3: October 30, 2023
|
| M3 Pro: October 30, 2023
|
| M3 Max: October 30, 2023
|
| -------------------------
|
| M4: May 7, 2024
|
| M4 Pro: October 30, 2024
|
| M4 Max: October 30, 2024
|
| -------------------------
|
| M3 Ultra: March 5, 2025
| kridsdale1 wrote:
| So about a year and a half delay for Ultra, but the M2 was
| an anomaly.
| ellisv wrote:
| I'd also point out that there was a rather awkward
| situation with M1/M2 chips where lower end devices were
| getting newer chips before the higher end devices. For
| example, the 14 and 16-inch MacBooks Pro didn't get a M2
| series chip until about 6 months after the 13 and 15-inch
| MacBooks Air. This left some professionals and power users
| frustrated.
|
| The M3 Ultra might perform as well as the M4 Max - I
| haven't seen benchmarks yet - but the newer series is in
| the higher end devices which is what most people expect.
| ferguess_k wrote:
| Ah, if we can have the hardware and the freedom of installing a
| good Linux repo on top of it. How is Asahi? Is it good enough? I
| assume, that since Asahi is focused on Apple hardware, it should
| have an easier time figuring out drivers and etc?
| bigyabai wrote:
| > How is Asahi?
|
| For M3 and M4 machines, hardware support is pretty derilict:
| https://asahilinux.org/docs/M3-Series-Feature-Support/
| ferguess_k wrote:
| Thanks, looks like even M1 support has some gaps:
|
| https://asahilinux.org/docs/M1-Series-Feature-
| Support/#table...
|
| I assume anything that doesn't have "linux-asahi" is not
| supported -- or any WIP is not supported.
|
| Wish I had the skills to help them. Targeting just one set of
| architecture, I think Asahi has more chances of success.
| bigyabai wrote:
| It's just not an easy task. I can't help but compare it to
| the Nouveau project spending years of effort to reverse-
| engineer just a few GPU designs. Then Nvidia changed their
| software and hardware architecture, and things went from
| "relatively hopeful" to "there is no chance" overnight.
| ferguess_k wrote:
| I agree, it's a lot of work, plus Apple definitely is not
| not going to help with the project. Maybe an alternative
| is something like Framework -- find some good enough
| hardware and support it.
| _alex_ wrote:
| apple keeps talking about the Neural Engine. Does anything
| actually use it? Seems like all the current LLM and Stable
| Diffusion packages (including MLX) use the GPU.
| gield wrote:
| Face ID, taking pictures, Siri, ARKit, voice-to-text
| transcription, face recognition and OCR in photos, noise
| filtering, ...
| cubefox wrote:
| These have been possible in much smaller smartphone chips for
| years.
| stouset wrote:
| Possible != energy efficient, which is important for mobile
| devices.
| cubefox wrote:
| If the energy efficiency of things like Face ID was
| indeed so far so bad that you need a more efficient M3
| Ultra, how come Face ID was integrated into smartphones
| years ago, apparently without significant negative impact
| on battery life?
| anentropic wrote:
| Yeah I agree.
|
| The Neural Engine is useful for a bunch of Apple features, but
| seems weirdly useless for any LLM stuff... been wondering if
| they'd address it on any of these upcoming products. AI is so
| hype right now it seems odd that they have specialised
| processor that doesn't get used for the kind of AI people are
| doing. I can see in the latest release:
|
| > Mac Studio is a powerhouse for AI, capable of running large
| language models (LLMs) with over 600 billion parameters
| entirely in memory, thanks to its advanced GPU
|
| https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...
|
| i.e. LLMs still run on the GPU not the NPU
| aurareturn wrote:
| On the iPhone, it runs on the NPU.
| 827a wrote:
| Very curiously: They upgraded the Mac Studio but not the Mac Pro
| today.
| FloatArtifact wrote:
| So, what's the question if the M1/M2 Ultra was limited by GPU/NPU
| or more memory bandwidth at this point?
|
| I'm curious what instruction sets may have been included with the
| M3 chip that the other two lack for AI.
|
| So far the candidates seem to be NVIDIA digits, Framework
| Desktop, M1 64gb M2/M3 128gb studio/ultra.
|
| The GPU market isn't competitive enough for the amount of VRAM
| needed. I was hoping for an Battlemage GPU Model with 24GB that
| would be reasonably priced and available.
|
| The framework desktop and devices I think a second generation
| will be significantly better than what's currently on offer
| today. Rationale below...
|
| For a max spec processor with ram at $2,000, this seems like a
| decent deal given today's market. However, this might age very
| fast for three reasons.
|
| Reason 1: LPDDR6 may debut in the next year or two this could
| bring massive improvements to memory bandwidth and capacity for
| soldered on memory.
|
| LPDDR6 vs LPDDR5 - Data bus width - 24 bits, 16 bits Burst length
| - 24 bits, 15 bits Memory bandwidth - Up to 38.4 GB/s, Up to 6.7
| GB/s
|
| - Camm ram may or may not be maintain signal integrity as memory
| bandwidth increases. Until I see it implemented for a AI use-case
| in a cost-effective manner, I am skeptical.
|
| Reason 2: - It's a laptop chip with limited PCI lanes and reduced
| power envelope. Theoretically, a desktop chip could have better
| performance, more lanes, socketable (Although, I don't think I've
| seen a socketed CPU with soldered RAM)
|
| Reason 3: In addition, what does hardware look like being
| repurposed in the future compared to alternatives?
|
| - Unlike desktop or server counterparts which can have a higher
| cpu core count, PCEe/IO Expansion, this processor with its
| motherboard is limited on re-purposing later down the line as a
| server to self-host other software besides AI. I suppose could be
| turned into a overkill, NAS with ZFS and HBA Single Controller
| Card in new case.
|
| - Buying into the framework desktop is pretty limited based on
| the form factor. Next generation might be able to include a 16x
| slot fully populated, a 10G nic. That seems about it if they're
| going to maintain the backward compatibility philosophy given the
| case form factor.
| gpapilion wrote:
| I think this will eventually morph into apples server fleet. This
| in conjunction with the ai server factory they are opening makes
| a lot of sense.
| api wrote:
| Half a terabyte could run 8 bit quantized versions of some of
| those full size llama and deepseek models. Looking forward to
| seeing some benchmarks on that.
| zamadatix wrote:
| Deepseek would need Q5ish level quantization to fit.
| ntqvm wrote:
| Disappointing announcement. M4 brings a significant uplift over
| M3, and the ST performance of the M3 Ultra will be significantly
| worse than the M4 Max.
|
| Even for its intended AI audience, the ISA additions in M4
| brought significant uplift.
|
| Are they waiting to put M4 Ultra into the Mac Pro?
| tuananh wrote:
| but is it actually usable for anything if it's too slow.
|
| Has anyone has a ballpark number how many tokens per second we
| can get with this?
| cxie wrote:
| 512GB of unified memory is truly breaking new ground. I was
| wondering when Apple would overcome memory constraints, and now
| we're seeing a half-terabyte level of unified memory. This is
| incredibly practical for running large AI models locally ("600
| billion parameters"), and Apple's approach of integrating this
| much efficient memory on a single chip is fascinating compared to
| NVIDIA's solutions. I'm curious about how this design of "fusing"
| two M3 Max chips performs in terms of heat dissipation and power
| consumption though
| bigyabai wrote:
| For enterprise markets, this is table stakes. A lot of
| datacenter customers will probably ignore this release
| altogether since there isn't a high-bandwidth option for
| systems interconnect.
| pavlov wrote:
| The Mac Studio isn't meant for data centers anyway? It's a
| small and silent desktop form factor -- in every respect the
| opposite of a design you'd want to put in a rack.
|
| A long time ago Apple had a rackmount server called Xserve,
| but there's no sign that they're interested in updating that
| for the AI age.
| bigyabai wrote:
| It's the Ultra chip, the same one that goes into the
| rackmount Mac Pro. I don't think there's much confusion as
| to who this is for.
|
| > there's no sign that they're interested in updating that
| for the AI age.
|
| https://security.apple.com/blog/private-cloud-compute/
| pavlov wrote:
| I genuinely forgot the Mac Pro still exists. It's been so
| long since I even saw one.
|
| And I've had every previous Mac tower design since 1999:
| G4, G5, the excellent dual Xeon, the horrible black trash
| can... But Apple Silicon delivers so much punch in the
| Studio form factor, the old school Pro has become very
| niche.
|
| Edit - looks like the new M3 Ultra is only available in
| Mac Studio anyway? So the existence of the Pro is moot
| here.
| choilive wrote:
| never understood the hate on the trash can. Isn't the mac
| studio basically the same idea as the trash can but even
| less upgradeable?
| pavlov wrote:
| The Mac Studio hit a sweet spot in 2023 that the trash
| can Mac Pro couldn't ten years earlier. It's mostly
| thanks to the high integration of Apple Silicon and
| improved device availability and speed of Thunderbolt.
|
| The 2013 Mac Pro was stuck forever with its original
| choice of Intel CPU and AMD GPU. And it was unfortunately
| prone to overheating due to these same components.
| wtallis wrote:
| The trash can also suffered from hitting the market right
| around when the industry gave up on making dual-GPU work.
| Alupis wrote:
| Outside of extremely niche use cases, who is racking
| apple products in 2025?
| nordsieck wrote:
| There's MacMiniVault (nee MacMiniColo)
| https://www.macminivault.com/
|
| Not sure if they count as niche or not.
| waveringana wrote:
| github for their macos runners (pretty sure theyre m1
| minis)
| wpm wrote:
| AWS
| kube-system wrote:
| Every provider who offers MacOS in the cloud.
| Alupis wrote:
| So MacOS is still not allowed to be virtualized per the
| EULA? Wow if that's true...
| kube-system wrote:
| MacOS is permitted to be virtualized... as long as the
| host is a Mac. :)
| wtallis wrote:
| The rackmount Mac Pro is for A/V studios, not
| datacenters.
| phillco wrote:
| Don't forget CI/CD farms for iOS builds, although I think
| it's much more cost effective to just make Minis or
| Studios work, despite their nonstandard formfactor
| kridsdale1 wrote:
| Google and Facebook have vast fleets of Minis in custom
| chassis for this purpose.
| alwillis wrote:
| Apple recently announced they're building a new plant in
| Texas to produce servers. Yes, they need servers for their
| Private Compute Cloud used by Apple Intelligence, but it
| doesn't _only_ need to be for that.
|
| From https://www.apple.com/newsroom/2025/02/apple-will-
| spend-more...
|
| _As part of its new U.S. investments, Apple will work with
| manufacturing partners to begin production of servers in
| Houston later this year. A 250,000-square-foot server
| manufacturing facility, slated to open in 2026, will create
| thousands of jobs._
| PaulHoule wrote:
| That article says you can connect them through the
| Thunderbolt 5 somehow to form clusters.
| burnerthrow008 wrote:
| I wonder if that's something new, or just the same virtual
| network interface that's been around since the TB1 days (a
| new network interface appears when you connect two Macs
| with a TB cable)
| PaulHoule wrote:
| Well already it is faster than GigE...
|
| https://arstechnica.com/gadgets/2013/10/os-x-10-9-brings-
| fas...
|
| Thunderbolt is PCIe-based and I could imagine it being
| extended to do what
| https://en.wikipedia.org/wiki/Compute_Express_Link and
| https://en.wikipedia.org/wiki/InfiniBand
| spiderfarmer wrote:
| You can use Thunderbolt 5 interconnect (80Gbps) to run LLMs
| distributed across 4 or 5 Mac Studios.
| whimsicalism wrote:
| why would you ever want to do that remains an open question
| aurareturn wrote:
| Probably some kind of local LLM server. 1TB of 1.6 TB/s
| memory if you link 2 together. $20k total. Half the price
| of a single Blackwell chip.
| whimsicalism wrote:
| with a vanishingly small fraction of flops and a small
| fraction of memory bandwidth
| aurareturn wrote:
| It's good enough to run whatever local model you want. 2x
| 80core GPU is no joke. Linking them together gives it
| effectively 1.6 TB/s of bandwidth. 1TB of total memory.
|
| You can run the full Deepseek 671b q8 model at 40
| tokens/s. Q4 model at 80 tokens/s. 37B active params at a
| time because R1 is MoE.
|
| Linking 2 of these together let's you run a model more
| capable (R1) than GPT4o at a comfortable speed at home.
| That was simply fantasy a year ago.
| atwrk wrote:
| But 80Gbit/s is way slower than even regular dual channel
| RAM, or am I missing something here? That would mean the
| LLM would be excruciatingly slow. You could get an old EPYC
| for a fraction of that price _and_ have more performance.
| wmf wrote:
| The weights don't go over the network so performance is
| OK.
| atwrk wrote:
| If I'm not mistaken, each token produced roughly equals
| the whole model in memory transfers (the exception being
| MoE models). That's why memory bandwidth is so important
| in the first place, or not?
| wmf wrote:
| My understanding is that if you can store 1/Nth of the
| weights in RAM on each of the N nodes then there's no
| need to send the weights over the network.
| phonon wrote:
| Thunderbolt 5 can do bi-directional 80 Gbps....and Mac Studio
| Ultra has 6 ports...
| cibyr wrote:
| That's still not even competitive with 100G Ethernet on a
| per-port basis. An overall bandwidth of 480 Gbps pales in
| comparison with, for example, the 3200 Gbps you get with a
| P5 instance on EC2.
| nyrikki wrote:
| To add to this GPU servers like supermicro have a 400GBe
| port per GPU plus more for the CPU.
| kridsdale1 wrote:
| Cost competitive though?
| phonon wrote:
| A 3 year reservation of a P5 is over a million dollars
| though? Not sure how that's comparable....
| FloatArtifact wrote:
| They didn't increase the memory bandwidth. You can get the same
| memory bandwidth, which is available on the M2 Studio. Yes,
| yes, of course you can get 512 gigabytes of uRAM for 10 grand.
|
| The the question is if a llm will run with usable performance
| at that scale? The point is there's diminishing returns despite
| having enough uRAM with the same amount of memory bandwidth
| even with increased processing speed of the new chip for AI.
|
| So there must be a min-max performance ratio between memory
| bandwidth and the size of the memory pool in relation to the
| processing power.
| cxie wrote:
| Guess what? I'm on a mission to completely max out all 512GB
| of mem...maybe by running DeepSeek on it. Pure greed!
| swivelmaster wrote:
| You could always just open a few Chrome tabs...
| valine wrote:
| Probably helps that models like deepseek are mixture of
| expert. Having all weights in VRAM means you don't have to
| unlod/reload. Memory bandwidth usage should be limited to the
| 37B active parameters.
| FloatArtifact wrote:
| > Probably helps that models like deepseek are mixture of
| expert. Having all weights in VRAM means you don't have to
| unlod/reload. Memory bandwidth usage should be limited to
| the 37B active parameters.
|
| "Memory bandwidth usage should be limited to the 37B active
| parameters."
|
| Can someone do a deep dive above quote. I understand having
| the entire model loaded into RAM helps with response times.
| However, I don't quite understand the memory bandwidth to
| active parameters.
|
| Context window?
|
| How much the model can actively be processed despite being
| fully loaded into memory based on memory bandwidth?
| valine wrote:
| With a mixture of experts model you only need to read a
| subset of the weights from memory to compute the output
| of each layer. The hidden dimensions are usually smaller
| as well so that reduces the size of the tensors you write
| to memory.
| bick_nyers wrote:
| Just to add onto this point, you expect different experts
| to be activated for every token, so not having all of the
| weights in fast memory can still be quite slow as you
| need to load/unload memory every token.
| valine wrote:
| Probably better to be moving things from fast memory to
| faster memory than from slow disk to fast memory.
| ein0p wrote:
| What people who did not actually work with this stuff in
| practice don't realize is the above statement only holds
| for batch size 1, sequence size 1. For processing the
| prompt you will need to read all the weights (which isn't
| a problem, because prefill is compute-bound, which, in
| turn is a problem on a weak machine like this Mac or an
| "EPYC build" someone else mentioned). Even for inference,
| batch size greater than 1 (more than one inference at a
| time) or sequence size of greater than 1 (speculative
| decoding), could require you to read the entire model,
| repeatedly. MoE is beneficial, but there's a lot of
| nuance here, which people usually miss.
| doctorpangloss wrote:
| Sure, nuance.
|
| This is why Apple makes so much fucking money: people
| will craft the wildest narratives about how they're going
| to use this thing. It's part of the aesthetics of
| spending $10,000. For every person who wants a solution
| to the problem of running a 400b+ parameter neural
| network, there are 19 who actually want an exciting
| experience of buying something, which is what Apple
| really makes. It has more in common with a Birkin bag
| than a server.
| rfoo wrote:
| For decode, MoE is nice for either bs=1 (decoding for a
| single user), or bs=<very large> (do EP to efficiently
| serve a large amount of users).
|
| Anything in between suffers.
| valine wrote:
| No one should be buying this for batch inference
| obviously.
|
| I remember right after OpenAI announced GPT3 I had a
| conversation with someone where we tried to predict how
| long it would be before GPT3 could run on a home desktop.
| This mac studio that has enough VRAM to run the full 175B
| parameter GPT3 with 16bit precision, and I think that's
| pretty cool.
| Der_Einzige wrote:
| No one who is using this for home use cares about
| anything except batch size 1 sequence size 1.
| ein0p wrote:
| What if you're doing bulk inference? The efficiency and
| throughput of bs=1 s=1 is truly abysmal.
| diggan wrote:
| > The the question is if a llm will run with usable
| performance at that scale?
|
| This is the big question to have answered. Many people claim
| Apple can now reliably be used as a ML workstation, but from
| the numbers I've seen from benchmarks, the models may fit in
| memory, but the performance for tok/sec is so slow to not
| feel worth it, compared to running it on NVIDIA hardware.
|
| Although it be expensive as hell to get 512GB of VRAM with
| NVIDIA today, maybe moves like this from Apple could push
| down the prices at least a little bit.
| johnmaguire wrote:
| It is much slower than nVidia, but for a lot of personal-
| use LLM scenarios, it's very workable. And it doesn't need
| to be anywhere near as fast considering it's really the
| only viable (affordable) option for private, local
| inference, besides building a server like this, which is no
| faster: https://news.ycombinator.com/item?id=42897205
| bastardoperator wrote:
| It's fast enough for me to cancel monthly AI services on
| a mac mini m4 max.
| diggan wrote:
| Could you maybe share a lightweight benchmark where you
| share the exact model (+ quantization if you're using
| that) + runtime + used settings and how much
| tokens/second you're getting? Or just like a log of the
| entire run with the stats, if you're using something like
| llama.cpp, LMDesktop or ollama?
|
| Also, would be neat if you could say what AI services you
| were subscribed to, there is a huge difference between
| paid Claude subscription and the OpenAI Pro subscription
| for example, both in terms of cost and the quality of
| responses.
| fetus8 wrote:
| How much RAM are you running on?
| staticman2 wrote:
| Smaller, dumber models are faster than bigger, slower
| ones.
|
| What model do you find fast enough and smart enough?
| Matl wrote:
| Not OP but I am finding the Qwen 2.5 32b distilled with
| DeepSeek R1 model to be a good speed/smartness ratio on
| the M4 Pro Mac Mini.
| lostmsu wrote:
| Hm, the AI services over 5 years cost half of m4 max
| minimal configuration which can barely run severely
| lobotomized LLaMA 70B. And they provide significantly
| better models.
| nomel wrote:
| It's probably much worse than that, with the falling
| prices of compute.
| Matl wrote:
| Sure, with something like Kagi you even get many models
| to choose from for a relatively low price, but not
| everybody likes to send over their codebase and documents
| to OpenAI.
| hangonhn wrote:
| Do we know if is it slower because of hardware is not as
| well suited for the task or is it mostly a software issue
| -- the code hasn't been optimized to run on Apple Silicon?
| titzer wrote:
| AFAICT the neural engine has accelerators for CNNs and
| integer math, but not the exact tensor operations in
| popular LLM transformer architectures that are well-
| supported in GPUs.
| kridsdale1 wrote:
| I have to assume they're doing something like that in the
| lab for 4 years from now.
| woadwarrior01 wrote:
| The neural engine is perfectly capable of accelerating
| matmults. It's just that autoregressive decoding in
| single batch LLM inference is memory bandwidth
| constrained, so there are no performance benefits to
| using the ANE for LLM inference (although, there's a huge
| power efficiency benefit). And the only way to use the
| neural engine is via CoreML. Using the GPU with MLX or
| MPS is often easier.
| azinman2 wrote:
| Memory bandwidth is the issue
| TheRealPomax wrote:
| Yeah they did? The M4 has a max memory bandwidth of 546GBps,
| the M3 Ultra bumps that up to a max of 819GBps.
|
| (and the 512GB version is $4,000 more rather than $10,000 -
| that's still worth mocking, but it's nowhere _near_ as much)
| okanesen wrote:
| Not that dramatic of an increase actually - the M2 Max
| already had 400GB/s and M2 Ultra 800GB/s memory bandwidth,
| so the M3 Ultra's 819GB/s is just a modest bump. Though the
| M4's additional 146GB/s is indeed a more noticeable
| improvement.
| choilive wrote:
| Also should note that 800/819GB/s of memory bandwidth is
| actually VERY usable for LLMs. Consider that a 4090 is
| just a hair above 1000GB/s
| hereonout2 wrote:
| Does it work like that though at this larger scale? 512GB
| of VRAM would be across multiple NVIDIA cards, so the
| bandwidth and access is parallelized.
|
| But here it looks more of a bottleneck from my
| (admittedly naive) understanding.
| choilive wrote:
| For inference the bandwidth is generally not parallelized
| because the weights need to go through the model layer by
| layer. The most common model splitting method is done by
| assigning each GPU a subset of the LLM layers and it
| doesn't take much bandwidth to send model weights via
| PCIE to the next GPU.
| manmal wrote:
| My understanding is that the GPU must still load its
| assigned layer from VRAM into registers and L2 cache for
| every token, because those aren't large enough to hold a
| significant portion. So naively, for a 24GB layer, you'd
| need to move up to 24GB for every token.
| lhl wrote:
| Since no one specifically answered your question yet, yes,
| you should be able to get usable performance. A Q4_K_M GGUF
| of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has
| 37B activations per pass. You'd probably expect in the
| ballpark of 20-30 tok/s (depends on how much actually MBW can
| be utilized) for text generation.
|
| From my napkin math, the M3 Ultra TFLOPs is still relatively
| low (around 43 FP16 TFLOPs?), but it should be more than
| enough to handle bs=1 token generation (should be way <10
| FLOPs/byte for inference). Now as far is its prefill/prompt
| processing speed... well, that's another matter.
| drited wrote:
| I would be curious about context window size that would be
| expected when generating ballpark 20 to 20 tokens per
| second using Deepseek-R1 Q4 on this hardware?
| bob1029 wrote:
| > The question is if a llm will run with usable performance
| at that scale?
|
| For the self-attention mechanism, memory bandwidth
| requirements scale ~quadratically with the sequence length.
| kridsdale1 wrote:
| Someone has got to be working on a better method than that.
| Hundreds of billions are at stake.
| deepGem wrote:
| Any idea what the sRAM to uRAM ratio is on these new GPUs ?
| If they have meaningfully higher sRAM than the Hopper GPUs,
| it could lead to meaningful speedups in large model training.
|
| If they didn't increase the memory bandwidth, then 512GB will
| enable longer context lengths and that's about it right? No
| speedups
|
| For any speedups You may need some new variant of
| FlashAttention3 or something along similar lines to be
| purpose built for Apple GPUs.
| dheera wrote:
| It will cost 4X what it costs to get 512GB on an x86 server
| motherboard.
| smith7018 wrote:
| You can build an x86 machine that can fully run DeepSeek R1
| with 512GB VRAM for ~$2,500?
| ta988 wrote:
| You will have to explain to me how.
| johnmaguire wrote:
| https://news.ycombinator.com/item?id=42897205
| bmelton wrote:
| https://digitalspaceport.com/how-to-run-
| deepseek-r1-671b-ful...
| muricula wrote:
| Is that a CPU based inference build? Shouldn't you be
| able to get more performance out of the M3's GPU?
| hbbio wrote:
| How would you compare the tok/sec between this setup and
| the M3 Max?
| aurareturn wrote:
| 3.5 - 4.5 tokens/s on the $2,000 AMD Epyc setup. Deepseek
| 671b q4.
|
| The AMD Epyc build is severely bandwidth and compute
| constrained.
|
| ~40 tokens/s on M3 Ultra 512GB by my calculation.
| sgt wrote:
| What kind of Nvidia-based rig would one need to achieve
| 40 tokens/sec on Deepseek 671b? And how much would it
| cost?
| aurareturn wrote:
| Around 5x Nvidia A100 80GB can fit 671b Q4. $50k just for
| the GPUs and likely much more when including cooling,
| power, motherboard, CPU, system RAM, etc.
| sgt wrote:
| So the M3 Ultra is amazing value then. And from what I
| could tell, an equivalent AMD Epyc would still be so
| constrained that we're talking 4-5 tokens/s. Is this a
| fair assumption?
| Aeolun wrote:
| The Epyc would only set you back $2000 though, so it's
| only a slightly worse price/return.
| SkiFire13 wrote:
| How many tokens/s would that be though?
| hbbio wrote:
| Thanks!
|
| If the M3 can run 24/7 without overheating it's a great
| deal to run agents. Especially considering that it should
| run only using 350W... so roughly $50/mo in electricity
| costs.
| valine wrote:
| What would it cost to get 512GB of VRAM on an Nvidia card?
| That's the real comparison.
| dheera wrote:
| Apples to oranges. NVIDIA cards have an order of magnitude
| more horsepower for compute than this thing. A B100 has 8
| TB/s of memory bandwidth, 10 times more than this. If
| NVIDIA made a card with 512GB of HBM I'd expect it to cost
| $150K.
|
| The compute and memory bandwidth of the M3 Ultra is more
| in-line with what you'd get from a Xeon or
| Epyc/Threadripper CPU on a server motherboard; it's just
| that the x86 "way" of doing things is usually to attach a
| GPU for way more horsepower rather than squeezing it out of
| the CPU.
|
| This will be good for local LLM inference, but not so much
| for training.
| LeifCarrotson wrote:
| Yep, it's apples to oranges. But sometimes you want
| apples, and sometimes you want oranges, so it's all good!
|
| There's a wide spectrum of potential requirements between
| memory capacity, memory bandwidth, compute speed, compute
| complexity, and compute parallelism. In the past, a few
| GB was adequate for tasks that we assigned to the GPU,
| you had enough storage bandwidth to load the relevant
| scene into memory and generate framebuffers, but now
| we're running different workloads. Conversely, a big
| database server might want its entire contents to be
| resident in many sticks of ECC DIMMs for the CPU, but
| only needed a couple dozen x86-64 threads. And if your
| workload has many terabytes or petabytes of content to
| work with, there are network file systems with entirely
| different bandwidth targets for entire racks of
| individual machines to access that data at far slower
| rates.
|
| There's a lot of latency between the needs of programmers
| and the development and shipping of hardware to satisfy
| those needs, I'm just happy we have a new option on that
| spectrum somewhere in the middle of traditional CPUs and
| traditional GPUs.
|
| As you say, if Nvidia made a 512 GB card it would cost
| $150k, but this costs an order of magnitude less than
| that. Even high-end consumer cards like a 5090 have 16x
| less memory than this does (average enthusiasts on
| desktops have maybe 8 GB) and just over double the
| bandwidth (1.7 TB/s).
|
| Also, nit pick FTA:
|
| > _Starting at 96GB, it can be configured up to 512GB, or
| over half a terabyte._
|
| 512 GB is exactly half of a terabyte, which is 1024 GB.
| It's too late for hard drives - the marketing departments
| have redefined storage to use multipliers of 1000 and
| invented "tebibytes" - but in memory we still work with
| powers of two. Please.
| dheera wrote:
| Sure, if you want to do training get an NVIDIA card. My
| point is that it's not worth comparing either Mac or CPU
| x86 setup to anything with NVIDIA in it.
|
| For inference setups, my point is that instead of paying
| $10000-$15000 for this Mac you could build an x86 system
| for <$5K (Epyc processor, 512GB-768GB RAM in 8-12
| channels, server mobo) that does the same thing.
|
| The "+$4000" for 512GB on the Apple configurator would be
| "+$1000" outside the Apple world.
| KingOfCoders wrote:
| But this is how it wonderfully works. +$4000 does two
| things: 1. Make Apple very very rich 2. Make people think
| this is better than a $10k EPYC. Win-Win for Apple. At
| the point when you have convinced that you are the best,
| higher price just means people think you are even better.
| egorfine wrote:
| > we still work with powers of two. Please.
|
| We do. Common people don't. It's easier to write "over
| half a terabyte" than explain (again) to millions of
| people what the power of two is.
| johnklos wrote:
| Anyone who calls 512 gigs "over half a terabyte" is
| bullshitting. No, thank you.
| egorfine wrote:
| Wasn't me.
| egorfine wrote:
| ...aaand I'm being downvoted for pointing out apple's
| language and possible reason for its obvious factual
| incorrectness...
| pklausler wrote:
| This prompts an "old guy anecdote"; forgive me.
|
| When I was much younger, I got to work on compilers at
| Cray Computer Corp., which was trying to bring the Cray-3
| to market. (This was basically a 16-CPU Cray-2
| implemented with GaAs parts; it never worked reliably.)
|
| Back then, HPC performance was measured in mere
| megaflops. And although the Cray-2 had peak performance
| of nearly 500MF/s/CPU, it was really hard to attain,
| since its memory bandwidth was just 250M words/s/CPU
| (2GB/s/CPU); so you had to have lots of operand re-use to
| not be memory-bound. The Cray-3 would have had more
| bandwidth, but it was split between loads and stores, so
| it was still quite a ways away from the competing Cray
| X-MP/Y-MP/C-90 architecture, which could load two words
| per clock, store one, and complete an add and a multiply.
|
| So I asked why the Cray-3 didn't have more read bandwidth
| to/from memory, and got a lesson from the answer that has
| stuck. You could actually _see_ how much physical
| hardware in that machine was devoted to the CPU /memory
| interconnect, since the case was transparent -- there was
| a thick nest of tiny blue & white twisted wire pairs
| between the modules, and the stacks of chips on each CPU
| devoted to the memory system were a large proportion of
| the total. So the memory and the interconnect constituted
| a surprising (to me) majority of the machine. Having more
| floating-point performance in the CPUs than the memory
| could sustain meant that the memory system was
| oversubscribed, and that meant that more of the machine
| was kept fully utilized. (Or would have been, had it
| worked...)
|
| In short, don't measure HPC systems with just flops.
| Measure the effective bandwidth over large data, and make
| sure that the flops are high enough to keep it utilized.
| zitterbewegung wrote:
| Since the GH200 has over a terabyte of VRAM at $343,000 and
| the H100 has 80GB that makes that $195,993 with a bit over
| 512GB of VRAM . You could beat the price of the Apple M3
| Ultra with an AMD EPYC build.
| bick_nyers wrote:
| About $12k when Project Digits comes out.
| valine wrote:
| That will only have 128GB of unified memory
| dragonwriter wrote:
| 128GB for 3K; per the announcement their ConnectX
| networking allows two Project Digits devices to be
| plugged into eachother and work together as one device
| giving you 256GB for $6k, and, AFAIK, existing frameworks
| can split models across devices, as well, hence,
| presumably, the upthread suggestion that Project Digits
| would provide 512GB for $12k, though arguably the last
| step is cheating.
| justincormack wrote:
| the reason Nvidia only talk about two machines over the
| network is I think they only have one network port, so
| you need to add costs for a switch.
| bick_nyers wrote:
| If you want to split tensorwise yes. Layerwise splits
| could go over Ethernet.
|
| I would be interested to see how feasible hybrid
| approaches would be, e.g. connect each pair up directly
| via ConnectX and then connect the sets together via
| Ethernet.
| matt-p wrote:
| Not really like for like.
|
| The pricing isn't as insane as you'd think, 96 to 256GB is
| 1500 which isn't 'cheap' but, it could be worse.
|
| All in 5,500 gets you a ultra with 256GB memory, 28 cores, 60
| GPU cores, 10Gb network - I think you'd be hard pushed to
| build a server for less.
| kllrnohj wrote:
| 5,500 easily gets me either vastly more CPU cores if I care
| more about that or a vastly faster GPU if I care more about
| that. Or for both a 9950x + 5090 (assuming you can actually
| find one in stock) is ~$3000 for the pair + motherboard,
| leaving a solid $2500 for whatever amount of RAM, storage,
| and networking you desire.
|
| The M3 strikes a very particular middle ground for AI of
| lots of RAM but a significantly slower GPU which nothing
| else matches, but that also isn't inherently the _right_
| balance either. And for any other workloads, it 's quite
| expensive.
| seanmcdirmid wrote:
| You'll need a couple of 32GB 5090s to run a quantized 70B
| model, maybe 4 to run a 70b model without quantization,
| forget about anything larger than that. A huge model
| might run slow on a M3 Ultra, but at least you can run it
| all.
|
| I have a Max M3 (the non-binned one), and I feel like
| 64GB or 96GB is within the realm of enabling LLMs that
| run reasonable fast on it (it is also a laptop, so I can
| do things on planes or trips). I thought about the Ultra,
| if you have 128GB for a top line M3 Ultra, the models
| that you could fit into memory would run fairly fast. For
| 512GB, you could run the bigger models, but not very
| quickly, so maybe not much point (at least for my use
| cases).
| matt-p wrote:
| That config would also use about 10x the power, and you
| still wouldn't be able to run a model over 32GB whereas
| the studio can easily cope with 70B llama and plenty of
| space to grow.
|
| I think it actually is perfect for local inference in a
| way that build or any other pc build in this price range
| would be.
| kllrnohj wrote:
| The M3 Ultra studio also wouldn't be able to run path
| traced Cyberpunk at all no matter how much RAM it has.
| Workloads other than local inference LLMs exist, you know
| :) After all, if the only thing this was built to do was
| run LLMs then they wouldn't have bothered adding so many
| CPU cores or video engines. CPU cores (along with
| networking) being 2 of the specs highlighted by the
| person I was responding to, so they were obviously
| valuing more than _just_ LLM use cases.
| kridsdale1 wrote:
| The core customer market for this thing remains Video
| Editors. That's why they talk about simultaneous 8K
| encoding streams.
|
| Apple's Pro segment has been video editors since the 90s.
| AnthonBerg wrote:
| That's not going to yield the same bandwidth or memory
| latency though, right?
| rbanffy wrote:
| You'd need a chip with 8 memory channels. 16 DIMM slots,
| IIRC.
| amelius wrote:
| Why does it matter if you can run the LLM locally, if you're
| still running it on someone else's locked down computing
| platform?
| PeterStuer wrote:
| Running locally, your data is not sent outside of your
| security perimeter off to a remote data center.
|
| If you are going to argue that the OS or even below that the
| hardware could be compromised to still enable exfiltration,
| that is true, but it is a whole different ballgame from using
| an external SaaS no matter what the service guarantees.
| tempest_ wrote:
| Nvidia has had the Grace Hoppers for a while now. Is this not
| like that?
| ykl wrote:
| This is cheap compared to GB200, which has a street price of
| >$70k for just the chip alone if you can even get one. Also
| GB200 technically has only 192GB per GPU and access to more
| than that happens over NVLink/RDMA, whereas here it's just
| one big flat pool of unified memory without any tiered access
| topology.
| rbanffy wrote:
| We finally encountered the situation where an Apple
| computer is cheaper than its competition ;-)
|
| All joking aside, I don't think Apples are that expensive
| compared to similar high-end gear. I don't think there is
| any other compact desktop computer with half a terabyte of
| RAM accessible to the GPU.
| kridsdale1 wrote:
| And yet all that cash still just goes to TSMC
| TheRealPomax wrote:
| I think the other big thing is that the base model finally
| starts at a normal amount of memory for a production machine.
| You can't get less than 96GB. Although an extra $4000 for the
| 512GB model seems Tim Apple levels of ridiculous. There is
| absolutely no way that the different costs anywhere near that
| much at the fab.
|
| And the storage solution still makes no sense of course, a
| machine like this should start at 4TB for $0 extra, 8TB for
| $500 more, and 16TB for $1000 more. Not start at a useless 1TB,
| with the 8TB version costing an extra $2400 and 16TB a truly
| idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives
| for $1000, SoC storage should set you back half that, not over
| double that.
| jjtheblunt wrote:
| > There is absolutely no way that the different costs
| anywhere near that much at the fab.
|
| price premium probably, but chip lithography errors (thus,
| yields) at the huge memory density might be partially driving
| up the cost for huge memory.
| TheRealPomax wrote:
| It's Apple, price premium is a given.
| PeterStuer wrote:
| Is this on chip memory? From the 800GB/s I would guess more
| likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a
| quad channel would _just_ about be possible, but really be
| pushing the envelope. Still a nice thing.
|
| As for practicality, which mainstream applications would
| benefit from this much memory paired with a nice but relative
| mid compute? At this price-point (14K for a full specced
| system), would you prefer it over e.g. a couple of NVIDIA
| project DIGITS (assuming that arrives on time and for around
| the announced the 3K price-point)?
| zitterbewegung wrote:
| NVIDIA project DIGITS has 128 GB LPDDR5x coherent unified
| system memory at a 273 Gb/s memory bus speed.
| bangaladore wrote:
| It would be 273 GB/s (gigabytes, not gigabits). But in
| reality we don't know the bandwidth. Some ex employee said
| 500 GB/s.
|
| You're source is a reddit post in which they try to match
| the size to existing chips, without realizing that its very
| likely that NVIDIA is using custom memory here produced by
| Micron. Like Apple uses custom memory chips.
| samstave wrote:
| "unified memory"
|
| funny that people think this is so new, when CRAY had Global
| Heap eons ago...
| ddtaylor wrote:
| Why did it take so long for us to get here?
| baby_souffle wrote:
| Just a guess, but fabricating this can't be easy. Yield is
| probably higher if you have less memory per chip.
| RachelF wrote:
| Some possible groups of reasons: 1. Until recently RAM
| amount was something the end user liked to configure, so
| little market demand. 2. Technically, building such a large
| system on a chip or collection of chiplets was not
| possible. 3. RAM speed wasn't a bottleneck for most tasks,
| it was IO or CPU. LLMs changed this.
| hot_gril wrote:
| M1 came out before the LLM rush, though
| webworker wrote:
| The real hardware needed for artificial intelligence wasn't
| NVIDIA, it was a CRAY XMP from 1982 all along
| hot_gril wrote:
| It's new for mainstream PCs to have it.
| daft_pink wrote:
| Really? M4 Max or M3 Ultra instead of M4 Ultra?
| aurareturn wrote:
| With an M3 Ultra going into the Mac Studio, Apple could
| differentiate from the Mac Pro, which could then get the M4
| Ultra. Right now, the Mac Studio and Mac Pro oddly both have
| the M2 Ultra and same overall performance.
|
| https://x.com/markgurman/status/1896972586069942738
| cynicalpeace wrote:
| Can someone explain what it would take for Apple to overtake
| NVIDIA as the preferred solution for AI shops?
|
| This is my understanding (probably incorrect in some places)
|
| 1. NVIDIA's big advantage is that they design the hardware
| (chips) _and_ software (CUDA). But Apple also designs the
| hardware (chips) _and_ software (Metal and MacOS).
|
| 2. CUDA has native support by AI libraries like PyTorch and
| Tensorflow, so works extra well during training and inference. It
| seems Metal is well supported by PyTorch, but not well supported
| by Tensorflow.
|
| 3. NVIDIA uses Linux rather than MacOS, making it easier in
| general to rack servers.
| bigyabai wrote:
| It's still boiling down to hardware and software differences.
|
| In terms of hardware - Apple designs their GPUs for GPU
| workloads, whereas Nvidia has a decades-old lead on optimizing
| for general-purpose compute. They've gotten really good at
| pipelining and keeping their raster performance competitive
| while also accelerating AI and ML. Meanwhile, Apple is
| directing most of their performance to just the raster stuff.
| They _could_ pivot to an Nvidia-style design, but that would be
| pretty unprecedented (even if a seemingly correct decision).
|
| And then there's CUDA. It's not really appropriate to compare
| it to Metal, both in feature scope and ease of use. CUDA has
| expansive support for AI/ML primatives and deeply integrated
| tensor/SM compute. Metal _does_ boast some compute features,
| but you 're expected to write most of the support yourself in
| the form of compute shaders. This is a pretty radical departure
| from the pre-rolled, almost "cargo cult" CUDA mentality.
|
| The Linux shtick matters a tiny bit, but it's mostly a matter
| of convenience. If Apple hardware started getting competitive,
| there would be people considering the hardware regardless of
| the OS it runs.
| cynicalpeace wrote:
| > keeping their raster performance competitive while also
| accelerating AI and ML. Meanwhile, Apple is directing most of
| their performance to just the raster stuff. They could pivot
| to an Nvidia-style design, but that would be pretty
| unprecedented (even if a seemingly correct decision).
|
| Isn't Apple also focusing on the AI stuff? How has it not
| already made that decision? What would prevent Apple from
| making that decision?
|
| > Metal does boast some compute features, but you're expected
| to write most of the support yourself in the form of compute
| shaders. This is a pretty radical departure from the pre-
| rolled, almost "cargo cult" CUDA mentality.
|
| Can you give an example of where Metal wants you to write
| something yourself whereas CUDA is pre-rolled?
| bigyabai wrote:
| > Isn't Apple also focusing on the AI stuff?
|
| Yes, but not with their GPU architecture. Apple's big bet
| was on low-power NPU hardware, assuming the compute cost of
| inference would go down as the field progressed. This was
| the wrong bet - LLMs and other AIs have scaled _up_ better
| than they scaled down.
|
| > How has it not already made that decision? What would
| prevent Apple from making that decision?
|
| I mean, for one, Apple is famously stubborn. They're the
| last ones to admit they're wrong whenever they make a
| mistake, presumably admitting that the NPU is wasted
| silicon would be a mea-culpa for their AI stance. It's also
| easier to wait for a new generation of Apple Silicon to
| overhaul the architecture, rather than driving a
| generational split as soon as the problem is identified.
|
| As for what's _preventing_ them, I don 't think there's
| anything insurmountable. But logically it might not make
| sense to adopt Nvidia's strategy even if it's better. Apple
| can't neccessarily block Nvidia from buying the same nodes
| they get from TSMC, so they'd have to out-design Nvidia if
| they wanted to compete on their merits. Even then, since
| Apple doesn't support OpenCL it's not guaranteed that they
| would replace CUDA. It would just be another proprietary
| runtime for vendors to choose from.
|
| > Can you give an example of where Metal wants you to write
| something yourself whereas CUDA is pre-rolled?
|
| Not exhaustively, no. Some of them are performance-
| optimized kernels like cuSPARSE, some others are primative
| sets like cuDNN, others yet are graph and signal processing
| libraries with built-out support for industrial
| applications.
|
| To Apple's credit, they've definitely started hardware-
| accelerating the important stuff like FFT and ray tracing.
| But Nvidia still has a decade of lead time that Apple spent
| shopping around with AMD for other solutions. The head-
| start CUDA has is so great that I don't think Apple can
| seriously respond unless the executives light a fire under
| their ass to make some changes. It will be an "immovable
| rock versus an unstoppable force" decision for Apple's
| board of directors.
| fintechie wrote:
| IMO this is a bigger blow to the AI big boys than Deepseek's
| release. This is massive for local inference. Exciting times
| ahead for open source AI.
| whimsicalism wrote:
| it is absolutely not
| kcb wrote:
| The market for local inference and $10k+ Macs is not nearly
| significant enough to effect the big boys.
| bigyabai wrote:
| I don't think you understand what the "AI big boys" are in the
| market for.
| rjeli wrote:
| Wow, incredible. I told myself I'd stop waffling and just buy the
| next 800gb/s mini or studio to come out, so I guess I'm getting
| this.
|
| Not sure how much storage to get. I was floating the idea of
| getting less storage, and hooking it up to a TB5 NAS array of
| 2.5" SSDs, 10-20tb for models + datasets + my media library would
| be nice. Any recommendations for the best enclosure for that?
| kridsdale1 wrote:
| It depends on your bandwidth needs.
|
| I also want to build the thing you want. There are no multi SSD
| M2 TB5 bays. I made one that holds 4 drives (16TB) at TB3 and
| even there the underlying drives are far faster than the cable.
|
| My stuff is in OWC Express 4M2.
| perfmode wrote:
| Are you running RAID?
| Sharlin wrote:
| > it can be configured up to 512GB, or over half a terabyte.
|
| Hah, I see what they did there.
| kridsdale1 wrote:
| If they added 1 byte, it counts.
| aurareturn wrote:
| You can run the full Deepseek 671b q4 model at 40 tokens/s. 37B
| active params at a time because R1 is MoE.
| KingOfCoders wrote:
| In another of your comments it was "by my calculation". Now
| it's just fact?
| screye wrote:
| How does the 500gb vram compare with 8xA100s ? ($15/hr rentals)
|
| If it is equivalent, then the machine pays for itself in 300
| hours. That's incredible value.
| awestroke wrote:
| A100 has 10x or so higher mem bandwidth
| egorfine wrote:
| Per nvidia [1] A100 has memory bandwidth up to 2,039. So not
| 10x, more like 2x.
|
| [1] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
| Cent...
| ummonk wrote:
| Is the Mac Pro dead or are they waiting for M4 Ultra refresh it?
| ozten wrote:
| We've come a long way since beowulf clusters of smart toasters.
| perfmode wrote:
| 32 core, 512GB RAM, 8TB SSD
|
| please take my money now
| raydev wrote:
| I know it's basically nitpicking competing luxury sports cars at
| this point, but I am very bothered that existing benchmarks for
| the M3 show single core perf that is approximately 70% of M4
| single core perf.
|
| I feel like I should be able to spend all my money to both get
| the fastest single core performance AND all the cores and
| available memory, but Apple has decided that we need to downgrade
| to "go wide". Annoying.
| xp84 wrote:
| > both get the fastest single core performance AND all the
| cores
|
| I'm a major Apple skeptic myself, but hasn't there always been
| a tradeoff between "fastest single core" vs "lots of cores"
| (and thus best multicore)?
|
| For instance, I remember when you could buy an iMac with an i9
| or whatever, with a higher clock speed and faster single core,
| or you could buy an iMac Pro with a Xeon with more cores, but
| the iMac (non-Pro) would beat it in a single core benchmark.
| Note: Though I used Macs as the example due to the simple
| product lines, I thought this was pretty much universal among
| all modern computers.
| raydev wrote:
| > hasn't there always been a tradeoff between "fastest single
| core" vs "lots of cores" (and thus best multicore)?
|
| Not in the Apple Silicon line. The M2 Ultra has the same
| single core performance as the M2 Max and Pro. No benchmarks
| for the M3 Ultra yet but I'm guessing the same vs M3 Max and
| Pro.
| xp84 wrote:
| Okay, good to know. Interesting change then.
| LPisGood wrote:
| I think the traditional reason for this is that other
| chips like to use complex scheduling logic to have more
| logical cores than physical cores. This costs single
| threaded speed but allows you to run more threads faster.
| 1attice wrote:
| Now with Ultra-class backdoors?
| https://news.ycombinator.com/item?id=43003230
| ein0p wrote:
| That's all nice, but if they are to be considered a serious AI
| hardware player, they will need to invest in better support of
| their hardware in deep learning frameworks such as PyTorch and
| Jax. Currently the support is rather poor, and is not suitable
| for any serious work.
| divan wrote:
| Model with 512GB VRAM costs $9500, if anyone wonders.
| m3kw9 wrote:
| Instantly hippa compliant high end models running locally.
| mlboss wrote:
| $14K with 512gb memory and 16 Tb storage
| maverwa wrote:
| I cannot believe I'm saying that, but: for apple that's rather
| cheap. Threadripper boxes with that amount of memory do not
| come a lot cheaper. Considering what apples pricing when it
| comes to memory in other devices, 4K for the 96GB to 512GB
| upgrade is a bargain.
| jltsiren wrote:
| It's not that much cheaper that with earlier comparable
| models. Apple memory prices have been $25/GB for the base and
| Pro chips and $12.5/GB for the Max and Ultra chips. With the
| new Studios, we get $12.5/GB until 128 GB and $9.375/GB
| beyond that.
|
| If you configure a Threadripper workstation at Puget Systems,
| memory price seems to be ~$6/GB. Except if you use 128 GB
| modules, which are almost $10/GB. You can get 768 GB for a
| Threadripper Pro cheaper than 512 GB for a Threadripper, but
| the base cost of a Pro system is much higher.
| alok-g wrote:
| Two questions for the fellow HNers:
|
| 1. What are various average joe (as opposed to researchers, etc.)
| use cases for running powerful AI models locally vs. just using
| cloud AI. Privacy of course is a benefit, but it by itself may
| not justify upgrades for an average user. Or are we expecting
| that new innovation will lead to much more proliferation of AI
| and use cases that will make running locally more feasible?
|
| 2. With the amount of memory used jumping up, would there be a
| significant growth for companies making memories? If so, which
| ones would be the best positioned?
|
| Thanks.
| christiangenco wrote:
| IMO it's all about privacy. Perhaps also availability if the
| main LLM providers start pulling shenanigans but it seems like
| that's not going to be a huge problem with how many big players
| are in the space.
|
| I think a great use case for this would be in a company that
| doesn't want all of their employees sending LLM queries about
| what they're working on outside the company. Buy one or two of
| these and give everybody a client to connect to it and hey
| presto you've got a secure private LLM everybody in the company
| can use while keeping data private.
| chamomeal wrote:
| I'll add to this that while I couldn't care less about open
| AI seeing my general coding questions, I wouldn't run actual
| important data through ChatGPT.
|
| With a local model, I could toss anything in there. Database
| query outputs, private keys, stuff like that. This'll
| probably become more relevant as we give LLM's broader use
| over certain systems.
|
| Like right now I still mostly just type or paste stuff into
| ChatGPT. But what about when I have a little database copilot
| that needs to read query results, and maybe even run its own
| subset of queries like schema checks? Or some open source
| computer-use type thingy needs to click around in all sorts
| of places I don't want openAI going, like my .env or my bash
| profile? That's the kinda thing I'd only use a local model
| for
| user3939382 wrote:
| Hopefully homomorphic encryption can solve this rather than
| building a new hardware layer everywhere.
| piotrpdev wrote:
| 1. Lower latency for real time tasks e.g. transcription +
| translation?
| archagon wrote:
| I don't currently use AIs, but if I did, they would be local.
| Simply put: I can't build my professional career around tools
| that I do not own.
| alok-g wrote:
| >> ... around tools that I do not own.
|
| That just may be dependent on how much trust you have on the
| providers you use. Or do you do your own electricity
| generation?
| archagon wrote:
| That's quite a reductio ad absurdum. No, I don't generate
| my own electricity (though I could). But I don't use tools
| for work that can change out from under me at any moment,
| or that can increase 10x in price on a corporate whim.
| hobofan wrote:
| And why would that require running AI models locally? You
| can be in essentially full control by using open source
| (/open weight) models (DeepSeek etc.) running on
| exchangable cloud providers that are as replaceable as
| your electricity provider.
| archagon wrote:
| Sure, I guess you can do that as long as you use an open
| weight model. (Offline support is a nice perk, however.)
| alok-g wrote:
| We align.
|
| I tend to do the same thing. I do not consider myself as
| a good representative of an average user though.
| zamalek wrote:
| I don't think there's a huge use-case locally, if you're happy
| with the subscription cost and privacy. That is, yet. Give it
| maybe 2 years and someone will probably invent something which
| local inference would seriously benefit from. I'm anticipating
| inference for the home appliances (something mac mini form
| factor that plugs into your router) _but_ that 's based on what
| would make logical sense for consumers, not what consumers
| would fall for.
|
| Apple seems to be using LPDDR, but HBM will also likely be a
| key tech. SK Hynix and Samsung are the most reputable for both.
| alok-g wrote:
| Thanks.
|
| >> Apple seems to be using LPDDR, but HBM will also likely be
| a key tech. SK Hynix and Samsung are the most reputable for
| both.
|
| So not much Micron? Any US based stocks to invest in? :-)
| zamalek wrote:
| I forgot about Micron, absolutely. TSMC is the supplier for
| all of these, so you're covering both memory and compute if
| that's your strategy (the risk is that US TSMC is over
| provisioning manufacturing based on the pandemic hardware
| boom).
| alok-g wrote:
| Thanks!
| theshrike79 wrote:
| For 1: censorship
|
| A local model will do anything you ask it to, as far as it
| "knows" about it. It doesn't need to please investors or be
| afraid of bad press.
|
| LM Studio + a group of select models from huggingface and you
| can do whatever you want.
|
| For generic coding assistance and knowledge, online services
| are still better quality.
| JadedBlueEyes wrote:
| One important one that I haven't seen mentioned is simply
| working without an internet connection. It was quite important
| for me when I was using AI whilst travelling through the
| countryside, where there is very limited network access.
| epolanski wrote:
| Can anybody ELI5 why aren't there multi gpu builds to run LLMs
| locally?
|
| It feels like one should be able to build a good machine for 3/4k
| if not less with 6 16GB mid level gaming GPUs.
| snitty wrote:
| Reddit's LocalLLama has a lot of these. 3090s are pretty
| popular for these purposes. But they're not trivial to build
| and run at home. Among other issues are that you're drawing
| >1kW for just the GPUs if you have four of them at 100% usage.
| risho wrote:
| 6 * 16 is still nowhere near 512gb of vram. On top of that that
| monster that you create requires hyper specific server grade
| hardware, will be huge, loud and pull down enough power to trip
| a circuit breaker. i'm sure most people would rather pay a 30
| percent premium to get twice the ram and have a power sipping
| device that you can hold in the palm of your hand.
| crowcroft wrote:
| Kinda curious to see how man tok/sec it can crush. Could be a fun
| way to host AI apps.
| joshhart wrote:
| This is pretty exciting. Now an organization could produce an
| open weights mixture of experts model that has 8-15b active
| parameters but could still be 500b+ parameters and it could be
| run locally with INT4 quantization with very fast performance.
| DeepSeek R1 is a similar model but over 30b active parameters
| which makes it a little slow.
|
| I do not have a good sense of how well quality scales with narrow
| MoEs but even if we get something like Llama 3.3 70b in quality
| at only 8b active parameters people could do a ton locally.
| wewewedxfgdf wrote:
| Computers these days - the more appealing, exciting, cooler
| desirable, the higher the price, into the stratosphere.
|
| $9499
|
| What ever happening to competition in computing?
|
| Computing hardware competition used to be cut throat, drop dead,
| knife fight, last man standing brutally competitive. Now it's
| just a massive gold rush cash grab.
| hu3 wrote:
| It doesn't even run Linux properly.
|
| Could cost half of that and it would still be uninteresting for
| my use cases.
|
| For AI, on-demand cloud processing is magnitudes better in
| speed and software compatibility anyway.
| niek_pas wrote:
| The Macintosh plus, released in 1986, cost $2600 at the time,
| or $7460 adjusted for inflation.
| bigyabai wrote:
| It even came with an official display! Nowadays that's a
| $1,600-$6,000 accessory, depending on whether you own a VESA
| mount.
| martin_a wrote:
| > Apple today announced M3 Ultra, the highest-performing chip it
| has ever created
|
| Well, duh, it would be a shame if you made a step backwards,
| wouldn't it? I hate that stupid phrase...
| wewewedxfgdf wrote:
| The good news is that AMD and Intel are both in good positions to
| develop similar products.
| dangoodmanUT wrote:
| 800GB/s and 512 unified ram is going to go stupid for llms
| minton wrote:
| + $4,000 to bump to 512GB from 96GB.
| ldng wrote:
| Well, a shame for Apple, a lot of the rest of the world is going
| to boycott american products after such level of treacherousness.
| tap-snap-or-nap wrote:
| All this hardware but I don't know how to best utilize it because
| 1) I am not a pro, and 2) The apps are not as helpful which can
| make complex jobs easier, which is what old apple used to do
| really well.
| narrator wrote:
| Not to rain on the Apple parade, but cloud video editing with the
| models running on H100s that can edit videos based on prompts is
| going to be vastly more productive than anything running locally.
| This will be useful for local development with the big Deepseek
| models though. Not sure if it's worth the investment unless
| Deepseek is close to the capability of cloud models, or privacy
| concerns overwhelm everything.
| gigatexal wrote:
| 8tb, 512gb ram, m3 ultra 15k+ usd. Wow.
___________________________________________________________________
(page generated 2025-03-05 23:00 UTC)