[HN Gopher] Memory access on the Apple M1 processor
___________________________________________________________________
Memory access on the Apple M1 processor
Author : luigi23
Score : 219 points
Date : 2021-01-06 17:06 UTC (5 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| lrossi wrote:
| Shouldn't you choose the random numbers such that array[idx1] ^
| array[idx1 + 1] are guaranteed to fall in the same cache line?
| Assuming that it has that. Right now some accesses cross the end
| of the cache line.
| CyberRabbi wrote:
| Technically you are correct but it's expected to cross a cache
| line 1/16 times (or however many ints there are in a cache
| line). There is an implicit assumption that that is relatively
| infrequent enough that it shouldn't increase the average time
| too much, but that assumption should be tested.
| jayd16 wrote:
| >our naive random-memory model
|
| Doesn't everyone use the (I believe) still valid concepts of
| latency and bandwidth?
| sroussey wrote:
| Depends on context.
|
| For example, what is the bandwidth and latency when you ask for
| the value at the same memory address in an infinite loop? And
| how does that compare to the latency and bandwidth of a memory
| module you buy on NewEgg?
| fluffy87 wrote:
| L1 BW.
|
| When people use BW in their performance models, they don't
| use only 1 bandwidth, but whatever combination of bandwidth
| makes sense for the _memory access pattern_.
|
| So if you are always accessing the same word, the first acces
| runs at DRAM BW, and subsequent ones at L1 BW, and any
| meaningful performance model will take that into account.
| whoisburbansky wrote:
| The concepts are still broadly valid, the naivety being
| referred to is the assumption that two non adjacent memory
| reads will be twice as slow as one memory read or two adjacent
| reads.
| wyldfire wrote:
| How do latency and bandwidth relate to the cost model for the
| code in the benchmark?
|
| When creating the model discussed in the post, we're using it
| to try to make a static prediction about how the code will
| execute.
|
| Note that the goal of the post is not to merely measure the
| memory access performance, it's to understand the specific
| microarchitecture and how it might deliver the benefits that we
| see in benchmarks.
| foota wrote:
| Is this per core or shared between cores?
| hundchenkatze wrote:
| Per core I think, emphasis is mine.
|
| > It looks like a _single_ core has about 28 levels of memory
| parallelism, and possibly more.
| foota wrote:
| I was wondering if this might be a shared resource though,
| since it doesn't seem they tested with multiple threads.
| wrsh07 wrote:
| Ok, summary:
|
| This article lays out three scenarios: 1) accessing two random
| elements
|
| 2) accessing 3 random elements
|
| 3) accessing two pairs of adjacent elements (same as (1) but also
| the elements after each random element)
|
| It then does some trivial math to use the loaded data.
|
| A naive model might only consider memory accesses and might
| assume accessing an adjacent element is free.
|
| On the Mac m1 core, this is not the case. While the naive model
| might expect cases 1 & 3 to cost the same and case 2 to cost 50%
| more, instead cases 2 & 3 are nearly the same (3 slightly faster)
| and case 2 is about 50% more expensive than 1.
| jayd16 wrote:
| I don't really understand the comparison because it seems like
| scenario 3 (2+) is doing more XORs and twice the accesses to
| array over the same amount of iterations.
|
| We have to assume these are byte arrays, yes? Or at least some
| size that's smaller than the cache line. You would still pay
| for the extra unaligned fetches. I don't think this is a valid
| scenario at all, M1 or not.
|
| Anyone want to run these tests on an Intel machine and let us
| know if the authors "naive model" test hold there?
| wrsh07 wrote:
| The point of the naive model is that you assume memory
| accesses dominate
|
| That is, the math part is so trivial compared to the memory
| access that you could do a bunch of math and you would still
| only notice a change in the number of memory accesses.
|
| Also it looks like the response to yours links their test and
| the naive model predicts correctly
| jayd16 wrote:
| I think 5% is a non-trivial difference but alright, its a
| much bigger difference on the M1.
|
| I guess I still don't understand whats going on here.
|
| Scenario 1 has two spatially close reads followed by two
| dependent random access reads.
|
| Scenario 3 (2+) has two spatially close reads, and two
| pairs of dependent random access reads of two spatially
| close locations.
|
| Why does it follow that this is caused by a change in
| memory access concurrency? The two required round trips
| should dominate both on the M1 and an Intel but for some
| reason the M1 performs worse than that. Why?
|
| I can't help but feel the first snippet triggers some SIMD
| path while the 3rd snippet fails to.
| africanboy wrote:
| I did it on an old i7 laptop
|
| https://news.ycombinator.com/item?id=25661055
| temac wrote:
| > A naive model might only consider memory accesses and might
| assume accessing an adjacent element is free.
|
| Really depends on the level of naivety and the definition of
| "free". It would be less insane to write that: accessing an
| adjacent element has a negligible overhead if the data must be
| loaded from RAM and there are some OOO bubbles to execute the
| adjacent loads. If some data are in cache the free adjacent
| load claim immediately is less probable. If the latency of a
| single load is already filled by OOO, adding another one will
| obviously have an impact. If the workload is highly regular you
| _can_ get quite chaotic results when making even some trivial
| changes (even sometimes when _aligning the .text differently_!)
|
| And the proposed microbenchmark is way too simplistic: it is
| possible that it saturates some units in some processors and
| completely different units in others...
|
| Is the impact of an extra adjacent load from RAM _likely_ to be
| negligible in a real world workloads? Absolutely. With precise
| characteristics depending on your exact model / current freq /
| other memory pressure at this time, etc.
| willvarfar wrote:
| A lot of commenters here are saying that Apples advantage is that
| it can profile the real workloads and optimise for that.
|
| Well that's true and could very well be an advantage. An
| advantage in that they did it, not in that only they have access
| to it.
|
| Intel and AMD can trivially profile real world workloads too.
|
| Did they? I don't know what Apple did, but the impression I get
| is that intel certainly hasn't.
| nabla9 wrote:
| What is the cache line size and page table size in M1?
| sysconf(_SC_PAGESIZE); /* posix */
|
| Can you get direct processor information like LEVEL1_ICACHE_ASSOC
| and LEVEL1_ICACHE_LINESIZE from the M1??
| momothereal wrote:
| `getconf PAGESIZE` returns 16384 on the base M1 MacBook Air.
|
| The L1 cache values aren't there. The macOS `getconf` doesn't
| support -a (listing all variables), so they may just be under a
| different name.
|
| edit: see replies for `sysctl -a` output
| lilyball wrote:
| Is it possibly exposed via sysctl, which does support a flag
| to list all variables?
| messe wrote:
| From sysctl -a on my M1:
| hw.cachelinesize: 128 hw.l1icachesize: 131072
| hw.l1dcachesize: 65536 hw.l2cachesize: 4194304
|
| EDIT: also, when run under Rosetta hw.cachelinesize is
| halved: hw.cachelinesize: 64
| hw.l1icachesize: 131072 hw.l1dcachesize: 65536
| hw.l2cachesize: 4194304
| Nokinside wrote:
| M1 cache lines are double of what is commonly used by
| Intel, AMD and other ARM microarchtectures use. That's
| significant difference.
| [deleted]
| JonathonW wrote:
| Compared to the i9-9880H in my 16" MacBook Pro:
| hw.cachelinesize: 64 hw.l1icachesize: 32768
| hw.l1dcachesize: 32768 hw.l2cachesize: 262144
| hw.l3cachesize: 16777216
|
| The M1 doubles the line size, doubles the L1 data cache
| (i.e. same number of lines), quadruples the L1
| instruction cache (i.e. double the lines), and has a 16x
| larger L2 cache, but no L3 cache.
| waterside81 wrote:
| For people who know more about this stuff than me: are these
| sorts optimizations only possible because Apple controls the
| whole stack and can make the hardware & OS/software perfectly
| match up with one another or is this something that Intel can do
| but doesn't for some reasons (tradeoffs)?
| viktorcode wrote:
| There's at least two M1 optimisations targeting Apple's
| software stack:
|
| 1. Fast uncontended atomics. Speeds up reference counting which
| is used heavily by Objective-C code base (and Swift). Increase
| is massive comparing to Intel.
|
| 2. Guaranteed instruction ordering mode. Allows for faster Arm
| code to be produced by Rosetta when emulating x86. Without it
| emulation overhead would be much bigger (similar to what
| Microsoft is experiencing).
| [deleted]
| [deleted]
| AnthonyMouse wrote:
| > are these sorts optimizations only possible because Apple
| controls the whole stack and can make the hardware &
| OS/software perfectly match up with one another or is this
| something that Intel can do but doesn't for some reasons
| (tradeoffs)?
|
| Interestingly it's the other way around. Apple is using TSMC's
| 5nm process (they don't have their own fabs), which is better
| than Intel's in-house fabs, so it's _Intel 's_ vertical
| integration which is _hurting_ them compared to the non-
| vertically integrated Apple.
|
| Also, the answer to "is this only possible because of vertical
| integration" is always _no_. Intel and Microsoft regularly
| coordinate to make hardware and software work together. Intel
| is one of the largest contributors to the Linux kernel, even
| though they don 't "own" it. Two companies coordinating with
| one another can do anything they could do as an individual
| company.
|
| Sometimes the efficiency of this is lower because there are
| communication barriers and isn't a single chain of command. But
| sometimes it's higher because you don't have internal politics
| screwing everything up when the designers would be happy with
| outsourcing to TSMC because they have a competitive advantage,
| but the common CEO knows that would enrich a competitor and
| trash their internal investment in their own fabs, and forces
| the decision that leads to less competitive products.
| cma wrote:
| Not quite vertical integration, but TSMC's 5nm fabs are
| Apple's fabs. (exclusively for a period of time)
|
| During the iPod era, Toshiba's 1.8in HD production was
| exclusively Apple's only for music players, but Apple gets
| all the 5nm output from TSMC for a period of time.
| hinkley wrote:
| Integration is a petri dish. It can speed up both growth and
| decay, and it is indifferent to which one wins.
| wmf wrote:
| No, there's no cross-stack optimization here. The M1 gives very
| high performance for all code.
| qeternity wrote:
| I think this gets lost in the fray between the "omg this is
| magic" and then the Apple haters. The M1 is a very good chip.
| Apple has hired an amazing team and resourced them well. But
| from a pure hardware perspective, the M1 is quite
| evolutionary. However the whole Apple Silicon experience is
| revolutionary and magical due to the tight software pairing.
|
| Both teams deserve huge praise for the tight coordination and
| unreal execution.
| acdha wrote:
| I think this is part of the reason where there are so many
| people trying to find reasons to downplay it: humans love
| the idea of "one weird trick" which makes a huge difference
| and we sometimes find those in tech but rarely for mature
| fields like CPU design. For many people, this is
| unsatisfying like asking an athlete their secret, and
| getting a response like "eat well, train a lot, don't give
| up" with nary a shortcut in sight.
| djacobs7 wrote:
| Is the article saying that the M1 is slower than we would have
| expected in this case?
|
| My understanding, based on the article, is that a normal
| processor, we would have expected arr[idx] + arr[idx+1] and
| arr[idx] to take the same amount of time.
|
| But the M1 is so parallelized that it goes to grab both arr[idx]
| and arr[idx+1] separately. So we have to wait for both of those
| two return. Meanwhile, on a less parallelized processor, we would
| have done arr[idx] first and waited for it to return, and the
| processor would realize that it already had arr[idx+1] without
| having to do the second fetch.
|
| Am I understanding this right?
| phkahler wrote:
| >> My understanding, based on the article, is that a normal
| processor, we would have expected arr[idx] + arr[idx+1] and
| arr[idx] to take the same amount of time.
|
| That depends. If the two accesses are on the same cache line,
| then yes. But since idx is random that will not happen
| sometimes. He never says how big array[] is in elements or what
| size each element is.
|
| I thought DRAM also had the ability to stream out consecutive
| addresses. If so then it looks like Apple could be missing out
| here.
|
| Then again, if his array fits in cache he's just measuring
| instruction counts. His random indexes need to cover that whole
| range too. There's not enough info to figure out what's going
| on.
| SekstiNi wrote:
| > There's not enough info to figure out what's going on.
|
| If you only look at the article this is true. However, the
| source code is freely available:
| https://github.com/lemire/Code-used-on-Daniel-Lemire-s-
| blog/...
| mrob wrote:
| I tried it on my old (2009) 2.5GHz Phenom II X4 905e (GCC
| 10.2.1 -O3, 64 bit) and got results almost perfectly
| matching the conventional wisdom: two :
| 97.4 ns two+ : 97.9 ns three: 145.8 ns
| egnehots wrote:
| TLDR: he is using a random index with a big enough array
| [deleted]
| africanboy wrote:
| I ran the benchmark on my system
|
| It's a 6 years old system, fastest times are in the 25ns
| range
|
| - 2-wise+ is 5% slower than 2-wise
|
| - 3-wise is 46% slower than 2-wise
|
| - 3-wise is 39% slower than 2-wise+
|
| on the M1
|
| - 2-wise+ is 40% slower than 2-wise
|
| - 3-wise is 46% slower than 2-wise
|
| - 3-wise is 4% slower than 2-wise+
| SekstiNi wrote:
| Interesting, I ran it on my laptop (i7-7700HQ) with the
| following results:
|
| - 2-wise+ is 19% slower than 2-wise
|
| - 3-wise is 48% slower than 2-wise
|
| - 3-wise is 25% slower than 2-wise+
|
| However, as mentioned in the post the numbers can vary a
| lot, and I noticed a maximum run-to-run difference of
| 23ms on two-wise.
| phkahler wrote:
| He's only got 3 million random[] numbers. Weather that's
| enough depends on the cache size. It also bothers me to
| read code like this where functions take parameters (like
| N) and never use them.
| eloff wrote:
| He mentioned it's a 1GB array, and the source code is
| available.
| jayd16 wrote:
| Its a little confusing because they're conflating the idea that
| you almost certainly read at least the entire word (and not a
| single byte) at a time with the other idea that you could fetch
| multiple words concurrently.
| duskwuff wrote:
| Any cached memory access is going to read in the entire cache
| line -- 64 bytes on x86, apparently 128 on M1. This is true
| across most architectures which use caches; it isn't specific
| to M1 or ARM.
| kzrdude wrote:
| (As I learned from recent Rust concurrency changes) on
| newer Intel, it usually fetches two cache lines so
| effectively 128 bytes while AMD usually 64 bytes. That's
| the sizes they use for "cache line padded" values (I.e
| making sure to separate two atomics by the fetch size to
| avoid threads invalidating the cache back and forth too
| much).
| jayd16 wrote:
| Yes almost certainly more than the word will be read but it
| varies by architecture. I would think almost by definition
| no less than a word can be read so I went with that in my
| explanation.
| syntaxing wrote:
| I'm super curious if it's true that my 8GB M1 will die quickly
| because of the aggressive swaps. I guess time will tell.
| acdha wrote:
| FWIW, I have a 2010 MBA which was _heavily_ used for years as a
| primary development system. The SSD only started to show signs
| of degraded performance last year and that wasn't massive. I
| would be quite surprised if the technology has become worse.
| [deleted]
| jeffbee wrote:
| Great practical information. Nice to see people who know what
| they are talking about putting data out there. I hope eventually
| these persistent HN memes about M1 memory will die: that it's
| "on-die" (it's not), that it's the only CPU using LPDDR4X-4267
| (it's not), or that it's faster because the memory is 2mm closer
| to the CPU (not that either).
|
| It's faster because it has more microarchitectural resources. It
| can load and store more, and it can do with a single core what an
| Intel part needs all cores to accomplish.
| titzer wrote:
| > it can do with a single core what an Intel part needs all
| cores to accomplish.
|
| Care to explain what you mean specifically by this?
| saagarjha wrote:
| The M1 has extremely high single-core performance.
| temac wrote:
| It is not 4 times faster than an Intel core, though...
| saagarjha wrote:
| It is in memory performance, which is what I assumed was
| being measured here.
| kllrnohj wrote:
| How are you defining memory performance and where are
| your supporting comparisons? This article only discusses
| the M1's behavior, and makes no comparisons to any other
| CPU.
| FabHK wrote:
| FWIW, I ran it on a MacBook Pro (13-inch, 2019, Four
| Thunderbolt 3 ports), 2.4 GHz Quad-Core Intel Core i5, 8
| GB 2133 MHz LPDDR3: two : 49.6 ns (x
| 5.5) two+ : 64.8 ns (x 5.2) three: 72.8 ns
| (x 5.6)
|
| EDIT to add: above was just `cc`. Below is with `cc -O3
| -Wall`, as in Lemire's article: two :
| 62.8 ns (x 7.1) two+ : 69.2 ns (x 5.5)
| three: 95.3 ns (x 7.3)
| namibj wrote:
| You _need_ to use -mnative because it otherwise retains
| backwards compatibility to older x86.
| [deleted]
| africanboy wrote:
| there must be something wrong there, on my late 2014
| laptop that mounts Type: DDR4
| Speed: 2133 MT/s
|
| I get two : 27.1 ns (3x) two+
| : 28.6 ns (2.2x) three: 39.7 ns (3x)
|
| which is not much, considering this is an almost 6 years
| old system with 2x slower memor
| titzer wrote:
| Sure, and it has a very large out-of-order execution
| engine, but it is not fundamentally different from what
| other super scalar processors do. So I am curious what the
| OP meant by that offhand comment.
| jeffbee wrote:
| One core of the M1 can drive the memory subsystem to the
| rails. A single core can copy (load+store) at 60GB/s.
| This is close to the theoretical design limit for DDR4X.
| A single core on Tiger Lake can only hit about 34GB/s,
| and Skylake-SP only gets about 15GB/s. So yes, it is
| close to 4x faster.
| titzer wrote:
| Thanks for clarifying. But this isn't any fundamental
| difference IMO. There isn't any functional limitation in
| an Intel core that means it cannot saturate the memory
| bandwidth from a single core, unless I am missing
| something.
| jeffbee wrote:
| I agree, it's not fundamental. It is, in particular, not
| that other popular myth, that it's "because ARM". It's
| only that 1 core on an Intel chip can have N-many
| outstanding loads and 1 core of an M1 can have M>N
| outstanding loads.
| titzer wrote:
| Frankly, I find Lemire does oversimplified, poor-quality
| control, back-of-the-envelope microbenchmarking all the time
| that provides little to no insight other than establishing a
| general trend. It's sophomoric and a poor demonstration about
| how to well-controlled benchmarking that might yield useful,
| repeatable, and transferrable results.
| foldr wrote:
| >or that it's faster because the memory is 2mm closer to the
| CPU (not that either)
|
| Not to disagree with your overall point, but 2mm is a long way
| when dealing with high frequency signals. You can't just
| eyeball this and infer that it makes no difference to
| performance or power consumption.
| jeffbee wrote:
| If it works, it works. There will be no observable
| performance difference for DDR4 SDRAM implementations with
| the same timing parameters, regardless of the trace length.
| There are systems out there with 15cm of traces between the
| memory controller pins and the DRAM chips. The only thing you
| can say against them is they might consume more power driving
| that trace. But you wouldn't say they are meaningfully
| slower.
| foldr wrote:
| You can't just eyeball the PCB layout for a GHz frequency
| circuit and say "yeah that would definitely work just the
| same if you moved that component 2mm in this direction".
| It's certainly possible to use longer trace lengths, but
| that may come with tradeoffs.
|
| >The only thing you can say against them is they might
| consume more power driving that trace
|
| Power consumption is really important in a laptop, and
| Apple clearly care deeply about minimising it.
|
| For all we know for sure, moving the memory closer to the
| CPU may have been part of what's enabled Apple to run
| higher frequency memory with acceptable (to them) power
| draw.
| sliken wrote:
| The most impressive thing I've seen is that when accessed in a
| TLB friendly fashion that the latency is around 30ns.
|
| Anandtech has a graph showing this, specifically the R per RV
| prange graph. I've verified this personally with a small
| microbenchmark I wrote. I've not seen anything else close to
| this memory latency.
| reasonabl_human wrote:
| Mind sharing the micro benchmark you wrote? I'm curious to
| know how that would work
| tandr wrote:
| Sorry, what would AMD's or Intel's "latest and greatest"
| numbers for the same be?
| sliken wrote:
| Here's the M1: https://www.anandtech.com/show/16252/mac-
| mini-apple-m1-teste...
|
| Scroll down to the latency vs size map and look at the R
| per RV prange. That gets you 30ns or so.
|
| Similar for AMD's latest/greatest the Ryzen 9 5950X:
| https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-
| di...
|
| The same R per RV prange is in the 60ns range.
| ed25519FUUU wrote:
| In other words, it's better architecture. If anything this
| makes it seem more impressive to me.
| amelius wrote:
| No, it's the same architecture but with different parameters.
|
| It's like the difference between the situation where every
| car uses 4 cylinders, and then Apple comes along and makes a
| car with 5 cylinders.
| kllrnohj wrote:
| Your analogy was so close! It's Apple comes along and makes
| an 8 cylinder engine. Since, you know, the other CPUs are
| 4-wide decode and Apple's M1 is 8-wide decode :)
| PragmaticPulp wrote:
| I don't understand this competition to attribute the M1's speed
| to _one_ specific change, while downplaying all of the others.
|
| M1 is fast because they optimized everything across the board.
| The speed is the cumulative result of many optimizations, from
| the on-die memory to the memory clock speed to the
| architecture.
| adam_arthur wrote:
| It's fast because they optimized everything across the board,
| and also paid for exclusive access to TSMC 5nm process.
| pdpi wrote:
| This seems to be a recurring theme with the M1, and one that,
| in a sense, actually baffles me even more than the alternative.
| There is no "magic" at play here, it's just lots and lots of
| raw muscle. They just seem to have a freakishly successful
| strategy for choosing what aspects of the processor to throw
| that muscle at.
|
| Why is that strategy simultaneously remarkably efficient and
| remarkably high-performance? What enabled/led them to make
| those choices where others haven't?
| mhh__ wrote:
| I think it's worth saying that because AMD have only just
| really hit their stride, Intel were under almost zero
| pressure to improve which has really hurt them especially
| with the process.
|
| X86 is definitely a coefficient overhead, but if Intel put
| their designs on 5nm they'd look pretty good too - Jim Keller
| (when he was still there) hinted their offerings for a year
| or so in the future are significantly bigger to the point of
| him judging it to be worth mentioning so I wouldn't write
| them off.
| adam_arthur wrote:
| Certainly Apple's processors are far ahead, but they're a
| full process generation (5nm) ahead of their competitors.
| They paid their way to that exclusive right through TSMC.
|
| I'm sure they'll still come out ahead in benchmarks, but the
| numbers will be much closer once AMD moves to 5nm. You
| absolutely cannot fairly compare chips from different fab
| generations.
|
| I don't see many comments hammering this point home enough...
| it's not like the performance gap is through engineering
| efforts that are leagues ahead. Certainly some can be
| attributed to that, and Apple has the resources to poach any
| talent necessary.
| GeekyBear wrote:
| A node shrink gives you a choice of cutting power,
| improving performance, or some mix of the two.
|
| Apple appears to have taken the power reduction when they
| moved to TSMC 5nm.
|
| >The one explanation and theory I have is that Apple might
| have finally pulled back on their excessive peak power draw
| at the maximum performance states of the CPUs and GPUs, and
| thus peak performance wouldn't have seen such a large jump
| this generation, but favour more sustainable thermal
| figures.
|
| Apple's A12 and A13 chips were large performance upgrades
| both on the side of the CPU and GPU, however one criticism
| I had made of the company's designs is that they both
| increased the power draw beyond what was usually
| sustainable in a mobile thermal envelope. This meant that
| while the designs had amazing peak performance figures, the
| chips were unable to sustain them for prolonged periods
| beyond 2-3 minutes. Keeping that in mind, the devices
| throttled to performance levels that were still ahead of
| the competition, leaving Apple in a leadership position in
| terms of efficiency.
|
| https://www.anandtech.com/show/16088/apple-
| announces-5nm-a14...
| wmf wrote:
| From a customer's perspective it's not my problem. Everyone
| had the opportunity to bid on that fab capacity and they
| decided not to.
| adam_arthur wrote:
| Yeah, totally agreed. But if you read these comments,
| they seem to be in total amazement about the performance
| gap and not acknowledging how much of an advantage being
| a fab generation ahead is.
|
| Customers don't care, but discussion of the merits of the
| chip should be more nuanced about this.
|
| It also implies that the gap won't exist for very long,
| as AMD will move onto 5nm soon
| tandr wrote:
| > It also implies that the gap won't exist for very long,
| as AMD will move onto 5nm soon
|
| ... yes, if there is any capacity left. Capacity for the
| new process is a limited resource after all.
| coldtea wrote:
| > _Why is that strategy simultaneously remarkably efficient
| and remarkably high-performance? What enabled /led them to
| make those choices where others haven't?_
|
| The things people give them complains about:
|
| (a) keeping a walled garden,
|
| (b) moving fast and taking the platform to new directions all
| at once
|
| (c) controlling the whole stack
|
| Which means they're not beholden to compatibility with third
| party frameworks and big players, or with their own past, and
| thus can rely on their APIs, third party devs etc, to cater
| to their changes to the architecture.
|
| And they're not chained to the whims of the CPU vendor (as
| the OS vendor) or the OS vendor (as the CPU vendor) either,
| as they serve the role of both.
|
| And of course they benchmarked and profiled the hell out of
| actual systems.
| jeffbee wrote:
| Neither A nor C makes any sense, are not supported by
| evidence. There is no aspect of the mac or macOS that can
| be realistically described as a "walled garden". It comes
| with a compiler toolchain and ... well, some docs. It
| natively runs software compiled for a foreign architecture.
| You can do whatever you want with it. It's pretty open.
|
| A "walled garden" is when there is a single source of
| software.
| anfilt wrote:
| I will be honest as long apple keeps this walled garden
| shenanigans going. I am not buying any of their hardware.
| fartcannon wrote:
| They could still do all this shit without the walled
| garden. To me, it suggests they aren't willing to compete.
| They're anti-competitive.
| marrvelous wrote:
| With the walled garden, Apple can set enforceable
| timelines for the software ecosystem to adopt to
| architectural changes.
|
| Remember the transition to arm64? Apple forced everything
| on the App Store to ship universal binaries.
|
| Without the App Store walled garden, software isn't
| required to keep up to date with architectural changes.
| Instead, keeping current is only a requirement to being
| featured on the App Store (which would just be a single
| way to install software, not the only method).
| danaris wrote:
| Well, and on the Mac, it's _not_ the only method. The
| walled garden here has big open gates.
|
| That said, _all_ software on the Mac, post-Catalina, has
| to be 64-bit, whether it 's distributed through the Mac
| App Store or not, because the 32-bit system libraries are
| no longer included at all.
| baybal2 wrote:
| > There is no "magic" at play here, it's just lots and lots
| of raw muscle. They just seem to have a freakishly successful
| strategy for choosing what aspects of the processor to throw
| that muscle at.
|
| There is no freakishly successful strategy at play there as
| well. It's just all previous attempts at "fast ARM" chip were
| rather half hearted "add a pipeline step there, add extra
| register there, increase datapath width there," and not to
| squeeze it to the limit.
| barkingcat wrote:
| The answer is that they have raw hard numbers from the
| hundres of millions of iPads/iPhones sold each year, and can
| use the metrics from those devices to optimize the next
| generation of devices.
|
| These improvements didn't come from nowhere. It came from
| iterations of iOS hardware.
| TYPE_FASTER wrote:
| Apple has been iterating on their proprietary mobile ARM-
| based processors since 2010, and has gotten really good at
| it. I would imagine that producing billions of consumer
| devices with these chips has helped give them a lot of
| experience in shortened time frame.
|
| I also wonder if having the hardware and software both worked
| on in-house is an advantage. I mean, if you're developing
| power management software for a mobile OS, and you're using a
| 3rd-party vendor, then you read the documentation, and work
| with the vendor if you have questions. If it's all internal,
| you call them, and could make suggestions on future processor
| design too based on OS usage statistics and metrics.
| jandrese wrote:
| It seems like Apple listened when people talked about how all
| modern processors bottleneck on memory access and decided to
| focus heavily on getting those numbers better.
|
| Of course this leads to the question that if everyone in the
| industry knew this was the issue why weren't Intel and AMD
| pushing harder on it? They already both moved the memory
| controller onboard so they had the opportunity to
| aggressively optimize it like Apple has done, but instead we
| have year after year where the memory lags behind the
| processor in speed improvements, to the point where it is
| ridiculous how many clock cycles a main memory access takes
| on a modern x86 chip.
| acdha wrote:
| > What enabled/led them to make those choices where others
| haven't?
|
| Others have to some extent -- AMD is certainly not out of the
| game -- so I'd treat this more as the question of how they've
| been able to go more aggressively down that path. One of the
| really obvious answers is that they control the whole stack
| -- not just the hardware and OS but also the compilers and
| high-level frameworks used in many demanding contexts.
|
| If you're Intel or Qualcomm, you have a wider range of things
| to support _and_ less revenue per device to support it, and
| you are likely to have to coordinate improvements with other
| companies who may have different priorities. Apple can
| profile things which their users do and direct attention to
| the right team. A company like Intel might profile something
| and see that they can make some changes to the CPU but the
| biggest gains would require work by a system vendor, a
| compiler improvement, Windows/Linux kernel change, etc. --
| they contribute a large amount of code to many open source
| projects but even that takes time to ship and be used.
| SurfingInVR wrote:
| Something I've seen no one else mentioning: Apple's low-spec
| tier is $1000, not $70.
| [deleted]
| dv_dt wrote:
| No fighting the sales department on where to put the market
| segmentation bottlenecks?
| gameswithgo wrote:
| I see two main things behind it:
|
| 1. they are the only ones who have 5nm chips because they
| paid a lot to TSMC for that right 2. they gave up on
| expandable memory, which lets them solder it right next to
| the cpu, which likely makes it easier to ship with really
| high clocks. and/or they just spent the money it takes to get
| binned lpddr4 at that speed.
|
| So a good cpu design, just like AMD and Intel have, but one
| generation ahead on node size, and fast ram. Its not special
| low latency ram or anything, just clocked higher than maybe
| any other production machine, though enthusiasts sometimes
| clock theirs higher on desktops!
| epistasis wrote:
| > So a good cpu design, just like AMD and Intel have
|
| The design seems to be very different, in that it's far far
| wider, and supposedly has a much better branch predictor.
|
| > fast ram
|
| Is that a property of the RAM clock, or a function of a
| better memory controller? The RAM certainly doesn't appear
| to have any better latency.
| gameswithgo wrote:
| Right, latency isn't (much) affected by a higher clock
| rate. Getting ram to run fast requires both good ram
| chips and good controller/motherboard.
|
| and yes, obviously apples bespoke ARM cpu is quite a bit
| different than Zen3 Ryzens x86 cpu, but I'm not sure it
| is net-better. When Zen4 hits at 5nm I expect it will
| perform on par or better than the M1, but we won't know
| till it happens!
| mtgx wrote:
| In other words: money. Throwing money at the (right)
| problems made them better than others.
|
| "But doesn't Intel have a lot of money, too?"
|
| Sure, but Intel has also been running around like a
| headless chicken this past decade (pretty much literally,
| since Otellini left) combined with them getting very
| complacent because they had "no real competition."
| gavin_gee wrote:
| didnt they also make some interesting hires a few years ago
| like Anand from Anandtech and some other silicon vets that
| likely helped them design the M1 approach?
| jeffbee wrote:
| I don't have any inside-Apple perspective, but my guess is
| having a tight feedback cycle between the profiles of their
| own software and the abilities of their own hardware has
| helped them greatly.
|
| The reason I think so is when I was at Google is was 7 years
| between when we told Intel what could be helpful, and when
| they shipped hardware with the feature. Also, when AMD first
| shipped the EPYC "Naples" it was crippled by some key uarch
| weaknesses that anyone could have pointed out if they had
| been able to simulate realistic large programs, instead of
| tiny and irrelevant SPEC benchmarks. If Apple is able to
| simulate or measure their own key workloads and get the
| improvements in silicon in a year or two they have a gigantic
| advantage over anyone else.
| martamorena2 wrote:
| That's bizarre. As if CPU vendors were unable to run
| "realistic" workloads. If they truly aren't, that's because
| they are unwilling and then they are designing for failure
| and Apple can just eat their lunch.
| dcolkitt wrote:
| Interesting point. This would suggest pretty sizable
| synergies from the oft-rumored Microsoft acquisition of
| Intel.
| f6v wrote:
| > Microsoft acquisition of Intel
|
| Could that possibly be approved by governments?
| gigatexal wrote:
| Nope. Not a lawyer but I doubt it at all.
| ChuckNorris89 wrote:
| Microsoft doesn't need to acquire Intel, they need to do
| what Apple did and acquire a stellar ARM design house
| that will build a chip with x86 translation, tailored to
| accelerate the typical workloads on Windows machines and
| sell those chips to the likes of Dell and Lenovo and tell
| developers _" ARM Windows is the future, x86 Windows will
| be sunset in 5 years and no longer supported by us, start
| porting your apps ASAP and in the mean time, try our X86
| emulator on our ARM silicon, it works great."_
|
| Microsoft proved with the XBOX and Surface series they
| can make good hardware if they want, now they need to
| move to chip design.
| ralfd wrote:
| Apple has at most 10% of the computer market and is just
| one player among many. I am skeptical Microsoft with
| their 90% dominance would or should be allowed this much
| power over the industry.
| megablast wrote:
| You'll get people guessing, since Apple itself puts out so
| little information.
| gameswithgo wrote:
| What other laptop ships with LPDDR4X clocked at 4267? I agree
| though that being closer to the cpu isn't having any
| appreciable effect on latency, but being soldered close to the
| cpu probably does make it easier for them to hit that high
| clock rate.
| wmf wrote:
| Tiger Lake laptops such as the XPS 13.
| jeffbee wrote:
| As WMF mentions, Tiger Lake laptops like my Razer Book have
| the same memory. It is not appreciably closer to the CPU in
| the Apple design. In Intel's Tiger Lake reference designs the
| memory is also in two chips that are mounted right next to
| the CPU.
| danaris wrote:
| And (genuine question) how do the Tiger Lake laptops
| compare with the M1 MacBooks thus far?
| skavi wrote:
| AnandTech has decent benchmarks for both Tiger Lake [0]
| and M1 [1].
|
| [0]: https://www.anandtech.com/show/16084/intel-tiger-
| lake-review...
|
| [1]: https://www.anandtech.com/show/16252/mac-mini-
| apple-m1-teste...
| jeffbee wrote:
| The outcome seems to depend greatly on the physical
| design of the laptops. The elsewhere-mentioned Dell XPS
| 13 has a particularly poor cooling design, which is why I
| chose the Razer Book instead. Despite being marketed in a
| very silly way to gamers only, it seems to have competent
| mechanical design.
| SAI_Peregrinus wrote:
| Gamers are likely to run their systems with demanding
| workloads, for hours, with a color-coded performance
| counter (FPS stat). They'll notice if it throttles.
| They're particularly demanding customers, and there's
| quite a bit of competition for their money.
| ksec wrote:
| >HN memes about M1 memory will die
|
| It is not only HN. It is practically the whole Internet. Go
| around the Top 20 hardware and Apple website forum and you see
| the same thing, also vastly amplify by a few KOL on twitter.
|
| I dont remember I have ever seen anything quite like it in tech
| circle. People were happily running around spreading
| misinformation.
| Bootvis wrote:
| What is a KOL?
| tyingq wrote:
| "Key Opinion Leader". I think it's the new word for
| "Influencer".
| ksec wrote:
| I am pretty sure KOL predates Influencer in modern
| internet usage. Before that they were simply known as
| Internet Celebrities. May be it is rarely used now. So
| apology for not explaining the acronyms.
| secondcoming wrote:
| First I've heard of it!
| walterbell wrote:
| Who introduced the term KOL and bestows the K title?
| jeffbee wrote:
| Yeah, I know. There was some kid on Twitter who was trying to
| tell me that it was the solder in an x86 machine (he actually
| said "a Microsoft computer") that made them slower. Apple,
| without the solder was much faster.
|
| According to this person's bio they had an undergraduate
| education in computer science -\\_(tsu)_/-
| s800 wrote:
| What's the precision of these ns level measurements?
| mhh__ wrote:
| The answer to that is usually very context dependant, and on
| what you're measuring. As long as you use a histogram first and
| don't blindly calculate (say) the mean it should he obvious.
|
| Two examples( that are slightly bigger than this but the same
| principles apply):
|
| If you benchmark a std::vector at insertion, you'll see a flat
| graph with n tall spikes at ratios of it's reallocation amount
| apart, and it scales very very well. The measurements are
| clean.
|
| If, however, you do the same for a linked list you get a
| linearly increasing graph _but_ it 's absolutely all over the
| place because it doesn't play nice with the memory hierarchy.
| The std_dev of a given value of n might be a hundred times
| worse than the vector.
| CyberRabbi wrote:
| Clock_gettime(CLOCK_REALTIME) on macos provides nanosecond-
| level precision.
| geocar wrote:
| I seem to recall OSX didn't used to have clock_gettime, so
| it's news to me that it even exists -- I might have been away
| from OSX too long.
|
| Is there any performance difference between that and
| mach_absolute_time() ?
| lilyball wrote:
| It was added some years ago, and I believe
| mach_absolute_time is actually now implemented in terms of
| (the implementation of) clock_gettime. The documentation on
| mach_absolute_time now even says you should use
| clock_gettime_nsec_np(CLOCK_UPTIME_RAW) instead.
|
| macOS also has clock constants for a monotonic clock that
| increases while sleeping (unlike CLOCK_UPTIME_RAW and
| mach_absolute_time).
| saagarjha wrote:
| Not yet, at least :) _mach_absolute_time:
| 00000000000012ec pushq %rbp
| 00000000000012ed movq %rsp, %rbp
| 00000000000012f0 movabsq $0x7fffffe00050, %rsi
| ## imm = 0x7FFFFFE00050 00000000000012fa
| movl 0x18(%rsi), %r8d 00000000000012fe
| testl %r8d, %r8d 0000000000001301 je
| 0x12fa 0000000000001303 lfence
| 0000000000001306 rdtsc 0000000000001308
| lfence 000000000000130b shlq $0x20, %rdx
| 000000000000130f orq %rdx, %rax
| 0000000000001312 movl 0xc(%rsi), %ecx
| 0000000000001315 andl $0x1f, %ecx
| 0000000000001318 subq (%rsi), %rax
| 000000000000131b shlq %cl, %rax
| 000000000000131e movl 0x8(%rsi), %ecx
| 0000000000001321 mulq %rcx
| 0000000000001324 shrdq $0x20, %rdx, %rax
| 0000000000001329 addq 0x10(%rsi), %rax
| 000000000000132d cmpl 0x18(%rsi), %r8d
| 0000000000001331 jne 0x12fa
| 0000000000001333 popq %rbp
| 0000000000001334 retq
| Skunkleton wrote:
| That may be the result of inlining clock_gettime, though
| that would imply a pretty different implementation from
| the one I am familiar with.
|
| AFAIR on x86 a locked rdtsc is ~20 cycles. So to answer
| the gp question, it has around a precision in the few
| nanoseconds range. Accuracy is a different question, IE
| compare numbers from the same die, but be a little more
| suspicious across dies.
|
| No clue how this is implemented on the M1, or if the M1
| has the same modern tsc guarantees that x86 has grown
| over the last few generations of chips.
| saagarjha wrote:
| Yeah, clock_gettime is somewhat more complicated than
| this. If anything, it might have an inlined
| mach_absolute_time in it...
| vlovich123 wrote:
| I was part of the team that really pushed the kernel team
| to add support for a monotonic clock that counts while
| sleeping (this had been a persistent ask before just not
| prioritized). We got it in for iOS 8 or 9. The dance you
| otherwise have to do is not only complicated in userspace
| on MacOS, it's expensive & full of footguns due to race
| conditions (& requires changing the clock basis for your
| entire app if I recall correctly).
| saagarjha wrote:
| It's new in macOS Sierra. I believe mach_absolute_time is
| slightly faster but not by much-both just read the commpage
| these days to save on a syscall.
| kergonath wrote:
| It's a good introduction, but it's a bit disappointing that it
| ends that way. I'd love to read more about what's behind the
| figure and more technical info about how it might work.
___________________________________________________________________
(page generated 2021-01-06 23:00 UTC)