[HN Gopher] Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL...
       ___________________________________________________________________
        
       Graviton 3, Apple M2 and Qualcomm 8cx 3rd gen: a URL parsing
       benchmark
        
       Author : ibobev
       Score  : 78 points
       Date   : 2023-05-03 19:31 UTC (3 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | joseph_grobbles wrote:
       | [dead]
        
       | jeffbee wrote:
       | To me it would be somewhat more interesting to compare head-to-
       | head mobile CPUs instead of comparing laptops and servers. In
       | this particular microbenchmark, mobile 12th and 13th-generation
       | Core performance cores, and even the efficiency cores on the 13th
       | generation, are faster than the M2.
        
         | scns wrote:
         | Even when they are faster, the M2 is on the same die as the RAM
         | and the bandwidth and latency are way better. That matters for
         | compilation or am i mistaken?
        
           | jeffbee wrote:
           | You are mistaken. Just like everyone else who has ever
           | repeated the myth that Apple puts large-scale, high-
           | performance logic and high density DRAM on the same die,
           | which is impossible.
           | 
           | Apple uses LPDDR4 modules soldered to a PCB, sourced from the
           | same Korean company that everyone else uses. Intel has used
           | the exact same architecture since Cannon Lake, in 2018.
        
             | wtallis wrote:
             | Cannon Lake is a weird red herring to bring in to the
             | discussion, because it was only a real product in the
             | narrowest sense possible. It may have technically been the
             | first CPU Intel shipped with LPDDR4 support (or was it one
             | of their Atom-based chips?), but the exact generation of
             | LPDDR isn't really relevant because both Apple and Intel
             | have supported multiple generations of LPDDR over the years
             | and both have moved past 4 and 4x to 5 now.
             | 
             | What is somewhat relevant as the source of confusion here
             | is that Apple puts the DRAM on the same _package_ as the
             | processor rather than on the motherboard nearby like is
             | almost always done for x86 systems that use LPDDR. (But
             | there 's at least one upcoming Intel system that's been
             | announced as putting the processor and LPDDR on a shared
             | module that is itself then soldered to the motherboard.)
             | That packaging detail probably doesn't matter much for the
             | entry-level Apple chips that use the same memory bus width
             | as x86 processors, but may be more important for the high-
             | end parts with GPU-like wide memory busses.
        
             | ricw wrote:
             | I don't think that is true.
             | 
             | Last I checked apple M1 max chips have up to 800GB/s
             | throughput, whilst AMDs high end chips taper out at around
             | ~250GB/s or so, closer to what a standard M2 chip does (not
             | max, or pro version). at the top end they've got at least
             | 2x the memory bandwidth than other CPU vendors, and that's
             | likely the case further down too.
        
               | wmf wrote:
               | The Mx Max should be compared to a discrete CPU+GPU
               | combination that does have comparable total memory
               | bandwidth. It isn't automatically better to put
               | everything on one chip.
        
             | scns wrote:
             | Thank you for the correction then. Ah yes the soldered
             | LPDDR4 dies allow higher bandwiths since more pins allow
             | parallel access.
        
         | Dalewyn wrote:
         | Intel 12th and 13th gen both use the same efficiency cores.
        
           | jeffbee wrote:
           | Well, on the ones I happen to have on hand the 12th gen hits
           | 3300MHz and the 13th gen goes all the way to 4200MHz.
        
             | Dalewyn wrote:
             | Yeah well, the efficiency cores on a 12900K will be faster
             | than those on an N100, so what is your point?
             | 
             | We're discussing overall compute power differences between
             | CPU architectures, minute differences in performance
             | between identical-architecture CPU cores stemming from
             | higher clock speeds is outside the scope of this
             | discussion.
        
       | smoldesu wrote:
       | > Note that you cannot blindly correct for frequency in this
       | manner because it is not physically possible to just change the
       | frequency as I did
       | 
       | They're also not the same core architecture? Comparing ARM chips
       | that conform to the same spec won't necessarily scale the same
       | across frequencies. Even if all of these CPUs did have scaling
       | clock speeds, their core logic is not the same. Hell, even the
       | Firestorm and Icestorm cores on the M1 SOC shouldn't be
       | considered directly comparable if you scale the clock speeds.
        
         | [deleted]
        
         | wmf wrote:
         | That's the point. He knows they're different architectures
         | (although X1 and V1 are related) so normalizing frequency
         | exposes the architectural differences.
        
       | psanford wrote:
       | It looks like there's been some good progress on getting Linux
       | running natively on the Windows Dev Kit 2023 hardware[0]. There
       | was a previous discussion here about this hardware back in
       | 2022-11[1].
       | 
       | [0]: https://github.com/linux-surface/surface-pro-x/issues/43
       | 
       | [1]: https://news.ycombinator.com/item?id=33418044
        
       | bushbaba wrote:
       | Seems weird to compare the c7g.large vs m2 and not the largest VM
       | sizes.
        
       | Thaxll wrote:
       | I think the performance of my oracle free instance ( arm cpu ) is
       | 10x worse than those results.
        
         | dylan604 wrote:
         | my oracle free instance uses mariadb instead of mysql, but i'm
         | guessing you meant that as the free instance provided by oracle
         | instead of an instance not using anything from oracle. =)
        
           | Thaxll wrote:
           | Yes I'm talking about: https://docs.oracle.com/en-
           | us/iaas/Content/FreeTier/freetier...
           | 
           | Ampere A1 Compute instances
        
       | [deleted]
        
       | seiferteric wrote:
       | Not that it's for sure, but M3 is probably coming out late this
       | year/early next year and will be on 3nm, so once again having a
       | huge node advantage. Just seems like Apple will have the latest
       | node before everyone else for the foreseeable future.
        
         | webaholic wrote:
         | Apple pays a premium to TSMC to reserve the early runs on the
         | next gen nodes. They can do this because they can charge their
         | users a premium for Apple devices. I am not sure the rest of
         | the players have that much pricing power or margins.
        
       | monocasa wrote:
       | Part of what they don't mention is that Graviton 3 and the
       | Snapdragon 8cx Gen 3 have pretty much the same processor core.
       | The Neoverse V1 is only a slightly modified Cortex X1. Hence the
       | same results when you account for clock frequency.
        
       | mhh__ wrote:
       | Without more context about what the code actually does this
       | doesn't tell me all that much, other than what I could guess from
       | the intended usecases of the chips.
       | 
       | The strength of Apple silicon is that it can crush benchmarks and
       | transfer that power very well to real world concurrent workloads
       | too e.g. is this basically just measuring the L1 latency? If not
       | are the compilers generating the right instructions etc. (One
       | would assume they are but I have had issues with getting good arm
       | codegen previously, only to find that the compiler couldn't work
       | out what ISA to target other than a conservative guess)
        
         | renewiltord wrote:
         | Git repo available from post.
        
           | mhh__ wrote:
           | I ain't readin' all that (OK maybe I will but lemire does do
           | this quite a lot, his blog is 40% gems 60% slightly sloppy
           | borderline factoids that only make sense if you think in
           | exactly the same way he does)
        
         | KerrAvon wrote:
         | Yes. Since it's Ada, I'm suspicious of codegen tuning being a
         | major factor here.
        
           | zimpenfish wrote:
           | "Ada is a fast and spec-compliant URL parser written in C++."
           | 
           | Wouldn't modern C++ compilers have decent codegen tuning for
           | all these platforms?
        
           | wtallis wrote:
           | Is there something specific about this library that makes you
           | suspicious, or are you assuming from the name that this is
           | using the Ada programming language?
        
         | dan-robertson wrote:
         | It does seem the benchmark has its data in cache, based on the
         | timings.
         | 
         | If the benchmark were only measuring L1 latency, what would
         | that imply about the 'scaling by inverse clock speed' bit? My
         | guess is as follows. Chips with higher clock rates will be
         | penalised: (a) it is harder to decrease latencies (memory,
         | pipeline length, etc) in absolute terms than run at a higher
         | clock speed to maybe do non-memory things faster; and (b) if
         | you're waiting 5ns to read some data, that hurts you more after
         | the scaling if your clock speed is higher. The fact that the M1
         | wins after the scaling despite the higher clock rate suggests
         | to me that either they have a big advantage on memory latency
         | or there's some non-memory-latency advantage in scheduling or
         | branch prediction that leads to more useful instructions being
         | retired per cycle.
         | 
         | But maybe I'm interpreting it the wrong way.
        
       | Kwpolska wrote:
       | A comparison with x86_64 CPUs (e.g. those seen in comparable
       | MacBooks and AWS machines) would be useful.
       | 
       | Also, I'm not sure if "correcting" the numbers for 3 GHz is
       | reasonable and reflects real-life performance. Perhaps some
       | throttling could be applied to test the CPUs using a common
       | frequency?
        
         | deltaci wrote:
         | It's a benchmark of GitHub Actions(Azure) vs a really old
         | Macbook Pro 15, not exactly what you are looking for, but it
         | tells the vibe already.
         | 
         | https://buildjet.com/for-github-actions/blog/a-performance-r...
        
           | willcipriano wrote:
           | Sometimes when I run a lot of builds in a short period of
           | time I feel like I get demoted to the slower boxes.
        
           | 015a wrote:
           | This is a big, general problem with CI providers I don't hear
           | talked about enough: because they charge per-minute, they are
           | actively incentivized to run on old hardware, slowing builds
           | and milking more from customers in the process. Doubly-so
           | when your CI is hosted by a major cloud provider who would
           | otherwise have to scrap these old machines.
           | 
           | I wish this were only a theoretical concern, a theoretical
           | incentive, but its not. Github Actions is slow, and Gitlab
           | suffers from a similar problem; their hosted SaaS runners are
           | on GCP n1-standard-1 machines. The oldest machine type in
           | GCP's fleet, the n1-standard-1 is powered by a variety of
           | dusty, old CPUs Google Cloud has no other use for, from Sandy
           | Bridge to Skylake. That's a 12 year old CPU.
        
             | AnthonyMouse wrote:
             | There are workloads where newer CPUs are dramatically
             | faster (e.g. AVX-512), but in general the difference isn't
             | huge. Most of what the newer CPUs get you is more cores and
             | higher power efficiency, which you don't care about when
             | you're paying per-vCPU. Which vCPU is faster, a ten year
             | old Xeon E5-2643 v2 at 3.5GHz or a two year old Xeon
             | Platinum 8352V at 2.1GHz? It depends on the workload. Which
             | has more memory bandwidth _per core_?
             | 
             | But the cloud provider prefers the latter because it has
             | 500% more cores for 50% more power. Which is why the latter
             | still goes for >$2000 and the former is <$15.
        
         | stingraycharles wrote:
         | In my totally unscientific (but consistent) benchmarks for our
         | CI build servers, m6g.8xlarge compile our c++ codebase in about
         | 9.5 minutes, where m6a.8xlarge takes about 11 minutes. The
         | price difference is about 20% as well iirc, so it's generally a
         | good deal.
         | 
         | Of course the types of optimisations that a compiler may (or
         | may not) do on aarch64 vs x86_64 are completely different and
         | may explain the difference (we actually compile with
         | -march=haswell for x86_64), but generally Graviton seems like a
         | really good deal.
        
           | foota wrote:
           | You're probably leaving a lot of performance on the floor if
           | you're building for haskell and running on skylakeish or
           | newer.
           | 
           | Edit: yes, haswell:-)
        
             | nickpeterson wrote:
             | *haswell, in case the Haskell people come after you.
        
               | speed_spread wrote:
               | "don't poke the endofunctor"
        
               | paulddraper wrote:
               | Ah, that makes more sense.
        
         | zamalek wrote:
         | > I'm not sure if "correcting" the numbers for 3 GHz is
         | reasonable and reflects real-life performance
         | 
         | It's not useful at all. It effectively measures IPC
         | (instructions per clock), which is just chip vendor bragging
         | rights.
         | 
         | Assuming that all the chips meet some baseline performance
         | criteria: for datacenter and portable devices, the real
         | benchmark would be "instructions per joule."
         | 
         | For desktop devices "instructions per dollar" would be most
         | relevant.
        
           | aylmao wrote:
           | > It's not useful at all. It effectively measures IPC
           | (instructions per clock), which is just chip vendor bragging
           | rights.
           | 
           | +1. Moreover the author then seems to conclude from this
           | benchmark:
           | 
           | > Overall, these numbers suggest that the Qualcomm processor
           | is competitive.
           | 
           | This is an odd conclusion to draw from this test and these
           | numbers, given how little this benchmark tests (just string
           | operations). Does this benchmark want to test raw CPU power?
           | Then why "normalize" to 3GHz? Does it want to test CPU
           | capabilities? If so why use such a "narrow" test?
           | 
           | IMO this benchmarking does for a good data-point, but far
           | from enough to draw much of a conclusion from.
        
           | Octoth0rpe wrote:
           | > For desktop devices "instructions per dollar" would be most
           | relevant.
           | 
           | For cloud customers as well
        
             | cubefox wrote:
             | Which would probably include the cost of the chips in some
             | way, not just electricity.
        
             | zamalek wrote:
             | > For cloud customers as well
             | 
             | Cloud costs are dominated by power delivery and cooling.
             | Both of those are directly influenced by how much power the
             | chip uses to achieve it's performance target.
             | 
             | I guess it does indirectly influence dollar cost, but I was
             | referring to MSRP of the chip. As a simple example: the
             | per-chip cost of Graviton is probably enormous (if you
             | factor R&D into the cost of a chip), but it's still cheaper
             | for Amazon customers. Why? Power and cooling.
        
       ___________________________________________________________________
       (page generated 2023-05-03 23:00 UTC)