[HN Gopher] Graviton 3: First Impressions
___________________________________________________________________
Graviton 3: First Impressions
Author : ingve
Score : 136 points
Date : 2022-05-29 13:30 UTC (9 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| jadbox wrote:
| I'd love to see benchmarks for webservers like Node or Py.
| wmf wrote:
| Related Graviton 3 benchmarks:
| https://www.phoronix.com/scan.php?page=article&item=graviton...
| bullen wrote:
| What about memory contention when many cores try to write/read
| the same memory?
|
| There is no point to add more cores if they can't cooperate.
|
| How come I'm the only one pointning this out?
|
| I think 4 cores will max out the memory contention, so keep on
| piling these 128 core heaters on. But they will not outlive a
| simple Raspberry 4!?
| electricshampo1 wrote:
| The whole chip in general will be used in aggregate by
| independent vms/containers etc that do NOT read and write to
| the same memory. Some kernel datastructures within a given vm
| are still shared, ditto for within a single process, but good
| design minimizes that (per cpu/thread data structures, sharded
| locks, etc etc).
| MaxBarraclough wrote:
| I don't think they were referring to contention across VM
| boundaries.
| gpderetta wrote:
| Reading the same memory usually ok.
|
| Writing us not, but respecting the single writer principle is
| usually rule zero of parallel programming optimisation.
|
| If you mean reading/writing to the same memory bus in general,
| then yes, the bus need to be sized according to the need of the
| expected loads (i.e. the machine need to be balanced).
| Sirened wrote:
| It's likely that it's going to need a post on its own since
| it's an extremely complicated topic. Someone else wrote an
| awesome post about this for the Neoverse 2 chips [1] and they
| found that with LSE atomics, the N2 performs as well or better
| than Icelake. Given gravitron3 has a much wider fabric, I would
| assume this lead only improves.
|
| [1] https://travisdowns.github.io/blog/2020/07/06/concurrency-
| co...
| bullen wrote:
| Ah, yes I remember this post, but it reads pretty cryptic to
| me. I would like to know what the slowdowns actually become
| in practice, does it add latency to the execution of other
| threads and how will the machine as a whole behave?
|
| I know M4 had much better multicore shared memory perf. than
| M3, but now both of those are old and I don't have users to
| test anything now.
| WhitneyLand wrote:
| How much can SVE instructions help with machine learning?
|
| I've wondered why Apple Silicon made the trade off decision to
| not include SVE support yet, given that support for lower
| precision FP vectorization seems like it could have made their
| NVidia perf gap smaller.
| tomrod wrote:
| Very interesting! I'm not terribly well versed in ARM vs x86 so
| its helpful to see these kinds of benchmarks and reports.
|
| One bit of feedback for the author: the sliding scale is helpful,
| but the y axes are different between the visualizations so you
| cannot see the apples to apples comparison needed. Suggest
| regenerating those.
| rwmj wrote:
| _> GCC will flat out refuse to emit SVE instructions (at least in
| our limited experience), even if you use assembly,_
|
| This seems ... wrong? I haven't tried it but according to the
| link below SVE2 intrinsics are supported in GCC 10 (and Clang 9):
|
| https://community.arm.com/arm-community-blogs/b/tools-softwa...
| adrian_b wrote:
| Yes, gcc 10.1 has introduced support for the SVE2 intrinsics
| (ACLE).
|
| Moreover, starting with the 8.1 version, gcc began to use SVE
| in certain cases when it succeeded to auto-vectorize loops (if
| the correct -march option had been used).
|
| Nevertheless, many Linux distributions are still shipped with
| older gcc versions, so SVE/SVE2 does not work with the
| available compiler or cross-compiler.
|
| You must upgrade gcc to 10.1 or a newer version.
| ykevinator2 wrote:
| No burstable graviton3's yet :-(
| DeathArrow wrote:
| Such a shame we can't play with a socketed CPU like this and a
| motherboard with EFI support as a workstation.
| jazzythom wrote:
| I hate reading about all the new chips I can't afford. If only
| there was a standardized univeral open source motherboard and
| some type of subscription model where I would always have the
| best chip at the latest fab mailed straight to me on release. I
| mean I only just got my hands on 32 core Epyc. Linus Torvalds has
| had a Threadripper 3970x for years and I still can't afford it
| and I'm still jealous, although to be fair my C skills hit their
| limit when I tried to write pong. I don't like the idea of
| building a new computer around a chip. It's messy and stupid.
| These systems can be made modular if the motherboards packed
| unnecessary bandwidth into the interconnect/planar.
| Erlangen wrote:
| I don't understand these graphs titled "Branch Predictor Pattern
| Recognition". What do they mean? Could someone here explain it a
| bit in detail? Thanks ahead.
| Hizonner wrote:
| It feels like we've gone badly wrong somewhere when processors
| have to spend so many of their resources guessing about the
| program. I am not saying I have a solution, just that feeling.
| staticassertion wrote:
| IDK, that seems like how brains work, and brains are pretty
| cool. They guess all the time in order to save time.
| Cthulhu_ wrote:
| It always did feel like a weird hack to me to avoid parts of
| the CPU to be idle. I mean the performance benefits are there,
| but it's at the cost of power usage in the end.
|
| Can branch prediction be turned off on a compiler or
| application level? If you're optimizing for energy use that is.
| Disclaimer: I don't actually know if disabling branch
| prediction is more energy efficient.
| imtringued wrote:
| Turning off branch prediction sounds like a weird hack that
| serves no purpose, just underclock and undervolt your CPU if
| you care about power consumption that much.
| Veedrac wrote:
| Disabling branch prediction would have such a catastrophic
| effect on performance, there is no way it would pay for
| itself. Actually this is true for most parts of a CPU;
| Apple's little cores are extremely power efficient and yet
| they are fairly impressive out-of-order designs. It would
| take a very meaningful redesign of how a CPU works to beat a
| processor like that, at least at useful speeds.
| [deleted]
| tyingq wrote:
| There's the Mill CPU, which sounds terrific on paper. Hard to
| gauge when it might turn into something commercially usable
| though.
| 0xCMP wrote:
| Mill would definitely make things more interesting. They were
| supposed to have their simulator online a while ago, but
| sounds like they needed to redo the work on the compiler
| (from what I understood). Once that comes out it sounds like
| the next step is getting the simulator online for people to
| play with.
| cesaref wrote:
| I thought this was the reasoning behind Itanium, the idea that
| scheduling could be worked out in advance by the compiler
| (probably profile guided from tests or something like that)
| which would reduce the latency and silicon cost of
| implementations.
|
| However, it wasn't exactly a raging success, with I think the
| predicted amazing compiler tech not materialising, but maybe it
| is the right answer, but the implementation was wrong? I'm no
| CPU expert...
| Hizonner wrote:
| I'm not sure what happened with Itanium.
|
| I do think a big part of the problem is that people want to
| distribute binaries that will run on a lot of CPUs that are
| physically really different inside. But nowadays there's JIT
| compilation even for JavaScript, so you could distribute
| something like LLVM, or even (ecch) JavaScript itself, and
| have the "compiler scheduling" happen at installation time or
| even at program start.
| imtringued wrote:
| You can't distribute LLVM for that purpose without defining
| a stable format like WebAssembly or SPIR-V.
| Veedrac wrote:
| Itanium was a really badly designed architecture, which a lot
| of people skip over when they try to draw analogies to it. It
| was a worst of three worlds, in that it was big and hot like
| an out-of-order, it had the serial dependency issues of an
| in-order, and it had all the complexity of fancy static
| scheduling without that fancy scheduling actually working.
|
| There have been a small number of attempts since Itanium,
| like NVIDIA's Denver, which make for much better baselines. I
| don't think those are anywhere close to optimal designs, or
| really that they tried hard enough to solve in-order issues
| at all, but they at least seem sane.
| speed_spread wrote:
| Would Itanium have been better served with bytecode and a
| modern JIT? Also, doesn't RISC-V kinda get back on that
| VLIW track with macro-ops fusion, using a very basic
| instruction set and letting the compiler figure out the
| best way to order stuff to help target CPU make sense of
| it?
| nine_k wrote:
| I heard that the desire to make x86 emulation performant on
| Itanium made things really bad, compared to a "clean" VLIW
| architecture.
| canarypilot wrote:
| Why would you consider prediction based on dynamic conditions
| to be the sign of a dystopian optimization cycle? Isn't it
| mostly intuitive that interesting program executions are not
| going to be things you can determine statically (otherwise your
| compiler would have cleaned them up for you with inlining
| etc.), or could be determined statically but at too great cost
| to meet execution deadlines (JiTs and so on), or resource
| constraints (you don't really want N code clones specialising
| each branch backtrace to create strictly predictable chains).
|
| Or is the worry on the other side; that processors have gotten
| so out-of-order that only huge dedication to guesswork can keep
| the beast sated? I don't see this as a million miles from
| software techniques in JiT compilers to optimistically optimize
| and later de-deoptimize when an assumption proves wrong.
|
| I think you might be right to be nervous if you wrote programs
| that took fairly regular data and did fairly regular things to
| it. But, as Itanium learned the hard way, programs have much
| more dynamic, emergent and interesting behaviour than that!
| [deleted]
| amelius wrote:
| I guess the fear is that the CPU might start guessing wrong,
| causing your program to miss deadlines. Also, the heuristics
| are practically useless for realtime computing, where timings
| must be guaranteed.
| nine_k wrote:
| I suppose that if you assume in-order execution and count
| the clock cycles, you should get a guaranteed lower bound
| of performance. It may be, say, 30-40% of the performance
| you really observe, but having some headroom should feel
| good.
| rwmj wrote:
| Uli Drepper has this tool which you can use to annotate source
| code with explanations of which optimisations are applied. In
| this case it would rely on GCC recognizing branches which are
| hard to predict (eg. a branch in an inner loop which is data-
| dependent), and I'm not sure GCC is able to do that.
|
| https://github.com/drepper/optmark
| bastawhiz wrote:
| Isn't that the whole promise of general purpose computing? That
| you don't need to find specialized hardware for most workloads?
| Nobody wants to be out shopping for CPUs that have features
| that align particularly well with their use case, then
| switching to different CPUs when they need to release an update
| or some customer comes along with data that runs less
| efficiently with the algorithms as written.
|
| Since processors are expensive and hard to change, they do
| tricks to allow themselves to be used more efficiently in
| common cases. That seems like a reasonable behavior to me.
| adrian_b wrote:
| A majority of the non-deterministic and speculative hardware
| mechanisms that exist in a modern CPU are required due to the
| consequences of one single hardware design decision: to use a
| data cache memory.
|
| The data cache memory is one of the solutions to avoid the
| extremely long latency of loading data from a DRAM memory.
|
| The alternative to a data cache memory is to have a hierarchy
| of memories with different speeds, which are addressed
| explicitly.
|
| The latter variant is sometimes chosen for embedded computers
| where determinism is more important than programmer
| convenience. However, for general-purpose computers this
| variant could be acceptable only if the hierarchy of memories
| would be managed automatically by a high-level language
| compiler.
|
| It appears that writing a compiler that could handle the
| allocation of data into a heterogeneous set of memories and the
| transfers between them is a more difficult task than designing
| a CPU that becomes an order of magnitude more complex due to
| having a hierarchy a data cache memories and a long list of
| other hardware mechanisms that must be added due to the
| existence of the data cache memory.
|
| Once it is decided that the CPU must have a data cache memory,
| a lot of other hardware design decisions follow from it.
|
| Because there is an inverse relationship between the load
| latency and the data cache memory size, the cache memory must
| be split into a multi-level hierarchy of cache memories.
|
| To reduce the number of cache misses, data cache prefetchers
| must be added, to speculatively fill the cache lines in advance
| of load requests.
|
| Now, when a data cache exists, most loads have a small latency,
| but from time to time there still is a cache miss, when the
| latency is huge, long enough to execute hundreds of
| instructions.
|
| There are 2 solutions to the problem of finding instructions to
| be executed during cache misses, instead of stalling the CPU:
| simultaneous multi-threading and out-of-order execution.
|
| For explicitly addressed heterogeneous memories, neither of
| these 2 hardware mechanisms is needed, because independent
| instructions can be scheduled statically to overlap the memory
| transfers. With a data cache, this is not possible, because it
| cannot be predicted statically when cache misses will occur
| (mainly due to the activity of other execution threads, but
| even an if-then-else can prevent the static prediction of the
| cache state, unless additional load instructions are inserted
| by the compiler, to ensure that the cache state does not depend
| on the selected branch of the conditional statement; this does
| not work for external library functions or other execution
| threads).
|
| With a data cache memory, one or both of SMT and OoOE must be
| implemented. If out-of-order execution is implemented, then the
| number of registers needed to avoid false dependencies between
| instructions becomes larger than it is convenient to encode in
| the instructions. so register renaming must also be
| implemented.
|
| And so on.
|
| In conclusion, to avoid the huge amount of resources needed by
| a CPU for guessing about the programs, the solution would be a
| high-level language compiler able to transparently allocate the
| data into a hierarchy of heterogeneous memories and schedule
| transfers between them when needed, like the compilers do now
| for register allocation, loading and storing.
|
| Unfortunately nobody has succeeded to demonstrate a good
| compiler of this kind.
|
| Moreover, the existing compilers have frequently difficulties
| in discovering the optimal allocation and transfer schedule for
| registers, which is a simpler problem.
|
| Doing efficiently the same for a hierarchy of heterogeneous
| memories seems out-of-reach for the current compilers.
| solarexplorer wrote:
| We do have these architectures already in the embedded space
| and as DSPs. I suppose, they would be interesting for
| supercomputers as well. But for general purpose CPUs they
| would be a difficult sell. Since the memory size and latency
| would be part of the ISA, binaries could not run unchanged on
| different memory configurations, you would need another
| software layer to take care of that. Context switching and
| memory mapping would also need some rethinking. Of course,
| all of this can be solved, but it would make adoption more
| difficult.
|
| And last not least, unknown memory latency is not the only
| source of problems, branch (mis)predictions are another. And
| they have the same remedies as cache misses: multithreading
| and speculative execution.
|
| So if you wanted to get rid of branch prediction as well, you
| could come up with something like the CRAY-1.
| adrian_b wrote:
| You are right that a kind of multi-threading can be useful
| to mitigate the effects of branch mispredictions.
|
| However, for this, fine-grained multi-threading is enough.
| Simultaneous multi-threading does not bring any advantage,
| because the thread with the mispredicted branch cannot
| progress.
|
| Out-of-order execution cannot be used during branch
| mispredictions, so like I have said, both SMT and OoOE are
| techniques useful only when a data cache memory exists.
|
| Any CPU with pipelined instruction execution needs a branch
| predictor and it needs to execute speculatively the
| instructions on the predicted path, in order to avoid the
| pipeline stalls caused by control dependencies between
| instructions. An instruction cache memory is also always
| needed for a CPU with pipelined instruction execution, to
| ensure that the instruction fetch rate is high enough.
|
| Unlike simultaneous multi-threading, fine-grained multi-
| threading is useful in a CPU without a data cache memory,
| not only because it can hide the latencies of branch
| mispredictions, but also because it can hide the latencies
| of any long operations, like it is done in all GPUs.
|
| Fine-grained multi-threading is significantly simpler to
| implement than simultaneous multi-threading.
| mhh__ wrote:
| People have tried over and over again to "fix" this and it
| hasn't worked.
|
| The interesting probabilities are all decided at runtime.
|
| Now we have AI workloads there is a place for a big lump of
| dumb compute again, but not in general purpose code.
| Dunedan wrote:
| > The final result is a chip that lets AWS sell each Graviton 3
| core at a lower price, while still delivering a significant
| performance boost over their previous Graviton 2 chip.
|
| That's not correct. AWS sells Graviton 3 based EC2 instances at a
| higher price than Graviton 2 based instances!
|
| For example a c6g.large instance (powered by Graviton 2) costs
| $0.068/hour in us-east-1, while a c7g.large instance (powered by
| Graviton 3) costs $0.0725/hour [1]. Both instances have the same
| core count and memory, although c7g instances have slightly
| better network throughput.
|
| I believe that is pretty unusual as, if my memory serves me
| right, newer instances family generations are usually cheaper
| than the previous generation.
|
| [1]: https://aws.amazon.com/ec2/pricing/on-demand/
| adrian_b wrote:
| Based on the first published benchmarks, even the programs
| which have not been optimized for Neoverse V1, and which do not
| benefit from its much faster floating-point and large-integer
| computation abilities, still show a performance increase of at
| least 40%, so greater than the price increase.
|
| So I believe that using Graviton 3 at these prices is still a
| much better deal than using Graviton 2.
| myroon5 wrote:
| Definitely unusual, as the graph here shows:
| https://github.com/patmyron/cloud/
| WASDx wrote:
| Could it be due to increasing global energy prices?
| usefulcat wrote:
| I don't follow. You seem to be implying that Amazon would
| like to reduce their electricity usage. If so, shouldn't
| they be charging _less_ for the more efficient instance
| type?
| nine_k wrote:
| No, they charge for compute, which the new CPU provides
| more of, even though it consumes the same amount of
| electricity as a unit.
| jeffbee wrote:
| It would be irrational to expect a durable lower price on
| graviton. Amazon will price it lower initially to entice
| customers to port their apps, but after they get a critical
| mass of demand the price will rise to a steady state where it
| costs the same as Intel. The only difference will be which guy
| is taking your money.
| zrail wrote:
| Do you have a cite on Amazon raising prices like that at any
| other point in their history?
| greatpostman wrote:
| I don't think Amazon has ever raised their prices. This
| comment is based on nothing
| losteric wrote:
| Prime has gone up quite a bit
|
| Nearly every business seeks to maximize profit. Right now
| AWS is in growth phase - why wouldn't they raise rates in
| the future?
| orf wrote:
| I mean, they just raised their graviton prices between
| generations.
|
| I don't think the point was that they would increase the
| cost of existing instance types, only that over time and
| generations the price will trend upwards as more workloads
| shift over.
| staticassertion wrote:
| I wouldn't call that "raising prices"... you can still
| use Graviton 2 if it's a better price for you.
| jhugo wrote:
| I dunno, this take is a bit weird to me. The work we did to
| support Graviton wasn't "moving" from Intel to ARM, it was
| making our build pipeline arch-agnostic. If Intel one day
| works out cheaper again we'll use it again with basically
| zero friction.
| ykevinator2 wrote:
| Same
| dilyevsky wrote:
| Considering blank stares that I get when mentioning arm as
| potential cost saving measure it will take years and maybe
| decades before that happens by which point you're def getting
| your money's worth as early adopter
| spookthesunset wrote:
| When is the last time Amazon has raised cloud prices?
| jeffbee wrote:
| Literally 6 days ago when they introduced this thing.
| dragonwriter wrote:
| > Literally 6 days ago when they introduced this thing.
|
| Offering a new option is not a price increase. You can
| still do all the same things at the same prices, plus if
| the new thing is more efficient for your particular task
| you have an additional option.
| jeffbee wrote:
| When they introduced c6i they did it at the same price as
| c5, even though the c6i is a lot more efficient. They're
| raising the price on c7g vs. c6g to bring it closer to
| the pricing of c6i, which is pretty much exactly what I
| suggested?
| deanCommie wrote:
| You're being highly obtuse.
|
| Universally everyone understands "raising prices" to be -
| "raising prices without any customer action".
|
| As in you consider your options, take into consideration
| pricing, design your architecture, you deploy it, and you
| get a bill. Then suddenly, later, without any action of
| your own, your bill goes up.
|
| THAT is raising prices, and it is something AWS has
| essentially never done.
|
| What you're describing is a situation where a customer
| CHOOSES to upgrade to a new generation of instances, and
| in doing so gets a larger bill. That is nowhere near the
| same thing.
| arpinum wrote:
| Graviton 2 (c6g) also cost more than the Graviton 1 (a1)
| instances
| mastax wrote:
| Given the surrounding context I read that sentence to mean that
| focusing on compute density allowed them to sell each core at a
| lower price vs focusing on performance, not that Graviton 3 is
| cheaper than Graviton 2.
| invalidname wrote:
| While the article is interesting I would be more interested in
| details about carbon footprint and cost reduction. Also how would
| this impact more typical node, Java loads?
| Hizonner wrote:
| You know, if you wanted to improve carbon footprint, a better
| place to look might be at software bloat. The sheer number of
| times things get encoded and decoded to text is mind boggling.
| Especially in "typical node, Java loads".
| tyingq wrote:
| Logging and Cybersecurity are bloaty areas as well. I've seen
| plenty of AWS cost breakdowns where the cybersec functions
| were the highest percentage of spend. Or desktops where
| carbon black, or windows defender were using most of the CPU
| or IO cycles. And networks where syslog traffic was the
| biggest percentage of traffic.
| Dunedan wrote:
| As AWS doesn't price services based on carbon footprint,
| you can't infer the carbon footprint from the cost.
|
| I agree however that certain AWS services are
| disproportional expensive.
| maxerickson wrote:
| Presumably the price provides some sort of bounds.
|
| (Unless they are doing something like putting profits
| towards some sort of carbon maximization scheme)
| tyingq wrote:
| Well and a fair amount of cybersec oriented services are
| a pattern of _" sniff and copy every bit of data and do
| something with it"_ or _" trawl all state"_. Which is
| inherently heavy.
| orangepurple wrote:
| Norton, Symantec, and McAfee contribute greatly to global
| warming in the financial services sector. At least half of
| CPU cycles on employee laptops are devoted to them.
| Cthulhu_ wrote:
| But do they actually work? For years I've been of the
| opinion that most anti-virus solutions don't actually
| stop virusses, instead they give you a false sense of
| security and their messaging is intentionally alarmist to
| make individuals and organizations pay their subscription
| fees.
|
| In my limited and sheltered experience, the only viruses
| I've gotten in the past decade or so was from dodgy
| pirated stuff or big "download" button ads on download
| sites.
| MrBuddyCasino wrote:
| At best they don't work, in reality they are an attack
| vector themselves and a performance nightmare. They
| should (mostly) not exist.
| MaxBarraclough wrote:
| Presumably then they're knocking hours off the laptops'
| battery lives?
| jeffbee wrote:
| Virtually 100% of cloud operating expenses are electricity, so
| you can pretty much assume that if it costs less it has a lower
| carbon footprint.
| _joel wrote:
| + Rent, support staff, development costs, regulation and
| compliance, network, maintenance (cooling, fire suppression +
| lots more), marketing.
|
| Speaking as someone who did sys admin for a small independent
| cloud provider, it definitely isn't virtually 100% of
| operating costs
| jeffbee wrote:
| No offense intended to your personal experience, but I
| don't think "small independent cloud" is terribly important
| in the global analysis. This paper concludes that TDP and
| TCO have become the same thing, i.e. power is heat, power
| is opex.
|
| https://www.gwern.net/docs/ai/scaling/hardware/2021-jouppi.
| p...
| shepherdjerred wrote:
| AWS is pushing to move its internal services (most which are in
| Java) to graviton, so I would expect it to be excellent for
| "normal" workloads/languages
___________________________________________________________________
(page generated 2022-05-29 23:00 UTC)