http://muratbuffalo.blogspot.com/2021/12/graviton2-and-graviton3.html

Skip to main content
 

Search This Blog

[                    ]
[Search]
Metadata

On distributed systems broadly defined and other curiosities. The
opinions on this site are my own.

Graviton2 and Graviton3

  * Get link
  * Facebook
  * Twitter
  * Pinterest
  * Email
  * Other Apps

- December 04, 2021

What do modern cloud workloads look like? And what does that have to
do with new chip designs?

I found these gems in Peter DeSantis's ReInvent20 and ReInvent21
talks. These talks are very informative and educational. Me likey!
The speakers at ReInvent are not just introducing new products/
services, but they are also explaining the thought processes behind
them.

To come up with this summary, I edited the YouTube video transcripts
slightly (mostly shortening it). The presentation narratives have
been really well planned, so this makes a good read I think.


Graviton2

This part is from the ReInvent2020 talk from Peter DeSantis.
 

Graviton2 is the best performing general purpose processor in our
cloud by a wide margin. It also offers significantly lower cost. And
it's also the most power efficient processor we've ever deployed.

Our plan was to build a processor that was optimized for AWS and
modern cloud workloads. But, what do modern cloud workloads look
like?

Let's start by looking at what a modern processor looks like. For a
long time, the main difference between one processor generation and
the next was the speed of the processor. And this was great while it
lasted. But about 15 years ago, this all changed. New processors
continued to improve their performance but not nearly as quickly as
they had in the past.

Instead, new processors started adding cores. You can think of a core
like a mini processor on the chip. Each core on the chip can work
independently and at the same time as all the other cores. This means
that if you can divide your work up, you can get that work done in
parallel. Processors went from one core to two and then four.

So, how did workloads adapt to this new reality? Well, the easiest
way to take advantage of cores is to run more independent
applications on the server and modern operating systems have got very
good at scheduling and managing multiple processes on high core
systems. Another approach is multi-threaded applications.
Multi-threaded applications allow builders to have the appearance of
scaling up while taking advantage of parallel execution.

While scale out computing has evolved to take advantage of higher
core processors; processor designers have never really abandoned the
old world. Modern processors have tried to have it both ways,
catering to both legacy applications and modern scale out
applications. And this makes sense if you think about it. As I
mentioned, producing a new processor can cost hundreds of millions of
dollars and the way you justify that sort of large upfront investment
is by targeting the broadest option possible. So, modern mini core
processors have unsurprisingly tried to appeal to both legacy
applications and modern scale out applications.

(In the ReInvent21 talk, Peter referred back to this with the El
Camino analogy. El Camino tries to be both a passenger car and pickup
truck, and as a result not being very good at either.)

Cores got so big and complex that it was hard to keep everything
utilized. And the last thing you want is transistors on your
processor doing nothing. To work around this limitation, processor
designers invented a new concept called simultaneous multi-threading
or SMT. SMT allows a single core to work on multiple tasks. Each task
is called a thread. Threads share the core so SMT doesn't double your
performance but it does allow you to take use of that big core and
maybe improves your performance by 20-30%. But SMT also has
drawbacks. The biggest drawback of SMT is it introduces overhead and
performance variability. And because each core has to work on
multiple tasks, each task's performance is dependent on what the
other tasks are doing around it. Workloads can contend for the same
resources like cache space slowing down the other threads on the same
core. There are also security concerns with SMT. The side channel
attacks try to use SMT to inappropriately share and access
information from one thread to another. EC2 doesn't share threads
from the same processor core across multiple customers to ensure
customers are never exposed to these potential SMT side channel
attacks.

And SMT isn't the only way processor designers have tried to
compensate for overly large and complex cores. The only thing worse
than idle transistors is idle transistors that use power. So, modern
cores have complex power management functions that attempt to turn
off or turn down parts of the processor to manage power usage. The
problem is, these power management features introduce even more
performance variability. Basically, all sort of things can happen to
your application and you have no control over it.

And in this context, you can now understand how Graviton2 is
different. The first thing we did with Graviton2 was focus on making
sure that each core delivered the most real-world performance for
modern cloud workloads. When I say real-world performance, I mean
better performance on actual workloads. Not things that lead to
better spec sheets stats like processor frequency or performance
micro benchmarks which don't capture real-world performance. We used
our experience running real scale out applications to identify where
we needed to add capabilities to assure optimal performance without
making our cores too bloated.

Second, we designed Graviton2 with as many independent cores as
possible. When I say independent, Graviton2 cores are designed to
perform consistently. No overlapping SMT threads. No complex power
state transitions. Therefore, you get no unexpected throttling, just
consistent performance. And some of our design choices actually help
us with both of these goals.

Let me give you an example. Caches help your cores run fast by hiding
the fact that system memory runs hundreds of times slower than the
processor. Processors often use several layers of caches. Some are
slower and shared by all the cores. And some are local to a core and
run much faster. With Graviton2, one of the things we prioritized was
large core local caches. In fact, the core local L1 caches on
Graviton2 are twice as large as the current generation x86
processors. And because we don't have SMT, this whole cache is
dedicated to a single execution thread and not shared by competing
execution threads. And this means that each Graviton2 core has four
times the local L1 caching as SMT enabled x86 processors. This means
each core can execute faster and with less variability.

Because Graviton2 is an Arm processor, a lot of people will assume
that Graviton2 will perform good at front-end applications, but they
doubt it can perform well enough for serious I/O intensive back-end
applications. But this is not the case. So, let's look at a Postgres
database workload performing a standard database benchmark called
HammerDB. We are going to compare Graviton2 with the m5 instances. As
we consider the larger m5 instance sizes (blue color bars), you can
see right away the scaling there isn't very good. There are a few
reasons for this flattening. But it mostly comes down to sharing
memory across two different processors, and that sharing adds latency
and variability to the memory access. And like all variability, this
makes it hard for scale-out applications to scale efficiently.

[Screen]



Let's now look at Graviton (green color bars). Here we can see the
M6g instance on the same benchmark. You can see that M6g delivers
better absolute performance at every size. But that's not all. First
you see the M6g scales almost linearly all the way up to the 64 core
largest instance size. And by the time you get to 48 cores, you have
better absolute performance that even the largest m5 instance with
twice as many threads. And for your most demanding workloads the 64
core M6g instance provides over 20% better absolute performance than
any m5 instance.

But this isn't the whole story. Things get even better when we factor
in the lower cost of the M6g. The larger sized instances are nearly
60% lower cost. And because the M6g scales down better than a
threaded processor, you can save even more with the small instance,
over 80% on this workload.


Graviton3

This part is from Peter's ReInvent21 talk.

We knew that if we built a processor that was optimized specifically
for modern workloads that we could dramatically improve the
performance, reduce the cost, and increase the efficiency of the vast
majority of workloads in the cloud. And that's what we did with
Graviton.

A lot happened over the last year so let me get you quickly caught
up. We released Graviton optimized versions of our most popular AWS
managed services. We also released Graviton support for Fargate and
Lambda, extending the benefits of Graviton to serverless computing.
...

So where do we go from here how do we build on the success of
Graviton2? We are previewing Graviton3, which will provide at least
25 percent improved performance for most workloads. Remember Graviton
2 which was released less than 18 months ago already provides the
best performance for many workloads so this is another big jump.

So how did we accomplish that?


[Screen]

Here are the sticker stats they may look impressive they are but as I
mentioned last year and this bears repeating: the most important
thing we're doing with graviton is staying laser focused on the
performance of real workloads, your workloads! When you're designing
a new chip it can be tempting to optimize the chip for these sticker
stats like processor frequency or core count and while these things
are important they're not the end goal. The end goal is the best
performance and the lowest cost for real workloads. And I'm going to
show you how we do that with Graviton3. And I'm also going to show
you how if you focus on these sticker stats you can actually be led
astray.


When you look to make a processor faster, the first thing that
probably comes to mind is to increase the processor frequency. For
many years we were spoiled because each new generation of processor
ran at a higher frequency than the previous generation, and higher
frequency means the processor runs faster. That's delightful, because
magically everything just runs faster. The problem is when you
increase frequency of a processor you need to increase the amount of
power that you're sending to the chip. Up until about 15 years ago,
every new generation of silicon technology allowed transistors to be
operated at lower and lower voltages. This was a property called 
Dennard scaling. Dennard scaling made processors more power efficient
and enable processor frequencies to be increased without raising the
power of the overall processor. But Dennard scaling has slowed down
as we've approached the minimum voltage threshold of a functional
transistor in silicon. So now if we want to keep increasing processor
frequency, we need to increase the power on a chip. Maybe you've
heard about or even tried overclocking a CPU. To overclock a CPU you
need to feed a lot more power into the server, and that means you get
a lot more waste heat, so you need to find a way to cool the
processor. This is not a great idea in a data center. Higher power
means higher cost. It means more heat. And it means lower efficiency.

So how do we increase the performance of graviton without reducing
power efficiency? The answer is we make the core wider! A wider core
is able to do more work per cycle. So instead of increasing the
number of cycles per second, we increase the amount of work that you
can do in each cycle.

With Graviton3 we've increased the width of the core in a number of
ways. One example is we've increased the number of instructions that
each core can work on concurrently from five to eight instructions
per cycle. This is called instruction execution parallelism. How well
each application is going to do with this additional core width is
going to vary, and it's dependent on really clever compilation, but
our testing tells us that most workloads will see at least 25% faster
performance and some workloads like Nginx are seeing 60% performance
improvement. Higher instruction execution parallelism is not the only
way to increase performance. By making things wider you can also
increase the width of the data that you're processing. A great
example of this is doing vector operations. Graviton3 doubles the
size of the vectors that can be operated on in a single cycle with
one of these vector operations and this will have significant impact
on workloads like video encoding and encryption.

Adding more cores is an effective way of improving processor
performance and generally you want as many cores as you can fit. But
you need to be careful here as well, because there are trade-offs
that impact the application. When we looked closely at real workloads
running on Graviton2, what we saw is that most workloads could
actually run more efficiently if they had more memory bandwidth and
lower latency access to memory. Now that isn't surprising modern
cloud workloads are using more memory and becoming more more
sensitive to memory latency so rather than using our extra
transistors to pack more cores onto Graviton3, we decided to use our
transistors to improve memory performance! Graviton2 instances
already had a lot of memory bandwidth per VCPU, but we decided to add
even more to Graviton3. Each Graviton3 core has 50% more memory than
Graviton2. c7g powered by Graviton3 is the first cloud instance to
support the new ddr5 memory standard which also improves memory
performance.



 
cloud computing misc

  * Get link
  * Facebook
  * Twitter
  * Pinterest
  * Email
  * Other Apps

Comments

Post a Comment

Popular posts from this blog

Foundational distributed systems papers

- February 27, 2021
I talked about the importance of reading foundational papers last
week. To followup, here is my compilation of foundational papers in
the distributed systems area. (I focused on the core distributed
systems area, and did not cover networking, security, distributed
ledgers, verification work etc. I even left out distributed
transactions, I hope to cover them at a later date.)  I classified
the papers by subject, and listed them in chronological order. I also
listed expository papers and blog posts at the end of each section.
Time and State in Distributed Systems Time, Clocks, and the Ordering
of Events in a Distributed System. Leslie Lamport, Commn. of the
ACM,  1978. Distributed Snapshots: Determining Global States of a
Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions
on Computer Systems, 1985. Virtual Time and Global States of
Distributed Systems.  Mattern, F. 1988. Expository papers and blog
posts There is No Now . Justin Sheehy, ACM Queue 2015 Why Logical
Clock
 
Read more

Your attitude determines your success

- March 13, 2021
This may sound like a cliche your dad used to tell, but after many
years of going through new areas, ventures, and careers, I find this
to be the most underrated career advice. This is the number one
advice I would like my kids to internalize as they grow up. This is
the most important idea I would like every one undertaking a new
venture to know.  If you think you are not good enough, it becomes a
self-fulfilling prophecy. If you think you are not enjoying
something, you start to hate it.  I gave examples of this several
times before. Let's suffice with this one : In graduate school, I had
read "Hackers: Heroes of the Computer Revolution" from Steven Levy
and enjoyed it a lot. (I still keep the dog eared paper copy with
affection.) So, I should have read Steven Levy's Crypto book a long
time ago. But for some reason, I didn't...even though I was aware of
the book. I guess that was due to a stupid quirk of mine; I had some
aversion to the security/cryptography res
 
Read more

Progress beats perfect

- August 06, 2021
Image
This is a favorite saying of mine. I use it to motivate myself when I
feel disheartened about how much I have to learn and improve. If I do
a little every day or every week, I will get there. If I get one
percent better each day for one year, I'll end up thirty-seven times
better by the end of the year.   $1.01^{365}=37.78$ Years ago I had
read this idea in one of John Ousterhouts life lessons, and it stuck
with me.  "A little bit of slope makes up for a lot of y-intercept"
Recently I noticed another advantage of progress over perfect. The
emotional  advantage. Progress is better because it makes you feel
better as you see improvement. You are getting there, you are making
... progress. Progress is growth mindset . You have an opportunity
ahead of you. Perfect feels bad.. It puts you on defense. You have to
defend the perfect, you have to keep the appearances. You can only go
downwards from perfect, or maintain status quo. Progress gives you
momentum. As long as you manag
 
Read more

Cores that don't count

- June 06, 2021
This paper is from Google and appeared at HotOS 2021 . There is also
a very nice 10 minute video presentation for it. So Google found
fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is
interesting because we thought tested CPUs do not have logic errors,
and if they had an error it would be a fail-stop or at least
fail-noisy hardware errors triggering machine checks. Previously we
had known about fail-silent storage and network errors due to bit
flips, but the CEEs are new because they are computation errors.
While it is easy to detect data corruption due to bit flips, it is
hard to detect CEEs because they are rare and require expensive
methods to detect/correct in real-time.  What are the causes of CEEs?
This is mostly due to ever-smaller feature sizes that push closer to
the limits of CMOS scaling, coupled with ever-increasing complexity
in architectural design. Together, these create new challenges for
the verification methods that chip makers use to detect diverse
 
Read more

Silent data corruptions at scale

- June 12, 2021
Image
This paper from Facebook (Arxiv Feb 2021) is referred in the Google
fail-silent Corruption Execution Errors (CEEs) paper as the most
related work. Both papers discuss the same phenomenon, and say that
we need to update our belief about quality-tested CPUs not having
logic errors, and that if they had an error it would be a fail-stop
or at least fail-noisy hardware errors triggering machine checks.
This paper provides an account of how Facebook have observed CEEs
over several years. After running a wide range of silent error test
scenarios across 100K  machines, they found that 100s of CPUs are
identified as having these errors, showing that CEEs are a systemic
issue across generations. This paper, as the Google paper, does not
name specific vendor or chipset types. Also the ~1/1000 ratio
reported here matches the ~1/1000 mercurial core ratio that the
Google paper reports. The paper claims that silent data corruptions
can occur due to device characteristics and are repeatable at scale
 
Read more

Learning about distributed systems: where to start?

- June 10, 2020
This is definitely not a "learn distributed systems in 21 days" post.
I recommend a principled, from the foundations-up, studying of
distributed systems, which will take a good three months in the first
pass, and many more months to build competence after that. If you are
practical and coding oriented you may not like my advice much. You
may object saying, "Shouldn't I learn distributed systems with coding
and hands on? Why can I not get started by deploying a Hadoop
cluster, or studying the Raft code." I think that is the wrong way to
go about learning distributed systems, because seeing similar code
and programming language constructs will make you think this is
familiar territory, and will give you a false sense of security. But,
nothing can be further from the truth. Distributed systems need
radically different software than centralized systems do.  --A.
Tannenbaum This quotation is literally the first sentence in my
distributed systems syllabus. Inst
 
Read more

Read papers, Not too much, Mostly foundational ones

- February 20, 2021
Here is my advice to people who want to develop competence and
expertise in their fields. Read papers By papers, I mean technical
research papers, not white papers or blog posts.  By read, I mean
read rigorously and critically .  Not too much If you read rigorously
and critically, you cannot read too many papers.  Moreover, learning
by doing is the only way to internalize and grok a concept. If you
read papers all day, you don't have time to try things yourself.  If
you are a PhD student, maybe read two or three papers a week (but,
remember, rigorously and actively). If you are not in academia, maybe
read one paper a week or two.    Mostly foundational ones While there
are exceptions, it is better to prioritize: seminal work over
incremental work, general principled work over point-solutions, work
introducing techniques/tools over work applying techniques A big
exception is good expository papers. Unfortunately, the academia
treats them as something the cat dragged in, because they
 
Read more

Sundial: Fault-tolerant Clock Synchronization for Datacenters

- March 21, 2021
Image
This paper appeared recently in OSDI 2020 . This paper is about clock
synchronization in the data center. I presented this paper for our
distributed systems zoom meeting group . I took a wider view of the
problem by explaining time synchronization challenges and fundamental
techniques to achieve precise time synchronization. I will take the
same path in this post as well. It is a bit circuitous road, but it
gives a scenic pleasurable journey. So let's get going. The benefits
of better time synchronization For any distributed system,
timestamping and ordering of events is a very important thing.
Processes in a distributed system run concurrently without knowing
what the other processes are doing at the moment.  Processes learn
about each other's states only by sending and receiving messages and
this information by definition come from the past state of the nodes.
The process needs to compose the coherent view of the system from
these messages and all the while the system is movi
 
Read more

Linearizability

- October 06, 2021
Distributed/networked systems employ data replication to achieve
availability and scalability. Consistency is concerned with the
question of what should happen if a client modifies some data items
and concurrently another client reads or modifies the same items
possibly at a different replica. Linearizability is a strong form of
consistency. (That is why it is also called as strong-consistency.)
For a system to satisfy linearizability,  each operation must appear
(from client perspective) to occur at an instantaneous point between
its start time (when the client submits it) and finish time (when the
client receives the response), and  execution at these instantaneous
points should form a valid sequential execution (i.e., it should be
as if operations are executed one at a time ---without concurrency,
like they are being executed by a single node)  Let's simplify things
further. In practice, consistency is often defined for systems that
have two very specific operations: read and w
 
Read more

Using Lightweight Formal Methods to Validate a Key-Value Storage Node
in Amazon S3 (SOSP21)

- October 02, 2021
Image
This paper comes from my colleagues at AWS S3 Automated Reasoning
Group, detailing their experience applying lightweight formal methods
to a new class of storage node developed for S3 storage backend.
Lightweight formal methods emphasize automation and usability. In
this case, the approach involves three prongs: developing executable
reference models as specifications, checking implementation
conformance to those models, and building infrastructure to ensure
the models remain accurate in the future. ShardStore ShardStore is a
new append-only key-value storage node developed for AWS S3 backend.
It is over 40K lines of Rust code. Shardstore is a log-structured
merge tree (LSM tree) but with shard data stored outside the tree to
reduce write amplification. ShardStore employs soft updates for
avoiding the cost of redirecting writes through a write-ahead log
while still being crash consistent. A soft updates implementation
ensures that only writes whose dependencies are persisted are sent
 
Read more
Powered by Blogger
Theme images by Michael Elkan
Murat Demirbas
My photo

Murat
    I am a principal applied scientist at AWS S3 Automated Reasoning
    Group. On leave as a computer science and engineering professor
    at SUNY Buffalo. I work on distributed systems, distributed
    consensus, and cloud computing. You can follow me on Twitter.

Visit profile

Recent Posts

  * 2021 43
      + December 1
          o Graviton2 and Graviton3
      + November 3
      + October 6
      + September 1
      + August 4
      + July 2
      + June 12
      + May 1
      + April 1
      + March 4
      + February 4
      + January 4

  * 2020 76
      + December 3
      + November 7
      + October 4
      + September 1
      + August 3
      + July 6
      + June 11
      + May 9
      + April 8
      + March 8
      + February 7
      + January 9
  * 2019 65
      + December 10
      + November 14
      + October 6
      + September 13
      + July 3
      + June 3
      + May 4
      + April 6
      + March 2
      + February 1
      + January 3
  * 2018 71
      + December 4
      + November 7
      + October 2
      + September 2
      + August 8
      + July 2
      + June 4
      + May 9
      + April 6
      + March 9
      + February 5
      + January 13
  * 2017 77
      + December 15
      + November 15
      + October 5
      + September 8
      + August 10
      + July 3
      + June 3
      + May 3
      + April 4
      + February 4
      + January 7
  * 2016 42
      + December 7
      + November 9
      + October 3
      + September 1
      + July 4
      + June 5
      + May 1
      + April 4
      + March 2
      + February 2
      + January 4
  * 2015 34
      + December 3
      + November 2
      + October 3
      + September 2
      + August 3
      + June 1
      + May 1
      + April 6
      + March 6
      + February 4
      + January 3
  * 2014 29
      + November 4
      + October 4
      + September 6
      + August 2
      + July 2
      + June 3
      + March 3
      + February 4
      + January 1
  * 2013 25
      + December 1
      + November 2
      + August 2
      + July 4
      + June 2
      + May 5
      + April 8
      + January 1
  * 2012 18
      + December 1
      + November 7
      + October 1
      + September 2
      + August 1
      + May 2
      + March 1
      + February 1
      + January 2
  * 2011 38
      + December 3
      + September 5
      + June 1
      + May 5
      + April 5
      + March 5
      + February 9
      + January 5
  * 2010 31
      + December 6
      + November 9
      + October 9
      + September 7
  * 2007 1
      + August 1

Show more Show less

Topics

auditability5 automated reasoning5 Azure9 bestof4 big-data19 
Blockchain39 book-review49 chaos2 cloud computing1 consistency23
Cosmos DB10 CosmosDB11 dataflow7 distributed consensus40 distributed
transactions3 facebook13 failures16 fault-tolerance35 formal methods
10 graph-processing1 humans9 indexing3 links2 mad-questions42 misc96 
mlbegin7 mldl25 mobile2 my advice15 my-paper10 newsql1 paper-review
130 paxos43 presenting4 programming5 reading-group22 research-advice
43 research-question43 Rust2 scheduling3 seminar9 serverless1 
smartphones2 sonification1 stabilization5 stream-processing10 
teaching30 tensorflow11 time8 time synchronization1 tla38 trip-report
20 wpaxos5 writing25
Show more Show less

Pageviews