http://muratbuffalo.blogspot.com/2021/12/graviton2-and-graviton3.html Skip to main content Search This Blog [ ] [Search] Metadata On distributed systems broadly defined and other curiosities. The opinions on this site are my own. Graviton2 and Graviton3 * Get link * Facebook * Twitter * Pinterest * Email * Other Apps - December 04, 2021 What do modern cloud workloads look like? And what does that have to do with new chip designs? I found these gems in Peter DeSantis's ReInvent20 and ReInvent21 talks. These talks are very informative and educational. Me likey! The speakers at ReInvent are not just introducing new products/ services, but they are also explaining the thought processes behind them. To come up with this summary, I edited the YouTube video transcripts slightly (mostly shortening it). The presentation narratives have been really well planned, so this makes a good read I think. Graviton2 This part is from the ReInvent2020 talk from Peter DeSantis. Graviton2 is the best performing general purpose processor in our cloud by a wide margin. It also offers significantly lower cost. And it's also the most power efficient processor we've ever deployed. Our plan was to build a processor that was optimized for AWS and modern cloud workloads. But, what do modern cloud workloads look like? Let's start by looking at what a modern processor looks like. For a long time, the main difference between one processor generation and the next was the speed of the processor. And this was great while it lasted. But about 15 years ago, this all changed. New processors continued to improve their performance but not nearly as quickly as they had in the past. Instead, new processors started adding cores. You can think of a core like a mini processor on the chip. Each core on the chip can work independently and at the same time as all the other cores. This means that if you can divide your work up, you can get that work done in parallel. Processors went from one core to two and then four. So, how did workloads adapt to this new reality? Well, the easiest way to take advantage of cores is to run more independent applications on the server and modern operating systems have got very good at scheduling and managing multiple processes on high core systems. Another approach is multi-threaded applications. Multi-threaded applications allow builders to have the appearance of scaling up while taking advantage of parallel execution. While scale out computing has evolved to take advantage of higher core processors; processor designers have never really abandoned the old world. Modern processors have tried to have it both ways, catering to both legacy applications and modern scale out applications. And this makes sense if you think about it. As I mentioned, producing a new processor can cost hundreds of millions of dollars and the way you justify that sort of large upfront investment is by targeting the broadest option possible. So, modern mini core processors have unsurprisingly tried to appeal to both legacy applications and modern scale out applications. (In the ReInvent21 talk, Peter referred back to this with the El Camino analogy. El Camino tries to be both a passenger car and pickup truck, and as a result not being very good at either.) Cores got so big and complex that it was hard to keep everything utilized. And the last thing you want is transistors on your processor doing nothing. To work around this limitation, processor designers invented a new concept called simultaneous multi-threading or SMT. SMT allows a single core to work on multiple tasks. Each task is called a thread. Threads share the core so SMT doesn't double your performance but it does allow you to take use of that big core and maybe improves your performance by 20-30%. But SMT also has drawbacks. The biggest drawback of SMT is it introduces overhead and performance variability. And because each core has to work on multiple tasks, each task's performance is dependent on what the other tasks are doing around it. Workloads can contend for the same resources like cache space slowing down the other threads on the same core. There are also security concerns with SMT. The side channel attacks try to use SMT to inappropriately share and access information from one thread to another. EC2 doesn't share threads from the same processor core across multiple customers to ensure customers are never exposed to these potential SMT side channel attacks. And SMT isn't the only way processor designers have tried to compensate for overly large and complex cores. The only thing worse than idle transistors is idle transistors that use power. So, modern cores have complex power management functions that attempt to turn off or turn down parts of the processor to manage power usage. The problem is, these power management features introduce even more performance variability. Basically, all sort of things can happen to your application and you have no control over it. And in this context, you can now understand how Graviton2 is different. The first thing we did with Graviton2 was focus on making sure that each core delivered the most real-world performance for modern cloud workloads. When I say real-world performance, I mean better performance on actual workloads. Not things that lead to better spec sheets stats like processor frequency or performance micro benchmarks which don't capture real-world performance. We used our experience running real scale out applications to identify where we needed to add capabilities to assure optimal performance without making our cores too bloated. Second, we designed Graviton2 with as many independent cores as possible. When I say independent, Graviton2 cores are designed to perform consistently. No overlapping SMT threads. No complex power state transitions. Therefore, you get no unexpected throttling, just consistent performance. And some of our design choices actually help us with both of these goals. Let me give you an example. Caches help your cores run fast by hiding the fact that system memory runs hundreds of times slower than the processor. Processors often use several layers of caches. Some are slower and shared by all the cores. And some are local to a core and run much faster. With Graviton2, one of the things we prioritized was large core local caches. In fact, the core local L1 caches on Graviton2 are twice as large as the current generation x86 processors. And because we don't have SMT, this whole cache is dedicated to a single execution thread and not shared by competing execution threads. And this means that each Graviton2 core has four times the local L1 caching as SMT enabled x86 processors. This means each core can execute faster and with less variability. Because Graviton2 is an Arm processor, a lot of people will assume that Graviton2 will perform good at front-end applications, but they doubt it can perform well enough for serious I/O intensive back-end applications. But this is not the case. So, let's look at a Postgres database workload performing a standard database benchmark called HammerDB. We are going to compare Graviton2 with the m5 instances. As we consider the larger m5 instance sizes (blue color bars), you can see right away the scaling there isn't very good. There are a few reasons for this flattening. But it mostly comes down to sharing memory across two different processors, and that sharing adds latency and variability to the memory access. And like all variability, this makes it hard for scale-out applications to scale efficiently. [Screen] Let's now look at Graviton (green color bars). Here we can see the M6g instance on the same benchmark. You can see that M6g delivers better absolute performance at every size. But that's not all. First you see the M6g scales almost linearly all the way up to the 64 core largest instance size. And by the time you get to 48 cores, you have better absolute performance that even the largest m5 instance with twice as many threads. And for your most demanding workloads the 64 core M6g instance provides over 20% better absolute performance than any m5 instance. But this isn't the whole story. Things get even better when we factor in the lower cost of the M6g. The larger sized instances are nearly 60% lower cost. And because the M6g scales down better than a threaded processor, you can save even more with the small instance, over 80% on this workload. Graviton3 This part is from Peter's ReInvent21 talk. We knew that if we built a processor that was optimized specifically for modern workloads that we could dramatically improve the performance, reduce the cost, and increase the efficiency of the vast majority of workloads in the cloud. And that's what we did with Graviton. A lot happened over the last year so let me get you quickly caught up. We released Graviton optimized versions of our most popular AWS managed services. We also released Graviton support for Fargate and Lambda, extending the benefits of Graviton to serverless computing. ... So where do we go from here how do we build on the success of Graviton2? We are previewing Graviton3, which will provide at least 25 percent improved performance for most workloads. Remember Graviton 2 which was released less than 18 months ago already provides the best performance for many workloads so this is another big jump. So how did we accomplish that? [Screen] Here are the sticker stats they may look impressive they are but as I mentioned last year and this bears repeating: the most important thing we're doing with graviton is staying laser focused on the performance of real workloads, your workloads! When you're designing a new chip it can be tempting to optimize the chip for these sticker stats like processor frequency or core count and while these things are important they're not the end goal. The end goal is the best performance and the lowest cost for real workloads. And I'm going to show you how we do that with Graviton3. And I'm also going to show you how if you focus on these sticker stats you can actually be led astray. When you look to make a processor faster, the first thing that probably comes to mind is to increase the processor frequency. For many years we were spoiled because each new generation of processor ran at a higher frequency than the previous generation, and higher frequency means the processor runs faster. That's delightful, because magically everything just runs faster. The problem is when you increase frequency of a processor you need to increase the amount of power that you're sending to the chip. Up until about 15 years ago, every new generation of silicon technology allowed transistors to be operated at lower and lower voltages. This was a property called Dennard scaling. Dennard scaling made processors more power efficient and enable processor frequencies to be increased without raising the power of the overall processor. But Dennard scaling has slowed down as we've approached the minimum voltage threshold of a functional transistor in silicon. So now if we want to keep increasing processor frequency, we need to increase the power on a chip. Maybe you've heard about or even tried overclocking a CPU. To overclock a CPU you need to feed a lot more power into the server, and that means you get a lot more waste heat, so you need to find a way to cool the processor. This is not a great idea in a data center. Higher power means higher cost. It means more heat. And it means lower efficiency. So how do we increase the performance of graviton without reducing power efficiency? The answer is we make the core wider! A wider core is able to do more work per cycle. So instead of increasing the number of cycles per second, we increase the amount of work that you can do in each cycle. With Graviton3 we've increased the width of the core in a number of ways. One example is we've increased the number of instructions that each core can work on concurrently from five to eight instructions per cycle. This is called instruction execution parallelism. How well each application is going to do with this additional core width is going to vary, and it's dependent on really clever compilation, but our testing tells us that most workloads will see at least 25% faster performance and some workloads like Nginx are seeing 60% performance improvement. Higher instruction execution parallelism is not the only way to increase performance. By making things wider you can also increase the width of the data that you're processing. A great example of this is doing vector operations. Graviton3 doubles the size of the vectors that can be operated on in a single cycle with one of these vector operations and this will have significant impact on workloads like video encoding and encryption. Adding more cores is an effective way of improving processor performance and generally you want as many cores as you can fit. But you need to be careful here as well, because there are trade-offs that impact the application. When we looked closely at real workloads running on Graviton2, what we saw is that most workloads could actually run more efficiently if they had more memory bandwidth and lower latency access to memory. Now that isn't surprising modern cloud workloads are using more memory and becoming more more sensitive to memory latency so rather than using our extra transistors to pack more cores onto Graviton3, we decided to use our transistors to improve memory performance! Graviton2 instances already had a lot of memory bandwidth per VCPU, but we decided to add even more to Graviton3. Each Graviton3 core has 50% more memory than Graviton2. c7g powered by Graviton3 is the first cloud instance to support the new ddr5 memory standard which also improves memory performance. cloud computing misc * Get link * Facebook * Twitter * Pinterest * Email * Other Apps Comments Post a Comment Popular posts from this blog Foundational distributed systems papers - February 27, 2021 I talked about the importance of reading foundational papers last week. To followup, here is my compilation of foundational papers in the distributed systems area. (I focused on the core distributed systems area, and did not cover networking, security, distributed ledgers, verification work etc. I even left out distributed transactions, I hope to cover them at a later date.) I classified the papers by subject, and listed them in chronological order. I also listed expository papers and blog posts at the end of each section. Time and State in Distributed Systems Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978. Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985. Virtual Time and Global States of Distributed Systems. Mattern, F. 1988. Expository papers and blog posts There is No Now . Justin Sheehy, ACM Queue 2015 Why Logical Clock Read more Your attitude determines your success - March 13, 2021 This may sound like a cliche your dad used to tell, but after many years of going through new areas, ventures, and careers, I find this to be the most underrated career advice. This is the number one advice I would like my kids to internalize as they grow up. This is the most important idea I would like every one undertaking a new venture to know. If you think you are not good enough, it becomes a self-fulfilling prophecy. If you think you are not enjoying something, you start to hate it. I gave examples of this several times before. Let's suffice with this one : In graduate school, I had read "Hackers: Heroes of the Computer Revolution" from Steven Levy and enjoyed it a lot. (I still keep the dog eared paper copy with affection.) So, I should have read Steven Levy's Crypto book a long time ago. But for some reason, I didn't...even though I was aware of the book. I guess that was due to a stupid quirk of mine; I had some aversion to the security/cryptography res Read more Progress beats perfect - August 06, 2021 Image This is a favorite saying of mine. I use it to motivate myself when I feel disheartened about how much I have to learn and improve. If I do a little every day or every week, I will get there. If I get one percent better each day for one year, I'll end up thirty-seven times better by the end of the year. $1.01^{365}=37.78$ Years ago I had read this idea in one of John Ousterhouts life lessons, and it stuck with me. "A little bit of slope makes up for a lot of y-intercept" Recently I noticed another advantage of progress over perfect. The emotional advantage. Progress is better because it makes you feel better as you see improvement. You are getting there, you are making ... progress. Progress is growth mindset . You have an opportunity ahead of you. Perfect feels bad.. It puts you on defense. You have to defend the perfect, you have to keep the appearances. You can only go downwards from perfect, or maintain status quo. Progress gives you momentum. As long as you manag Read more Cores that don't count - June 06, 2021 This paper is from Google and appeared at HotOS 2021 . There is also a very nice 10 minute video presentation for it. So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time. What are the causes of CEEs? This is mostly due to ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse Read more Silent data corruptions at scale - June 12, 2021 Image This paper from Facebook (Arxiv Feb 2021) is referred in the Google fail-silent Corruption Execution Errors (CEEs) paper as the most related work. Both papers discuss the same phenomenon, and say that we need to update our belief about quality-tested CPUs not having logic errors, and that if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. This paper provides an account of how Facebook have observed CEEs over several years. After running a wide range of silent error test scenarios across 100K machines, they found that 100s of CPUs are identified as having these errors, showing that CEEs are a systemic issue across generations. This paper, as the Google paper, does not name specific vendor or chipset types. Also the ~1/1000 ratio reported here matches the ~1/1000 mercurial core ratio that the Google paper reports. The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale Read more Learning about distributed systems: where to start? - June 10, 2020 This is definitely not a "learn distributed systems in 21 days" post. I recommend a principled, from the foundations-up, studying of distributed systems, which will take a good three months in the first pass, and many more months to build competence after that. If you are practical and coding oriented you may not like my advice much. You may object saying, "Shouldn't I learn distributed systems with coding and hands on? Why can I not get started by deploying a Hadoop cluster, or studying the Raft code." I think that is the wrong way to go about learning distributed systems, because seeing similar code and programming language constructs will make you think this is familiar territory, and will give you a false sense of security. But, nothing can be further from the truth. Distributed systems need radically different software than centralized systems do. --A. Tannenbaum This quotation is literally the first sentence in my distributed systems syllabus. Inst Read more Read papers, Not too much, Mostly foundational ones - February 20, 2021 Here is my advice to people who want to develop competence and expertise in their fields. Read papers By papers, I mean technical research papers, not white papers or blog posts. By read, I mean read rigorously and critically . Not too much If you read rigorously and critically, you cannot read too many papers. Moreover, learning by doing is the only way to internalize and grok a concept. If you read papers all day, you don't have time to try things yourself. If you are a PhD student, maybe read two or three papers a week (but, remember, rigorously and actively). If you are not in academia, maybe read one paper a week or two. Mostly foundational ones While there are exceptions, it is better to prioritize: seminal work over incremental work, general principled work over point-solutions, work introducing techniques/tools over work applying techniques A big exception is good expository papers. Unfortunately, the academia treats them as something the cat dragged in, because they Read more Sundial: Fault-tolerant Clock Synchronization for Datacenters - March 21, 2021 Image This paper appeared recently in OSDI 2020 . This paper is about clock synchronization in the data center. I presented this paper for our distributed systems zoom meeting group . I took a wider view of the problem by explaining time synchronization challenges and fundamental techniques to achieve precise time synchronization. I will take the same path in this post as well. It is a bit circuitous road, but it gives a scenic pleasurable journey. So let's get going. The benefits of better time synchronization For any distributed system, timestamping and ordering of events is a very important thing. Processes in a distributed system run concurrently without knowing what the other processes are doing at the moment. Processes learn about each other's states only by sending and receiving messages and this information by definition come from the past state of the nodes. The process needs to compose the coherent view of the system from these messages and all the while the system is movi Read more Linearizability - October 06, 2021 Distributed/networked systems employ data replication to achieve availability and scalability. Consistency is concerned with the question of what should happen if a client modifies some data items and concurrently another client reads or modifies the same items possibly at a different replica. Linearizability is a strong form of consistency. (That is why it is also called as strong-consistency.) For a system to satisfy linearizability, each operation must appear (from client perspective) to occur at an instantaneous point between its start time (when the client submits it) and finish time (when the client receives the response), and execution at these instantaneous points should form a valid sequential execution (i.e., it should be as if operations are executed one at a time ---without concurrency, like they are being executed by a single node) Let's simplify things further. In practice, consistency is often defined for systems that have two very specific operations: read and w Read more Using Lightweight Formal Methods to Validate a Key-Value Storage Node in Amazon S3 (SOSP21) - October 02, 2021 Image This paper comes from my colleagues at AWS S3 Automated Reasoning Group, detailing their experience applying lightweight formal methods to a new class of storage node developed for S3 storage backend. Lightweight formal methods emphasize automation and usability. In this case, the approach involves three prongs: developing executable reference models as specifications, checking implementation conformance to those models, and building infrastructure to ensure the models remain accurate in the future. ShardStore ShardStore is a new append-only key-value storage node developed for AWS S3 backend. It is over 40K lines of Rust code. Shardstore is a log-structured merge tree (LSM tree) but with shard data stored outside the tree to reduce write amplification. ShardStore employs soft updates for avoiding the cost of redirecting writes through a write-ahead log while still being crash consistent. A soft updates implementation ensures that only writes whose dependencies are persisted are sent Read more Powered by Blogger Theme images by Michael Elkan Murat Demirbas My photo Murat I am a principal applied scientist at AWS S3 Automated Reasoning Group. On leave as a computer science and engineering professor at SUNY Buffalo. I work on distributed systems, distributed consensus, and cloud computing. You can follow me on Twitter. Visit profile Recent Posts * 2021 43 + December 1 o Graviton2 and Graviton3 + November 3 + October 6 + September 1 + August 4 + July 2 + June 12 + May 1 + April 1 + March 4 + February 4 + January 4 * 2020 76 + December 3 + November 7 + October 4 + September 1 + August 3 + July 6 + June 11 + May 9 + April 8 + March 8 + February 7 + January 9 * 2019 65 + December 10 + November 14 + October 6 + September 13 + July 3 + June 3 + May 4 + April 6 + March 2 + February 1 + January 3 * 2018 71 + December 4 + November 7 + October 2 + September 2 + August 8 + July 2 + June 4 + May 9 + April 6 + March 9 + February 5 + January 13 * 2017 77 + December 15 + November 15 + October 5 + September 8 + August 10 + July 3 + June 3 + May 3 + April 4 + February 4 + January 7 * 2016 42 + December 7 + November 9 + October 3 + September 1 + July 4 + June 5 + May 1 + April 4 + March 2 + February 2 + January 4 * 2015 34 + December 3 + November 2 + October 3 + September 2 + August 3 + June 1 + May 1 + April 6 + March 6 + February 4 + January 3 * 2014 29 + November 4 + October 4 + September 6 + August 2 + July 2 + June 3 + March 3 + February 4 + January 1 * 2013 25 + December 1 + November 2 + August 2 + July 4 + June 2 + May 5 + April 8 + January 1 * 2012 18 + December 1 + November 7 + October 1 + September 2 + August 1 + May 2 + March 1 + February 1 + January 2 * 2011 38 + December 3 + September 5 + June 1 + May 5 + April 5 + March 5 + February 9 + January 5 * 2010 31 + December 6 + November 9 + October 9 + September 7 * 2007 1 + August 1 Show more Show less Topics auditability5 automated reasoning5 Azure9 bestof4 big-data19 Blockchain39 book-review49 chaos2 cloud computing1 consistency23 Cosmos DB10 CosmosDB11 dataflow7 distributed consensus40 distributed transactions3 facebook13 failures16 fault-tolerance35 formal methods 10 graph-processing1 humans9 indexing3 links2 mad-questions42 misc96 mlbegin7 mldl25 mobile2 my advice15 my-paper10 newsql1 paper-review 130 paxos43 presenting4 programming5 reading-group22 research-advice 43 research-question43 Rust2 scheduling3 seminar9 serverless1 smartphones2 sonification1 stabilization5 stream-processing10 teaching30 tensorflow11 time8 time synchronization1 tla38 trip-report 20 wpaxos5 writing25 Show more Show less Pageviews