https://blog.danslimmon.com/2023/08/11/squeeze-the-hell-out-of-the-system-you-have/#like-2777 Skip to content Dan Slimmon Evidence-oriented SRE Menu Search * About Search for: [ ] Search [screen-shot-2022-02-10-at-8] Squeeze the hell out of the system you have On 2023/08/11 By Dan SlimmonIn Uncategorized About a year ago, I raised a red flag with colleagues and managers about Postgres performance. Our database was struggling to keep up with the load generated by our monolithic SaaS application. CPU utilization was riding between 60 and 80%, and at least once it spiked to 100%, causing a brief outage. Now, we had been kicking the can down the road with respect to Postgres capacity for a long time. When the database looked too busy, we'd replace it with a bigger instance and move on. This saved us a lot of time and allowed us to focus on other things, like building features, which was great. But this time, it wasn't possible to scale the DB server vertically: we were already on the biggest instance. And we were about to overload that instance. Lots of schemes were floated. Foremost among them: * Shard writes. Spin up a cluster of independent databases, and write data to one or the other according to some partitioning strategy. * Do micro-services. Split up the monolith into multiple interconnected services, each with its own data store that could be scaled on its own terms. Both of these options are cool! A strong case can be made for either one on its merits. With write sharding, we could potentially increase our capacity by 2 or even 3 orders of magnitude. With micro-services, we'd be free to use "the right tool for the job," picking data stores optimized to the requirements of each service workload. Either branch of the skill tree would offer exciting options for fault tolerance and operational resilience. Either way, everyone had to agree: we'd outgrown our old, naive implementation. Onward and upward! We can do hard things! In situations like this, presented with a dazzling array of next-generation architecture options that can be built to last us through the decade, it's easy to forget what our goal was: to get database performance under control. Complexity costs attention. Sometimes, leaps in complexity must be made. It's generally a good problem to have. If enough demand is being placed on your system to render obsolete your existing technology, then even more growth is probably on the horizon! If you can just put in the investment and build the more advanced architecture now, then you'll be looking at a bright future of unconstrained year-over-year success. But don't just consider the implementation cost. The real cost of increased complexity - often the much larger cost - is attention. If you decide to shard across databases, then not only must you pay the money-, time-, and opportunity cost of building out the new architecture: you must also take the new complexity into account in every subsequent technical decision. Want to shard writes? Fine, but this complicates every future decision about backups, monitoring, migrations, the ORM, and network topology (just to name a few). And don't get me started on micro-services. Just think about how massive these costs are. How much feature delivery will have to be delayed or foregone to support the additional architectural complexity? Always squeeze first We should always put off significant complexity increases as long as possible. When complexity leaps are on the table, there's usually also an opportunity to squeeze some extra juice out of the system you have. By tweaking the workload, tuning performance, or supplementing the system in some way, you may be able to add months or even years of runway. When viable, these options are always preferable to building out a next-gen system. Let's return to the example of the overloaded Postgres instance. In that case, what we ended up doing was twofold: 1. Two engineers (me and my colleague Ted - but mostly Ted) spent about 3 months working primarily on database performance issues. There was no silver bullet. We used our telemetry to identify heavy queries, dug into the (Rails) codebase to understand where they were coming from, and optimized or eliminated them. We also tuned a lot of Postgres settings. 2. Two more engineers cut a path through the codebase to run certain expensive read-only queries on a replica DB. This effort bore fruit around the same time as (1), when we offloaded our single most frequent query (a SELECT triggered by polling web clients). These two efforts together reduced the maximum weekly CPU usage on the database from 90% to 30%. Now we can sleep at night. We have a huge amount of room to grow, both in terms of CPU headroom and our ability to shed load from the primary. And furthermore, since our work touched many parts of the codebase and demanded collaboration with lots of different devs, we now have a strong distributed knowledge base about the existing system. We're well positioned to squeeze it even more if need be. This doesn't mean complexity is bad Of course, I'm not saying complexity is bad. It's necessary. Some day we'll reach a fundamental limit of our database architecture, and before that day arrives, we'll need to make a jump in complexity. But until then, because we squeezed first, we get to keep working with the most boring system possible. This is by far the cheaper and more practical option. Share this: * Twitter * Facebook * Like this: Like Loading... opssre Post navigation Previous Don't fix it just because it's technical debt. 5 thoughts on "Squeeze the hell out of the system you have" 1. Pingback: Squeeze the hell out of the system you may perchance perchance also own - TOP HACKER(tm) 2. Pingback: Ji Chu Xi Tong Li De Quan Bu Qian Li - Pian Zhi De Ma Nong 3. Pingback: Squeeze the hell out of the machine you would even own - TOP Show HN 4. Pingback: Squeeze the hell out of the machine that you simply would be in a position to perchance have - TOP HACKER(tm) 5. Pingback: Squeeze the hell out of the system you must maybe well also just beget - TOP HACKER(tm) Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] Search for: [ ] Search Top Posts * Squeeze the hell out of the system you have * Don't fix it just because it's technical debt. * Do-nothing scripting: the key to gradual automation * The most important thing to understand about queues * It's fine to use names in post-mortems Archives * August 2023 * May 2023 * April 2023 * March 2023 * February 2023 * December 2022 * November 2022 * October 2022 * July 2022 * June 2022 * May 2022 * February 2022 * December 2021 * March 2021 * August 2019 * July 2019 * June 2019 * May 2019 * March 2019 * February 2019 * October 2017 * July 2017 * June 2017 * April 2017 * September 2016 * August 2016 * June 2016 * March 2016 * December 2015 * October 2015 * July 2015 * June 2015 * September 2014 * August 2014 * July 2014 * May 2014 * October 2013 * September 2013 * June 2013 * May 2013 * April 2013 * March 2013 * December 2012 * November 2012 * August 2012 * July 2012 Categories * Descriptive engineering * Mind hacks * Monitoring * Post-mortems * Problem-Solving * Statistics and probability * Uncategorized * Workflow Meta * Register * Log in * Entries feed * Comments feed * WordPress.com Blog at WordPress.com. * Follow Following + [croppe] Dan Slimmon Join 45 other followers [ ] Sign me up + Already have a WordPress.com account? Log in now. * + [croppe] Dan Slimmon + Customize + Follow Following + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar %d bloggers like this: [b]