https://blog.danslimmon.com/2023/08/11/squeeze-the-hell-out-of-the-system-you-have/#like-2777

Skip to content
 

Dan Slimmon

Evidence-oriented SRE

Menu
Search

  * About

Search for: [                    ] Search
[screen-shot-2022-02-10-at-8]

Squeeze the hell out of the system you have

On 2023/08/11 By Dan SlimmonIn Uncategorized

About a year ago, I raised a red flag with colleagues and managers
about Postgres performance. Our database was struggling to keep up
with the load generated by our monolithic SaaS application. CPU
utilization was riding between 60 and 80%, and at least once it
spiked to 100%, causing a brief outage.

Now, we had been kicking the can down the road with respect to
Postgres capacity for a long time. When the database looked too busy,
we'd replace it with a bigger instance and move on. This saved us a
lot of time and allowed us to focus on other things, like building
features, which was great.

But this time, it wasn't possible to scale the DB server vertically:
we were already on the biggest instance. And we were about to
overload that instance.

Lots of schemes were floated. Foremost among them:

  * Shard writes. Spin up a cluster of independent databases, and
    write data to one or the other according to some partitioning
    strategy.
  * Do micro-services. Split up the monolith into multiple
    interconnected services, each with its own data store that could
    be scaled on its own terms.

Both of these options are cool! A strong case can be made for either
one on its merits. With write sharding, we could potentially increase
our capacity by 2 or even 3 orders of magnitude. With micro-services,
we'd be free to use "the right tool for the job," picking data stores
optimized to the requirements of each service workload. Either branch
of the skill tree would offer exciting options for fault tolerance
and operational resilience.

Either way, everyone had to agree: we'd outgrown our old, naive
implementation. Onward and upward! We can do hard things!

In situations like this, presented with a dazzling array of
next-generation architecture options that can be built to last us
through the decade, it's easy to forget what our goal was: to get
database performance under control.

Complexity costs attention.

Sometimes, leaps in complexity must be made. It's generally a good
problem to have. If enough demand is being placed on your system to
render obsolete your existing technology, then even more growth is
probably on the horizon! If you can just put in the investment and
build the more advanced architecture now, then you'll be looking at a
bright future of unconstrained year-over-year success.

But don't just consider the implementation cost. The real cost of
increased complexity - often the much larger cost - is attention.

If you decide to shard across databases, then not only must you pay
the money-, time-, and opportunity cost of building out the new
architecture: you must also take the new complexity into account in
every subsequent technical decision. Want to shard writes? Fine, but
this complicates every future decision about backups, monitoring,
migrations, the ORM, and network topology (just to name a few). And
don't get me started on micro-services.

Just think about how massive these costs are. How much feature
delivery will have to be delayed or foregone to support the
additional architectural complexity?

Always squeeze first

We should always put off significant complexity increases as long as
possible.

When complexity leaps are on the table, there's usually also an
opportunity to squeeze some extra juice out of the system you have.
By tweaking the workload, tuning performance, or supplementing the
system in some way, you may be able to add months or even years of
runway. When viable, these options are always preferable to building
out a next-gen system.

Let's return to the example of the overloaded Postgres instance. In
that case, what we ended up doing was twofold:

 1. Two engineers (me and my colleague Ted - but mostly Ted) spent
    about 3 months working primarily on database performance issues.
    There was no silver bullet. We used our telemetry to identify
    heavy queries, dug into the (Rails) codebase to understand where
    they were coming from, and optimized or eliminated them. We also
    tuned a lot of Postgres settings.
 2. Two more engineers cut a path through the codebase to run certain
    expensive read-only queries on a replica DB. This effort bore
    fruit around the same time as (1), when we offloaded our single
    most frequent query (a SELECT triggered by polling web clients).

These two efforts together reduced the maximum weekly CPU usage on
the database from 90% to 30%.

Now we can sleep at night. We have a huge amount of room to grow,
both in terms of CPU headroom and our ability to shed load from the
primary. And furthermore, since our work touched many parts of the
codebase and demanded collaboration with lots of different devs, we
now have a strong distributed knowledge base about the existing
system. We're well positioned to squeeze it even more if need be.

This doesn't mean complexity is bad

Of course, I'm not saying complexity is bad. It's necessary. Some day
we'll reach a fundamental limit of our database architecture, and
before that day arrives, we'll need to make a jump in complexity.

But until then, because we squeezed first, we get to keep working
with the most boring system possible. This is by far the cheaper and
more practical option.

Share this:

  * Twitter
  * Facebook
  * 

Like this:

Like Loading...
opssre

Post navigation

Previous

Don't fix it just because it's technical debt.

5 thoughts on "Squeeze the hell out of the system you have"

 1. Pingback: Squeeze the hell out of the system you may perchance
    perchance also own - TOP HACKER(tm)

 2. Pingback: Ji Chu Xi Tong Li De Quan Bu Qian Li  - Pian Zhi De Ma Nong 

 3. Pingback: Squeeze the hell out of the machine you would even own
    - TOP Show HN

 4. Pingback: Squeeze the hell out of the machine that you simply
    would be in a position to perchance have - TOP HACKER(tm)

 5. Pingback: Squeeze the hell out of the system you must maybe well
    also just beget - TOP HACKER(tm)

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

Search for: [                    ] Search
Top Posts

  * Squeeze the hell out of the system you have
  * Don't fix it just because it's technical debt.
  * Do-nothing scripting: the key to gradual automation
  * The most important thing to understand about queues
  * It's fine to use names in post-mortems

Archives

  * August 2023
  * May 2023
  * April 2023
  * March 2023
  * February 2023
  * December 2022
  * November 2022
  * October 2022
  * July 2022
  * June 2022
  * May 2022
  * February 2022
  * December 2021
  * March 2021
  * August 2019
  * July 2019
  * June 2019
  * May 2019
  * March 2019
  * February 2019
  * October 2017
  * July 2017
  * June 2017
  * April 2017
  * September 2016
  * August 2016
  * June 2016
  * March 2016
  * December 2015
  * October 2015
  * July 2015
  * June 2015
  * September 2014
  * August 2014
  * July 2014
  * May 2014
  * October 2013
  * September 2013
  * June 2013
  * May 2013
  * April 2013
  * March 2013
  * December 2012
  * November 2012
  * August 2012
  * July 2012

Categories

  * Descriptive engineering
  * Mind hacks
  * Monitoring
  * Post-mortems
  * Problem-Solving
  * Statistics and probability
  * Uncategorized
  * Workflow

Meta

  * Register
  * Log in
  * Entries feed
  * Comments feed
  * WordPress.com

Blog at WordPress.com.

  * Follow Following
      + [croppe] Dan Slimmon
        Join 45 other followers
        [                    ]
        Sign me up
      + Already have a WordPress.com account? Log in now.
  * 
      + [croppe] Dan Slimmon
      + Customize
      + Follow Following
      + Sign up
      + Log in
      + Copy shortlink
      + Report this content
      + View post in Reader
      + Manage subscriptions
      + Collapse this bar

%d bloggers like this:

[b]