[HN Gopher] The database servers powering Let's Encrypt
       ___________________________________________________________________
        
       The database servers powering Let's Encrypt
        
       Author : jaas
       Score  : 200 points
       Date   : 2021-01-21 17:25 UTC (5 hours ago)
        
 (HTM) web link (letsencrypt.org)
 (TXT) w3m dump (letsencrypt.org)
        
       | rantwasp wrote:
       | gets huge server. does not properly resize the jpeg on the page
       | (it's 5megs in size and i see it loading). we don't al have 5TB
       | of ram you know
        
         | jaas wrote:
         | Fixed!
        
       | avipars wrote:
       | those EPYC CPUS are really epic
        
       | uncledave wrote:
       | I'd like to understand more about the workload here. Queries per
       | second, average result size, database size, query complexity etc.
        
       | polskibus wrote:
       | I wonder how much do they save by not going public cloud?
        
       | StreamBright wrote:
       | >> We currently use MariaDB, with the InnoDB database engine.
       | 
       | It is kind of funny how long InnoDB was the most reliable storage
       | engine. I am not sure if MyISAM is still trying to catch up, it
       | used to be much worse than InnoDB. With the emergence of RocksDB
       | there are multiple options today.
        
         | VWWHFSfQ wrote:
         | The only thing I ever used MyISAM tables for was for storing
         | blobs and full-text search on documents. If your data is mostly
         | read only then it's a decent option out of the box. But if you
         | do even mildly frequent updates then you'll quickly run into
         | problems with the table level locking instead of the row level
         | locking offered by InnoDB
        
         | kirugan wrote:
         | MyISAM days are gone, no one will seriously consider it as
         | suitable engine in MySQL.
        
           | jeffbee wrote:
           | It is quite likely unless you've meticulously avoided it that
           | MySQL is using ISAM on-disk temp tables in the service of
           | your queries.
        
             | kirugan wrote:
             | I didn't know that, thanks!
        
             | fipar wrote:
             | Actually, that shouldn't be the case since 5.7:
             | https://dev.mysql.com/doc/refman/5.7/en/server-system-
             | variab... (and related:
             | https://dev.mysql.com/doc/refman/5.7/en/server-system-
             | variab...)
             | 
             | And on 8 MyISAM is mostly gone, not even the 'mysql' schema
             | uses it.
             | 
             | Edit: Originally linked only to
             | default_tmp_storage_engine).
        
       | sradman wrote:
       | > We can clearly see how our old CPUs were reaching their limit.
       | In the week before we upgraded our primary database server, its
       | CPU usage (from /proc/stat) averaged over 90%
       | 
       | This strikes me as odd. In my experience, traditional OLTP row
       | stores are I/O bound due to contention (locking and latching).
       | Does anyone have an explanation for this?
       | 
       | > Once you have a server full of NVMe drives, you have to decide
       | how to manage them. Our previous generation of database servers
       | used hardware RAID in a RAID-10 configuration, but there is no
       | effective hardware RAID for NVMe, so we needed another
       | solution... we got several recommendations for OpenZFS and
       | decided to give it a shot.
       | 
       | Again, traditional OLTP row stores have included a mechanism for
       | recovering from media failure: place the WAL log on separate
       | device from the DB. Early MySQL used a proprietary backup add-on
       | as a revenue model so maybe this technique is now obfuscated
       | and/or missing. You may still need/want a mechanism to federate
       | the DB devices and incremental volume snapshots are far superior
       | to full DB backup but placing the WAL log on a separate device is
       | a fantastic technique for both performance and availability.
       | 
       | The Let's Encrypt post does not describe how they implement off-
       | machine and off-site backup-and-recovery. I'd like to know if and
       | how they do this.
        
         | snuxoll wrote:
         | MySQL backup is more or less a "solved" issue with xtrabackup,
         | and I assume that's exactly what they are using with a database
         | of this size.
        
         | mike_d wrote:
         | > traditional OLTP row stores are I/O bound due to contention
         | (locking and latching). Does anyone have an explanation for
         | this?
         | 
         | I have seen CPU bound database servers when developers push
         | application logic in to the database. Everything from using
         | server-side functions like MD5() to needless triggers and
         | stored procedures that could have been done application side.
        
           | cdcarter wrote:
           | Indeed, at $PLACE_OF_WORK over the last 15 years a lot of
           | logic was built in PL/SQL and dbcpu has become one of our
           | most precious resources. For some applications its perfectly
           | reasonable until you need to horizontally scale.
        
           | jeffbee wrote:
           | Any MySQL with more than about 100 concurrent queries of the
           | same InnoDB table is going to be CPU bound on locks. Their
           | whole locking scheme doesn't scale; it's designed to look
           | great in benchmarks with few clients.
        
         | PeterCorless wrote:
         | > This strikes me as odd. In my experience, traditional OLTP
         | row stores are I/O bound due to contention (locking and
         | latching). Does anyone have an explanation for this?
         | 
         | Yes. My CTO, Avi Kivity did a great talk about this at Core C++
         | 2019: https://www.scylladb.com/2020/03/26/avi-kivity-at-
         | core-c-201...
         | 
         | Let me boil it down to a few points; some beyond Avi's talk:
         | 
         | * Traditional RDBMS with strong consistency and ACID guarantees
         | are always going to exhibit delays. That's what you want them
         | for. Slow, but solid.
         | 
         | * Even many NoSQL databases written (supposedly) for High
         | Availability still use highly synchronous mechanisms
         | internally.
         | 
         | * You need to think about a multi-processor, multi-core server
         | _as its own network_ internally. You need to consider rewriting
         | everything with the fundamental consideration of async
         | processing, even within the same node. Scylla uses C++ futures
         | /promises, shared-nothing shard-per-core architecture, as well
         | as new async methods like io_uring.
         | 
         | * Between nodes, you also have to consider highly async
         | mechanisms. For example, the tunable eventual consistency model
         | you'd find in Cassandra or Scylla. While we also support Paxos
         | for LWT, if you need strong linearizability, read-before-write
         | conditional updates, that comes at a cost. Many classes of
         | transactions will treat that as overkill.
         | 
         | * And yes, backups are also a huge issue for those sorts of
         | data volumes. Scylla, for example, has implemented different
         | priority classes for certain types of activities. It handles
         | all the scheduling between OLTP transactions as highest
         | priority, while allowing the system to plug away at, say,
         | backups or repairs.
         | 
         | More on where we're going with all this is written in a blog
         | about our new Project Circe:
         | 
         | https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...
         | 
         | But the main point is that you have to really think about how
         | to re-architect your software to take advantage of huge multi-
         | processor machines. If you invest in all this hardware, but
         | your software is limiting your utility of it, you're not
         | getting the full bang you spent your buck on.
        
         | e12e wrote:
         | > The Let's Encrypt post does not describe how they implement
         | off-machine and off-site backup-and-recovery. I'd like to know
         | if and how they do this.
         | 
         | The section:
         | 
         | > There wasn't a lot of information out there about how best to
         | set up and optimize OpenZFS for a pool of NVMe drives and a
         | database workload, so we want to share what we learned. You can
         | find detailed information about our setup in this GitHub
         | repository.
         | 
         | points to: https://github.com/letsencrypt/openzfs-nvme-
         | databases
         | 
         | Which states:
         | 
         | > Our primary database server rapidly replicates to two others,
         | including two locations, and is backed up daily. The most
         | business- and compliance-critical data is also logged
         | separately, outside of our database stack. As long as we can
         | maintain durability for long enough to evacuate the primary
         | (write) role to a healthier database server, that is enough.
         | 
         | Which sounds like traditional master/slave setup, with fail
         | over?
        
           | sradman wrote:
           | > Which sounds like traditional master/slave setup, with fail
           | over?
           | 
           | Yes, thank you. I assumed that the emphasis on the speed of
           | NVMe drives meant that master/slave synchronous replication
           | was avoided and asynchronous replication could not keep up.
           | In my mind, this leaves room for interesting future
           | efficiency/performance gains, especially surrounding the
           | "...and is backed up daily" approach mentioned in your quote.
           | 
           | The bottom line is that the old RPO (Recovery Point
           | Objective) and RTO (Recovery Time Objective) are as important
           | as ever.
        
       | baby wrote:
       | So scaling up instead of scaling out. I'm not sure if it's a
       | viable strategy long term, at the same time we probably don't
       | want a single CA to handle too many certificates?
        
         | thamer wrote:
         | Scaling up means each query is faster (3x in this particular
         | case). Scaling out means they can support more clients/domains
         | (more DB shards, more web servers, more concurrency, etc).
         | 
         | These are two distinct axes that are not incompatible with each
         | other.
        
         | theandrewbailey wrote:
         | I'm not aware of any other CA giving out free certificates to
         | anyone. I know that some other providers/hosts will do free
         | certificates, but only to their users (last time I checked).
        
       | gbrown_ wrote:
       | This is a very high level overview and ideally I would have liked
       | to have seen more application level profiling, e.g. where time is
       | being spent (be it on CPU or IO) within the DB rather than high
       | level system stats. For example the following.
       | 
       | > CPU usage (from /proc/stat) averaged over 90%
       | 
       | Leaves me wondering exactly _which_ metric from  /proc/stat they
       | are refering to. I mean it's presumably its user time, but I just
       | dislike attempts to distill systems performance into a few graph
       | comparisons. In reality the realized performance of a system is
       | often better described given a narrative describing what
       | bottlenecks the system.
        
       | fabian2k wrote:
       | I'm curious why they didn't go with the larger 64 core Epyc. I
       | mean it's double the cost, but I suspect that the huge amount of
       | NVMe SSDs is by far the largest part of the cost anyway. And it
       | seems like CPU was the previous bottleneck as it was at 90%.
        
         | jaas wrote:
         | We didn't go with the 64-core chips because they have
         | significantly lower clock speeds.
         | 
         | Dual 32-core chips give us plenty of cores while keeping clocks
         | higher for single-threaded performance.
         | 
         | You are correct that the price of the CPUs is almost irrelevant
         | to the overall cost of a system with this much memory and
         | storage. We were picking the ideal CPU, not selecting on CPU
         | price.
        
           | _joel wrote:
           | ooi how is the NUMA on that setup? Does it still use QPI or
           | is there a newer technology now (I've been out of this space
           | for a few years now)
        
           | fabian2k wrote:
           | Thanks for the answer. I would have guessed that the higher
           | core count outweighs the lower frequence for database usage,
           | but obviously I don't know the details. I think the 90% CPU
           | usage graph just made me nervous enough to want the biggest
           | possible CPU in there.
        
       | sgt wrote:
       | I'm guessing someone out there's thinking: Why aren't they
       | hosting in the cloud? The cloud being either Amazon or Azure.
       | Surely nothing else exists. Is it really possible to host your
       | own PHYSICAL machine? Does that count as the cloud?!
        
         | castillar76 wrote:
         | First, this made me giggle because I run into that attitude all
         | the time. "You're hosting things on a SERVER? Why would anyone
         | do THAT? Heck, you should be putting everything in serverless
         | and avoiding even the vague possibility that you would have to
         | touch anything so degrading and low-class as an _operating
         | system_. Systems administration? Who does that? "
         | 
         | In all seriousness, however, the decision (likely) has very
         | little to do with that. They're most likely not hosting in the
         | cloud because the current CA/Browser Forum rules around the
         | operation of public CAs effectively don't permit cloud hosting.
         | That's a work in progress, but for the time being, the actual
         | CA infrastructure can't be hosted in the cloud due to security
         | and auditability requirements.
        
         | speedgoose wrote:
         | Yes it's still possible to put your own physical machines in a
         | datacentre.
         | 
         | For example : https://www.scaleway.com/en/dedibox/dedirack/
         | 
         | I'm not sure you can say it's the cloud though.
         | 
         | They are not hosting their database in the cloud like Amazon or
         | Azure because no cloud provider offers such high performances
         | at a comparable price. Actually I'm not even sure you can get a
         | cloud VM with that many IOs, if you don't mind the pricing.
        
           | Symbiote wrote:
           | We have a server with the hostname "cloud.example.com".
           | 
           | It can help if someone wants their data "in the cloud".
        
         | zimmerfrei wrote:
         | They are a Public CA and they must undergo a third party
         | compliance audit to operate. The conditions are such you cannot
         | really pass if your infra is in any of those public clouds.
        
         | Neil44 wrote:
         | Not sure if you're being sarcastic but couple of mil a year on
         | AWS I rekon for anything similar, vs a one off 200k, not a bad
         | saving.
        
           | iruoy wrote:
           | And a sizable internet bill I'd assume. This puppy ain't
           | running on Gigabit.
        
         | barkingcat wrote:
         | For a service like letsencrypt, the independence factor is also
         | a major reason for self hosting.
         | 
         | I can forsee letsencrypt in the future going to building their
         | own cloud (on their own physical infrastructure), but speaking
         | as a letsencrypt user of their free certificate program, I
         | would lose respect and interest in their service if they went
         | with an AWS or GCP or Azure approach.
         | 
         | The independence from other major players (and the ability of
         | their team to change and move everything about their service,
         | as needed) is one of the reasons I use letsencrypt.
        
           | dingaling wrote:
           | Funny you mention AWS as they're one of the corporate
           | sponsors of LE.
           | 
           | So long as they don't have a viable independent revenue
           | stream they're arguably less independent than commercial CAs.
        
             | [deleted]
        
       | multifascia wrote:
       | As someone unfamiliar with db management, is it really less
       | operational overhead to have to physically scale your hardware
       | than using a distributed option with more elastic scalability
       | capabilities?
        
         | anewaccount2021 wrote:
         | As stated in the post, there are read replicas. Assuming their
         | workload is primarily reads, this buys them a decent amount of
         | redundancy.
        
         | tanelpoder wrote:
         | Also worth noting that scalability != efficiency. With enough
         | NVMe drives, a single server can do _millions of IOPS_ and scan
         | data at over 100 GB /s. A single PCIe 4.0 x4 SSD on my machine
         | can do large I/Os at 6.8 GB/s rate, so 16 of them (with 4 x
         | quad SSD adapter cards) in a 2-socket EPYC machine can do over
         | 100 GB/s.
         | 
         | You may need clusters, duplicated systems, replication, etc for
         | resiliency reasons of course, but a single modern machine with
         | lots of memory channels per CPU and PCIe 4.0 can achieve
         | ridiculous throughput...
         | 
         | edit: Here's an example of doing 11M IOPS with 10x Samsung Pro
         | 980 PCIe 4.0 SSDs (it's from an upcoming blog entry):
         | 
         | https://twitter.com/TanelPoder/status/1352329243070504964
        
           | riku_iki wrote:
           | > 16 of them (with 4 x quad SSD adapter cards) in a 2-socket
           | EPYC machine can do over 100 GB/s.
           | 
           | It is more interesting if actual CPU can handle such traffic
           | in context of DB load: encode/decide records, sort, search,
           | merge etc.
        
             | tanelpoder wrote:
             | Yes, with modern storage, throughput is a CPU problem.
             | 
             | And CPU problem for OLTP databases is largely a memory
             | access latency problem. For columnar analytics & complex
             | calculations it's more about CPU itself.
             | 
             | When doing 1 MB sized I/Os for scanning, my 16c/32t (AMD
             | Ryzen Threadripper Pro WX) CPUs were just about 10% busy.
             | So, with a 64 core single socket ThreadRipper workstation
             | (or 128-core dual socket EPYC server), there should be
             | plenty of horsepower left.
        
               | tanelpoder wrote:
               | As I mentioned memory access latency - I just posted my
               | old article series about measuring RAM access performance
               | (using different database workloads) to HN and looks like
               | it even made it to the front page (nice):
               | 
               | https://news.ycombinator.com/item?id=25863093
        
               | gpsar wrote:
               | If the problem involves independent traversals,
               | interleaving with coroutines is a practical way to hide
               | latency https://dl.acm.org/doi/10.1145/3329785.3329917
               | https://www.linkedin.com/pulse/dont-stall-multitask-
               | georgios...
        
           | jdsully wrote:
           | It can burst to millions of IOPS, but you get killed on the
           | sustained write workload. Even a high end enterprise NVMe
           | drive will be limited to around 60k IOPs once you exceed its
           | write cache.
        
             | tanelpoder wrote:
             | Yup indeed it's an issue with NAND SSDs - and it heavily
             | depends on the type (SLC, MLC, TLC, QLC), vendor
             | (controller, memory, "write buffer") sizes etc. I'm doing
             | mostly read tests right now and will move to writes after.
             | 
             | The Samsung PRO 980 I have, are TLC for main storage, but
             | apparently are using some of that TLC storage as a faster
             | write buffer (TurboWrite buffer) - I'm not an expert, but
             | apparently the controller can decide to program the TLC
             | NAND with only 1-bit "depth", they call it "simulated SLC"
             | or something like that. On the 1 TB SSD, the turbowrite
             | buffer can dynamically extend to ~100 GB, if there's unused
             | NAND space on the disk.
             | 
             | Btw, the 3DXpoint storage (Intel Optane SSDs & Micron X1)
             | should be able to sustain crazy write rates too.
        
         | speedgoose wrote:
         | They may rely on the ACID properties of their database. Which
         | makes everything simpler, easier, and safer.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/mysql-acid.html
        
         | baby wrote:
         | I'm guessing that since it is for registration and all, the
         | usage might be write-driven, or at least equally balance
         | between writes and reads.
         | 
         | In addition, you really care about integrity of your data so
         | you probably want serializability, avoid concurrency and
         | potential write/update conflicts, and to only do the writes on
         | a single server.
         | 
         | For this reason it sounds to me that partitioning/sharding is
         | the only way to really scale this: have different write servers
         | that care about different primary keys.
        
         | sumtechguy wrote:
         | That really depends on your software.
         | 
         | Something like a NOSQL style it is kind of built in that it
         | will be distributed. But that backs the compute cost back into
         | the clients. Each node is 'crap' but you have hundreds so it
         | does not matter.
         | 
         | Something like SQL server it comes down to how fast you can get
         | the data out of the machine to clone it somewhere else
         | (sharding/hashing, live/live backups, etc). This is disk,
         | network, CPU. Usually in that order.
         | 
         | In most of the ones I ever did it was almost always network
         | that was the bottleneck. Something like a 10gb network card
         | (was state of the art neato at the time, I am sure you can buy
         | better now) you were looking at saturation of 1GB per second
         | (if you were lucky). That is a big number. But depending on
         | your input transaction rate and how the data is stored it can
         | drop off dramatically. Put it local to the server and you can
         | 10x that easy. Going out of node costs a huge amount of
         | latency. Add in the req of say 'offsite hot backup' and it
         | slows down quickly.
         | 
         | In the 'streaming' world like kafka you end up with a different
         | style and lots of small processes/threads which live on 'meh'
         | machines but you hash it and dump it out to other layers for
         | storage of the results. But this comes at a cost of more
         | hardware and network. Things like 'does the rack have enough
         | power', 'do we have open ports', 'do we have enough licenses to
         | run at the 10GB rate on this router'. 'how do we configure 100
         | machines in the same way', 'how do we upgrade 100 machines in
         | our allotted time'. You can fling that out to something like
         | AWS but that comes at a monetary cost. But even virtual there
         | is a management cost. Less boxes is less cost.
        
         | stu2010 wrote:
         | Relational databases enable some very flexible data access
         | patterns. Once you shard, you lose a lot of that flexibility.
         | If you move away from a relational model, you lose even more
         | flexibility and start having to do much more work in your
         | application layer, and usually start having to use more
         | resources and developer time every step of the way.
         | 
         | The productivity enabled by having one master RDBMS is a big
         | deal, and if they can buy commodity servers that satisfy their
         | requirement, this seems like a fine way to operate.
        
           | hinkley wrote:
           | If I had a billion dollars, I'd put a research group together
           | to study the prospects of index sharding.
           | 
           | That is, full table replication, but individual servers
           | maintaining differing sets of indexes. OLAP and single
           | request transactions could be routed to specialized replicas
           | based on query planning, sending requests to machines that
           | have appropriate indexes, and preferably ones where those
           | indexes are hot.
        
             | jandrewrogers wrote:
             | This has been done in popular commercial databases for
             | decades, and is thoroughly researched. As far as I know,
             | these types of architectures are no longer used at this
             | point due to their relatively poor scalability and write
             | performance. I don't think anyone is designing new
             | databases this way anymore, since it only ever made sense
             | in the context of spinning disk.
             | 
             | The trend has been away from complex specialization of data
             | structures, secondary indexing, etc and toward more general
             | and expressive internal structures (but with more difficult
             | theory and implementation) that can efficiently handle a
             | wider range of data models and workloads. Designers started
             | moving on from btrees and hash tables quite a while ago,
             | mostly for the write performance.
             | 
             | Write performance is critical even for read-only analytical
             | systems due to the size of modern data models. The initial
             | data loading can literally take several months with many
             | popular systems, even for data models that are not
             | particularly large. Loading and indexing 100k records per
             | second is a problem if you have 10T records.
        
             | ddorian43 wrote:
             | The problem is the network. You need billion dollars to fix
             | the network so it's as fast as local ram/nvme.
        
           | jasonwatkinspdx wrote:
           | I agree this is an under appreciated strategy. Someone in my
           | family worked for a hedge fund where one of their simple
           | advantages was they just ran MS SQL on the biggest physical
           | machine available at any given moment. Lots of complexity
           | dodged by just having a lot of brute capacity.
        
         | [deleted]
        
       | tyingq wrote:
       | I was, long ago, an old-school Unix sysadmin. While I was
       | technically aware of how powerful smallish servers have become,
       | this article really crystallized that for me.
       | 
       | 64 cores and 24 NVME drives in a 2U spot on a rack is just insane
       | compared to what we used to have to do to get a beefy database
       | server. And it's not some exotic thing, just a popular mainstream
       | Dell SKU.
       | 
       | If you price it out on Dell's site, you get a retail price north
       | of $200k. That is really what made it clear for me. That you
       | could fit $200k+ worth of DIMMS, Drives, CPUS into a 2U spot :)
        
         | theandrewbailey wrote:
         | Dell-controlled VMWare is listed under "major sponsors and
         | funders". I wonder if Let's Encrypt got a discount on that
         | server. Good for them!
        
         | wpietri wrote:
         | Totally. 2 TB of RAM! In one box!
         | 
         | I think the first servers I had in production had 8 MB RAM. No
         | more, certainly. Soon we'll be at 1000x that. My dad's first
         | "server" was 3 orders of magnitude smaller, with 8 KB of RAM
         | (hand-wound wire core memory). In that time, the US population
         | hasn't even doubled.
        
           | jagger27 wrote:
           | More like 1,000,000x
        
         | vmception wrote:
         | I have a motherboard from 2012 and I just put 2x 8TB NVMe SSDs
         | on it, on a PCIe 2.0 x16 slot
         | 
         | Works great. The PCIe card itself has 2 more slots for SSDs
         | 
         | The GPU is on the 2.0 x8 slot because they don't really
         | transfer that much data over the lanes.
         | 
         | I honestly didn't realize PCIe was up to 4.0 now, and I am
         | pushing up against the limits of PCIe 2.0 but it still works!
         | And I'm "only" at the limits, and its only a limit when I want
         | faster than 3,000 megabytes per second, which is amazing.
         | 
         | Granted, this would have been considered a good enthusiast
         | motherboard in 2012. Buying new but cheap is the mistake.
        
           | foota wrote:
           | What drives did you get? I think you need PCI 4 to stress
           | most SSDs these days?
        
             | vmception wrote:
             | I have a hunch that the pcie card itself is most important
             | as it is doing bifurcation.
             | 
             | So each drive acts like it has its own slower (but fast
             | enough) pcie slot, and then the raid0 combines the bits
             | back to double the performance.
             | 
             | Could be wrong but I get 2,900 megabytes per second
             | transfers from RAM to disk and back.
             | 
             | And this is PCIe 2.0 x16
             | 
             | so maybe if you want , 3,000, 4,000 or 6,500 megabytes per
             | second then I have nothing to brag about. I'm pretty amazed
             | though and will be content for all my use cases.
        
       | ryanworl wrote:
       | What are they storing on this server that requires 150Tb of
       | storage and millions of IOPS?
        
         | abrookewood wrote:
         | My thoughts exactly ... They are creating and storing text.
         | Only thing I can think is that they don't actually need the
         | storage, but just want the lowest possible latency by having a
         | large number of drives.
        
         | dewey wrote:
         | > What exactly are we doing with these servers? Our CA
         | software, Boulder, uses MySQL-style schemas and queries to
         | manage subscriber accounts and the entire certificate issuance
         | process.
        
           | jeffbee wrote:
           | There's nothing in that sentence that implies they'd need
           | even 100 IOPS, much less 20 million.
        
           | stefan_ wrote:
           | What exactly needs to be stored once the certificate is
           | created and published in the hash tree? It seems like the
           | kind of data that possibly needn't be stored at all or onto
           | something like Glacier for archival.
        
             | Conclusionist wrote:
             | Going to guess it's for OCSP responses.
        
               | stefan_ wrote:
               | I'm not sure, e.g. Chrome doesn't do OCSP by default,
               | lots of embedded clients like curl won't either. Unless
               | the protocol is terribly broken, that also seems like the
               | kind of use case where 99% of queries just come out of
               | cache and should never hit a database.
        
               | siwyd wrote:
               | FYI, the intermediate CA's signed by their new Root X2
               | certificate won't have OCSP URLs anymore.
               | 
               | Source: https://letsencrypt.org/2020/09/17/new-root-and-
               | intermediate...
        
               | cipherboy wrote:
               | AFAIK, nobody has suggested removal of OCSP from end-
               | entity certificates. This article you linked (and the
               | comment you wrote) is purely about removal from
               | intermediate CA certificates.
               | 
               | The majority of OCSP traffic will probably be for end-
               | entity certificates; most OCSP validation (in browsers
               | and cryptographic libraries) is end-entity validation,
               | not leaf-and-chain.
               | 
               | Removal of intermediate CA's OCSP is probably not really
               | relevant to their overall OCSP performance numbers (and
               | if it was, it was likely cached already).
        
           | gbrown_ wrote:
           | The post doesn't specify requirements or application level
           | targets for performance. They show a couple of good latency
           | improvements but don't describe the business or technical
           | impact. The closest we get is this.
           | 
           | > If this database isn't performing well enough, it can cause
           | API errors and timeouts for our subscribers.
           | 
           | What are the SLO's? How was this being met (or not) before vs
           | after the hardware upgrade? There's a lot of additional
           | context that could have been added in this post. It's not a
           | bad post but instead it simply reduces down to this new
           | hardware is faster than our old hardware.
        
       | tclancy wrote:
       | +10% for the proper use of decimated
        
         | MayeulC wrote:
         | You mean, that one?
         | https://en.wikipedia.org/wiki/Decimation_(Roman_army)
         | 
         | Then it would be 90 ms -> 81 ms, not 90 ms -> 9 ms. The way I
         | see it, at least. With proper decimation, 90% of what was there
         | remains. ("removal of a tenth", as wikipedia puts it).
        
       | jarym wrote:
       | Just goes to show how much a single SQL server can scale before
       | having to worry about sharing and horizontal scaling
        
         | nine_k wrote:
         | Read performance is much easier to scale (in one box or
         | several) than write performance. It's usually the writes that
         | make you look at Cassandra and similar, instead of adding more
         | disks and RAM, or spinning another read-only replica.
         | 
         | 24 NVMEs should have _a lot_ of write throughput, though.
        
         | cperciva wrote:
         | Based on their stated 225M sites and a renewal period of 90
         | days, they're probably averaging around 40 certificates per
         | second. That's only an order of magnitude higher than bitcoin;
         | I wouldn't call it an indication of an ability to scale to a
         | particularly large amount of traffic.
        
           | tyingq wrote:
           | Does "certbot renew" talk to the mothership at all if no
           | certs are ready for renewal? If it does, most setups I've
           | seen run the renewal once or twice a day since it only does
           | the renew when you're down to 30 days left. There may also be
           | some OCSP related traffic.
        
             | mike_d wrote:
             | Certbot will look at the expiration timestamp on your local
             | certs without talking to Lets Encrypt.
        
           | PeterCorless wrote:
           | Yes. They are not doing a very heavy computational workload.
           | Typical heavy-duty servers these days can do 100k's or
           | millions of TPS. 40 TPS is a really, really, really light
           | load.
           | 
           | Further, I was looking at those new server specs. There's an
           | error I think? The server config on the Dell site shows 2x 8
           | GB DRIMMs, for 16 GB RAM per sever, whereas the article says
           | 2 TB!
           | 
           | With only 16GB of RAM, but 153.6 TB of NVMe storage, the real
           | issue here is memory limitation for a general-purpose SQL
           | database or a typical high-availability NoSQL database.
           | 
           | Check my math: 153600 GB storage / 16 GB memory = 9600:1
           | ratio
           | 
           | Consider, by comparison that a high data volume AWS
           | i3en.24xlarge has 60TB of NVMe storage but 768 GB of RAM. A
           | 78:1 ratio.
           | 
           | If the article is correct, and the error is in the config on
           | the Dell page (not the blog), and this server is actually 2
           | TB RAM, then that's another story. That'd make it a ratio of
           | 153600 / 2000 = ~77:1.
           | 
           | Quite in line with the AWS I3en.
           | 
           | But then it would baffle me why you would only get 40 TPS out
           | of such a beast.
           | 
           | Check my logic. Did I miss something?
        
           | RL_Quine wrote:
           | Bitcoin is ECDSA verification, letsencrypt is generating RSA
           | signatures, the two aren't even remotely comparable.
        
             | schoen wrote:
             | But the cryptographic bottleneck is in HSMs, not in
             | database servers (database servers don't _generate_ the
             | digital signatures, they just have to store them after they
             | 've been generated).
        
         | ed25519FUUU wrote:
         | This is especially true with your own hardware. Trying this
         | kind of thing in the cloud is usually prohibitively expensive.
        
           | jbverschoor wrote:
           | It's doesn't have to. Unless you're conditioned to believe
           | that aws is cheap
        
             | [deleted]
        
             | ed25519FUUU wrote:
             | I'm curious what the cost would be to run this type of
             | hardware at _any_ cloud vendor? Does it even exist?
        
               | whitepoplar wrote:
               | Not completely comparable, but Hetzner offers the
               | following dedicated server that costs 637 euro/month
               | (maxed out):
               | 
               | - 32-core AMD EPYC
               | 
               | - 512GB ECC memory
               | 
               | - 8x 3.84TB NVMe datacenter drives
               | 
               | - Unmetered 1gbps bandwidth
        
               | dastbe wrote:
               | (I work at AWS, but this is just for fun)
               | 
               | Checking out AWS side, the closest I think you'd get is
               | the x1.32xlarge, which would translate to 128 vCPU (which
               | on intel generally means 64 physical cores) and close to
               | 2TB of RAM. nvme storage is only a paltry 4TB, so you'd
               | have to make up the rest with EBS volumes. You'd also get
               | a lower clock speed than they are getting out of the
               | EPICs
        
               | PeterCorless wrote:
               | _spittakes reading the suggestion of replacing NVMe with
               | EBS_
               | 
               | I mean, yeah, I guess you _can_. But a lot depends on
               | your use case and SLA. If you need to keep ultra-low p99s
               | -- single digits -- then EBS is not a real option.
               | 
               | But if you don't mind latencies, then yeah, fine.
               | 
               | Don't get me wrong: EBS is great. But it's not a panacea
               | and strikes me as a mismatch for a high performance
               | monster system. If you need NVMe, you need NVMe.
        
               | WatchDog wrote:
               | If you eschew RDS, the largest you can go up to seems to
               | be a u-24tb1.metal.
               | 
               | 448 vcpu, 24TiB of RAM, $70 an hour. ~$52k per month.
        
               | jeffbee wrote:
               | That's the wrong way to think about the cloud. A better
               | way to think about it would be "how much database traffic
               | (and storage) can I serve from Cloud Whatever for $xxx".
               | Then you need to think about what your realistic
               | effective utilization would be. This server has 153600 GB
               | of raw storage. That kind of storage would cost you
               | $46000 (retail) in Cloud Spanner every month, but I doubt
               | that's the right comparison. The right math would
               | probably be that they have 250 million customers and
               | perhaps 1KB of real information per customer. Now the
               | question becomes why you would ever buy 24x6400 GB of
               | flash memory to store this scale of data.
        
       | ed25519FUUU wrote:
       | What a great read. I think the authors here made great hardware
       | and software decisions. OpenZFS is the way to go, and is so much
       | easier to manage than the legacy RAID controllers imho.
       | 
       | Ah, I miss actual hardware.
        
         | ganoushoreilly wrote:
         | I enjoyed it as well, i'm also appreciative that they shared
         | their configuration notes here. I've been running multiple data
         | stores on ZFS for years now and it's taken a while to get out
         | of the hardware mindset (albeit you still need a nice beefy
         | controller anyway).
         | 
         | https://github.com/letsencrypt/openzfs-nvme-databases
        
         | gautamcgoel wrote:
         | Can you explain the advantages of OpenZFS over other
         | filesystems? I know FreeBSD uses ZFS, but I never really
         | understood how it stacks up relative to other technologies...
        
       | cbg0 wrote:
       | Unless I misunderstood something, it seems they have a single
       | primary that handles read+write and multiple read replicas for
       | it.
       | 
       | It shouldn't be too difficult given the current use of MariaDB to
       | start using something like Galera to create a multi-master
       | cluster and improve redundancy of the service, unless there are
       | some non-obvious reasons why they wouldn't be doing this.
       | 
       | I think I also see redundant PSUs, would be neat to know if
       | they're connected to different PDUs and if the networking is also
       | redundant.
        
         | jabberwcky wrote:
         | Multi-master hardly comes for free in terms of complexity or
         | performance, you're at the mercy of latency. Either host the
         | second master in the same building, in which case the
         | redundancy is an illusion, or host it somewhere else in which
         | case watch your write rate tank
         | 
         | Asynchronous streaming to a truly redundant second site often
         | makes more sense
        
           | birdman3131 wrote:
           | How well would same city with fiber between work?
        
         | lykr0n wrote:
         | That's still a very common pattern if you need maximum
         | performance, and can tolerate small periods of downtime. When
         | designing systems, you have to accept some drawbacks. You can
         | forgo a clustered database if you have a strong on call
         | schedule, and redundancy built in to other parts of your
         | infrastructure.
         | 
         | Galera is great, but you lose some functionality with
         | transactions and locking that could be a deal breaker. And up
         | until MySQL 8, there were some fairly significant barriers to
         | automation and clustering that could be a turn off for some
         | people.
         | 
         | Everything has it's pros and cons.
        
       ___________________________________________________________________
       (page generated 2021-01-21 23:00 UTC)