[HN Gopher] The database servers powering Let's Encrypt
___________________________________________________________________
The database servers powering Let's Encrypt
Author : jaas
Score : 200 points
Date : 2021-01-21 17:25 UTC (5 hours ago)
(HTM) web link (letsencrypt.org)
(TXT) w3m dump (letsencrypt.org)
| rantwasp wrote:
| gets huge server. does not properly resize the jpeg on the page
| (it's 5megs in size and i see it loading). we don't al have 5TB
| of ram you know
| jaas wrote:
| Fixed!
| avipars wrote:
| those EPYC CPUS are really epic
| uncledave wrote:
| I'd like to understand more about the workload here. Queries per
| second, average result size, database size, query complexity etc.
| polskibus wrote:
| I wonder how much do they save by not going public cloud?
| StreamBright wrote:
| >> We currently use MariaDB, with the InnoDB database engine.
|
| It is kind of funny how long InnoDB was the most reliable storage
| engine. I am not sure if MyISAM is still trying to catch up, it
| used to be much worse than InnoDB. With the emergence of RocksDB
| there are multiple options today.
| VWWHFSfQ wrote:
| The only thing I ever used MyISAM tables for was for storing
| blobs and full-text search on documents. If your data is mostly
| read only then it's a decent option out of the box. But if you
| do even mildly frequent updates then you'll quickly run into
| problems with the table level locking instead of the row level
| locking offered by InnoDB
| kirugan wrote:
| MyISAM days are gone, no one will seriously consider it as
| suitable engine in MySQL.
| jeffbee wrote:
| It is quite likely unless you've meticulously avoided it that
| MySQL is using ISAM on-disk temp tables in the service of
| your queries.
| kirugan wrote:
| I didn't know that, thanks!
| fipar wrote:
| Actually, that shouldn't be the case since 5.7:
| https://dev.mysql.com/doc/refman/5.7/en/server-system-
| variab... (and related:
| https://dev.mysql.com/doc/refman/5.7/en/server-system-
| variab...)
|
| And on 8 MyISAM is mostly gone, not even the 'mysql' schema
| uses it.
|
| Edit: Originally linked only to
| default_tmp_storage_engine).
| sradman wrote:
| > We can clearly see how our old CPUs were reaching their limit.
| In the week before we upgraded our primary database server, its
| CPU usage (from /proc/stat) averaged over 90%
|
| This strikes me as odd. In my experience, traditional OLTP row
| stores are I/O bound due to contention (locking and latching).
| Does anyone have an explanation for this?
|
| > Once you have a server full of NVMe drives, you have to decide
| how to manage them. Our previous generation of database servers
| used hardware RAID in a RAID-10 configuration, but there is no
| effective hardware RAID for NVMe, so we needed another
| solution... we got several recommendations for OpenZFS and
| decided to give it a shot.
|
| Again, traditional OLTP row stores have included a mechanism for
| recovering from media failure: place the WAL log on separate
| device from the DB. Early MySQL used a proprietary backup add-on
| as a revenue model so maybe this technique is now obfuscated
| and/or missing. You may still need/want a mechanism to federate
| the DB devices and incremental volume snapshots are far superior
| to full DB backup but placing the WAL log on a separate device is
| a fantastic technique for both performance and availability.
|
| The Let's Encrypt post does not describe how they implement off-
| machine and off-site backup-and-recovery. I'd like to know if and
| how they do this.
| snuxoll wrote:
| MySQL backup is more or less a "solved" issue with xtrabackup,
| and I assume that's exactly what they are using with a database
| of this size.
| mike_d wrote:
| > traditional OLTP row stores are I/O bound due to contention
| (locking and latching). Does anyone have an explanation for
| this?
|
| I have seen CPU bound database servers when developers push
| application logic in to the database. Everything from using
| server-side functions like MD5() to needless triggers and
| stored procedures that could have been done application side.
| cdcarter wrote:
| Indeed, at $PLACE_OF_WORK over the last 15 years a lot of
| logic was built in PL/SQL and dbcpu has become one of our
| most precious resources. For some applications its perfectly
| reasonable until you need to horizontally scale.
| jeffbee wrote:
| Any MySQL with more than about 100 concurrent queries of the
| same InnoDB table is going to be CPU bound on locks. Their
| whole locking scheme doesn't scale; it's designed to look
| great in benchmarks with few clients.
| PeterCorless wrote:
| > This strikes me as odd. In my experience, traditional OLTP
| row stores are I/O bound due to contention (locking and
| latching). Does anyone have an explanation for this?
|
| Yes. My CTO, Avi Kivity did a great talk about this at Core C++
| 2019: https://www.scylladb.com/2020/03/26/avi-kivity-at-
| core-c-201...
|
| Let me boil it down to a few points; some beyond Avi's talk:
|
| * Traditional RDBMS with strong consistency and ACID guarantees
| are always going to exhibit delays. That's what you want them
| for. Slow, but solid.
|
| * Even many NoSQL databases written (supposedly) for High
| Availability still use highly synchronous mechanisms
| internally.
|
| * You need to think about a multi-processor, multi-core server
| _as its own network_ internally. You need to consider rewriting
| everything with the fundamental consideration of async
| processing, even within the same node. Scylla uses C++ futures
| /promises, shared-nothing shard-per-core architecture, as well
| as new async methods like io_uring.
|
| * Between nodes, you also have to consider highly async
| mechanisms. For example, the tunable eventual consistency model
| you'd find in Cassandra or Scylla. While we also support Paxos
| for LWT, if you need strong linearizability, read-before-write
| conditional updates, that comes at a cost. Many classes of
| transactions will treat that as overkill.
|
| * And yes, backups are also a huge issue for those sorts of
| data volumes. Scylla, for example, has implemented different
| priority classes for certain types of activities. It handles
| all the scheduling between OLTP transactions as highest
| priority, while allowing the system to plug away at, say,
| backups or repairs.
|
| More on where we're going with all this is written in a blog
| about our new Project Circe:
|
| https://www.scylladb.com/2021/01/12/making-scylla-a-monstrou...
|
| But the main point is that you have to really think about how
| to re-architect your software to take advantage of huge multi-
| processor machines. If you invest in all this hardware, but
| your software is limiting your utility of it, you're not
| getting the full bang you spent your buck on.
| e12e wrote:
| > The Let's Encrypt post does not describe how they implement
| off-machine and off-site backup-and-recovery. I'd like to know
| if and how they do this.
|
| The section:
|
| > There wasn't a lot of information out there about how best to
| set up and optimize OpenZFS for a pool of NVMe drives and a
| database workload, so we want to share what we learned. You can
| find detailed information about our setup in this GitHub
| repository.
|
| points to: https://github.com/letsencrypt/openzfs-nvme-
| databases
|
| Which states:
|
| > Our primary database server rapidly replicates to two others,
| including two locations, and is backed up daily. The most
| business- and compliance-critical data is also logged
| separately, outside of our database stack. As long as we can
| maintain durability for long enough to evacuate the primary
| (write) role to a healthier database server, that is enough.
|
| Which sounds like traditional master/slave setup, with fail
| over?
| sradman wrote:
| > Which sounds like traditional master/slave setup, with fail
| over?
|
| Yes, thank you. I assumed that the emphasis on the speed of
| NVMe drives meant that master/slave synchronous replication
| was avoided and asynchronous replication could not keep up.
| In my mind, this leaves room for interesting future
| efficiency/performance gains, especially surrounding the
| "...and is backed up daily" approach mentioned in your quote.
|
| The bottom line is that the old RPO (Recovery Point
| Objective) and RTO (Recovery Time Objective) are as important
| as ever.
| baby wrote:
| So scaling up instead of scaling out. I'm not sure if it's a
| viable strategy long term, at the same time we probably don't
| want a single CA to handle too many certificates?
| thamer wrote:
| Scaling up means each query is faster (3x in this particular
| case). Scaling out means they can support more clients/domains
| (more DB shards, more web servers, more concurrency, etc).
|
| These are two distinct axes that are not incompatible with each
| other.
| theandrewbailey wrote:
| I'm not aware of any other CA giving out free certificates to
| anyone. I know that some other providers/hosts will do free
| certificates, but only to their users (last time I checked).
| gbrown_ wrote:
| This is a very high level overview and ideally I would have liked
| to have seen more application level profiling, e.g. where time is
| being spent (be it on CPU or IO) within the DB rather than high
| level system stats. For example the following.
|
| > CPU usage (from /proc/stat) averaged over 90%
|
| Leaves me wondering exactly _which_ metric from /proc/stat they
| are refering to. I mean it's presumably its user time, but I just
| dislike attempts to distill systems performance into a few graph
| comparisons. In reality the realized performance of a system is
| often better described given a narrative describing what
| bottlenecks the system.
| fabian2k wrote:
| I'm curious why they didn't go with the larger 64 core Epyc. I
| mean it's double the cost, but I suspect that the huge amount of
| NVMe SSDs is by far the largest part of the cost anyway. And it
| seems like CPU was the previous bottleneck as it was at 90%.
| jaas wrote:
| We didn't go with the 64-core chips because they have
| significantly lower clock speeds.
|
| Dual 32-core chips give us plenty of cores while keeping clocks
| higher for single-threaded performance.
|
| You are correct that the price of the CPUs is almost irrelevant
| to the overall cost of a system with this much memory and
| storage. We were picking the ideal CPU, not selecting on CPU
| price.
| _joel wrote:
| ooi how is the NUMA on that setup? Does it still use QPI or
| is there a newer technology now (I've been out of this space
| for a few years now)
| fabian2k wrote:
| Thanks for the answer. I would have guessed that the higher
| core count outweighs the lower frequence for database usage,
| but obviously I don't know the details. I think the 90% CPU
| usage graph just made me nervous enough to want the biggest
| possible CPU in there.
| sgt wrote:
| I'm guessing someone out there's thinking: Why aren't they
| hosting in the cloud? The cloud being either Amazon or Azure.
| Surely nothing else exists. Is it really possible to host your
| own PHYSICAL machine? Does that count as the cloud?!
| castillar76 wrote:
| First, this made me giggle because I run into that attitude all
| the time. "You're hosting things on a SERVER? Why would anyone
| do THAT? Heck, you should be putting everything in serverless
| and avoiding even the vague possibility that you would have to
| touch anything so degrading and low-class as an _operating
| system_. Systems administration? Who does that? "
|
| In all seriousness, however, the decision (likely) has very
| little to do with that. They're most likely not hosting in the
| cloud because the current CA/Browser Forum rules around the
| operation of public CAs effectively don't permit cloud hosting.
| That's a work in progress, but for the time being, the actual
| CA infrastructure can't be hosted in the cloud due to security
| and auditability requirements.
| speedgoose wrote:
| Yes it's still possible to put your own physical machines in a
| datacentre.
|
| For example : https://www.scaleway.com/en/dedibox/dedirack/
|
| I'm not sure you can say it's the cloud though.
|
| They are not hosting their database in the cloud like Amazon or
| Azure because no cloud provider offers such high performances
| at a comparable price. Actually I'm not even sure you can get a
| cloud VM with that many IOs, if you don't mind the pricing.
| Symbiote wrote:
| We have a server with the hostname "cloud.example.com".
|
| It can help if someone wants their data "in the cloud".
| zimmerfrei wrote:
| They are a Public CA and they must undergo a third party
| compliance audit to operate. The conditions are such you cannot
| really pass if your infra is in any of those public clouds.
| Neil44 wrote:
| Not sure if you're being sarcastic but couple of mil a year on
| AWS I rekon for anything similar, vs a one off 200k, not a bad
| saving.
| iruoy wrote:
| And a sizable internet bill I'd assume. This puppy ain't
| running on Gigabit.
| barkingcat wrote:
| For a service like letsencrypt, the independence factor is also
| a major reason for self hosting.
|
| I can forsee letsencrypt in the future going to building their
| own cloud (on their own physical infrastructure), but speaking
| as a letsencrypt user of their free certificate program, I
| would lose respect and interest in their service if they went
| with an AWS or GCP or Azure approach.
|
| The independence from other major players (and the ability of
| their team to change and move everything about their service,
| as needed) is one of the reasons I use letsencrypt.
| dingaling wrote:
| Funny you mention AWS as they're one of the corporate
| sponsors of LE.
|
| So long as they don't have a viable independent revenue
| stream they're arguably less independent than commercial CAs.
| [deleted]
| multifascia wrote:
| As someone unfamiliar with db management, is it really less
| operational overhead to have to physically scale your hardware
| than using a distributed option with more elastic scalability
| capabilities?
| anewaccount2021 wrote:
| As stated in the post, there are read replicas. Assuming their
| workload is primarily reads, this buys them a decent amount of
| redundancy.
| tanelpoder wrote:
| Also worth noting that scalability != efficiency. With enough
| NVMe drives, a single server can do _millions of IOPS_ and scan
| data at over 100 GB /s. A single PCIe 4.0 x4 SSD on my machine
| can do large I/Os at 6.8 GB/s rate, so 16 of them (with 4 x
| quad SSD adapter cards) in a 2-socket EPYC machine can do over
| 100 GB/s.
|
| You may need clusters, duplicated systems, replication, etc for
| resiliency reasons of course, but a single modern machine with
| lots of memory channels per CPU and PCIe 4.0 can achieve
| ridiculous throughput...
|
| edit: Here's an example of doing 11M IOPS with 10x Samsung Pro
| 980 PCIe 4.0 SSDs (it's from an upcoming blog entry):
|
| https://twitter.com/TanelPoder/status/1352329243070504964
| riku_iki wrote:
| > 16 of them (with 4 x quad SSD adapter cards) in a 2-socket
| EPYC machine can do over 100 GB/s.
|
| It is more interesting if actual CPU can handle such traffic
| in context of DB load: encode/decide records, sort, search,
| merge etc.
| tanelpoder wrote:
| Yes, with modern storage, throughput is a CPU problem.
|
| And CPU problem for OLTP databases is largely a memory
| access latency problem. For columnar analytics & complex
| calculations it's more about CPU itself.
|
| When doing 1 MB sized I/Os for scanning, my 16c/32t (AMD
| Ryzen Threadripper Pro WX) CPUs were just about 10% busy.
| So, with a 64 core single socket ThreadRipper workstation
| (or 128-core dual socket EPYC server), there should be
| plenty of horsepower left.
| tanelpoder wrote:
| As I mentioned memory access latency - I just posted my
| old article series about measuring RAM access performance
| (using different database workloads) to HN and looks like
| it even made it to the front page (nice):
|
| https://news.ycombinator.com/item?id=25863093
| gpsar wrote:
| If the problem involves independent traversals,
| interleaving with coroutines is a practical way to hide
| latency https://dl.acm.org/doi/10.1145/3329785.3329917
| https://www.linkedin.com/pulse/dont-stall-multitask-
| georgios...
| jdsully wrote:
| It can burst to millions of IOPS, but you get killed on the
| sustained write workload. Even a high end enterprise NVMe
| drive will be limited to around 60k IOPs once you exceed its
| write cache.
| tanelpoder wrote:
| Yup indeed it's an issue with NAND SSDs - and it heavily
| depends on the type (SLC, MLC, TLC, QLC), vendor
| (controller, memory, "write buffer") sizes etc. I'm doing
| mostly read tests right now and will move to writes after.
|
| The Samsung PRO 980 I have, are TLC for main storage, but
| apparently are using some of that TLC storage as a faster
| write buffer (TurboWrite buffer) - I'm not an expert, but
| apparently the controller can decide to program the TLC
| NAND with only 1-bit "depth", they call it "simulated SLC"
| or something like that. On the 1 TB SSD, the turbowrite
| buffer can dynamically extend to ~100 GB, if there's unused
| NAND space on the disk.
|
| Btw, the 3DXpoint storage (Intel Optane SSDs & Micron X1)
| should be able to sustain crazy write rates too.
| speedgoose wrote:
| They may rely on the ACID properties of their database. Which
| makes everything simpler, easier, and safer.
|
| https://dev.mysql.com/doc/refman/8.0/en/mysql-acid.html
| baby wrote:
| I'm guessing that since it is for registration and all, the
| usage might be write-driven, or at least equally balance
| between writes and reads.
|
| In addition, you really care about integrity of your data so
| you probably want serializability, avoid concurrency and
| potential write/update conflicts, and to only do the writes on
| a single server.
|
| For this reason it sounds to me that partitioning/sharding is
| the only way to really scale this: have different write servers
| that care about different primary keys.
| sumtechguy wrote:
| That really depends on your software.
|
| Something like a NOSQL style it is kind of built in that it
| will be distributed. But that backs the compute cost back into
| the clients. Each node is 'crap' but you have hundreds so it
| does not matter.
|
| Something like SQL server it comes down to how fast you can get
| the data out of the machine to clone it somewhere else
| (sharding/hashing, live/live backups, etc). This is disk,
| network, CPU. Usually in that order.
|
| In most of the ones I ever did it was almost always network
| that was the bottleneck. Something like a 10gb network card
| (was state of the art neato at the time, I am sure you can buy
| better now) you were looking at saturation of 1GB per second
| (if you were lucky). That is a big number. But depending on
| your input transaction rate and how the data is stored it can
| drop off dramatically. Put it local to the server and you can
| 10x that easy. Going out of node costs a huge amount of
| latency. Add in the req of say 'offsite hot backup' and it
| slows down quickly.
|
| In the 'streaming' world like kafka you end up with a different
| style and lots of small processes/threads which live on 'meh'
| machines but you hash it and dump it out to other layers for
| storage of the results. But this comes at a cost of more
| hardware and network. Things like 'does the rack have enough
| power', 'do we have open ports', 'do we have enough licenses to
| run at the 10GB rate on this router'. 'how do we configure 100
| machines in the same way', 'how do we upgrade 100 machines in
| our allotted time'. You can fling that out to something like
| AWS but that comes at a monetary cost. But even virtual there
| is a management cost. Less boxes is less cost.
| stu2010 wrote:
| Relational databases enable some very flexible data access
| patterns. Once you shard, you lose a lot of that flexibility.
| If you move away from a relational model, you lose even more
| flexibility and start having to do much more work in your
| application layer, and usually start having to use more
| resources and developer time every step of the way.
|
| The productivity enabled by having one master RDBMS is a big
| deal, and if they can buy commodity servers that satisfy their
| requirement, this seems like a fine way to operate.
| hinkley wrote:
| If I had a billion dollars, I'd put a research group together
| to study the prospects of index sharding.
|
| That is, full table replication, but individual servers
| maintaining differing sets of indexes. OLAP and single
| request transactions could be routed to specialized replicas
| based on query planning, sending requests to machines that
| have appropriate indexes, and preferably ones where those
| indexes are hot.
| jandrewrogers wrote:
| This has been done in popular commercial databases for
| decades, and is thoroughly researched. As far as I know,
| these types of architectures are no longer used at this
| point due to their relatively poor scalability and write
| performance. I don't think anyone is designing new
| databases this way anymore, since it only ever made sense
| in the context of spinning disk.
|
| The trend has been away from complex specialization of data
| structures, secondary indexing, etc and toward more general
| and expressive internal structures (but with more difficult
| theory and implementation) that can efficiently handle a
| wider range of data models and workloads. Designers started
| moving on from btrees and hash tables quite a while ago,
| mostly for the write performance.
|
| Write performance is critical even for read-only analytical
| systems due to the size of modern data models. The initial
| data loading can literally take several months with many
| popular systems, even for data models that are not
| particularly large. Loading and indexing 100k records per
| second is a problem if you have 10T records.
| ddorian43 wrote:
| The problem is the network. You need billion dollars to fix
| the network so it's as fast as local ram/nvme.
| jasonwatkinspdx wrote:
| I agree this is an under appreciated strategy. Someone in my
| family worked for a hedge fund where one of their simple
| advantages was they just ran MS SQL on the biggest physical
| machine available at any given moment. Lots of complexity
| dodged by just having a lot of brute capacity.
| [deleted]
| tyingq wrote:
| I was, long ago, an old-school Unix sysadmin. While I was
| technically aware of how powerful smallish servers have become,
| this article really crystallized that for me.
|
| 64 cores and 24 NVME drives in a 2U spot on a rack is just insane
| compared to what we used to have to do to get a beefy database
| server. And it's not some exotic thing, just a popular mainstream
| Dell SKU.
|
| If you price it out on Dell's site, you get a retail price north
| of $200k. That is really what made it clear for me. That you
| could fit $200k+ worth of DIMMS, Drives, CPUS into a 2U spot :)
| theandrewbailey wrote:
| Dell-controlled VMWare is listed under "major sponsors and
| funders". I wonder if Let's Encrypt got a discount on that
| server. Good for them!
| wpietri wrote:
| Totally. 2 TB of RAM! In one box!
|
| I think the first servers I had in production had 8 MB RAM. No
| more, certainly. Soon we'll be at 1000x that. My dad's first
| "server" was 3 orders of magnitude smaller, with 8 KB of RAM
| (hand-wound wire core memory). In that time, the US population
| hasn't even doubled.
| jagger27 wrote:
| More like 1,000,000x
| vmception wrote:
| I have a motherboard from 2012 and I just put 2x 8TB NVMe SSDs
| on it, on a PCIe 2.0 x16 slot
|
| Works great. The PCIe card itself has 2 more slots for SSDs
|
| The GPU is on the 2.0 x8 slot because they don't really
| transfer that much data over the lanes.
|
| I honestly didn't realize PCIe was up to 4.0 now, and I am
| pushing up against the limits of PCIe 2.0 but it still works!
| And I'm "only" at the limits, and its only a limit when I want
| faster than 3,000 megabytes per second, which is amazing.
|
| Granted, this would have been considered a good enthusiast
| motherboard in 2012. Buying new but cheap is the mistake.
| foota wrote:
| What drives did you get? I think you need PCI 4 to stress
| most SSDs these days?
| vmception wrote:
| I have a hunch that the pcie card itself is most important
| as it is doing bifurcation.
|
| So each drive acts like it has its own slower (but fast
| enough) pcie slot, and then the raid0 combines the bits
| back to double the performance.
|
| Could be wrong but I get 2,900 megabytes per second
| transfers from RAM to disk and back.
|
| And this is PCIe 2.0 x16
|
| so maybe if you want , 3,000, 4,000 or 6,500 megabytes per
| second then I have nothing to brag about. I'm pretty amazed
| though and will be content for all my use cases.
| ryanworl wrote:
| What are they storing on this server that requires 150Tb of
| storage and millions of IOPS?
| abrookewood wrote:
| My thoughts exactly ... They are creating and storing text.
| Only thing I can think is that they don't actually need the
| storage, but just want the lowest possible latency by having a
| large number of drives.
| dewey wrote:
| > What exactly are we doing with these servers? Our CA
| software, Boulder, uses MySQL-style schemas and queries to
| manage subscriber accounts and the entire certificate issuance
| process.
| jeffbee wrote:
| There's nothing in that sentence that implies they'd need
| even 100 IOPS, much less 20 million.
| stefan_ wrote:
| What exactly needs to be stored once the certificate is
| created and published in the hash tree? It seems like the
| kind of data that possibly needn't be stored at all or onto
| something like Glacier for archival.
| Conclusionist wrote:
| Going to guess it's for OCSP responses.
| stefan_ wrote:
| I'm not sure, e.g. Chrome doesn't do OCSP by default,
| lots of embedded clients like curl won't either. Unless
| the protocol is terribly broken, that also seems like the
| kind of use case where 99% of queries just come out of
| cache and should never hit a database.
| siwyd wrote:
| FYI, the intermediate CA's signed by their new Root X2
| certificate won't have OCSP URLs anymore.
|
| Source: https://letsencrypt.org/2020/09/17/new-root-and-
| intermediate...
| cipherboy wrote:
| AFAIK, nobody has suggested removal of OCSP from end-
| entity certificates. This article you linked (and the
| comment you wrote) is purely about removal from
| intermediate CA certificates.
|
| The majority of OCSP traffic will probably be for end-
| entity certificates; most OCSP validation (in browsers
| and cryptographic libraries) is end-entity validation,
| not leaf-and-chain.
|
| Removal of intermediate CA's OCSP is probably not really
| relevant to their overall OCSP performance numbers (and
| if it was, it was likely cached already).
| gbrown_ wrote:
| The post doesn't specify requirements or application level
| targets for performance. They show a couple of good latency
| improvements but don't describe the business or technical
| impact. The closest we get is this.
|
| > If this database isn't performing well enough, it can cause
| API errors and timeouts for our subscribers.
|
| What are the SLO's? How was this being met (or not) before vs
| after the hardware upgrade? There's a lot of additional
| context that could have been added in this post. It's not a
| bad post but instead it simply reduces down to this new
| hardware is faster than our old hardware.
| tclancy wrote:
| +10% for the proper use of decimated
| MayeulC wrote:
| You mean, that one?
| https://en.wikipedia.org/wiki/Decimation_(Roman_army)
|
| Then it would be 90 ms -> 81 ms, not 90 ms -> 9 ms. The way I
| see it, at least. With proper decimation, 90% of what was there
| remains. ("removal of a tenth", as wikipedia puts it).
| jarym wrote:
| Just goes to show how much a single SQL server can scale before
| having to worry about sharing and horizontal scaling
| nine_k wrote:
| Read performance is much easier to scale (in one box or
| several) than write performance. It's usually the writes that
| make you look at Cassandra and similar, instead of adding more
| disks and RAM, or spinning another read-only replica.
|
| 24 NVMEs should have _a lot_ of write throughput, though.
| cperciva wrote:
| Based on their stated 225M sites and a renewal period of 90
| days, they're probably averaging around 40 certificates per
| second. That's only an order of magnitude higher than bitcoin;
| I wouldn't call it an indication of an ability to scale to a
| particularly large amount of traffic.
| tyingq wrote:
| Does "certbot renew" talk to the mothership at all if no
| certs are ready for renewal? If it does, most setups I've
| seen run the renewal once or twice a day since it only does
| the renew when you're down to 30 days left. There may also be
| some OCSP related traffic.
| mike_d wrote:
| Certbot will look at the expiration timestamp on your local
| certs without talking to Lets Encrypt.
| PeterCorless wrote:
| Yes. They are not doing a very heavy computational workload.
| Typical heavy-duty servers these days can do 100k's or
| millions of TPS. 40 TPS is a really, really, really light
| load.
|
| Further, I was looking at those new server specs. There's an
| error I think? The server config on the Dell site shows 2x 8
| GB DRIMMs, for 16 GB RAM per sever, whereas the article says
| 2 TB!
|
| With only 16GB of RAM, but 153.6 TB of NVMe storage, the real
| issue here is memory limitation for a general-purpose SQL
| database or a typical high-availability NoSQL database.
|
| Check my math: 153600 GB storage / 16 GB memory = 9600:1
| ratio
|
| Consider, by comparison that a high data volume AWS
| i3en.24xlarge has 60TB of NVMe storage but 768 GB of RAM. A
| 78:1 ratio.
|
| If the article is correct, and the error is in the config on
| the Dell page (not the blog), and this server is actually 2
| TB RAM, then that's another story. That'd make it a ratio of
| 153600 / 2000 = ~77:1.
|
| Quite in line with the AWS I3en.
|
| But then it would baffle me why you would only get 40 TPS out
| of such a beast.
|
| Check my logic. Did I miss something?
| RL_Quine wrote:
| Bitcoin is ECDSA verification, letsencrypt is generating RSA
| signatures, the two aren't even remotely comparable.
| schoen wrote:
| But the cryptographic bottleneck is in HSMs, not in
| database servers (database servers don't _generate_ the
| digital signatures, they just have to store them after they
| 've been generated).
| ed25519FUUU wrote:
| This is especially true with your own hardware. Trying this
| kind of thing in the cloud is usually prohibitively expensive.
| jbverschoor wrote:
| It's doesn't have to. Unless you're conditioned to believe
| that aws is cheap
| [deleted]
| ed25519FUUU wrote:
| I'm curious what the cost would be to run this type of
| hardware at _any_ cloud vendor? Does it even exist?
| whitepoplar wrote:
| Not completely comparable, but Hetzner offers the
| following dedicated server that costs 637 euro/month
| (maxed out):
|
| - 32-core AMD EPYC
|
| - 512GB ECC memory
|
| - 8x 3.84TB NVMe datacenter drives
|
| - Unmetered 1gbps bandwidth
| dastbe wrote:
| (I work at AWS, but this is just for fun)
|
| Checking out AWS side, the closest I think you'd get is
| the x1.32xlarge, which would translate to 128 vCPU (which
| on intel generally means 64 physical cores) and close to
| 2TB of RAM. nvme storage is only a paltry 4TB, so you'd
| have to make up the rest with EBS volumes. You'd also get
| a lower clock speed than they are getting out of the
| EPICs
| PeterCorless wrote:
| _spittakes reading the suggestion of replacing NVMe with
| EBS_
|
| I mean, yeah, I guess you _can_. But a lot depends on
| your use case and SLA. If you need to keep ultra-low p99s
| -- single digits -- then EBS is not a real option.
|
| But if you don't mind latencies, then yeah, fine.
|
| Don't get me wrong: EBS is great. But it's not a panacea
| and strikes me as a mismatch for a high performance
| monster system. If you need NVMe, you need NVMe.
| WatchDog wrote:
| If you eschew RDS, the largest you can go up to seems to
| be a u-24tb1.metal.
|
| 448 vcpu, 24TiB of RAM, $70 an hour. ~$52k per month.
| jeffbee wrote:
| That's the wrong way to think about the cloud. A better
| way to think about it would be "how much database traffic
| (and storage) can I serve from Cloud Whatever for $xxx".
| Then you need to think about what your realistic
| effective utilization would be. This server has 153600 GB
| of raw storage. That kind of storage would cost you
| $46000 (retail) in Cloud Spanner every month, but I doubt
| that's the right comparison. The right math would
| probably be that they have 250 million customers and
| perhaps 1KB of real information per customer. Now the
| question becomes why you would ever buy 24x6400 GB of
| flash memory to store this scale of data.
| ed25519FUUU wrote:
| What a great read. I think the authors here made great hardware
| and software decisions. OpenZFS is the way to go, and is so much
| easier to manage than the legacy RAID controllers imho.
|
| Ah, I miss actual hardware.
| ganoushoreilly wrote:
| I enjoyed it as well, i'm also appreciative that they shared
| their configuration notes here. I've been running multiple data
| stores on ZFS for years now and it's taken a while to get out
| of the hardware mindset (albeit you still need a nice beefy
| controller anyway).
|
| https://github.com/letsencrypt/openzfs-nvme-databases
| gautamcgoel wrote:
| Can you explain the advantages of OpenZFS over other
| filesystems? I know FreeBSD uses ZFS, but I never really
| understood how it stacks up relative to other technologies...
| cbg0 wrote:
| Unless I misunderstood something, it seems they have a single
| primary that handles read+write and multiple read replicas for
| it.
|
| It shouldn't be too difficult given the current use of MariaDB to
| start using something like Galera to create a multi-master
| cluster and improve redundancy of the service, unless there are
| some non-obvious reasons why they wouldn't be doing this.
|
| I think I also see redundant PSUs, would be neat to know if
| they're connected to different PDUs and if the networking is also
| redundant.
| jabberwcky wrote:
| Multi-master hardly comes for free in terms of complexity or
| performance, you're at the mercy of latency. Either host the
| second master in the same building, in which case the
| redundancy is an illusion, or host it somewhere else in which
| case watch your write rate tank
|
| Asynchronous streaming to a truly redundant second site often
| makes more sense
| birdman3131 wrote:
| How well would same city with fiber between work?
| lykr0n wrote:
| That's still a very common pattern if you need maximum
| performance, and can tolerate small periods of downtime. When
| designing systems, you have to accept some drawbacks. You can
| forgo a clustered database if you have a strong on call
| schedule, and redundancy built in to other parts of your
| infrastructure.
|
| Galera is great, but you lose some functionality with
| transactions and locking that could be a deal breaker. And up
| until MySQL 8, there were some fairly significant barriers to
| automation and clustering that could be a turn off for some
| people.
|
| Everything has it's pros and cons.
___________________________________________________________________
(page generated 2021-01-21 23:00 UTC)