[HN Gopher] How Discord supercharges network disks for extreme l...
___________________________________________________________________
How Discord supercharges network disks for extreme low latency
Author : techinvalley
Score : 177 points
Date : 2022-08-15 19:24 UTC (3 hours ago)
(HTM) web link (discord.com)
(TXT) w3m dump (discord.com)
| throwdbaaway wrote:
| That's a very smart solution indeed. I wonder if it is also
| possible to throw more memory at the problem? Managing instances
| with locally attached disks on the cloud is a bit of a pain, as
| you don't have the capability to stop/start the instances
| anymore.
| skyde wrote:
| The database should be able to be configured with a cache device.
|
| Using a RAID manager or a filesystem for this does not seem
| optimal.
| skyde wrote:
| Actually, using Something like ReadySet[1] but using a swap
| drive instead of memory to store the cached row would work very
| well.
|
| [1] https://readyset.io/blog/introducing-readyset
| JanMa wrote:
| How does this setup handle a maintenance of the underlying
| hypervisor host? As far as I know the VM will be migrated to a
| new hypervisor and all data on the local SSDs is lost. Can the
| custom RAID0 array of local SSDs handle this or does it have to
| be manually rebuilt on every maintenance?
| ahepp wrote:
| From the article:
|
| > GCP provides an interesting "guarantee" around the failure of
| Local SSDs: If any Local SSD fails, the entire server is
| migrated to a different set of hardware, essentially erasing
| all Local SSD data for that server.
|
| I wonder how md handles reads during the rebuild, and how long
| it takes to replicate the persistent store back onto the raid0
| mirror.
| jhgg wrote:
| On GCP, live migration moves the data on the local-ssd to the
| new host as well.
| merb wrote:
| would've really known how much cpu this solution would cost. most
| of the time mdraid adds additonal cpu time (not that it matters
| to them that much)
| idorosen wrote:
| If it's just raid0 and raid1, then there's likely not any
| significant CPU time or CPU overhead involved. Most
| southbridges or I/O controllers support mirroring directly, and
| md knows how to use them. Most virtualized disk controllers do
| this in hardware as well.
|
| CPU overhead comes into play when you're doing parity on a
| software raid setup (like md or zfs) such as in md raid5 or
| raid6.
|
| If they needed data scrubbing at a single host level like zfs
| offers, then probably CPU would be a factor, but I'm assuming
| they achieve data integrity at a higher level/across hosts,
| such as in their distributed DB.
| nvarsj wrote:
| As an aside, I'm always impressed by Discord's engineering
| articles. They are incredibly pragmatic - typically using
| commonly available OSS to solve big problems. If this was another
| unicorn company they would have instead written a custom disk
| controller in Rust, called it a Greek name, and have done several
| major conference talks on their unique innovation.
| jfim wrote:
| There are some times when writing a custom solution does make
| sense though.
|
| In their case, I'm wondering why the host failure isn't handled
| at a higher level already. A node failure causing all data to
| be lost on that host should be handled gracefully through
| replication and another replica brought up transparently.
|
| In any case, their usage of local storage as a write through
| cache though md is pretty interesting, I wonder if it would
| work the other way around for reading.
| mikesun wrote:
| Scylla (and Cassandra) provides cluster-level replication.
| Even with only local NVMes, a single node failure with loss
| of data would be tolerated. But relying on "ephemeral local
| SSDs" that nodes can lose if any VM is power-cycled adds
| additional risk that some incident could cause multiple
| replicas to lose their data.
| stevenpetryk wrote:
| That's a common theme here. We try to avoid making tools into
| projects.
| Pepe1vo wrote:
| Discord has done its fair share of RIIR though[0][1] ;)
|
| [0] https://discord.com/blog/why-discord-is-switching-from-go-
| to...
|
| [1] https://discord.com/blog/using-rust-to-scale-elixir-
| for-11-m...
| whizzter wrote:
| They've probably subscribed to Taco Bell Programming.
|
| http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
| MonkeyMalarky wrote:
| The article is from 2010 and uses the term DevOps! Just how
| long has that meme been around?
| Sohcahtoa82 wrote:
| Oooh, I like this. I gotta remember this term.
| tablespoon wrote:
| >> http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
| progra...
|
| > Here's a concrete example: suppose you have millions of web
| pages that you want to download and save to disk for later
| processing. How do you do it? The cool-kids answer is to
| write a distributed crawler in Clojure and run it on EC2,
| handing out jobs with a message queue like SQS or ZeroMQ.
|
| > The Taco Bell answer? xargs and wget. In the rare case that
| you saturate the network connection, add some split and
| rsync. A "distributed crawler" is really only like 10 lines
| of shell script.
|
| I generally agree, but it's probably only 10 lines if you
| assume you never have to deal with any errors.
| derefr wrote:
| That can be solved at the design level: write your get step
| as an idempotent "only do it if it isn't already done"
| creation operation for a given output file -- like a make
| target, but no need to actually use Make (just a `test -f
| || ...`.)
|
| Then run your little pipeline _in a loop_ until it stops
| making progress (`find | wc` doesn't increase.) Either it
| finished, or everything that's left as input represents one
| or more classes of errors. Debug them, and then start it
| looping again :)
| dekhn wrote:
| Not redoing steps that appear to be already done has its
| own challenges- for example, a transfer that broke
| halfway through might leave a destination file, but not
| represent a completion (typically dealt with by writing
| to a temp file and renaming).
|
| The issue here is that your code has no real-time
| adaptability. Many backends will scale with load up to a
| point then start returning "make fewer requests".
| Normally, you implement some internal logic such as
| randomized exponential backoff retries (amazingly, this
| is a remarkably effective way to automatically find the
| saturation point of the cluster), although I have also
| seen some large clients that coordinate their fetches
| centrally using tokens.
| derefr wrote:
| Having that logic in the same place as the work of
| actually driving the fetch/crawl, though, is a violation
| of Unix "small components, each doing one thing"
| thinking.
|
| You know how you can rate-limit your requests? A forward
| proxy daemon that rate-limits upstream connections by
| holding them open but not serving them until the timeout
| has elapsed. (I.e. Nginx with five lines of config.) As
| long as your fetcher has a concurrency limit, stalling
| some of those connections will lead to decreased
| attempted throughput.
|
| (This isn't just for scripting, either; it's also a near-
| optimal way to implement global per-domain upstream-API
| rate-limiting in a production system that has multiple
| shared-nothing backends. It's Istio/Envoy "in the
| small.")
| dekhn wrote:
| Setting up the nginx server is one more server (and isn't
| particularly a small component doing one thing) to
| manage.
|
| Having built several large distributed computing systems,
| I've found that the inner client always needs to have a
| fair amount of intelligence when talking to the server.
| That means responding to errors in a way that doesn't
| lead to thundering herds. The nice thing about this is
| that, like modern TCP, it auto-tunes to the capacity of
| the system, while also handling outages well.
| hn_version_0023 wrote:
| I'd add GNU _parallel_ to the tools used; I've written this
| exact crawler that way, saving screenshots using
| _ghostscript_ , IIRC
| Sohcahtoa82 wrote:
| Fair, but a simply Python script could probably handle it.
| Still don't need a message queue.
| skrtskrt wrote:
| Errors, retries, consistency, observability. For actual
| software running in production, shelling out is way too
| flaky and unobservable
| bbarnett wrote:
| It's not flaky at all, it is merely that most people
| don't code bash/etc to catch errors, retry on failure,
| etc, etc.
|
| I will 100% agree that it has disadvantages, but it's
| unfair to level the above at shell scripts, for most of
| your complaint, is about poorly coded shell scripts.
|
| An example? sysvinit is a few C programs, and all of it
| wrapped in bash or sh. It's far more reliable than
| systemd ever has been, with far better error checking.
|
| Part of this is simplicity. 100 lines of code is better
| than 10k lines. "Whole scope" on one page can't be
| underestimated for debugging and comprehension, which
| also makes error checking easier too.
| skrtskrt wrote:
| Can I, with off the shelf OSS tooling, _easily_ trace
| that code that's "just wget and xargs", emit metrics and
| traces to collectors, differentiate between all the
| possible network and http failure errors, retry
| individual requests with backoff and jitter, allow
| individual threads to fail and retry those without
| borking the root program, and write the results to a
| datastore in idempotent way, and allow a junior developer
| to contribute to it with little ramp-up?
|
| It's not about "can bash do it" it's about "is there a
| huge ecosystem of tools, which we are probably already
| using in our organization, that thoroughly cover all
| these issues".
| lordpankake wrote:
| Awesome article!
| [deleted]
| shrubble wrote:
| Key sentence and a half:
|
| 'Discord runs most of its hardware in Google Cloud and they
| provide ready access to "Local SSDs" -- NVMe based instance
| storage, which do have incredibly fast latency profiles.
| Unfortunately, in our testing, we ran into enough reliability
| issues'
| why_only_15 wrote:
| Don't all physical SSDs have reliability issues? There's a good
| reason we replicate data across devices.
| legulere wrote:
| Basically they need to solve local SSDs not having all needed
| features and persistent disks having too high latency by:
|
| > essentially a write-through cache, with GCP's Local SSDs as the
| cache and Persistent Disks as the storage layer.
| [deleted]
| ahepp wrote:
| I found it worth noting that the cache is primarily interested
| through linux's built in software raid system, md. SSDs in
| raid0 (strip), persistent disk in raid1 (mirror).
| hardware2win wrote:
| It sounds like intel optane persistent memory could work here
| wmf wrote:
| Sadly Optane cost a fortune, was never available in most
| clouds, and has now been canceled.
| davidw wrote:
| Regarding ScyllaDB:
|
| > while achieving significantly higher throughputs and lower
| latencies [compared to Cassandra]
|
| Do they really get all that just because it's in C++? Anyone
| familiar with both of them?
| doubledad222 wrote:
| This was a very interesting read. The detail on the exploratory
| process was perfect. Can't wait for part two.
| javier_e06 wrote:
| Yes, RAID is all that. It's interesting to see such established
| technology shine and shine again.
| madars wrote:
| Q about a related use case: Can I use tiered storage (e.g., SSD
| cache in front of a HDD with, say, dm-cache), or an md-based
| approach like Discord's, to successfully sync an Ethereum node?
| Everyone says "you should get a 2TB SSD" but I'm wondering if I
| can be more future-proof with say 512 GB SSD cache + much larger
| HDD.
| pas wrote:
| Sure, you might want to look into bcache or bcachefs.
| Nextgrid wrote:
| I'm not familiar with what Ethereum requires for its syncing
| operation.
|
| If it's a sequential write (by downloading the entire
| blockchain), you will still be bottlenecked by the throughput
| of the underlying disk.
|
| If it's sequential reads (in between writes), the reads can be
| handled by the cache if the location is local enough to the
| previous write operation that it hasn't been evicted yet.
|
| If it's random unpredictable reads, it's unlikely a cache will
| help unless the cache is big enough to fit the entire working
| dataset (otherwise you'll get a terrible cache hit rate as most
| of what you need would've been evicted by then) but then you're
| back at your original problem of needing a huge SSD.
| alberth wrote:
| TL;DR - NAS is slow. RAID 0 is fast.
| wmf wrote:
| This is not an accurate summary of the article.
| [deleted]
| PeterWhittaker wrote:
| I feel like I am missing a step. Do they write to md1 and read
| from md0? Or do they read from md1, and under the covers the read
| is likely fulfilled by md0?
| ahepp wrote:
| Is it necessary to fully replicate the persistent store onto the
| striped SSD array? I admire such a simple solution, but I wonder
| if something like an LRU cache would achieve similar speedups
| while using fewer resources. On the other hand, it could be a
| small cost to pay for a more consistent and predictable workload.
|
| How does md handle a synchronous write in a heterogenous mirror?
| Does it wait for both devices to be written?
|
| I'm also curious how this solution compares to allocating more
| ram to the servers, and either letting the database software use
| this for caching, or even creating a ramdisk and putting that in
| raid1 with the persistent storage. Since the SSDs are being
| treated as volatile anyways. I assume it would be prohibitively
| expensive to replicate the entire persistent store into main
| memory.
|
| I'd also be interested to know how this compares with replacing
| the entire persistent disk / SSD system with zfs over a few SSDs
| (which would also allow snapshoting). Of course it is probably a
| huge feature to be able to have snapshots be integrated into your
| cloud...
| mikesun wrote:
| > Is it necessary to fully replicate the persistent store onto
| the striped SSD array? I admire such a simple solution, but I
| wonder if something like an LRU cache would achieve similar
| speedups while using fewer resources. On the other hand, it
| could be a small cost to pay for a more consistent and
| predictable workload.
|
| One of the reasons an LRU cache like dm-cache wasn't feasible
| was because we had a higher than acceptable bad sector read
| rate which would cause a cache like dm-cache to bubble up a
| block device error up to the database. The database would then
| shut itself down when it encountered an disk-level error.
|
| > How does md handle a synchronous write in a heterogenous
| mirror? Does it wait for both devices to be written? Yes, md
| waits for both mirrors to be written.
|
| > I'm also curious how this solution compares to allocating
| more ram to the servers, and either letting the database
| software use this for caching, or even creating a ramdisk and
| putting that in raid1 with the persistent storage. Since the
| SSDs are being treated as volatile anyways. I assume it would
| be prohibitively expensive to replicate the entire persistent
| store into main memory. Yeah, we're talking many terabytes.
|
| > I'd also be interested to know how this compares with
| replacing the entire persistent disk / SSD system with zfs over
| a few SSDs (which would also allow snapshoting). Of course it
| is probably a huge feature to be able to have snapshots be
| integrated into your cloud... Would love if we could've used
| ZFS, but Scylla requires XFS.
| cperciva wrote:
| _4 billion messages sent through the platform by millions of
| people per day_
|
| I wish companies would stop inflating their numbers by citing
| "per day" statistics. 4 billion messages per day is less than
| 50k/second; that sort of transaction volume is well within the
| capabilities of pgsql running on midrange hardware.
| [deleted]
| combyn8tor wrote:
| Is there a template for replying to hackernews posts linking to
| engineering articles?
|
| 1. Cherry pick a piece of info 2. Claim it's not that
| hard/impressive/large 3. Claim it can be done much simpler with
| <insert database/language/hardware>
| zorkian wrote:
| (I work at Discord.)
|
| It's not even the most interesting metric about our systems
| anyway. If we're really going to look at the tech, the
| inflation of those metrics to deliver the service is where the
| work generally is in the system --
|
| * 50k+ QPS (average) for new message inserts * 500k+ QPS when
| you factor in deletes, updates, etc * 3M+ QPS looking at db
| reads * 30M+ QPS looking at the gateway websockets (fanout of
| things happening to online users)
|
| But I hear you, we're conflating some marketing metrics with
| technical metrics, we'll take that feedback for next time.
| cperciva wrote:
| Ideally I'd like to hear about messages per second at the
| 99.99th percentile or something similar. That number says far
| more about how hard it is to service the load than a per-day
| value ever will.
| pixl97 wrote:
| 50k/s doesn't tell us about the burstiness of the messages.
| Most places in the US don't have much traffic at 2AM
| bearjaws wrote:
| Absolutely not. This is an average, I bet their peak could be
| in the mid hundreds of thousands, imagine a hype moment in a
| LCS game.
|
| It can in _theory_ work, but the real world would make this the
| most unstable platform of all the messaging platforms.
|
| Just one vacuum would bring this system down, even if it wasn't
| an exclusive lock... Also I would be curious how you would
| implement similar functionality to the URL deep linking and
| image posting / hosting.
|
| Mind you the answers to these will probably increase average
| message size. Which means more write bandwidth.
|
| Some bar napkin math shows this would be around 180GiB per
| hour, 24/7, 4.3TiB per day.
|
| Unless all messages disappeared within 20 days you would exceed
| pretty much any reasonable single-server NVME setup. Also have
| fun with trim and optimizing NVME write performance. Which is
| also going to diminish as all the drives fail due to write
| wear...
| cperciva wrote:
| _Absolutely not. This is an average, I bet their peak could
| be in the mid hundreds of thousands_
|
| That's my point. Per-second numbers are far more useful than
| per-day numbers.
| marcinzm wrote:
| I mean they literally list the per second numbers a couple
| paragraph down:
|
| >Our databases were serving around 2 million requests per
| second (in this screenshot.)
|
| I doubt pgsql will have fun on mid-range hardware with 50k
| writes/sec, ~2 million reads/sec and 4 billion additional rows
| per day with few deletions.
| merb wrote:
| actually 'few deletions' and 'few updates' is basically the
| happy case for postgres. MVCC in a append only system is
| where it shines. (because you generate way less dead tuples,
| thus vacuum is not that big of a problem)
| dist1ll wrote:
| That's only messages _sent_. The article cites 2 million
| queries per second hitting their cluster, which are not served
| by a CDN. Considering latency requirements and burst traffic,
| you 're looking at a pretty massive load.
| mbesto wrote:
| What a myopic comment.
|
| > 50k/second
|
| Yes, 50k/second for every minute of the day 365/24/7. Very few
| companies can quote that.
|
| Not to mention:
|
| - Has complex threaded messages
|
| - Geo-redundancies
|
| - Those message are real-time
|
| - Global user base
|
| - Unknown told of features related to messaging (bot recations,
| ACL, permissions, privacy, formatting, reactions, etc.)
|
| - No/limited downtime, live updates
|
| Discord is technically impressive, not sure why you felt you
| had to diminish that.
| sammy2255 wrote:
| Geo redundancies? Discord is all in us-east-1 gcp. Unless you
| Meant AZ redundancy?
| mbesto wrote:
| > We are running more than 850 voice servers in 13 regions
| (hosted in more than 30 data centers) all over the world.
| This provisioning includes lots of redundancy to handle
| data center failures and DDoS attacks. We use a handful of
| providers and use physical servers in their data centers.
| We just recently added a South Africa region. Thanks to all
| our engineering efforts on both the client and server
| architecture, we are able to serve more than 2.6 million
| concurrent voice users with egress traffic of more than 220
| Gbps (bits-per-second) and 120 Mpps (packets-per-second).
|
| I don't see anything on their messaging specifically, just
| assuming they would have something similar.
|
| https://discord.com/blog/how-discord-handles-two-and-half-
| mi...
| zorkian wrote:
| Our messaging stack is not currently multi-regional,
| unfortunately. This is in the works, though, but it's a
| fairly significant architectural evolution from where we
| are today.
|
| Data storage is going to be multi-regional soon, but
| that's just from a redundancy/"data is safe in case of
| us-east1 failure" scenario -- we're not yet going to be
| actively serving live user traffic from outside of us-
| east1.
| t0mas88 wrote:
| Clever trick. Having dealt with very similar things using
| Cassandra, I'm curious how this setup will react to a failure of
| a local Nvme disk.
|
| They say that GCP will kill the whole node, which is probably a
| good thing if you can be sure it does that quickly and
| consistently.
|
| If it doesn't (or not fast enough) you'll have a slow node
| amongst faster ones, creating a big hotspot in your database.
| Cassandra doesn't work very well if that happens and in early
| versions I remember some cascading effects when a few nodes had
| slowdowns.
| TheGuyWhoCodes wrote:
| that's always a risk when using local drives and needing to
| rebuild when a node dies but I guess they can over provision in
| case of one node failure in cluster until the cache is warmed
| up
|
| Edit: Just wanted to add that because they are using Persistent
| Disks as the source of truth and depending on the network
| bandwidth it might not be that big of a problem to restore a
| node to a working state if it's using a quorum for reads and RP
| >= 3.
|
| Resorting a Node from zero in case of disk failure will always
| be bad.
|
| They could also have another caching layer on top of the
| cluster to further mitigated the latency issue until the nodes
| gets back to health and finishes all the hinted handoffs.
| jhgg wrote:
| We have two ways of re-building a node under this setup.
|
| We can either re-build the node by simply wiping its disks,
| and letting it stream in data from other replicas, or we can
| re-build by simply re-syncing the pd-ssd to the nvme.
|
| Node failure is a regular occurrence, it isn't a "bad" thing,
| and something we intend to fully automate. Node should be
| able to fail and recover without anyone noticing.
| mikesun wrote:
| That's good observation. We've spent a lot of time on our
| control plane which handles the various RAID1 failure modes,
| e.g. when a RAID1 degrades due to failed local SSD, we force
| stop the node so that it doesn't continue to operate as a slow
| node. Wait for part 2! :)
___________________________________________________________________
(page generated 2022-08-15 23:00 UTC)