hngopher.com

       [HN Gopher] IO Devices and Latency
       ___________________________________________________________________
        
       IO Devices and Latency
        
       Author : milar
       Score  : 213 points
       Date   : 2025-03-13 16:46 UTC (6 hours ago)
        
 (HTM) web link (planetscale.com)
 (TXT) w3m dump (planetscale.com)
        
       | cmurf wrote:
       | Plenty of text but also many cool animations. I'm a sucker for
       | visual aids. It's a good balance.
        
       | bddicken wrote:
       | Author of the blog here. I had a great time writing this. By far
       | the most complex article I've ever put together, with literally
       | thousands of lines of js to build out these interactive visuals.
       | I hope everyone enjoys.
        
         | inetknght wrote:
         | I don't see a single visual. I don't use the web with
         | javascript. Why not embed static images instead or in addition?
        
           | bddicken wrote:
           | The visuals add a lot to this article. A big theme throughout
           | is latency, and the visual help the reader see why tape is
           | slower than an hdd, which is slower than an ssd, etc. Also,
           | its just plain fun!
           | 
           | I'm curious, what do you do on the internet without js these
           | days?
        
             | inetknght wrote:
             | > _I 'm curious, what do you do on the internet without js
             | these days?_
             | 
             | Browse the web, send/receive email, read stories, play
             | games, the usual. I primarily use native apps and
             | selectively choose what sites are permitted to use
             | javascript, instead of letting websites visited on a whim
             | run javascript willy nilly.
        
               | bddicken wrote:
               | I respect it. In my very biased opinion, it's worth
               | enabling for this article.
        
         | zalebz wrote:
         | The level of your effort really shows through. If you had to
         | ballpark guess, how much time do you think you put in? and I
         | realize keyboard time vs kicking around in your head time are
         | quite different
        
           | bddicken wrote:
           | Thank you! I started this back in October, but of course have
           | worked on plenty of other things in the meantime. But this
           | was easily 200+ hours of work spread out over that time.
           | 
           | If this helps as context, the git diff for merging this into
           | our website was: +5,820 -1
        
         | dormando wrote:
         | Half on topic: what libs/etc did you use for the animations?
         | Not immediately obvious from the source page.
         | 
         | (it's a topic I'm deeply familiar with so I don't have a
         | comment on the content, it looks great on a skim!) - but I've
         | been sketching animations for my own blog and not liked the
         | last few libs I tried.
         | 
         | Thanks!
        
           | bddicken wrote:
           | I heavily, heavily abused d3.js to build these.
        
             | petedoyle wrote:
             | Small FYI that I couldn't see them in Chrome 133.0.6943.142
             | on MacOS. Firefox works.
        
               | homebrewer wrote:
               | It's the complete opposite for me -- there are no
               | animations in Firefox even with uBlock Origin disabled,
               | but Brave shows them fine.
               | 
               | The browser console spams this link:
               | https://react.dev/errors/418?invariant=418
               | 
               | edit: looks like it's caused by a userstyles extension
               | injecting a dark theme into the page; React doesn't like
               | it and the page silently breaks.
        
               | bddicken wrote:
               | Ohhh interesting! Obviously not ideal, but I guess just
               | an extension issue?
        
               | bddicken wrote:
               | Interesting. Running any chrome extensions that might be
               | messing with things? Alternatively, if you can share any
               | errors you're getting in the console lmk.
        
               | petedoyle wrote:
               | Oh, looks like it. I disabled extensions one by one til I
               | found it was reflect.app's extension. Edit: reported on
               | their discord.
               | 
               | False alarm :) Amazing work!!
        
         | jasonthorsness wrote:
         | The visuals are awesome; the bouncing-box is probably the best
         | illustration of relative latency I've seen.
         | 
         | Your "1 in a million" comment on durability is certainly too
         | pessimistic once you consider the briefness of the downtime
         | before a new server comes in and re-replicates everything,
         | right? I would think if your recovery is 10 minutes for
         | example, even if each of three servers is _guaranteed_ to fail
         | once in the month, I think it 's already like 1 in two million?
         | and if it's a 1% chance of failure in the month failure of all
         | three overlapping becomes extremely unlikely.
         | 
         | Thought I would note this because one-in-a-million is not great
         | if you have a million customers ;)
        
           | bddicken wrote:
           | > Your "1 in a million" comment on durability is certainly
           | too pessimistic once you consider the briefness of the
           | downtime before a new server comes in and re-replicates
           | everything, right?
           | 
           | Absolutely. Our actual durability is far, far, far higher
           | than this. We believe that nobody should ever worry about
           | losing their data, and thats the peace of mind we provide.
        
             | alfons_foobar wrote:
             | > Instead of relying on a single server to store all data,
             | we can replicate it onto several computers. One common way
             | of doing this is to have one server act as the primary,
             | which will receive all write requests. Then 2 or more
             | additional servers get all the data replicated to them.
             | With the data in three places, the likelihood of losing
             | data becomes very small.
             | 
             | Is my understanding correct, that this means you propagate
             | writes asynchronously from the primary to the secondary
             | servers (without waiting for an "ACK" from them for
             | writes)?
        
           | the_arun wrote:
           | Kudos to whoever patiently & passionately built these. On an
           | off topic - This is a great perspective for building
           | realistic course work for middle & high school students. I'm
           | sure they learn faster & better with visuals like these.
        
             | bddicken wrote:
             | It would be incredibly cool if this were used in high
             | school curricula.
        
           | mixermachine wrote:
           | 1 in a million is the probability that all three servers die
           | in one months, without swapping out the broken ones. So at
           | some point in the month all the data is gone.
           | 
           | If you replace the failed(or failing) node right away, the
           | failure percentage goes down greatly. You would likely need
           | the probability of a node going done in 30 minutes time
           | space. Assuming the migration can be done in 30 min.
           | 
           | (i hope this calculation is correct)
           | 
           | If 1% probability per month then 1%/(43800/30) = (1/1460)%
           | probability per 30 min.
           | 
           | For three instances: (1/1460)% * (1/1460)% * (1/1460)% =
           | (1/3112136000)% probability per 30 min that all go down.
           | 
           | Calculated for one month (1/3112136000)% * (43800/30) =
           | (1/2131600)%
           | 
           | So one in 213 160 000 that all three servers go down in a 30
           | minute time span somewhere in one month. After the 30 minutes
           | another replica will already be available, making the data
           | safe.
           | 
           | I'm happy to be corrected. The probability course was some
           | years back :)
        
         | alexellisuk wrote:
         | Hi, what actually are _metal_ instances that are being used
         | when you're on EC2 that have local NVME attached? Last time I
         | looked, apart from the smallest/slowest Graviton, you have to
         | spend circa 2.3k USD/mo to get a bare-metal instance from AWS -
         | https://blog.alexellis.io/how-to-run-firecracker-without-kvm...
        
           | lizztheblizz wrote:
           | Hi there, PS employee here. In AWS, the instance types
           | backing our Metal class are currently in the following
           | families: r6id, i4i, i3en and i7ie. We're deploying across
           | multiple clouds, and our "Metal" product designation has no
           | direct link to Amazon's bare-metal offerings.
        
         | logsr wrote:
         | Amazing presentation. It really helps to understand the
         | concepts.
         | 
         | The only add is that it understates the impact of SSD
         | parallelism. 8 Channel controllers are typical for high end
         | devices and 4K random IOPS continue to scale with queue depth,
         | but for an introduction the example is probably complex enough.
         | 
         | It is great to see PlanetScale moving in this direction and
         | sharing the knowledge.
        
           | bddicken wrote:
           | Thank you for the info! Do you have any good references on
           | this for those who want to learn more?
        
         | AlphaWeaver wrote:
         | Were you at all inspired by the work of Bartosz Ciechanowski?
         | My first thought was that you all might have hired him to do
         | the visuals for this post :)
        
           | bddicken wrote:
           | Bartosz Ciechanowski is incredible at this type of stuff. Sam
           | Rose has some great interactive blogs too. Both have had big
           | hits here on HN.
        
         | tombert wrote:
         | The visualizations are excellent, very fun to look at and play
         | with, and they go along with the article extremely well. You
         | should be proud of this, I really enjoyed it.
        
         | b0rbb wrote:
         | The animations are fantastic and awesome job with the
         | interactivity. I find myself having to explain latency to folks
         | often in my work and being able to see the extreme difference
         | in latencies for something like a HDD vs SSD makes it much
         | easier to understand for some people.
         | 
         | Edit: And for real, fantastic work, this is awesome.
        
         | hakaneskici wrote:
         | Great work! Thank you for making this.
         | 
         | This is beautiful and brilliant, and also is a great visual
         | tool to explain how some of the fundamental algorithms and data
         | structures originate from the physical characteristics of
         | storage mediums.
         | 
         | I wonder if anyone remembers the old days where you programmed
         | your own custom defrag util to place your boot libs and
         | frequently used apps to the outer tracks of the hard drive, so
         | they are loaded faster due to the higher linear velocity of the
         | outermost track :)
        
       | tonyhb wrote:
       | This is really cool, and PlanetScale Metal looks really solid,
       | too. Always a huge sucker for seeing latency huge latency drops
       | on releases: https://planetscale.com/blog/upgrading-query-
       | insights-to-met....
        
       | vessenes wrote:
       | Great nerdbaiting ad. I read all the way to the bottom of it, and
       | bookmarked it to send to my kids if I feel they are not
       | understanding storage architectures properly. :)
        
         | bddicken wrote:
         | The nerdbaiting will now provide generational benefit!
        
       | bob1029 wrote:
       | I've been advocating for SQLite+NVMe for a while now. For me it
       | is a new kind of pattern you can apply to get much further into
       | trouble than usual. In some cases, you might actually make it out
       | to the other side without needing to scale horizontally.
       | 
       | Latency is king in all performance matters. _Especially_ in those
       | where items must be processed serially. Running SQLite on NVMe
       | provides a latency advantage that no other provider can offer. I
       | don 't think running in memory is even a substantial uplift over
       | NVMe persistence for most real world use cases.
        
         | jstimpfle wrote:
         | I still measure 1-2ms of latency with an NVMe disk on my
         | Desktop computer, doing fsync() on a file on a ext4 filesystem.
         | 
         | Update: about 800us on a more modern system.
        
           | rbranson wrote:
           | Not so sure that's true. This is single-threaded direct I/O
           | doing a fio randwrite workload on a WD 850X Gen4 SSD:
           | write: IOPS=18.8k, BW=73.5MiB/s
           | (77.1MB/s)(4412MiB/60001msec); 0 zone resets         slat
           | (usec): min=2, max=335, avg= 3.42, stdev= 1.65         clat
           | (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
           | lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
           | clat percentiles (usec):          |  1.00th=[   33],
           | 5.00th=[   34], 10.00th=[   34], 20.00th=[   35],          |
           | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[
           | 43],          | 70.00th=[   53], 80.00th=[   60], 90.00th=[
           | 70], 95.00th=[   84],          | 99.00th=[  137], 99.50th=[
           | 174], 99.90th=[  404], 99.95th=[  652],          | 99.99th=[
           | 2311]
        
             | jstimpfle wrote:
             | I checked again with O_DIRECT and now I stand corrected. I
             | didn't know that O_DIRECT could make such a huge
             | difference. Thanks!
        
               | jstimpfle wrote:
               | Oops, O_DIRECT does not actually make that big of a
               | difference. I had updated my ad-hoc test to use O_DIRECT,
               | but didn't check that write() now returned errors because
               | of wrong alignment ;-)
               | 
               | As mentioned in the sibling comment, syncs are still
               | slow. My initial 1-2ms number came from a desktop I
               | bought in 2018, to which I added an NVME drive connected
               | to an M.1 slot in 2022. On my current test system I'm
               | seeing avg latencies of around 250us, sometimes a lot
               | more (there a fluctuations).                  # put the
               | following in a file "fio.job" and run "fio fio.job"
               | # enable either direct=1 (O_DIRECT) or fsync=1 (fsync()
               | after each write())        [Job1]        #direct=1
               | fsync=1        readwrite=randwrite        bs=64k  # size
               | of each write()        size=256m  # total size written
        
             | wmf wrote:
             | Random writes and fsync aren't the same thing. A single
             | unflushed random write on a consumer SSD is extremely fast
             | because it's not durable.
        
               | rbranson wrote:
               | You're right. Sync writes are ten times as slow. 331us.
               | write: IOPS=3007, BW=11.7MiB/s
               | (12.3MB/s)(118MiB/10001msec); 0 zone resets         clat
               | (usec): min=196, max=23274, avg=331.13, stdev=220.25
               | lat (usec): min=196, max=23275, avg=331.25, stdev=220.27
               | clat percentiles (usec):          |  1.00th=[  210],
               | 5.00th=[  223], 10.00th=[  235], 20.00th=[  262],
               | | 30.00th=[  297], 40.00th=[  318], 50.00th=[  330],
               | 60.00th=[  343],          | 70.00th=[  355], 80.00th=[
               | 371], 90.00th=[  400], 95.00th=[  429],          |
               | 99.00th=[  523], 99.50th=[  603], 99.90th=[ 1631],
               | 99.95th=[ 2966],          | 99.99th=[ 8225]
        
           | the8472 wrote:
           | I assume fsyncing a whole file does more work than just
           | ensuring that specific blocks made it to the WAL which it can
           | achieve with direct IO or maybe sync_file_range.
        
           | dzr0001 wrote:
           | What drive is this and does it need a trim? Not all NVMe
           | devices are created equal, especially in consumer drives. In
           | a previous role I was responsible for qualifying drives. Any
           | datacenter or enterprise class drive that had that sort of
           | latency in direct IO write benchmarks after proper pre-
           | conditioning would have failed our validation.
        
             | jstimpfle wrote:
             | My current one reads SAMSUNG MZVL21T0HCLR-00BH1 and is
             | built into a quite new work laptop. I can't get below
             | around 250us avg.
             | 
             | On my older system I had a WD_BLACK SN850X but had it
             | connected to an M.1 slot which may be limiting. This is
             | where I measured 1-2ms latency.
             | 
             | Is there any good place to get numbers of what is possible
             | with enterprise hardware today? I've struggled for some
             | time to find a good source.
        
           | madisp wrote:
           | I'm not an expert, but I think an enterprise NVMe will have
           | some sort of power loss protection so it can afford to fsync
           | to ram/caches as they will be written down in a power loss.
           | Consumer NVMe drives afaik lack this so fsync will force the
           | file to be written.
        
           | dogben wrote:
           | I believe that's power saving in action. A single operation
           | at idle is slow, the drive needs time to wake from idle.
        
         | sergiotapia wrote:
         | I had a lot of fun with Coolify running my app and my database
         | on the same machine. It was pretty cool to see zero latency in
         | my SQL queries, just the cost of the engine.
        
         | crazygringo wrote:
         | > _I 've been advocating for SQLite+NVMe for a while now._
         | 
         | Why SQLite instead of a traditional client-server database like
         | Postgres? Maybe it's a smidge faster on a single host, but
         | you're just making it harder for yourself the moment you have 2
         | webservers instead of 1, and both need to write to the
         | database.
         | 
         | > _Latency is king in all performance matters._
         | 
         | This seems misleading. First of all, your performance doesn't
         | matter if you don't have consistency, which is what you now
         | have to figure out the moment you have multiple webservers. And
         | secondly, database latency is generally miniscule compared to
         | internet round-trip latency, which itself is miniscule compared
         | to the "latency" of waiting for all page assets to load like
         | images and code libraries.
         | 
         | > _Especially in those where items must be processed serially._
         | 
         | You should be avoiding serial database queries as much as
         | possible in the first place. You should be using joins whenever
         | possible instead of separate queries, and whenever not possible
         | you should be issuing queries asynchronously at once as much as
         | possible, so they execute in parallel.
        
           | conradev wrote:
           | Until you hit the single-writer limitation in SQLite, you do
           | _not_ need to spend more CPU cycles on Postgres
        
           | bob1029 wrote:
           | The _entire point_ is to avoid the network hop.
           | 
           | Application <-> SQLite <-> NVMe
           | 
           | has orders of magnitude less latency than
           | 
           | Application <-> Postgres Client <-> Network <-> Postgres
           | Server <-> NVMe
           | 
           | > You should be avoiding serial database queries as much as
           | possible in the first place.
           | 
           | I don't get to decide this. The business does.
        
             | sedatk wrote:
             | "...has orders of magnitude less latency than..."
             | 
             | [citation needed]. Local network access shouldn't be much
             | different than local IPC.
        
             | crazygringo wrote:
             | Perhaps I wasn't clear enough in my comment. When I said
             | "database latency is generally miniscule compared to
             | internet round-trip latency", I meant between the _user_
             | and the website. Because they 're often thousands of miles
             | away, there are network buffers, etc.
             | 
             | But no, a _local_ network hop doesn 't introduce "orders of
             | magnitude" more latency. The article itself describes how
             | it is only 5x slower within a datacenter _for the roundtrip
             | part_ -- not 100x or 1,000x as you are claiming. But even
             | that is generally significantly less than the time it takes
             | the database to actually execute the query -- so maybe you
             | see a 1% or 5% speedup of your query. It 's just not a
             | major factor, since queries are generally so fast anyways.
             | 
             | The kind of database latency that you seem to be trying to
             | optimize for is a classic example of premature
             | optimization. In the context of a web application, you're
             | shaving _micro_ seconds for a page load time that is
             | probably measured in _hundreds of milliseconds_ for the
             | user.
             | 
             | > _I don 't get to decide this. The business does._
             | 
             | You have enough power to design the entire database
             | architecture, but you can't write and execute queries more
             | efficiently, following best practices?
        
         | cynicalsecurity wrote:
         | Sqlite doesn't work super well with parallelism in writing. It
         | supports it, yes, but in a bit clunky way and it still can
         | fail. To avoid problems with parallel writing besides setting a
         | specific clunky mode of operations a trick of using a single
         | thread for writing in an app can be used. Which usually makes
         | the already complicated parallel code slightly more
         | complicated.
         | 
         | If only one thread of writing is required, then SQLite works
         | absolutely great.
        
           | bob1029 wrote:
           | > If only one thread of writing is required, then SQLite
           | works absolutely great.
           | 
           | The whole point of getting your commands down to microsecond
           | execution time is so that you can get away with just one
           | thread of writing.
           | 
           | Entire financial exchanges operate on this premise.
        
         | dangoodmanUT wrote:
         | The SQLite filesystem is laid out to hedge against HDD
         | defragging. It wouldn't benefit as much as changing it to a
         | more modern layout that's SSD-native, then using NVMe
        
       | jhgg wrote:
       | Metal looks super cool, however at my last job when we tried
       | using instance local SSD's on GCP, there were serious reliability
       | issues (e.g. blocks on the device losing data). Has this
       | situation changed? What machine types are you using?
       | 
       | Our workaround was this: https://discord.com/blog/how-discord-
       | supercharges-network-di...
        
         | rcrowley wrote:
         | Neat workaround! We only started working with GCP Local SSDs in
         | 2024 and can report we haven't experienced read or write
         | failures due to bad sectors in any of our testing.
         | 
         | That said, we're running a redundant system in which MySQL
         | semi-sync replication ensures every write is durable to two
         | machines, each in a different availability zone, before that
         | write's acknowledged to the client. And our Kubernetes operator
         | plus Vitess' vtorc process are working together to aggressively
         | detect and replace failed or even suspicious replicas.
         | 
         | In GCP we find the best results on n2d-highmem machines. In
         | AWS, though, we run on pretty much all the latest-generation
         | types with instance storage.
        
       | bloopernova wrote:
       | Fantastic article, well explained and beautiful diagrams. Thank
       | you bddicken for writing this!
        
         | bddicken wrote:
         | You are welcome!
        
           | TechDebtDevin wrote:
           | Probably the best diagrams I've ever seen in a blog post.
        
       | magicmicah85 wrote:
       | Can I just say that I love how informative this was that I
       | completely forgot it was to promote a product? Excellent visuals
       | and interactivity.
        
       | gozzoo wrote:
       | Can someeone share their expirience in creating such diagrams.
       | What libraries and tools can be useful for such interactive
       | diagrams?
        
         | Joel_Mckay wrote:
         | Do you mean something for data visualization, or tricks
         | condensing large data sets with cursors?
         | 
         | https://d3js.org/
         | 
         | Best of luck =3
        
         | bddicken wrote:
         | For this particular one I used d3.js, but honestly this isn't
         | really the type of thing it's designed for. I've also used GSAP
         | for this type of thing on this article I wrote about database
         | sharding.
         | 
         | https://planetscale.com/blog/database-sharding
        
       | aftbit wrote:
       | Hrm "unlimited IOPS"? I suppose contrasted against the abysmal
       | IOPS available to Cloud block devs. A good modern NVMe enterprise
       | drive is specced for (order of magnitude) 10^6 to 10^7 IOPS. If
       | you can saturate that from database code, then you've got some
       | interesting problems, but it's definitely not unlimited.
        
         | bddicken wrote:
         | Technically any drive has a finite IOPS capacity. We have found
         | that no matter how hard we tried, we could not get MySQL to
         | exhaust the max IOPS of the underlying hardware. You hit CPU
         | limits long before hitting IOPS limits. Thus "infinite IOPS."
        
       | ucarion wrote:
       | Really, really great article. The visualization of random writes
       | is very nicely done.
       | 
       | On:
       | 
       | > Another issue with network-attached storage in the cloud comes
       | in the form of limiting IOPS. Many cloud providers that use this
       | model, including AWS and Google Cloud, limit the amount of IO
       | operations you can send over the wire. [...]
       | 
       | > If instead you have your storage attached directly to your
       | compute instance, there are no artificial limits placed on IO
       | operations. You can read and write as fast as the hardware will
       | allow for.
       | 
       | I feel like this might be a dumb series of questions, but:
       | 
       | 1. The ratelimit on "IOPS" is precisely a ratelimit on a
       | particular kind of network traffic, right? Namely traffic to/from
       | an EBS volume? "IOPS" really means "EBS volume network traffic"?
       | 
       | 2. Does this save me money? And if yes, is from some weird AWS
       | arbitrage? Or is it more because of an efficiency win from doing
       | less EBS networking?
       | 
       | I see pretty clearly putting storage and compute on the same
       | machine strictly a latency win, because you structurally have one
       | less hop every time. But is it also a throughput-per-dollar win
       | too?
        
         | the8472 wrote:
         | For network-attached storage IOPS limits packets per second,
         | not bandwidth, since IO operations can happen at different
         | sizes (e.g. 4K vs. 16K blocks).
        
           | rbranson wrote:
           | More specific details for EC2 instances can be seen in the
           | docs here: https://docs.aws.amazon.com/ec2/latest/instancetyp
           | es/gp.html...
        
         | rbranson wrote:
         | > 1. The ratelimit on "IOPS" is precisely a ratelimit on a
         | particular kind of network traffic, right? Namely traffic
         | to/from an EBS volume? "IOPS" really means "EBS volume network
         | traffic"?
         | 
         | The EBS volume itself has a provisioned capacity of IOPS and
         | throughput, and the EC2 instance it's attached to will have its
         | own limits as well across all the EBS volumes attached to it. I
         | would characterize it more like a different model. An EBS
         | volume isn't just just a slice of a physical PCB attached to a
         | PCIe bus, it's a share in a large distributed system a large
         | number of physical drives with its own dedicated network
         | capacity to/from compute, like a SAN.
         | 
         | > 2. Does this save me money? And if yes, is from some weird
         | AWS arbitrage? Or is it more because of an efficiency win from
         | doing less EBS networking?
         | 
         | It might. It's a set of trade-offs.
        
           | ucarion wrote:
           | That makes sense. The weirdness of
           | https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-
           | char... makes more sense now. Reminds me of DynamoDB capacity
           | units.
        
       | gz09 wrote:
       | Nice blog. There is also a problem that generally cloud storage
       | is "just unusually slow" (this has been noted by others before,
       | but here is a nice summary of the problem
       | http://databasearchitects.blogspot.com/2024/02/ssds-have-bec...)
       | 
       | Having recently added support for storing our incremental indexes
       | in https://github.com/feldera/feldera on S3/object storage (we
       | had NVMe for longer due to obvious performance advantages
       | mentioned in the previous article), we'd be happy for someone to
       | disrupt this space with a better offering ;).
        
         | bddicken wrote:
         | That database architects blog is a great read.
        
       | __turbobrew__ wrote:
       | I think something about distributed storage which is not
       | appreciated in this article:
       | 
       | 1. Some systems do not support replication out of the box. Sure
       | your cassandra cluster and mysql can do master slave replication,
       | but lots of systems cannot.
       | 
       | 2. Your life becomes much harder with NVME storage in cloud as
       | you need to respect maintenance intervals and cloud initiated
       | drains. If you do not hook into those system and drain your data
       | to a different node, the data goes poof. Separating storage from
       | compute allows the cloud operator to drain and move around
       | compute as needed and since the data is independent from the
       | compute -- and the cloud operator manages that data system and
       | draining for that system as well -- the operator can manage
       | workload placements without the customer needing to be involved.
        
         | maayank wrote:
         | what do you mean by drains?
        
           | rcrowley wrote:
           | AWS, for one example, provide a feed of upcoming "events" in
           | EC2 in which certain instances will need to be rebooted or
           | terminated entirely due to whatever maintenance they're doing
           | on the physical infrastructure.
           | 
           | If you miss a termination event you miss your chance to copy
           | that data elsewhere. Of course, if you're _always_ copying
           | the data elsewhere, you can rest easy.
        
         | rcrowley wrote:
         | Good points. PlanetScale's durability and reliability are built
         | on replication - MySQL replication - and all the operational
         | software we've written to maintain replication in the face of
         | servers coming and going, network partitions, and all the rest
         | of the weather one faces in the cloud.
         | 
         | Replicated network-attached storage that presents a "local"
         | filesystem API is a powerful way to create durability in a
         | system that doesn't build it in like we have.
        
         | wmf wrote:
         | I assume DRBD still exists although it's certainly easier to
         | use EBS.
        
       | r3tr0 wrote:
       | We are working on a platform that lets you measure this stuff
       | with pretty high precision in real time.
       | 
       | You can check out our sandbox here:
       | 
       | https://yeet.cx/play
        
       | robotguy wrote:
       | Seeing the disk IO animation reminded me of Melvin Kaye[0]:
       | Mel never wrote time-delay loops, either, even when the balky
       | Flexowriter       required a delay between output characters to
       | work right.       He just located instructions on the drum
       | so each successive one was just past the read head when it was
       | needed;       the drum had to execute another complete revolution
       | to find the next instruction.
       | 
       | [0]
       | https://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/The...
        
         | Thoreandan wrote:
         | I was reminded of Mel as well! If you haven't seen it, Usagi
         | Electric on YouTube has gotten a drum-memory system from the
         | 1950s nearly fully-functional again.
        
       | samwho wrote:
       | Gosh, this is beautiful. Fantastic work, Ben. <3
        
       | rsanheim wrote:
       | That great infographic at the top illustrates one big reason why
       | 'dev instances in the cloud' is a bad idea.
        
       | cynicalsecurity wrote:
       | That was a cool advertisement, I must give them that.
        
       | pjdesno wrote:
       | I love the visuals, and if it's ok with you will probably link
       | them to my class material on block devices in a week or so.
       | 
       | One small nit: > A typical random read can be performed in 1-3
       | milliseconds.
       | 
       | Um, no. A 7200 RPM platter completes a rotation in 8.33
       | milliseconds, so rotational delay for a random read is uniformly
       | distributed between 0 and 8.33ms, i.e. mean 4.16ms.
       | 
       | >a single disk will often have well over 100,000 tracks
       | 
       | By my calculations a Seagate IronWolf 18TB has about 615K tracks
       | per surface given that it has 9 platters and 18 surfaces, and an
       | outer diameter read speed of about 260MB/s. (or 557K tracks/inch
       | given typical inner and outer track diameters)
       | 
       | For more than you ever wanted to know about hard drive
       | performance and the mechanical/geometrical considerations that go
       | into it, see https://www.msstconference.org/MSST-
       | history/2024/Papers/msst...
        
         | bddicken wrote:
         | Whoah, thanks for sharing the paper.
        
       | jgalt212 wrote:
       | Disk latency, and one's aversion to it, is IMHO the only way
       | Hetzner costs can run up on you. You want to keep the database on
       | local disk, and not their very slow attached Volumes (Hetzner
       | EBS). In short, you can have relatively light work-loads that
       | will be on sort of expensive VMs because you need 500GB, or more,
       | of local disk. 1TB local disk is the biggest VM they offer in the
       | US. 300 EUR a month.
        
       | dangoodmanUT wrote:
       | what local nvme is getting 20us? Nitro?
        
       ___________________________________________________________________
       (page generated 2025-03-13 23:00 UTC)