[HN Gopher] NVMe is not a hard disk
       ___________________________________________________________________
        
       NVMe is not a hard disk
        
       Author : omnibrain
       Score  : 195 points
       Date   : 2021-06-11 14:45 UTC (8 hours ago)
        
 (HTM) web link (blog.koehntopp.info)
 (TXT) w3m dump (blog.koehntopp.info)
        
       | KaiserPro wrote:
       | NVME over X seems like fun. After a decade of shying away from
       | fibre channel, we've finally got to the point where we've
       | forgotten how awful iscsi was to look after. (and how expensive
       | FC used to be, and how disappointing SAS switches were.)
        
         | jabl wrote:
         | Don't worry. By the time the worst bugs are ironed out, and we
         | have management tools worth crap, there will be a new shiny.
         | 
         | That being said, nvme seems nice with no major footguns, so I
         | guess NVME over X is in principle at least a good idea.
        
         | isotopp wrote:
         | "I don't know what features the network of the future will
         | have, but it will be called Ethernet."
         | 
         | Ethernet is absorbing all the features of other networking
         | technology, and adding them to it's own perfection.
        
       | villgax wrote:
       | Wish it was as swappable as RAM sticks, quite the pain to get in
       | with a screwdriver after removing the GPU during OS installs on 2
       | drives without bootloader crap
        
         | Godel_unicode wrote:
         | Icy dock makes an entire range of hot-swap docks for m.2 NVMe
         | drives.
        
         | anarazel wrote:
         | E.g. U.2 drives can utilize NVMe but in a form factor easier to
         | swap (including hot plugging support).
         | 
         | Edit: typo
        
         | snuxoll wrote:
         | NVMe is just a protocol for accessing flash memory over PCIe
         | express, do not confuse it with the M.2 form factor (which also
         | supports SATA and USB!). My Optane P900 I use as a ZFS log
         | device is NVMe and plugs into a standard PCIe slot on my
         | PowerEdge R520, and servers frequently use U.2 form factor
         | drives.
        
           | anarazel wrote:
           | Not necessarily flash - the p900 you list isn't flash for
           | one. Nor is it necessarily over PCIe...
        
           | walrus01 wrote:
           | I think where people commonly confuse it is because _most
           | often_ when they see a M.2 device, it is the variant of the
           | slot that exposes PCI-E pins to the NVME SSD, such as for a
           | $100 M.2 2280 SSD from Samsung, Intel, WD, etc. As you
           | mention there 's lots of other things which can be
           | electrically connected to a motherboard's I/O buses in a M.2
           | slot.
        
       | Syzygies wrote:
       | _" Flash shredders exist, too, but in order to be compliant the
       | actual chips in their cases need to be broken. So what they
       | produce is usually much finer grained, a "sand" of plastics and
       | silicon."_
       | 
       | For metal, chip credit cards I can't shred, I put them on a brick
       | in the back yard, and torch them with a weed burner. There's a
       | psychological bonus here, if for example Amex made me change
       | cards through no choice of my own because they were hacked.
       | 
       | This wouldn't be "compliant" for a flash drive, but it would be
       | effective.
        
       | DeliriumTrigger wrote:
       | This blog post is the perfect example for "I think I have a deep
       | understanding but I really don't."
        
         | walrus01 wrote:
         | > "Customers of the data track have stateless applications,
         | because they have outsourced all their state management to the
         | various products and services of the data track."
         | 
         | I have no idea what this is even supposed to mean. It's like
         | somebody combined some buzzwords thought up by a fresh business
         | school marketing graduate working in the 'cloud' industry with
         | an attempt at actual x86-64 hardware systems engineering.
         | 
         | The whole premise of the first half of the article seems to be
         | 'you don't need to design a lot of redundancy and fault
         | tolerance', the second part then goes into a weird explanation
         | of NVME targets on CentOS. I hope this person isn't actually
         | responsible for building storage systems at the bare metal
         | level supporting some production business application.
        
           | cowmoo728 wrote:
           | I think the article is saying that a web server (customers)
           | should be stateless, because everything important should be
           | in a database (data track) on another host. And that database
           | probably has application level handling for duplicating
           | writes to another disk or another host.
           | 
           | The conclusion seems to be that it's not important for
           | hardware level data redundancy because existing database
           | software already handles duplication in application code. I
           | don't understand how that conclusion was reached. Hardware
           | level redundancy like raid1 seems useful because it
           | simplifies handling a common failure case when a single HDD
           | or NVME fails on a database server. Hardware redundancy is
           | just the first stage in a series of steps to handle drive
           | failure. I do agree that a typical stateless server doesn't
           | need raid1, but afaik it's not standard practice for a
           | stateless web application to bother with raid1 anyway.
        
             | isotopp wrote:
             | > I think the article is saying that a web server
             | (customers) should be stateless, because everything
             | important should be in a database (data track) on another
             | host. And that database probably has application level
             | handling for duplicating writes to another disk or another
             | host.
             | 
             | Correct.
             | 
             | > Hardware level redundancy like raid1 seems useful because
             | it simplifies handling a common failure case when a single
             | HDD or NVME fails on a database server
             | 
             | Nobody needs that if your database does replicate.
             | Cassandra replicates data. MySQL in a replication setup
             | replicates data. And so on. Individual nodes in such a
             | setup are as expendable as individual disks in a RAID. More
             | so, because you get not only protection against a disk
             | failure, but depending on deployment strategy also against
             | loss of a node, a Rack or a Rack Row. Or even loss of DC or
             | AZ.
        
           | wmf wrote:
           | It's basic 12-factor aka cloud native thinking: "Any data
           | that needs to persist must be stored in a stateful backing
           | service, typically a database."
        
       | notaplumber wrote:
       | > Because flash does not overwrite anything, ever.
       | 
       | This is repeated multiple times in the article, and I refuse to
       | believe it is true. If NVME/SSDs never overwrote anything, they
       | would quickly run out of available blocks, especially on OSs that
       | don't support TRIM.
        
         | isotopp wrote:
         | Flash has a flash translation layer (FTL). It translates linear
         | block addresses (LBA) into physical addresses ("PHY").
         | 
         | Flash can write blocks at a granularity similar to a memory
         | page (cells, around 4-16 KB). It can erase only sets of blocks,
         | at a much larger granularity (around 512-ish cell sized
         | blocks).
         | 
         | The FTL will try to find free pages to write your data to. In
         | the background, it will also try to move data around to
         | generate unused erase blocks and then erase them.
         | 
         | In flash, seeks are essentially free. That means that it does
         | no longer matter if blocks are adjacent. Also, because of the
         | FTL, adjacent FTL are not necessarily adjacent on the physical
         | layer. And even if you do not rewrite a block, it may be that
         | the garbage collection moves data around at the PHY layer in
         | order to generate completely empty erase blocks.
         | 
         | The net effect is that positioning as seen from the OS no
         | longer matters at all from the OS layer, and that the OS layer
         | has zero control over adjacency and erase at the PHY layer.
         | Rewriting, defragging, or other OS level operations cannot
         | control what happens physically at the flash layer.
         | 
         | TRIM is a "blatant layering violation" in the Linus sense: It
         | tells the disk "hardware" what the OS thinks it no longer
         | needs. TRIM'ed blocks can be given up and will not be kept when
         | the garbage collector tries to free up an erase page.
        
           | anarazel wrote:
           | > In flash, seeks are essentially free. That means that it
           | does no longer matter if blocks are adjacent.
           | 
           | > The net effect is that positioning as seen from the OS no
           | longer matters at all from the OS layer, and that the OS
           | layer has zero control over adjacency and erase at the PHY
           | layer. Rewriting, defragging, or other OS level operations
           | cannot control what happens physically at the flash layer.
           | 
           | I don't agree with this. The "OS visible position" is
           | relevant, because it influences what can realistically be
           | written together (multiple larger IOs targeting consecutive
           | LBAs in close time proximity). And writing data in larger
           | chunks is very important for good performance, particularly
           | in sustained write workloads. And sequential IO (in contrast
           | to small random IOs) _does_ influence how the FTL will lay
           | out the data to some degree.
        
           | notaplumber wrote:
           | Thanks for this part, I feel like this was a crucial piece of
           | information I was missing. Also explains my observations
           | about TRIM not being as important as people claim it is, the
           | firmware on modern flash storage seems more than capable of
           | handling this without OS intervention.
        
             | isotopp wrote:
             | The GC in the device cleans up.
             | 
             | TRIM is useful, it gives the GC important information.
             | 
             | TRIM is not that important as long as the device is not
             | full (less than 80%, generally speaking, but it is very
             | easy to produce pathological cases that are way off in
             | either direction). Once the device fills up above that it
             | is crucial.
        
         | cduzz wrote:
         | There's nuance to this; the deletes / overwrites are
         | accomplished by bulk wiping entire blocks.
         | 
         | Rather than change the paint color in a hallway you have to
         | tear down the house and build a new house in the vacant lot
         | next door that's a duplicate of the original, but with the new
         | hallway paint.
         | 
         | To optimize, you keep a bucket of houses to destroy, and a
         | bucket of vacant lots, and whenever a neighborhood has lots of
         | "to be flattened houses" the remaining active houses are copied
         | to a vacant lot and the whole neighborhood is flattened.
         | 
         | So, things get deleted, but not in the way people are used to
         | if they imagine a piece of paper and a pencil and eraser.
        
           | zdw wrote:
           | inspired by that last sentence, the analogy could be
           | rewritten as:                 - lines on page       - pages
           | of paper       - whole notebooks
           | 
           | and might be easier for people to grok than the earlier
           | houses/paint analogy.
        
             | jazzyjackson wrote:
             | I don't know, I like the drama of copying a neighborhood
             | and tearing down the old one xD
        
               | jnwatson wrote:
               | Reminds me of https://xkcd.com/1737/.
               | 
               | "When a datacenter catches fire, we just rope it off and
               | rebuild one town over."
        
               | jdironman wrote:
               | Speaking of xkcd, 2021 is return of "All your bases" See
               | alt-text on image.
               | 
               | https://xkcd.com/286/
        
           | slaymaker1907 wrote:
           | Just to add to the explanation, SSDs are able to do this
           | because they have a layer of indirection akin to virtual
           | memory. This means that what your OS thinks is byte 800000 of
           | the SSD may change it's actual physical location on the SSD
           | over time even in the absence of writes or reads to said
           | location.
           | 
           | This is a very important property of SSDs and is a large
           | reason why log structured storage is so popular in recent
           | times. The SSD is very fast at appends, but changing data is
           | much slower.
        
             | alisonkisk wrote:
             | Append-only garbage-collected storage was used in data
             | center even when hard disks were (and are) popular because
             | it's more reliable and scalable.
        
             | hawski wrote:
             | Are there write-once SSDs? They would have a tremendous
             | capacity. Probably good for long term backups or archiving.
             | Also possibly with a log structured filesystem only.
        
               | weinzierl wrote:
               | I find this a super interesting question. I always
               | assumed that long term stability of electronic non-
               | volatile memory is worse than that of magnetic memory.
               | When I think about it, I can't think of any compelling
               | reason why that should be the case. Trapped electrons vs
               | magnetic regions; I have no intuition which one of them
               | is likely to be more stable.
               | 
               | There is a question on stackoverflow about this topic
               | with many answers but no definitive conclusion. There
               | seem to be some papers touching the subject but at a
               | glance I couldn't find anything useful in them.
               | 
               | [1] https://superuser.com/questions/4307/what-lasts-
               | longer-data-...
        
               | olejorgenb wrote:
               | According to https://www.ni.com/en-
               | no/support/documentation/supplemental/... (Seems kinda
               | reputable at least)
               | 
               | "The level of charge in each cell must be kept within
               | certain thresholds to maintain data integrity.
               | Unfortunately, charge leaks from flash cells over time,
               | and if too much charge is lost then the data stored will
               | also be lost.
               | 
               | During normal operation, the flash drive firmware
               | routinely refreshes the cells to restore lost charge.
               | However, when the flash is not powered the state of
               | charge will naturally degrade with time. The rate of
               | charge loss, and sensitivity of the flash to that loss,
               | is impacted by the flash structure, amount of flash wear
               | (number of P/E cycles performed on the cell), and the
               | storage temperature. Flash Cell Endurance specifications
               | usually assume a minimum data retention duration of 12
               | months at the end of drive life."
        
               | dataflow wrote:
               | > When I think about it, I can't think of any compelling
               | reason why that should be the case. Trapped electrons vs
               | magnetic regions; I have no intuition which one of them
               | is likely to be more stable.
               | 
               | My layman intuition (which could be totally wrong) is
               | that trapped electrons have a natural tendency to escape
               | due to pure thermal jitter. Whereas magnetic materials
               | tend to stick together, so there's at least that. Don't
               | how much of this matches the actual electron
               | physics/technology though...
        
               | sharikone wrote:
               | Hmm I don't think this is conclusive. Thermal jitter
               | makes magnetic boundaries change too, and of course you
               | have to add to it that it is more susceptible to magnetic
               | interference.
               | 
               | I don't have intuition either, but I don't think this
               | explanation is sufficient
        
               | madacol wrote:
               | > Trapped electrons vs magnetic regions;
               | 
               | From the physics point of view, aren't both cases the
               | same thing?.
               | 
               | Isn't magnetic regions a state of the electric field? so
               | if I move electrons in and out, the electric field should
               | be changing as well
        
               | mananaysiempre wrote:
               | No. A region of a piece of material is magnetized in a
               | certain direction when its (ionized) atoms are mostly
               | oriented in that direction, the presence of a constant
               | magnetic field is (roughly speaking) only a consequence
               | of that.
               | 
               | So flash memory is about the electrons, while magnetic
               | memory is about the ions.
        
               | pas wrote:
               | Aren't permanent magnetics a direct result of oriented
               | spins? (So due to quantum effects?)
        
               | wtallis wrote:
               | I don't think anyone would make literally write-once
               | drives with flash memory; that's more optical disk
               | territory. But zoned SSDs and host-managed SMR hard
               | drives make explicit the distinction between writes and
               | larger-scale erase operations, while still allowing
               | random-access reads.
        
               | dmitrygr wrote:
               | Modern multi-bit-per-cell flash has quite terrible data
               | retention. It is especially low if it is stored in a warm
               | place. You'd be lucky to see ten years without an
               | occasional re-read + error-correct + re-write operation
               | going on
        
               | eqvinox wrote:
               | Making them write-once doesn't increase the capacity;
               | that's mostly limited by how many analog levels you can
               | distinguish on the stored charge, and how many cells you
               | can fit. The management overhead and spare capacity to
               | make SSDs rewritable is -to my knowledge- in the single
               | digit percentages.
               | 
               | (Also you need the translation layer even for write-once
               | since flash generally doesn't come 100% defect free. Not
               | sure if manufacturers could try to get it there, but
               | that'd probably drive the cost up massively. And the
               | translation layer is there for rewritable flash anyway...
               | the cost/benefit tradeoff is in favor of just living with
               | a few bugged cells.)
        
               | jonny_eh wrote:
               | > Making them write-once doesn't increase the capacity
               | 
               | It could theoretically make them cheaper. But I guess
               | that there wouldn't be enough demand, so you'd be better
               | off having some kind of OS enforced limitation on it.
        
               | entangledqubit wrote:
               | I suspect that hawki was assuming that a WORM SSD would
               | be based on a different non-flash storage medium. I don't
               | know any write once media that has similar read/write
               | access times to an SSD.
               | 
               | FWIW, there are WORM microsd cards available but it looks
               | like they still use flash under the hood.
        
               | hawski wrote:
               | I don't know enough specifics, so I didn't assume
               | anything :) In fact I was not aware of non-flash SSDs.
               | 
               | Because of the Internet age there probably is not much
               | place for write once media anyway, even it would be
               | somewhat cheaper. But maybe for specialized applications
               | or if it would be much much cheaper per GB.
        
               | wongarsu wrote:
               | The only write once media I'm aware of that is in
               | significant use are WORM tapes. They don't offer
               | significant advantages over regular tapes, but for
               | compliance reasons it can be useful to just make it
               | impossible to modify the backups.
        
               | juloo wrote:
               | That would be magnetic tapes.
        
               | phonon wrote:
               | Like https://en.wikipedia.org/wiki/ROM_cartridge ?
               | 
               | I think Nintendo uses https://www.mxic.com.tw/en-
               | us/products/ROM/Pages/default.asp...
               | 
               | https://www.mxic.com.tw/CachePages/en-us-Product-ROM-
               | default...
        
             | hinkley wrote:
             | > The SSD is very fast at appends, but changing data is
             | much slower.
             | 
             | No, it's worse than that. The fact that it's an overly
             | subtle distinction is the problem.
             | 
             | SSDs are fast while write traffic is light. From an
             | operational standpoint, the drive is _lying to you_ about
             | its performance. Unless you are routinely stress testing
             | your system to failure, you may have a very inaccurate
             | picture of how your system performs under load, meaning you
             | have done your capacity planning incorrectly, and you will
             | be caught out with a production issue.
             | 
             | Ultimately it's the same sentiment as people who don't like
             | the worst-case VACUUM behavior of Postgres - best-effort
             | algorithms in your system of record make some people very
             | cranky. They'd rather have higher latency with a smaller
             | error range, because at least they can see the problem.
        
           | wand3r wrote:
           | I think the explanation is sound maybe (I am not that
           | familiar) but the analogy gets a bit lost when you talk about
           | buckets of houses and buckets of vacant lots.
           | 
           | Maybe there is a better analogy or paradigm to view this
           | through.
        
           | daniellarusso wrote:
           | Spoiler alert - This is the plot to 'The Prestige'.
        
         | aidenn0 wrote:
         | Perhaps they mean it must erase an entire block before writing
         | any data, unlike a disk that can write a single sector at a
         | time?
        
           | dragontamer wrote:
           | The issue is that DDR4 is like that too. Not only the 64 byte
           | cache line, but DDR4 requires a transfer to the sense
           | amplifiers (aka a RAS, row access strobe) before you can read
           | or write.
           | 
           | The RAS command eradicated the entire row, like 1024 bytes or
           | so. This is because the DDR4 cells only have enough charge
           | for one reliable read, after that the capacitors don't have
           | enough electrons to know if a 0 or 1 was stored.
           | 
           | A row close command returns the data from the sense amps back
           | to the capacitors. Refresh commands renew the 0 or 1 as the
           | capacitor can only hold the data for a few milliseconds.
           | 
           | ------
           | 
           | The CAS latency statistic assumes that the row was already
           | open. It's a measure of the sense amplifiers and not of the
           | actual data.
        
             | zymhan wrote:
             | What does DDR have to do with NVMe?
        
               | nine_k wrote:
               | You can't write a byte, or a word, either.
               | 
               | The "fact" that you can do it in your program without
               | disturbing bytes around it is a convenient fiction that
               | the hardware fabricates for you.
        
               | dragontamer wrote:
               | DDR4 is effectively a block device and not 'random
               | access'.
               | 
               | Pretty much only cache is RAM proper these days (aka: all
               | locations have equal access time... that is, you can
               | access it randomly with little performance loss).
        
               | mirker wrote:
               | I'm confused. What's the difference between a cache line
               | and a row in RAM? They're both multiples of bytes. You
               | have data sharing per chunk in either case.
               | 
               | The distinction seems to be how big the chunk is not
               | uniformity of access time (is a symmetrical read disk not
               | a block device?)
        
               | dragontamer wrote:
               | Hard disk chunks are 512 bytes classically, and smaller
               | than the DDR4 row of 1024 bytes !!
               | 
               | So yes. DDR4 has surprising similarities to a 512byte
               | sector hard drive (modern hard drives have 4k blocks)
               | 
               | >> What's the difference between a cache line and a row
               | in RAM?
               | 
               | Well DDR4 doesn't have a cache line. It has a burst
               | length of 8, so the smallest data transfer is 64 bytes.
               | This happens to coincide with L1 cache lines.
               | 
               | The row is 1024 bytes long. Its pretty much the L1 cache
               | on the other side, so to speak. When your CPU talks to
               | DDR4, it needs to load a row (RAS all 1024 bytes) before
               | it can CAS read a 64 byte burst length 8 chunk.
               | 
               | -----------
               | 
               | DDR4, hard drives, and Flash are all block devices.
               | 
               | The main issue for Flash technologies, is that the erase
               | size is even larger than the read/write block size.
               | That's why we TRIM for NVMe devices.
        
             | Dylan16807 wrote:
             | The difference is that on DDR you have infinite write
             | endurance and you can do the whole thing in parallel.
             | 
             | If flash was the same way, and it could rewrite an entire
             | erase block with no consequences, then you could ignore
             | erase blocks. But it's nowhere near that level, so the
             | performance impact is very large.
        
               | dragontamer wrote:
               | That's a good point.
               | 
               | There are only 10,000 erase cycles per Flash cell. So a
               | lot of algorithms are about minimizing those erases.
        
             | eqvinox wrote:
             | It's vaguely similar, but there's a huge difference in that
             | flash needs to be erased before you can write it again, and
             | that operation is much slower and only possible on much
             | larger sizes. DDR4 doesn't care, you can always write, just
             | the read is destructive and needs to be followed by a
             | write.
             | 
             | I think this makes the comparison unhelpful since the
             | characteristics are still very different.
        
         | eqvinox wrote:
         | It's true and untrue depending on how you look at it. Flash
         | memory only supports changing/"writing" bits in one direction,
         | generally from 1 to 0. Erase, as a separate operation, clears
         | entire sectors back to 1, but is more costly than a write.
         | (Erase block size depends on the technology but we're talking
         | MB on modern flash AFAIK, stuff from 2010 already had 128kB.)
         | 
         | So, the drives do indeed never "overwrite" data - they mark the
         | block as unused (either when the OS uses TRIM, or when it
         | writes new data [for which it picks an empty block elsewhere]),
         | and put it in a queue to be erased whenever there's time (and
         | energy and heat budget) to do so.
         | 
         | Understanding this is also quite important because it can have
         | performance implications, particularly on consumer/low-end
         | devices. Those don't have a whole lot of spare space to work
         | with, so if the entire device is "in use", write performance
         | can take a serious hit when it becomes limited by erase speed.
         | 
         | [Add.: reference for block sizes:
         | https://www.micron.com/support/~/media/74C3F8B1250D4935898DB...
         | - note the PDF creation date on that is 2002(!) and it compares
         | 16kB against 128kB size.]
        
           | IshKebab wrote:
           | By any reasonable definition they do overwrite data. It's
           | just that they can't overwrite less than a block of data.
        
           | matheusmoreira wrote:
           | > Understanding this is also quite important because it can
           | have performance implications
           | 
           | Security implications too. The storage device cannot be
           | trusted to securely delete data.
        
             | effie wrote:
             | If you write whole drive capacity of random data, you
             | should be fine.
        
               | robocat wrote:
               | No. Say a particular model of SSD has over-provisioning
               | of 10%, then even after writing the "whole" capacity of
               | the drive, you can still be left with up to 10% of data
               | recoverable from the Flash chips.
        
           | tzs wrote:
           | If a logical overwrite only involved bits going from 1 to 0,
           | are and drives smart enough to recognize this and do it as an
           | actual overwrite instead of a copy and erase?
        
             | eqvinox wrote:
             | On embedded devices, yes, this is actually used in file
             | systems like JFFS2. But in these cases the flash chip is
             | just dumb storage and the translation layer is implemented
             | on the main CPU in software. So there's no "drive" really.
             | 
             | On NVMe/PC type applications with a controller driving the
             | flash chips... I have absolutely no idea. I'm curious too,
             | if anyone knows :)
        
             | jasonwatkinspdx wrote:
             | Generally no, because the unit of write is a page.
        
         | Dzugaru wrote:
         | Of course it does [0]. It's just it assigns writes as evenly as
         | possible (to have as even wear as possible), so log-like
         | internal "file system" is a way to go.
         | 
         | https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
        
         | throwaway09223 wrote:
         | The author clearly explains how this works in the sentence
         | immediately following. "Instead it has internally a thing
         | called flash translation layer (FTL)" ...
        
           | notaplumber wrote:
           | I unfortunately skimmed over this, isotopp's explanation
           | helped clear things up in my head.
        
             | throwaway09223 wrote:
             | I just saw his post, it's a great explanation.
             | 
             | It might also help to keep in mind that both regular disk
             | drives and solid state drives remap bad sectors. Both types
             | of disks maintain an unaddressable storage area which is
             | used to transparently cover for faulty sectors.
             | 
             | In a hard drive, faulty sectors are mapped during
             | production and stored in the p-list, and are remapped to
             | sectors in this extra hidden area. Sectors that fail at
             | runtime are recorded in the g-list and are likewise
             | remapped.
             | 
             | Writes may _usually_ go to the same place in a hard drive,
             | but it 's not guaranteed there either.
        
       | rbanffy wrote:
       | One thing I don't get in the Mach.2 design is why not use the
       | approach IBM used in mainframe hard drives of having multiple
       | heads over the same platter. The Mach.2 is two drives sharing
       | parts, that will behave like a 2-drive RAID array. Having
       | multiple heads on the same platter would allow it to read like a
       | RAID-1 while writing like a RAID-0.
        
       | myrandomcomment wrote:
       | Arista was pushing leaf-spine in 2008....
        
       | bob1029 wrote:
       | My take on NVMe has evolved a lot after getting a chance to
       | really play around with the capabilities of consumer-level
       | devices.
       | 
       | The biggest thing I have realized (as have others), is that
       | traditional IO wait strategies don't make sense for NVMe. Even
       | "newer" strategies like async/await do not give you want you
       | truly want for one of these devices (still too slow). The best
       | performance I have been able to extract from these is when I am
       | doing really stupid busy-wait strategies.
       | 
       | Also, single-writer principle and serializing+batching writes
       | _before_ you send them to disk is critical. With any storage
       | medium where a block write costs you device lifetime, you want to
       | put as much effective data per block as possible, and you also
       | want to avoid editing existing blocks. Append-only log structures
       | are what NVMe /flash devices live for.
       | 
       | With all of this in mind, the motivations for building gigantic
       | database clusters should start going away. One NVMe device per
       | application instance is starting to sound a lot more compelling
       | to me. Build clusters with app logic (not database logic).
       | 
       | In testing of these ideas I have been able to push 2 million
       | small writes (~64k) per second to a single Samsung 960 pro for a
       | single business entity. I don't know of any SQL clusters that can
       | achieve these same figures.
        
         | isotopp wrote:
         | In NVME you can get around 800.000 IOPS from a single device,
         | but the latency gives you around 20.000 IOPS sequentially. You
         | need to talk with deep queues or with multiple concurrent
         | threads to the device in order to eat the entire IOPS buffet.
         | 
         | Traditional OLTP workloads do not tend to have the concurrency
         | to actually saturate the NVME. You would need to be 40-way
         | parallel, but most OLTP workloads give you 4-way.
         | 
         | Multiple instances per device are almost a must.
        
           | anarazel wrote:
           | With a lot of NVMe devices, up to medium priced server gear,
           | the bottleneck in OLTP workloads isn't normal write latency,
           | but slow write cache flushes. On devices with write caches
           | one either needs to fdatasync() the journal on commit (which
           | typically issues a whole device cache flush) or use O_DIRECT
           | | O_DSYNC (ending up as a FUA write which just tags the
           | individual write as needing to be durable) for journal
           | writes. Often that drastically increases latency _and_ slows
           | down concurrent non-durable IO, reducing the benefit of
           | deeply queued IO substantially.
           | 
           | On top-line gear this isn't an issue, they don't signal a
           | write cache (by virtue of either having a non-volatile cache
           | or enough of a power reserve to flush the cache). Which then
           | prevents the OS from actually doing more expensive for
           | fdatasync()/O_DSYNC. One also can manually ignore the need
           | for caching by changing /sys/block/nvme*/queue/write_cache to
           | say write through, but that obviously looses guarantees - but
           | can be useful to test on lower end devices.
        
             | anarazel wrote:
             | One consequence of that is that:
             | 
             | > Multiple instances per device are almost a must.
             | 
             | Isn't actually unproblematic in OLTP, because it increases
             | the number of journal writes that need to be flushed. With
             | a single instance group commit can amortize the write cache
             | flush costs much more efficiently than with many concurrent
             | instances all separately doing much smaller group commits.
        
           | bob1029 wrote:
           | > You need to talk with deep queues or with multiple
           | concurrent threads to the device in order to eat the entire
           | IOPS buffet.
           | 
           | Completely agree. There is another angle you can play if you
           | are willing to get your hands dirty at the lowest levels.
           | 
           | If you build a custom database engine that fundamentally
           | stores everything as key-value, and then builds relational
           | abstractions on top, you can leverage a lot more benefit on a
           | per-transaction basis. For instance, if you are storing a KVP
           | per column in a table and the table has 10 columns, you may
           | wind up generating 10-20 KVP items per logical row
           | insert/update/delete. And if you are careful, you can make
           | sure this extra data structure expressiveness does not cause
           | write amplification (single writer serializes _and batches_
           | all transactions).
        
             | isotopp wrote:
             | You may want to play with a TiDB setup from Pingcap.
        
             | jaytaylor wrote:
             | > If you build a custom database        > engine that
             | fundamentally stores       > everything as key-value, and
             | then       > builds relational abstractions on       > top
             | 
             | Sounds like this could be FoundationDB, among other
             | contenders like TiDB.
             | 
             | https://foundationdb.org
        
         | 90minuteAPI wrote:
         | I guess it still makes sense for higher abstraction levels
         | though, right? Like a filesystem or other shared access to a
         | storage resource. So these asynchronous APIs aren't writing as
         | directly to storage, they're placing something in the queue and
         | notifying when that batch is committed.
         | 
         | > Append-only log structures are what NVMe/flash devices live
         | for.
         | 
         | I would think this is also good for filesystems like ZFS, APFS,
         | and BTRFS, yes? I had an inkling but never really looked into
         | it. Aren't these filesystems somewhat similar to append-only
         | logs of changes, which serialize operations as a single writer?
        
       | jeffinhat wrote:
       | I've been messing with NVMe over TCP at home lately and it's
       | pretty awesome. You can scoop up the last-gen of 10GBe/40GBe
       | networking on eBay for cheap and build your own fast
       | disaggregated storage on upstream Linux. The kernel based
       | implementation saves you some context switching over other
       | network file systems, and you can (probably) pay to play for on-
       | NIC implementations (especially as they're getting smarter).
       | 
       | It seems like these solutions don't have a strong
       | authentication/encryption-in-transit story. Are vendors building
       | this into proprietary products or is this only being used on
       | trusted networks? I think it'd be solid technology to leverage
       | for container storage.
        
         | stingraycharles wrote:
         | I just use iSCSI at home but using Mellanox' RoCE which is
         | pretty well performing.
         | 
         | One thing I'm noticing is that most of these storage protocols
         | do, in fact, assume converged Ethernet; that is, zero packet
         | loss and proper flow control.
         | 
         | Is this also the case with NVMe over TCP?
        
         | effie wrote:
         | > build your own fast disaggregated storage on upstream Linux
         | 
         | Why is this better than just connecting the drive via PCIe bus
         | directly to the CPU?
        
       | innot wrote:
       | > RAID or storage replication in distributed storage <..> is not
       | only useless, but actively undesirable
       | 
       | I guess I'm different from most people, good news! When building
       | my new "home server" half a year ago I made a raid-1 (based on
       | ZFS) with 4 NVMEs. I rarely appear at that city, so I brought the
       | fifth one and put it into an empty slot. Well, one of the 4 nvmes
       | lasted for 3 months and stopped responding. One "zpool replace"
       | and I'm back to normal, without any downtime, disassembly, even
       | reboots. I think that's quite useful. When I'm there the next
       | time I'll replace the dead one, of course.
        
         | twotwotwo wrote:
         | > > RAID or storage replication in distributed storage <..> is
         | not only useless, but actively undesirable
         | 
         | > I guess I'm different from most people, good news!
         | 
         | The earlier part of the sentence helps explain the difference:
         | "That is, because most database-like applications do their
         | redundancy themselves, at the application level..."
         | 
         | Running one box I'd want RAID on it for sure. Work already runs
         | a DB cluster because the app needs to stay up when an entire
         | box goes away. Once you have 3+ hot copies of the data and a
         | failover setup, RAID within each box on top of that can be
         | extravagant. (If you _do_ want greater reliability, it might be
         | through more replicas, etc. instead of RAID.)
         | 
         | There is a bit of universalization in how the blog post phrases
         | it. As applied to databases, though, I get where they're coming
         | from.
        
         | goodpoint wrote:
         | Compare your solution with having 4 SBCs, with 1 NVME each, at
         | different locations. The network client would handle
         | replication and checksumming.
         | 
         | The total cost might be similar but you have increased
         | reliability over SBC/controller/uplink failure.
         | 
         | Of course there are tradeoffs on performance and ease of
         | management...
        
           | Godel_unicode wrote:
           | You think that building 4 systems in 4 locations is likely to
           | have a similar cost to one system at one location? For small
           | systems, the fixed costs are a significant portion of the
           | overall system cost.
           | 
           | This is doubly true for physical or self-hosted systems.
        
         | zxexz wrote:
         | What setup do you use to put 4 NVME in one box? I know it's
         | possible, I've just heard off so many different setups. I know
         | there are some PCIE cards that allow for 4 NVME drives. But you
         | have to match that with a motherboard/CPU combo with enough
         | ones to not lose bandwidth.
        
           | birdyrooster wrote:
           | That's exactly what they are doing. Anyone else is using
           | proprietary controllers and ports for a server chassis
        
           | pdpi wrote:
           | I've been looking into building some small/cheap storage, and
           | this is one of the enclosures I've been looking at.
           | 
           | https://www.owcdigital.com/products/express-4m2
        
           | isotopp wrote:
           | For distributed storage, we use this:
           | https://www.slideshare.net/Storage-Forum/operation-
           | unthinkab...
           | 
           | We then install SDS software, Cloudian for S3, Quobyte for
           | File, and we used to use Datera for iSCSI. Lightbits maybe in
           | the future, I don't know.
           | 
           | These boxen get purchased with 4 NVME devices, but can grow
           | to 24 NVME devices. Currently 11 TB Microns, going for 16 or
           | more in the future.
           | 
           | For local storage, multiple NVME hardly ever make sense.
        
         | birdyrooster wrote:
         | Does zpool not automatically promote the hot spare like mdadm?
        
           | seized wrote:
           | It can, if you set a disk as the hot spare for that pool.
           | 
           | But a disk can only be a hot spare in one pool, so to have a
           | "global" hot spare it has to be done manually. That may be
           | what that poster was doing.
        
             | sjagoe wrote:
             | Also, if I understand it correctly, there are a few other
             | caveats with hot spares: It will only activate when another
             | drive completely fails, so you can't decide to replace a
             | drive when it's close to failure (probably not an issue in
             | this case, though, with the unresponsive drive). Second,
             | with the hot spare activated, the pool is still degraded,
             | and the original drive still needs to be replaced; then the
             | hot spare is removed from the vdev, and goes back to being
             | a hot spare.
             | 
             | It's these reasons that I've decided to just keep a couple
             | of cold spares ready that I can swap in to my system as
             | needed, although I do have access to the NAS at any time.
             | If I was remote like GP, I might decide to use a hot spare.
        
         | SkyMarshal wrote:
         | I've recently converted all my home workstation and NAS hard
         | drives over to OpenZFS, and it's amazing. Anyone who says RAID
         | is useless or undesirable just hasn't used ZFS yet.
        
           | amarshall wrote:
           | The article's author only said RAID was useless in a specific
           | scenario, not generally, and the post you're replying to
           | omitted this crucial context.
        
         | isotopp wrote:
         | My environment is not a home environment.
         | 
         | It looks like this:
         | https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html
        
         | toast0 wrote:
         | This article is speaking of large scale multinode distributed
         | systems. Hundreds of rack sized systems. In those systems, you
         | often don't need explicit disk redundancy, because you have
         | data redundancy across nodes with independent disks.
         | 
         | This is a good insight, but you need to be sure the disks are
         | independent.
        
           | merb wrote:
           | well most often hba's and raid controllers are another thing
           | which increases latency and makes maintenances costs go up
           | quite a bit (more stuff to update) and also it's another part
           | that can break.
           | 
           | that's why it's not recommended when running ceph.
        
             | Aea wrote:
             | I'm pretty sure discrete HBAs / Hardware RAID Controllers
             | have effectively gone the way of the dodo. Software RAID
             | (or ZFS) is the common, faster, cheaper, more reliable way
             | of doing things.
        
               | toast0 wrote:
               | Hardware RAID doesn't seem to be going away quickly.
               | Since they're almost all made by the same company, and
               | they can usually be flashed to be dumb HBAs, it's not too
               | bad, but it was pretty painful when using managed hosting
               | and the menu options with lots of disks all have the raid
               | controllers that are a pain to setup; and I'm not going
               | to reflash their hardware (although I did end up doing
               | some SSD firmware updates myself because firmware bugs
               | were causing issues and their firmware upgrade scripts
               | weren't working well and were tremendously slow)
        
               | amarshall wrote:
               | Don't lop HBAs and RAID controllers together. The former
               | is just PCIe to SATA or SCSI or whatever (otherwise it is
               | not just an HBA, but indeed a RAID controller). Such a
               | thing is still useful and perhaps necessary for software
               | RAID if there are insufficient ports on the motherboard.
        
               | karmakaze wrote:
               | Hardware caching raid controllers do have the advantage
               | if power is lost, the cache can still be written out
               | without the CPU/software to do it. This let's you safely
               | run without write-thru cache fsync. This was a common
               | spec for provisioned bare-metal MySQL servers I'd worked
               | with.
        
               | amarshall wrote:
               | Sometimes. Other times they may make things worse by
               | lying to the filesystem (and thereby also the
               | application) about writes being completed, which may
               | confound higher-level consistency models.
        
               | wtallis wrote:
               | It does seem to me that it's much easier to reason about
               | the overall system's resiliency when the capacitor-
               | protected caches are in the drives themselves (standard
               | for server SSDs) and nothing between that and the OS lies
               | about data consistency. And for solid state storage, you
               | probably don't need those extra layers of caching to get
               | good performance.
        
               | Godel_unicode wrote:
               | The entire comment thread of this article is on-prem, low
               | scale admins and high-scale cloud admins talking past
               | each other.
               | 
               | You can build in redundancy at the component level, at
               | the physical computer level, at the rack level, at the
               | datacenter level, at the region level. Having all of them
               | is almost certainly redundant and unnecessary at best.
        
               | seized wrote:
               | ZFS needs HBAs. Those get your disks connected but
               | otherwise get out of the way of ZFS.
               | 
               | But yes, hardware RAID controllers and ZFS don't go
               | together.
        
         | amarshall wrote:
         | You omitted the context from the rest of the sentence:
         | 
         | > most database-like applications do their redundancy
         | themselves, at the application level ...
         | 
         | If that's not the case for your storage (doesn't sound like
         | it), then the author's point doesn't apply to your case anyway.
         | In which case, yes, RAID may be useful.
        
         | linsomniac wrote:
         | We are currently converting our SSD-based Ganeti clusters from
         | LVM on RAID to ZFS, to prepare for our NVMe future, without
         | RAID cards (1). Was hoping to get the second box in our dev
         | ganeti cluster reinstalled this morning to do further testing,
         | but the first box has been working great!
         | 
         | 1: LSI has a NVMe RAID controller for U.2 chassis, preparing
         | for a non-RAID future, just in case.
        
       | dcminter wrote:
       | "Most people need their data a lot less than they think they do"
       | - great way to put it and a thought provoking article.
        
         | rattray wrote:
         | Yep, reminds me of a tech blog I read in the last year or so
         | (can't remember where) that talked about using 6 replicas for
         | each DB shard (across 3 AZ's IIRC)... they just used EC2's on-
         | disk NVMe storage (which is "ephemeral") because it's faster
         | and hey, if the machine dies, you have replicas!
         | 
         | This post's point that even with that setup, it's still nice to
         | have volume-based storage for quicker image replacements is
         | interesting, I'm not experienced enough with cloud setups to
         | know if that makes sense (eg how long does it take to upgrade
         | an EC2 instance that has data on disk? upgrade your OS? upgrade
         | your pg version? does ephemeral storage vs volume storage
         | affect these? I imagine not...)
        
         | behringer wrote:
         | I can't imagine wasting NVME storage on backup data. That's
         | what I have my 8tb spinning hdds for.
        
           | Koshkin wrote:
           | Isn't tape more cost-effective and reliable? (Also, your
           | backups do not need to be spinning all the time, if that's
           | what they are doing.)
        
             | walrus01 wrote:
             | The problem with really high capacity tape these days is
             | there's almost literally 1 vendor for drives and tapes,
             | which is even worse than the 3 companies worldwide that
             | manufacture hard drives. If you want relatively not so fast
             | storage for a lot of terabytes, I bet I could clone a
             | backblaze storage pod with some modifications and achieve a
             | better $/TB ratio than an enterprise priced tape solution.
        
               | adrian_b wrote:
               | There are at least 2 manufacturers for tapes (Fujifilm
               | and Sony), but their tapes are also sold under many other
               | brands (e.g. IBM, Quantum, HP).
               | 
               | The price per TB is currently about $7 or less (per real
               | TB of LTO-8, not per marketing compressed TB), so there
               | is no chance to approach the cost and reliability of
               | tapes using HDDs.
               | 
               | The problem with tapes is that currently the tape drives
               | are very expensive, because their market is small, so you
               | need to store a lot of data before the difference in
               | price between HDDs and tapes exceeds the initial expense
               | for the tape drive.
               | 
               | Nevertheless, if you want to store data for many years,
               | investing in a tape drive may be worthwhile just for the
               | peace of mind, because when storing HDDs for years you
               | can never be sure how will they work when operated again.
        
             | whatshisface wrote:
             | Right now disks are cheaper than tapes, even if you don't
             | count the very expensive tape readers.
        
               | eqvinox wrote:
               | This is incorrect. I recently acquired an LTO-6 tape
               | library; the tapes are <20EUR for 2.5TB true capacity
               | (marketing = "6TB compressed".) That's <8EUR/TB. Disk
               | drives start at 20EUR/TB for the cheapest garbage.
               | 
               | Sources:
               | 
               | https://geizhals.eu/?cat=zip&xf=1204_LTO-6 (tapes)
               | 
               | https://geizhals.eu/?cat=hde7s&sort=r (disks)
               | 
               | (For completeness, the tape library ran me 500EUR on eBay
               | used, but they normally run about twice that. It's a
               | 16-slot library, which coincidentally matches the
               | breakeven - filled for 40TB it's 820EUR, the same in
               | disks would've been 800EUR, though I would never buy the
               | cheapest crappy disks.)
               | 
               | FWIW, a major reason for going for tapes for me was that
               | at some point my backup HDDs always ended up used for
               | "real". The tape library frees me from trying to
               | discipline myself to properly separate backup HDDs ;)
        
               | adrian_b wrote:
               | For newer tape formats the prices are similar or even a
               | little lower.
               | 
               | For example, at Amazon in Europe an LTO-7 cartridge is
               | EUR 48 for 6 real TB (EUR 8 per real TB), while an LTO-8
               | cartridge is EUR 84 for 12 real TB (EUR 7 per real TB).
               | 
               | At Amazon USA I see now $48 for LTO-7 and $96 for LTO-8,
               | i.e. $8 per TB if you buy 1 tape. For larger quantities,
               | there are discounts.
               | 
               | Older formats like LTO-6 have the advantage that you may
               | find much cheaper tape drives, but you must handle more
               | cartridges for a given data size.
               | 
               | Currently the cheapest HDDs are the external USB drives
               | with 5400 rpm, which are much slower, about 3 times
               | slower than the tapes, but even those are many times more
               | expensive than the tapes (e.g. $27 ... $30 per TB).
        
               | Dylan16807 wrote:
               | I don't disagree much overall but in my experience cheap
               | or sale drives can beat 15 per TB.
        
             | jabroni_salad wrote:
             | Tape is great for long term storage of archival records,
             | but nobody wants to restore a live virtual machine from 7
             | years ago.
             | 
             | Depending on the business system you might want your server
             | to be backed up multiple times a day to just weekly, and if
             | you are doing multiple times a day you do want it to be
             | decently fast or to have some extra capacity so as not to
             | slow down normal operations while the backup runs.
        
           | [deleted]
        
         | santoshalper wrote:
         | Especially in a world where so much data now "lives" in the
         | cloud. Between my dropbox, github, google photos, etc. Very
         | little of my data only lives on a hard drive. The stuff that
         | does lives on a Synology NAS and is mirrored to S3 Glacier
         | weekly.
        
           | wazoox wrote:
           | But Dropbox, Github, Google photos etc rely on massive piles
           | of hard drives.
        
             | santoshalper wrote:
             | Sure, but the comment was about personal storage. Fault
             | tolerance at the edge is less important in a cloud world.
        
           | sreeramb93 wrote:
           | NAS mirrored to S3 Glacier. How much does it cost?
        
             | xyzzy_plugh wrote:
             | In my experience this is very cheap. I take it the parent
             | is not retrieving from Glacier often/ever, which is where
             | the significant costs go. It's a decent balance for
             | disaster recovery.
             | 
             | I sync my photos to S3 (a mix of raw and jpeg, sidecar
             | rawtherapee files) across a few devices so Glacier is
             | prohibitively expensive in this regard, but I still pay
             | <$100 a year for more stuff than I could ever store
             | locally.
        
               | kstrauser wrote:
               | I did the math, and for me Glacier is great for backups
               | where homeowners insurance is likely to be involved in
               | the restoral. It was ferociously expensive for anything
               | less drastic.
        
               | laurentb wrote:
               | how would that process look like in practice should you
               | need to get to call the insurance guys? as in, would you
               | claim on the cost to retrieve the data or ? (this
               | question is general, regardless of the actual country)
        
               | kstrauser wrote:
               | I honestly don't know. I've never had to use it.
        
               | DougWebb wrote:
               | I'm trying to figure out the costs. To back up my NAS at
               | full capacity, I need 10TB of storage. Using S3 Glacier
               | Deep Archive, that seems to cost $10/month per full
               | backup image I keep. That's not bad.
               | 
               | What's confusing is that the calculator has "Restore
               | Requests" as the # of requests, "Data Retrievals" as
               | TB/month, but there's also a "Data Transfer" section for
               | the S3 calculator. If I add 1 restore request for 10TB of
               | data (eg: restoring my full backup to the NAS), that adds
               | about $26 for that month. Totally reasonable.
               | 
               | However, if "Data Transfer" is relevant, and I can't tell
               | if it is or isn't, uploading my backup data is free but
               | retrieving 10TB would cost $922! Is that right?
               | 
               | This is what has always deterred me from using AWS. It's
               | so unclear what services and fees will apply to any given
               | use case, and it seems like there's no way to know until
               | Amazon decides that you've incurred them. At $10/month
               | for storage and $26 if I need to restore, I can just set
               | this up and I don't need to plan for disaster recovery
               | expenses. But if it's going to cost me $922 to get my
               | data back, I've got to figure out how to make sure my
               | insurance is going to cover that. This isn't a no-brainer
               | anymore. Also, what assurance do I have that the cost
               | isn't going to be higher when I need the data, or that
               | there won't be other fees tacked on that I've missed?
               | 
               | [1] https://calculator.aws/#/createCalculator/S3
        
               | Dylan16807 wrote:
               | > However, if "Data Transfer" is relevant, and I can't
               | tell if it is or isn't, uploading my backup data is free
               | but retrieving 10TB would cost $922! Is that right?
               | 
               | That's right. AWS charges offensive prices for bandwidth.
               | 
               | There are alternate methods to get data out for about
               | half the price, or you can try your luck on using
               | lightsail and if they don't decide it's a ToS violation
               | you could get the transfer costs to around $50.
        
               | seized wrote:
               | Glacier pricing can be hard to grok...
               | 
               | With Glacier as it is usually used you dont read data
               | directly from the Glacier storage, it has to be restored
               | to S3 where you then access it. That is the restore
               | charges and the delays, so you can pay a low rate for the
               | bulk option that takes up to 24 hours to restore your
               | data to S3. But the real cost is the bandwidth from S3
               | back to your NAS/datacenter/etc, which brings it up to
               | about $90USD/TB.
               | 
               | Other fees would include request pricing, some low amount
               | per 1000 requests. So costs can go up a bit if you store
               | 1 million small files to Glacier vs 1000 large files.
               | There is also a tipping point (IIRC about 170KB) where it
               | is cheaper to store small files on S3 than Glacier.
               | 
               | Depending on your data and patterns it can be better to
               | use Glacier as a second backup which is what I do. All my
               | data is backed up to a Google Workspace as that is
               | "unlimited" for now. The most important subset (a few TB)
               | also goes to Glacier. Glacier is pay as you go, there
               | isnt some "unlimited" or "5TB for life" type deal that
               | can change. If Google Workspace ever becomes not
               | "unlimited" or something happens to it, I have the most
               | important data in Glacier and its data that I have no
               | qualms paying >$1k to get back.
               | 
               | But for me restoring from Glacier means that my NAS is
               | dead (ZFS RAIDZ2 on good hardware) and Google Workspace
               | has failed me at the same time.
        
               | DougWebb wrote:
               | Cool, thank you for the details. None of their marketing
               | or FAQs for Glacier mention that getting the data back
               | means going to S3 first and then paying S3's outgoing
               | bandwidth costs. As deceptive as I expected.
               | 
               | I'll check out Google Workspace; that sounds like the
               | right level of kludge for me, since this is the first
               | time I've ever bothered to try to setup off-site backups.
               | I only started using RAID a couple of years ago.
        
               | kstrauser wrote:
               | Are you sure about the "restored to s3" bit? Their SDK
               | seems to fetch directly from Glacier.
               | 
               | Note that the official name is "S3 Glacier", so from
               | AWS's public perspective, it _is_ S3.
        
               | kstrauser wrote:
               | The $922 sounds about right. That jibes with my
               | estimates.
               | 
               | There's another (unofficial!) calculator at
               | http://liangzan.net/aws-glacier-calculator/ you can toy
               | with.
        
               | DougWebb wrote:
               | Thanks, I'll check it out.
        
             | seized wrote:
             | Glacier is about $1USD/TB/month just for storing data. If
             | you need to retrieve it ends up being about $90USD/TB, most
             | of that is bandwidth charges.
        
               | adrian_b wrote:
               | That means that if you store the data for much more than
               | a half of year, Glacier becomes more expensive than
               | storing on tapes.
               | 
               | Of course, tapes require a tape drive and its cost would
               | require a lot of data to compensate the cost, but at a
               | such high cost of retrieval it would not take much data
               | to equal the cost of a tape drive.
               | 
               | Glacier is OK for a couple of TB, but for tens or
               | hundreds it would not be suitable.
        
               | Dylan16807 wrote:
               | > Of course, tapes require a tape drive and its cost
               | would require a lot of data to compensate the cost, but
               | at a such high cost of retrieval it would not take much
               | data to equal the cost of a tape drive.
               | 
               | But the less you expect to use it, the less this matters.
               | 
               | So I'd put the break-even point a bit higher. Tape is
               | good for 100TB or more but for tens it's hard to justify
               | a tape drive.
               | 
               | Also it's important to remember to get those tapes
               | offsite every week!
        
             | santoshalper wrote:
             | Few bucks a month. More if I needed to retrieve something.
        
             | unethical_ban wrote:
             | I don't have automated mirroring set up, but I have
             | insight.
             | 
             | I use a Windows free tool called FastGlacier. I set up an
             | IAM user on my AWS account for my backups, and use those
             | creds to login. Then it's drag and drop! You can even use
             | FastGlacier to encrypt/decrypt on the fly as you upload and
             | download.
             | 
             | Glacier is cheap because the retrieval times are very slow
             | - something like 1-12 hours depending on the tier.
             | 
             | I have about 100GB of critical data. Personal documents,
             | photos and some music I don't want to have to search for if
             | the house burns down. It's something like a dollar a month.
             | Less than a cup of coffee.
        
               | thrtythreeforty wrote:
               | Deep Archive is super cost effective, $1/TB/mo. For the
               | house-burns-down scenario, I don't mind the 24hr
               | retrieval time
        
         | zwieback wrote:
         | Yes, interesting thought. On my ride in to work I was actually
         | thinking how our situation is exactly opposite: in our
         | environment (R&D test and measurement and production
         | automation) data is everything and never in the cloud so we
         | don't get to benefit from all the cool stuff the kids are doing
         | these days. Historical data can go in the cloud (as long as
         | we're reasonably sure it's secure) but operational data from
         | our test and production tooling (e.g. assembly lines and end-
         | of-line audit tools) has to be right there with super short
         | latency.
         | 
         | So we're still very interested in things like hard disks, NVMe,
         | etc.
        
         | birdyrooster wrote:
         | I know for a fact I don't need 40T of movies and tv, but that
         | changes nothing
        
         | NikolaNovak wrote:
         | I think it's often true at business level (most departments in
         | a large company don't need redundancy and uptime they think
         | they do),
         | 
         | and rarely if ever true at personal/family level (most people
         | don't think their phone/tablet/laptop could lose their data;
         | most people don't think of their data past photos they took
         | this week - "Oh it's OK to lose... wait, my tax documents are
         | gone? Photos of my baby from two years ago are gone? My poems
         | are gone? My [.. etc] is gone???"?).
        
       | tyingq wrote:
       | _" most database-like applications do their redundancy
       | themselves, at the application level, so that...storage
       | replication in distributed storage...is not only useless, but
       | actively undesirable"_
       | 
       | I do believe that's been the author's experience. However, I
       | think he may be unaware that's not everyone's...or even most
       | people's, experience.
        
         | [deleted]
        
         | isotopp wrote:
         | If you are not running a database at home, you have the
         | database in a replication setup.
         | 
         | That provides capacity, but also redundancy. Better redundancy
         | than at the disk level - fewer resources are shared.
         | 
         | https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html Here
         | is how we run our datatbases.
        
           | tyingq wrote:
           | _" If you are not running a database at home, you have the
           | database in a replication setup."_
           | 
           | That's one of the perceptions I'm saying isn't always true.
           | Especially in big, non-tech companies that have a mish-mash
           | of crazy legacy stuff. Traditional H/A disks and backups
           | still dominate some spaces.
        
         | Spooky23 wrote:
         | Agreed. It depends on what you do.
         | 
         | When I ran large-scale Exchange and database systems, we would
         | always get into big fights with the storage people, who
         | believed that SAN disk with performance tiering, replication,
         | etc was the way to solve all problems.
         | 
         | The problem at the time was that the fancy storage cost $40+
         | Gb/mo and performed like hot garbage because the data didn't
         | tier well. For email, the "correct" answer in 2010 was local
         | disk or dumb sas arrays with exchange dag.
         | 
         | For databases, the answer was "it depends". Often it made sense
         | to use the SAN and replication at that layer.
        
         | [deleted]
        
         | walrus01 wrote:
         | My biggest take away from this is "I certainly hope this person
         | isn't responsible for actually designing the bare metal storage
         | infrastructure meant to underly something important". They seem
         | to be operating from the premise that data replication at the
         | 'cloud' level can solve all their problems.
        
           | 411111111111111 wrote:
           | I think you're giving him too little credit there.
           | 
           | The parents point is really spot on though: most websites
           | aren't at scale where every stateful api has redundant
           | implementations. But the author's point does have merit:
           | inevitably, something goes wrong with all systems - and when
           | it does, your system goes down if you trust your HA config to
           | work. If you actually did go for redundant implementations,
           | your users likely aren't even gonna notice anything went
           | wrong.
           | 
           | It's however kinda unsustainable to maintain unless you have
           | a massive development budget
        
             | alisonkisk wrote:
             | But why is his layer the only correct layer for redundancy?
             | 
             | He's also doing wasted work his redundancy has bugs that a
             | consumer has to worry about working around.
        
               | [deleted]
        
               | smueller1234 wrote:
               | I'm not the original author but I used to work with him
               | and now do in storage infrastructure at Google. As others
               | pointed out, what the author, Kris, writes kind of
               | implies/requires a certain scale of infrastructure to
               | make sense. Let me try to provide at least a little bit
               | of context:
               | 
               | The larger your infrastructure, the smaller the relative
               | efficiency win that's worth pursuing (duh, I know,
               | engineering time costs the same, but the absolute savings
               | numbers from relative wins go up). That's why an approach
               | along the lines of "redundancy at all levels" (raid +
               | x-machine replication + x-geo replication etc) starts
               | becoming increasingly worth streamlining.
               | 
               | Another, separate consideration is types of failures you
               | have to consider: an availability incident (temporary
               | unavailability) vs. durability (permanent data loss). And
               | then it's worth considering that in the limit to long
               | durations, an availability incident will become the same
               | as a durability incident. This is contextual: To pick an
               | obvious/illustrative example, if your Snapchat messages
               | are offline for 24h, you might as well have lost the data
               | instead.
               | 
               | Now, machines fail, of course. Doing physical maintenance
               | (swapping disks) is going to take significant, human time
               | scales. It's not generally tolerable for your data to be
               | offline for that long. So local RAID barely helps at all.
               | Instead, you're going to want to make sure your data
               | stays available despite a certain rate of machine
               | failure.
               | 
               | You can now make similar considerations for different,
               | larger domains. Network, power, building, city/location,
               | etc. They have vastly different failure probabilities and
               | also different failure modes (network devices failing is
               | likely an availability concern, a mudslide into your DC
               | is a bit less likely to recover). Depending on your
               | needs, you might accept some of these but not others.
               | 
               | The most trivial way to deal with this is to simply make
               | sure you have a replica of each chunk of data in multiple
               | of each of these kinds of failure zones. A replica each
               | on multiple machines (pick the amount of redundancy you
               | need based on a statistical model from component failure
               | rates), a replica each on machines under different
               | network devices, on different power, in different
               | geographies, etc.
               | 
               | That's expensive. The next most efficient thing would be
               | to use the same concept as RAID (erasure codes) and apply
               | that across a wider scope. So you basically get RAID, but
               | you use your clever model of failure zones for placement.
               | 
               | This gets a bit complicated in practice. Most folks stick
               | to replicas. (Eg. last I looked, for example HDFS/Hadoop
               | only supported replication, but it did use knowledge of
               | network topology for placing data.)
               | 
               | The reason why you don't want to do this in your
               | application is because it's really kinda complicated.
               | You're far more likely to have many applications than
               | many storage technologies (or databases).
               | 
               | Now, at some point of infrastructure size or complexity
               | or team size it may make sense to separate your storage
               | (the stuff I'm talking about) from your databases as
               | well. But as Kris argues, many common databases can be
               | made to handle some of these failure zones.
               | 
               | In any case, that's the extremely long version of an
               | answer to your question why you'd handle redundancy in
               | this particular layer. The short answer is: Below this
               | layer is too small a scope or with too little meta
               | information. But doing it higher in the stack fails to
               | exploit a prime opportunity to abstract away some really
               | significant complexity. I think we all know how useful
               | good encapsulation can be! You avoid doing this in
               | multiple places simply because it's expensive.
               | 
               | (Everything above is common knowledge among storage
               | folks, nothing is Google specific or otherwise it has
               | been covered in published articles. Alas, the way we
               | would defend against bugs in storage is not public.
               | Sorry.)
        
               | 411111111111111 wrote:
               | We're currently quite off-topic if i'm honest.
               | 
               | I think the author was specifically talking about RAID-1
               | redundancy and is advocating that you can leave your
               | systems with RAID-0 (so no redundant drives in each
               | server), as you're gonna need multiple nodes in your
               | cluster anyway... so if any of your systems disks break,
               | you can just let it go down and replace the disk while
               | the node is offline.
               | 
               | but despite being offtopic: redundant implementations are
               | -from my experience - not used in a failover way. they're
               | active at all times and load is spread if you can do
               | that, so you'd likely find the inconsistencies in the
               | integration test layer.
        
               | rbanffy wrote:
               | > not used in a failover way.
               | 
               | Aurora works like that. Read replicas are on standby and
               | become writable when the writable node dies or is
               | replaced. They can be used, of course, but so can other
               | standby streaming replicas.
        
           | ahupp wrote:
           | Not sure what to say, but this is how it works on all the
           | large systems I'm familiar with.
           | 
           | Image you have two servers, each with two 1TB disks (4TB
           | physical storage). And you have two distinct services with
           | 1TB datasets, and want some storage redundancy.
           | 
           | One option is to put each pair of discs in a RAID-1
           | configuration, and so each RAID instance holds one dataset.
           | This protects against a disk failure, but not against server
           | or network failures.
           | 
           | Your other option is to put one copy of the dataset on each
           | server. Now you are protected from the failure of any single
           | disk, server, or the network to that server.
           | 
           | In both cases you have 2TB of logical storage available.
        
             | wafflespotato wrote:
             | you're putting yourself at risk of split-brain though (or
             | downtime due to fail-over (or the lack of it)).
             | 
             | In either case what you're describing isn't really the
             | 'cloud' alternative.
        
               | dan_quixote wrote:
               | Yeah, that's a complication/cost of HA that a significant
               | portion of industry has long accepted. Everywhere I've
               | been in the last 5 years has had this assumption baked in
               | across all distributed systems.
        
               | nightfly wrote:
               | Where possible, my organization tries and have services
               | deployed in sets of three (and try and require a quorum)
               | to reduce/eliminate split-brain situations. And we're
               | very small scale.
        
             | [deleted]
        
         | lurkerasdfh8 wrote:
         | Even if the author assumptions are true and valid, this is like
         | saying mirroring RAID is a safe backup option :)
         | 
         | it is not. if you have 10 DB hosts with the same brand of nvme,
         | and they fail under a workload, what good is it that you have
         | 10 hot-hot failover hosts? you just bought yourself a few days,
         | or hours if you are specially unlucky and using consumer grade.
        
           | tyingq wrote:
           | Yep. Mirroring happily mirrors mistakes very well, and very
           | quickly.
        
             | alisonkisk wrote:
             | Replication also replicates mistakes.
        
               | isotopp wrote:
               | Correct.
               | 
               | For that you have time delayed replicas and of course
               | backups. But time delayed replicas are usually much
               | faster to restore than a backup.
        
       | kardos wrote:
       | Not following the leaf-and-spine figure. 40x10G is 400G, but
       | 4x40G is 160G, so how is this completely oversubscription free?
       | 
       | Edit: followed the link, it is 2.5:1 oversubscribed
        
         | walrus01 wrote:
         | it's worth noting that 40GbE is an absolute dead end in terms
         | of technology, in the ISP world. Look at how many 40GbE members
         | there are on many major IXes and other things. It's super cheap
         | to buy an older Arista switch on ebay with N x 10GbE ports and
         | a few 40GbE uplinks, because they're all getting pulled from
         | service and sold off cheap. The upgrade path for 10GbE is
         | 100GbE.
         | 
         | major router and switch manufacturers aren't even making new
         | line cards with multiple 40GbE ports on them anymore, they're
         | tech from 6-7 years ago. You can buy either something with
         | dense 10GbE SFP+ ports, or something with QSFP/QSFP28 or
         | similar 100GbE ports.
         | 
         | If you want something for a test/development lab, or are on a
         | _very_ tight budget, sure.
        
           | bluedino wrote:
           | Meanwhile some budget homelabbers are jumping on the 2.5Gb
           | bandwagon
        
         | isotopp wrote:
         | The image is from 2012 and describes a topology, but not a
         | technology.
         | 
         | Today you'd rather use Arista 7060, 7050cx orJuniper 5200 as
         | ToR. You'd not build 1:1, but plan for it in terms of ports and
         | cable space, then add capacity as needed.
         | 
         | Almost nobody in an Enterprise environment actually needs 1:1,
         | unlike hosters or hyperscalers renting nodes to others. Even
         | then you'd probably being able to get away with a certain
         | amount of oversubscription.
        
       ___________________________________________________________________
       (page generated 2021-06-11 23:01 UTC)