[HN Gopher] NVMe is not a hard disk
___________________________________________________________________
NVMe is not a hard disk
Author : omnibrain
Score : 195 points
Date : 2021-06-11 14:45 UTC (8 hours ago)
(HTM) web link (blog.koehntopp.info)
(TXT) w3m dump (blog.koehntopp.info)
| KaiserPro wrote:
| NVME over X seems like fun. After a decade of shying away from
| fibre channel, we've finally got to the point where we've
| forgotten how awful iscsi was to look after. (and how expensive
| FC used to be, and how disappointing SAS switches were.)
| jabl wrote:
| Don't worry. By the time the worst bugs are ironed out, and we
| have management tools worth crap, there will be a new shiny.
|
| That being said, nvme seems nice with no major footguns, so I
| guess NVME over X is in principle at least a good idea.
| isotopp wrote:
| "I don't know what features the network of the future will
| have, but it will be called Ethernet."
|
| Ethernet is absorbing all the features of other networking
| technology, and adding them to it's own perfection.
| villgax wrote:
| Wish it was as swappable as RAM sticks, quite the pain to get in
| with a screwdriver after removing the GPU during OS installs on 2
| drives without bootloader crap
| Godel_unicode wrote:
| Icy dock makes an entire range of hot-swap docks for m.2 NVMe
| drives.
| anarazel wrote:
| E.g. U.2 drives can utilize NVMe but in a form factor easier to
| swap (including hot plugging support).
|
| Edit: typo
| snuxoll wrote:
| NVMe is just a protocol for accessing flash memory over PCIe
| express, do not confuse it with the M.2 form factor (which also
| supports SATA and USB!). My Optane P900 I use as a ZFS log
| device is NVMe and plugs into a standard PCIe slot on my
| PowerEdge R520, and servers frequently use U.2 form factor
| drives.
| anarazel wrote:
| Not necessarily flash - the p900 you list isn't flash for
| one. Nor is it necessarily over PCIe...
| walrus01 wrote:
| I think where people commonly confuse it is because _most
| often_ when they see a M.2 device, it is the variant of the
| slot that exposes PCI-E pins to the NVME SSD, such as for a
| $100 M.2 2280 SSD from Samsung, Intel, WD, etc. As you
| mention there 's lots of other things which can be
| electrically connected to a motherboard's I/O buses in a M.2
| slot.
| Syzygies wrote:
| _" Flash shredders exist, too, but in order to be compliant the
| actual chips in their cases need to be broken. So what they
| produce is usually much finer grained, a "sand" of plastics and
| silicon."_
|
| For metal, chip credit cards I can't shred, I put them on a brick
| in the back yard, and torch them with a weed burner. There's a
| psychological bonus here, if for example Amex made me change
| cards through no choice of my own because they were hacked.
|
| This wouldn't be "compliant" for a flash drive, but it would be
| effective.
| DeliriumTrigger wrote:
| This blog post is the perfect example for "I think I have a deep
| understanding but I really don't."
| walrus01 wrote:
| > "Customers of the data track have stateless applications,
| because they have outsourced all their state management to the
| various products and services of the data track."
|
| I have no idea what this is even supposed to mean. It's like
| somebody combined some buzzwords thought up by a fresh business
| school marketing graduate working in the 'cloud' industry with
| an attempt at actual x86-64 hardware systems engineering.
|
| The whole premise of the first half of the article seems to be
| 'you don't need to design a lot of redundancy and fault
| tolerance', the second part then goes into a weird explanation
| of NVME targets on CentOS. I hope this person isn't actually
| responsible for building storage systems at the bare metal
| level supporting some production business application.
| cowmoo728 wrote:
| I think the article is saying that a web server (customers)
| should be stateless, because everything important should be
| in a database (data track) on another host. And that database
| probably has application level handling for duplicating
| writes to another disk or another host.
|
| The conclusion seems to be that it's not important for
| hardware level data redundancy because existing database
| software already handles duplication in application code. I
| don't understand how that conclusion was reached. Hardware
| level redundancy like raid1 seems useful because it
| simplifies handling a common failure case when a single HDD
| or NVME fails on a database server. Hardware redundancy is
| just the first stage in a series of steps to handle drive
| failure. I do agree that a typical stateless server doesn't
| need raid1, but afaik it's not standard practice for a
| stateless web application to bother with raid1 anyway.
| isotopp wrote:
| > I think the article is saying that a web server
| (customers) should be stateless, because everything
| important should be in a database (data track) on another
| host. And that database probably has application level
| handling for duplicating writes to another disk or another
| host.
|
| Correct.
|
| > Hardware level redundancy like raid1 seems useful because
| it simplifies handling a common failure case when a single
| HDD or NVME fails on a database server
|
| Nobody needs that if your database does replicate.
| Cassandra replicates data. MySQL in a replication setup
| replicates data. And so on. Individual nodes in such a
| setup are as expendable as individual disks in a RAID. More
| so, because you get not only protection against a disk
| failure, but depending on deployment strategy also against
| loss of a node, a Rack or a Rack Row. Or even loss of DC or
| AZ.
| wmf wrote:
| It's basic 12-factor aka cloud native thinking: "Any data
| that needs to persist must be stored in a stateful backing
| service, typically a database."
| notaplumber wrote:
| > Because flash does not overwrite anything, ever.
|
| This is repeated multiple times in the article, and I refuse to
| believe it is true. If NVME/SSDs never overwrote anything, they
| would quickly run out of available blocks, especially on OSs that
| don't support TRIM.
| isotopp wrote:
| Flash has a flash translation layer (FTL). It translates linear
| block addresses (LBA) into physical addresses ("PHY").
|
| Flash can write blocks at a granularity similar to a memory
| page (cells, around 4-16 KB). It can erase only sets of blocks,
| at a much larger granularity (around 512-ish cell sized
| blocks).
|
| The FTL will try to find free pages to write your data to. In
| the background, it will also try to move data around to
| generate unused erase blocks and then erase them.
|
| In flash, seeks are essentially free. That means that it does
| no longer matter if blocks are adjacent. Also, because of the
| FTL, adjacent FTL are not necessarily adjacent on the physical
| layer. And even if you do not rewrite a block, it may be that
| the garbage collection moves data around at the PHY layer in
| order to generate completely empty erase blocks.
|
| The net effect is that positioning as seen from the OS no
| longer matters at all from the OS layer, and that the OS layer
| has zero control over adjacency and erase at the PHY layer.
| Rewriting, defragging, or other OS level operations cannot
| control what happens physically at the flash layer.
|
| TRIM is a "blatant layering violation" in the Linus sense: It
| tells the disk "hardware" what the OS thinks it no longer
| needs. TRIM'ed blocks can be given up and will not be kept when
| the garbage collector tries to free up an erase page.
| anarazel wrote:
| > In flash, seeks are essentially free. That means that it
| does no longer matter if blocks are adjacent.
|
| > The net effect is that positioning as seen from the OS no
| longer matters at all from the OS layer, and that the OS
| layer has zero control over adjacency and erase at the PHY
| layer. Rewriting, defragging, or other OS level operations
| cannot control what happens physically at the flash layer.
|
| I don't agree with this. The "OS visible position" is
| relevant, because it influences what can realistically be
| written together (multiple larger IOs targeting consecutive
| LBAs in close time proximity). And writing data in larger
| chunks is very important for good performance, particularly
| in sustained write workloads. And sequential IO (in contrast
| to small random IOs) _does_ influence how the FTL will lay
| out the data to some degree.
| notaplumber wrote:
| Thanks for this part, I feel like this was a crucial piece of
| information I was missing. Also explains my observations
| about TRIM not being as important as people claim it is, the
| firmware on modern flash storage seems more than capable of
| handling this without OS intervention.
| isotopp wrote:
| The GC in the device cleans up.
|
| TRIM is useful, it gives the GC important information.
|
| TRIM is not that important as long as the device is not
| full (less than 80%, generally speaking, but it is very
| easy to produce pathological cases that are way off in
| either direction). Once the device fills up above that it
| is crucial.
| cduzz wrote:
| There's nuance to this; the deletes / overwrites are
| accomplished by bulk wiping entire blocks.
|
| Rather than change the paint color in a hallway you have to
| tear down the house and build a new house in the vacant lot
| next door that's a duplicate of the original, but with the new
| hallway paint.
|
| To optimize, you keep a bucket of houses to destroy, and a
| bucket of vacant lots, and whenever a neighborhood has lots of
| "to be flattened houses" the remaining active houses are copied
| to a vacant lot and the whole neighborhood is flattened.
|
| So, things get deleted, but not in the way people are used to
| if they imagine a piece of paper and a pencil and eraser.
| zdw wrote:
| inspired by that last sentence, the analogy could be
| rewritten as: - lines on page - pages
| of paper - whole notebooks
|
| and might be easier for people to grok than the earlier
| houses/paint analogy.
| jazzyjackson wrote:
| I don't know, I like the drama of copying a neighborhood
| and tearing down the old one xD
| jnwatson wrote:
| Reminds me of https://xkcd.com/1737/.
|
| "When a datacenter catches fire, we just rope it off and
| rebuild one town over."
| jdironman wrote:
| Speaking of xkcd, 2021 is return of "All your bases" See
| alt-text on image.
|
| https://xkcd.com/286/
| slaymaker1907 wrote:
| Just to add to the explanation, SSDs are able to do this
| because they have a layer of indirection akin to virtual
| memory. This means that what your OS thinks is byte 800000 of
| the SSD may change it's actual physical location on the SSD
| over time even in the absence of writes or reads to said
| location.
|
| This is a very important property of SSDs and is a large
| reason why log structured storage is so popular in recent
| times. The SSD is very fast at appends, but changing data is
| much slower.
| alisonkisk wrote:
| Append-only garbage-collected storage was used in data
| center even when hard disks were (and are) popular because
| it's more reliable and scalable.
| hawski wrote:
| Are there write-once SSDs? They would have a tremendous
| capacity. Probably good for long term backups or archiving.
| Also possibly with a log structured filesystem only.
| weinzierl wrote:
| I find this a super interesting question. I always
| assumed that long term stability of electronic non-
| volatile memory is worse than that of magnetic memory.
| When I think about it, I can't think of any compelling
| reason why that should be the case. Trapped electrons vs
| magnetic regions; I have no intuition which one of them
| is likely to be more stable.
|
| There is a question on stackoverflow about this topic
| with many answers but no definitive conclusion. There
| seem to be some papers touching the subject but at a
| glance I couldn't find anything useful in them.
|
| [1] https://superuser.com/questions/4307/what-lasts-
| longer-data-...
| olejorgenb wrote:
| According to https://www.ni.com/en-
| no/support/documentation/supplemental/... (Seems kinda
| reputable at least)
|
| "The level of charge in each cell must be kept within
| certain thresholds to maintain data integrity.
| Unfortunately, charge leaks from flash cells over time,
| and if too much charge is lost then the data stored will
| also be lost.
|
| During normal operation, the flash drive firmware
| routinely refreshes the cells to restore lost charge.
| However, when the flash is not powered the state of
| charge will naturally degrade with time. The rate of
| charge loss, and sensitivity of the flash to that loss,
| is impacted by the flash structure, amount of flash wear
| (number of P/E cycles performed on the cell), and the
| storage temperature. Flash Cell Endurance specifications
| usually assume a minimum data retention duration of 12
| months at the end of drive life."
| dataflow wrote:
| > When I think about it, I can't think of any compelling
| reason why that should be the case. Trapped electrons vs
| magnetic regions; I have no intuition which one of them
| is likely to be more stable.
|
| My layman intuition (which could be totally wrong) is
| that trapped electrons have a natural tendency to escape
| due to pure thermal jitter. Whereas magnetic materials
| tend to stick together, so there's at least that. Don't
| how much of this matches the actual electron
| physics/technology though...
| sharikone wrote:
| Hmm I don't think this is conclusive. Thermal jitter
| makes magnetic boundaries change too, and of course you
| have to add to it that it is more susceptible to magnetic
| interference.
|
| I don't have intuition either, but I don't think this
| explanation is sufficient
| madacol wrote:
| > Trapped electrons vs magnetic regions;
|
| From the physics point of view, aren't both cases the
| same thing?.
|
| Isn't magnetic regions a state of the electric field? so
| if I move electrons in and out, the electric field should
| be changing as well
| mananaysiempre wrote:
| No. A region of a piece of material is magnetized in a
| certain direction when its (ionized) atoms are mostly
| oriented in that direction, the presence of a constant
| magnetic field is (roughly speaking) only a consequence
| of that.
|
| So flash memory is about the electrons, while magnetic
| memory is about the ions.
| pas wrote:
| Aren't permanent magnetics a direct result of oriented
| spins? (So due to quantum effects?)
| wtallis wrote:
| I don't think anyone would make literally write-once
| drives with flash memory; that's more optical disk
| territory. But zoned SSDs and host-managed SMR hard
| drives make explicit the distinction between writes and
| larger-scale erase operations, while still allowing
| random-access reads.
| dmitrygr wrote:
| Modern multi-bit-per-cell flash has quite terrible data
| retention. It is especially low if it is stored in a warm
| place. You'd be lucky to see ten years without an
| occasional re-read + error-correct + re-write operation
| going on
| eqvinox wrote:
| Making them write-once doesn't increase the capacity;
| that's mostly limited by how many analog levels you can
| distinguish on the stored charge, and how many cells you
| can fit. The management overhead and spare capacity to
| make SSDs rewritable is -to my knowledge- in the single
| digit percentages.
|
| (Also you need the translation layer even for write-once
| since flash generally doesn't come 100% defect free. Not
| sure if manufacturers could try to get it there, but
| that'd probably drive the cost up massively. And the
| translation layer is there for rewritable flash anyway...
| the cost/benefit tradeoff is in favor of just living with
| a few bugged cells.)
| jonny_eh wrote:
| > Making them write-once doesn't increase the capacity
|
| It could theoretically make them cheaper. But I guess
| that there wouldn't be enough demand, so you'd be better
| off having some kind of OS enforced limitation on it.
| entangledqubit wrote:
| I suspect that hawki was assuming that a WORM SSD would
| be based on a different non-flash storage medium. I don't
| know any write once media that has similar read/write
| access times to an SSD.
|
| FWIW, there are WORM microsd cards available but it looks
| like they still use flash under the hood.
| hawski wrote:
| I don't know enough specifics, so I didn't assume
| anything :) In fact I was not aware of non-flash SSDs.
|
| Because of the Internet age there probably is not much
| place for write once media anyway, even it would be
| somewhat cheaper. But maybe for specialized applications
| or if it would be much much cheaper per GB.
| wongarsu wrote:
| The only write once media I'm aware of that is in
| significant use are WORM tapes. They don't offer
| significant advantages over regular tapes, but for
| compliance reasons it can be useful to just make it
| impossible to modify the backups.
| juloo wrote:
| That would be magnetic tapes.
| phonon wrote:
| Like https://en.wikipedia.org/wiki/ROM_cartridge ?
|
| I think Nintendo uses https://www.mxic.com.tw/en-
| us/products/ROM/Pages/default.asp...
|
| https://www.mxic.com.tw/CachePages/en-us-Product-ROM-
| default...
| hinkley wrote:
| > The SSD is very fast at appends, but changing data is
| much slower.
|
| No, it's worse than that. The fact that it's an overly
| subtle distinction is the problem.
|
| SSDs are fast while write traffic is light. From an
| operational standpoint, the drive is _lying to you_ about
| its performance. Unless you are routinely stress testing
| your system to failure, you may have a very inaccurate
| picture of how your system performs under load, meaning you
| have done your capacity planning incorrectly, and you will
| be caught out with a production issue.
|
| Ultimately it's the same sentiment as people who don't like
| the worst-case VACUUM behavior of Postgres - best-effort
| algorithms in your system of record make some people very
| cranky. They'd rather have higher latency with a smaller
| error range, because at least they can see the problem.
| wand3r wrote:
| I think the explanation is sound maybe (I am not that
| familiar) but the analogy gets a bit lost when you talk about
| buckets of houses and buckets of vacant lots.
|
| Maybe there is a better analogy or paradigm to view this
| through.
| daniellarusso wrote:
| Spoiler alert - This is the plot to 'The Prestige'.
| aidenn0 wrote:
| Perhaps they mean it must erase an entire block before writing
| any data, unlike a disk that can write a single sector at a
| time?
| dragontamer wrote:
| The issue is that DDR4 is like that too. Not only the 64 byte
| cache line, but DDR4 requires a transfer to the sense
| amplifiers (aka a RAS, row access strobe) before you can read
| or write.
|
| The RAS command eradicated the entire row, like 1024 bytes or
| so. This is because the DDR4 cells only have enough charge
| for one reliable read, after that the capacitors don't have
| enough electrons to know if a 0 or 1 was stored.
|
| A row close command returns the data from the sense amps back
| to the capacitors. Refresh commands renew the 0 or 1 as the
| capacitor can only hold the data for a few milliseconds.
|
| ------
|
| The CAS latency statistic assumes that the row was already
| open. It's a measure of the sense amplifiers and not of the
| actual data.
| zymhan wrote:
| What does DDR have to do with NVMe?
| nine_k wrote:
| You can't write a byte, or a word, either.
|
| The "fact" that you can do it in your program without
| disturbing bytes around it is a convenient fiction that
| the hardware fabricates for you.
| dragontamer wrote:
| DDR4 is effectively a block device and not 'random
| access'.
|
| Pretty much only cache is RAM proper these days (aka: all
| locations have equal access time... that is, you can
| access it randomly with little performance loss).
| mirker wrote:
| I'm confused. What's the difference between a cache line
| and a row in RAM? They're both multiples of bytes. You
| have data sharing per chunk in either case.
|
| The distinction seems to be how big the chunk is not
| uniformity of access time (is a symmetrical read disk not
| a block device?)
| dragontamer wrote:
| Hard disk chunks are 512 bytes classically, and smaller
| than the DDR4 row of 1024 bytes !!
|
| So yes. DDR4 has surprising similarities to a 512byte
| sector hard drive (modern hard drives have 4k blocks)
|
| >> What's the difference between a cache line and a row
| in RAM?
|
| Well DDR4 doesn't have a cache line. It has a burst
| length of 8, so the smallest data transfer is 64 bytes.
| This happens to coincide with L1 cache lines.
|
| The row is 1024 bytes long. Its pretty much the L1 cache
| on the other side, so to speak. When your CPU talks to
| DDR4, it needs to load a row (RAS all 1024 bytes) before
| it can CAS read a 64 byte burst length 8 chunk.
|
| -----------
|
| DDR4, hard drives, and Flash are all block devices.
|
| The main issue for Flash technologies, is that the erase
| size is even larger than the read/write block size.
| That's why we TRIM for NVMe devices.
| Dylan16807 wrote:
| The difference is that on DDR you have infinite write
| endurance and you can do the whole thing in parallel.
|
| If flash was the same way, and it could rewrite an entire
| erase block with no consequences, then you could ignore
| erase blocks. But it's nowhere near that level, so the
| performance impact is very large.
| dragontamer wrote:
| That's a good point.
|
| There are only 10,000 erase cycles per Flash cell. So a
| lot of algorithms are about minimizing those erases.
| eqvinox wrote:
| It's vaguely similar, but there's a huge difference in that
| flash needs to be erased before you can write it again, and
| that operation is much slower and only possible on much
| larger sizes. DDR4 doesn't care, you can always write, just
| the read is destructive and needs to be followed by a
| write.
|
| I think this makes the comparison unhelpful since the
| characteristics are still very different.
| eqvinox wrote:
| It's true and untrue depending on how you look at it. Flash
| memory only supports changing/"writing" bits in one direction,
| generally from 1 to 0. Erase, as a separate operation, clears
| entire sectors back to 1, but is more costly than a write.
| (Erase block size depends on the technology but we're talking
| MB on modern flash AFAIK, stuff from 2010 already had 128kB.)
|
| So, the drives do indeed never "overwrite" data - they mark the
| block as unused (either when the OS uses TRIM, or when it
| writes new data [for which it picks an empty block elsewhere]),
| and put it in a queue to be erased whenever there's time (and
| energy and heat budget) to do so.
|
| Understanding this is also quite important because it can have
| performance implications, particularly on consumer/low-end
| devices. Those don't have a whole lot of spare space to work
| with, so if the entire device is "in use", write performance
| can take a serious hit when it becomes limited by erase speed.
|
| [Add.: reference for block sizes:
| https://www.micron.com/support/~/media/74C3F8B1250D4935898DB...
| - note the PDF creation date on that is 2002(!) and it compares
| 16kB against 128kB size.]
| IshKebab wrote:
| By any reasonable definition they do overwrite data. It's
| just that they can't overwrite less than a block of data.
| matheusmoreira wrote:
| > Understanding this is also quite important because it can
| have performance implications
|
| Security implications too. The storage device cannot be
| trusted to securely delete data.
| effie wrote:
| If you write whole drive capacity of random data, you
| should be fine.
| robocat wrote:
| No. Say a particular model of SSD has over-provisioning
| of 10%, then even after writing the "whole" capacity of
| the drive, you can still be left with up to 10% of data
| recoverable from the Flash chips.
| tzs wrote:
| If a logical overwrite only involved bits going from 1 to 0,
| are and drives smart enough to recognize this and do it as an
| actual overwrite instead of a copy and erase?
| eqvinox wrote:
| On embedded devices, yes, this is actually used in file
| systems like JFFS2. But in these cases the flash chip is
| just dumb storage and the translation layer is implemented
| on the main CPU in software. So there's no "drive" really.
|
| On NVMe/PC type applications with a controller driving the
| flash chips... I have absolutely no idea. I'm curious too,
| if anyone knows :)
| jasonwatkinspdx wrote:
| Generally no, because the unit of write is a page.
| Dzugaru wrote:
| Of course it does [0]. It's just it assigns writes as evenly as
| possible (to have as even wear as possible), so log-like
| internal "file system" is a way to go.
|
| https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
| throwaway09223 wrote:
| The author clearly explains how this works in the sentence
| immediately following. "Instead it has internally a thing
| called flash translation layer (FTL)" ...
| notaplumber wrote:
| I unfortunately skimmed over this, isotopp's explanation
| helped clear things up in my head.
| throwaway09223 wrote:
| I just saw his post, it's a great explanation.
|
| It might also help to keep in mind that both regular disk
| drives and solid state drives remap bad sectors. Both types
| of disks maintain an unaddressable storage area which is
| used to transparently cover for faulty sectors.
|
| In a hard drive, faulty sectors are mapped during
| production and stored in the p-list, and are remapped to
| sectors in this extra hidden area. Sectors that fail at
| runtime are recorded in the g-list and are likewise
| remapped.
|
| Writes may _usually_ go to the same place in a hard drive,
| but it 's not guaranteed there either.
| rbanffy wrote:
| One thing I don't get in the Mach.2 design is why not use the
| approach IBM used in mainframe hard drives of having multiple
| heads over the same platter. The Mach.2 is two drives sharing
| parts, that will behave like a 2-drive RAID array. Having
| multiple heads on the same platter would allow it to read like a
| RAID-1 while writing like a RAID-0.
| myrandomcomment wrote:
| Arista was pushing leaf-spine in 2008....
| bob1029 wrote:
| My take on NVMe has evolved a lot after getting a chance to
| really play around with the capabilities of consumer-level
| devices.
|
| The biggest thing I have realized (as have others), is that
| traditional IO wait strategies don't make sense for NVMe. Even
| "newer" strategies like async/await do not give you want you
| truly want for one of these devices (still too slow). The best
| performance I have been able to extract from these is when I am
| doing really stupid busy-wait strategies.
|
| Also, single-writer principle and serializing+batching writes
| _before_ you send them to disk is critical. With any storage
| medium where a block write costs you device lifetime, you want to
| put as much effective data per block as possible, and you also
| want to avoid editing existing blocks. Append-only log structures
| are what NVMe /flash devices live for.
|
| With all of this in mind, the motivations for building gigantic
| database clusters should start going away. One NVMe device per
| application instance is starting to sound a lot more compelling
| to me. Build clusters with app logic (not database logic).
|
| In testing of these ideas I have been able to push 2 million
| small writes (~64k) per second to a single Samsung 960 pro for a
| single business entity. I don't know of any SQL clusters that can
| achieve these same figures.
| isotopp wrote:
| In NVME you can get around 800.000 IOPS from a single device,
| but the latency gives you around 20.000 IOPS sequentially. You
| need to talk with deep queues or with multiple concurrent
| threads to the device in order to eat the entire IOPS buffet.
|
| Traditional OLTP workloads do not tend to have the concurrency
| to actually saturate the NVME. You would need to be 40-way
| parallel, but most OLTP workloads give you 4-way.
|
| Multiple instances per device are almost a must.
| anarazel wrote:
| With a lot of NVMe devices, up to medium priced server gear,
| the bottleneck in OLTP workloads isn't normal write latency,
| but slow write cache flushes. On devices with write caches
| one either needs to fdatasync() the journal on commit (which
| typically issues a whole device cache flush) or use O_DIRECT
| | O_DSYNC (ending up as a FUA write which just tags the
| individual write as needing to be durable) for journal
| writes. Often that drastically increases latency _and_ slows
| down concurrent non-durable IO, reducing the benefit of
| deeply queued IO substantially.
|
| On top-line gear this isn't an issue, they don't signal a
| write cache (by virtue of either having a non-volatile cache
| or enough of a power reserve to flush the cache). Which then
| prevents the OS from actually doing more expensive for
| fdatasync()/O_DSYNC. One also can manually ignore the need
| for caching by changing /sys/block/nvme*/queue/write_cache to
| say write through, but that obviously looses guarantees - but
| can be useful to test on lower end devices.
| anarazel wrote:
| One consequence of that is that:
|
| > Multiple instances per device are almost a must.
|
| Isn't actually unproblematic in OLTP, because it increases
| the number of journal writes that need to be flushed. With
| a single instance group commit can amortize the write cache
| flush costs much more efficiently than with many concurrent
| instances all separately doing much smaller group commits.
| bob1029 wrote:
| > You need to talk with deep queues or with multiple
| concurrent threads to the device in order to eat the entire
| IOPS buffet.
|
| Completely agree. There is another angle you can play if you
| are willing to get your hands dirty at the lowest levels.
|
| If you build a custom database engine that fundamentally
| stores everything as key-value, and then builds relational
| abstractions on top, you can leverage a lot more benefit on a
| per-transaction basis. For instance, if you are storing a KVP
| per column in a table and the table has 10 columns, you may
| wind up generating 10-20 KVP items per logical row
| insert/update/delete. And if you are careful, you can make
| sure this extra data structure expressiveness does not cause
| write amplification (single writer serializes _and batches_
| all transactions).
| isotopp wrote:
| You may want to play with a TiDB setup from Pingcap.
| jaytaylor wrote:
| > If you build a custom database > engine that
| fundamentally stores > everything as key-value, and
| then > builds relational abstractions on > top
|
| Sounds like this could be FoundationDB, among other
| contenders like TiDB.
|
| https://foundationdb.org
| 90minuteAPI wrote:
| I guess it still makes sense for higher abstraction levels
| though, right? Like a filesystem or other shared access to a
| storage resource. So these asynchronous APIs aren't writing as
| directly to storage, they're placing something in the queue and
| notifying when that batch is committed.
|
| > Append-only log structures are what NVMe/flash devices live
| for.
|
| I would think this is also good for filesystems like ZFS, APFS,
| and BTRFS, yes? I had an inkling but never really looked into
| it. Aren't these filesystems somewhat similar to append-only
| logs of changes, which serialize operations as a single writer?
| jeffinhat wrote:
| I've been messing with NVMe over TCP at home lately and it's
| pretty awesome. You can scoop up the last-gen of 10GBe/40GBe
| networking on eBay for cheap and build your own fast
| disaggregated storage on upstream Linux. The kernel based
| implementation saves you some context switching over other
| network file systems, and you can (probably) pay to play for on-
| NIC implementations (especially as they're getting smarter).
|
| It seems like these solutions don't have a strong
| authentication/encryption-in-transit story. Are vendors building
| this into proprietary products or is this only being used on
| trusted networks? I think it'd be solid technology to leverage
| for container storage.
| stingraycharles wrote:
| I just use iSCSI at home but using Mellanox' RoCE which is
| pretty well performing.
|
| One thing I'm noticing is that most of these storage protocols
| do, in fact, assume converged Ethernet; that is, zero packet
| loss and proper flow control.
|
| Is this also the case with NVMe over TCP?
| effie wrote:
| > build your own fast disaggregated storage on upstream Linux
|
| Why is this better than just connecting the drive via PCIe bus
| directly to the CPU?
| innot wrote:
| > RAID or storage replication in distributed storage <..> is not
| only useless, but actively undesirable
|
| I guess I'm different from most people, good news! When building
| my new "home server" half a year ago I made a raid-1 (based on
| ZFS) with 4 NVMEs. I rarely appear at that city, so I brought the
| fifth one and put it into an empty slot. Well, one of the 4 nvmes
| lasted for 3 months and stopped responding. One "zpool replace"
| and I'm back to normal, without any downtime, disassembly, even
| reboots. I think that's quite useful. When I'm there the next
| time I'll replace the dead one, of course.
| twotwotwo wrote:
| > > RAID or storage replication in distributed storage <..> is
| not only useless, but actively undesirable
|
| > I guess I'm different from most people, good news!
|
| The earlier part of the sentence helps explain the difference:
| "That is, because most database-like applications do their
| redundancy themselves, at the application level..."
|
| Running one box I'd want RAID on it for sure. Work already runs
| a DB cluster because the app needs to stay up when an entire
| box goes away. Once you have 3+ hot copies of the data and a
| failover setup, RAID within each box on top of that can be
| extravagant. (If you _do_ want greater reliability, it might be
| through more replicas, etc. instead of RAID.)
|
| There is a bit of universalization in how the blog post phrases
| it. As applied to databases, though, I get where they're coming
| from.
| goodpoint wrote:
| Compare your solution with having 4 SBCs, with 1 NVME each, at
| different locations. The network client would handle
| replication and checksumming.
|
| The total cost might be similar but you have increased
| reliability over SBC/controller/uplink failure.
|
| Of course there are tradeoffs on performance and ease of
| management...
| Godel_unicode wrote:
| You think that building 4 systems in 4 locations is likely to
| have a similar cost to one system at one location? For small
| systems, the fixed costs are a significant portion of the
| overall system cost.
|
| This is doubly true for physical or self-hosted systems.
| zxexz wrote:
| What setup do you use to put 4 NVME in one box? I know it's
| possible, I've just heard off so many different setups. I know
| there are some PCIE cards that allow for 4 NVME drives. But you
| have to match that with a motherboard/CPU combo with enough
| ones to not lose bandwidth.
| birdyrooster wrote:
| That's exactly what they are doing. Anyone else is using
| proprietary controllers and ports for a server chassis
| pdpi wrote:
| I've been looking into building some small/cheap storage, and
| this is one of the enclosures I've been looking at.
|
| https://www.owcdigital.com/products/express-4m2
| isotopp wrote:
| For distributed storage, we use this:
| https://www.slideshare.net/Storage-Forum/operation-
| unthinkab...
|
| We then install SDS software, Cloudian for S3, Quobyte for
| File, and we used to use Datera for iSCSI. Lightbits maybe in
| the future, I don't know.
|
| These boxen get purchased with 4 NVME devices, but can grow
| to 24 NVME devices. Currently 11 TB Microns, going for 16 or
| more in the future.
|
| For local storage, multiple NVME hardly ever make sense.
| birdyrooster wrote:
| Does zpool not automatically promote the hot spare like mdadm?
| seized wrote:
| It can, if you set a disk as the hot spare for that pool.
|
| But a disk can only be a hot spare in one pool, so to have a
| "global" hot spare it has to be done manually. That may be
| what that poster was doing.
| sjagoe wrote:
| Also, if I understand it correctly, there are a few other
| caveats with hot spares: It will only activate when another
| drive completely fails, so you can't decide to replace a
| drive when it's close to failure (probably not an issue in
| this case, though, with the unresponsive drive). Second,
| with the hot spare activated, the pool is still degraded,
| and the original drive still needs to be replaced; then the
| hot spare is removed from the vdev, and goes back to being
| a hot spare.
|
| It's these reasons that I've decided to just keep a couple
| of cold spares ready that I can swap in to my system as
| needed, although I do have access to the NAS at any time.
| If I was remote like GP, I might decide to use a hot spare.
| SkyMarshal wrote:
| I've recently converted all my home workstation and NAS hard
| drives over to OpenZFS, and it's amazing. Anyone who says RAID
| is useless or undesirable just hasn't used ZFS yet.
| amarshall wrote:
| The article's author only said RAID was useless in a specific
| scenario, not generally, and the post you're replying to
| omitted this crucial context.
| isotopp wrote:
| My environment is not a home environment.
|
| It looks like this:
| https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html
| toast0 wrote:
| This article is speaking of large scale multinode distributed
| systems. Hundreds of rack sized systems. In those systems, you
| often don't need explicit disk redundancy, because you have
| data redundancy across nodes with independent disks.
|
| This is a good insight, but you need to be sure the disks are
| independent.
| merb wrote:
| well most often hba's and raid controllers are another thing
| which increases latency and makes maintenances costs go up
| quite a bit (more stuff to update) and also it's another part
| that can break.
|
| that's why it's not recommended when running ceph.
| Aea wrote:
| I'm pretty sure discrete HBAs / Hardware RAID Controllers
| have effectively gone the way of the dodo. Software RAID
| (or ZFS) is the common, faster, cheaper, more reliable way
| of doing things.
| toast0 wrote:
| Hardware RAID doesn't seem to be going away quickly.
| Since they're almost all made by the same company, and
| they can usually be flashed to be dumb HBAs, it's not too
| bad, but it was pretty painful when using managed hosting
| and the menu options with lots of disks all have the raid
| controllers that are a pain to setup; and I'm not going
| to reflash their hardware (although I did end up doing
| some SSD firmware updates myself because firmware bugs
| were causing issues and their firmware upgrade scripts
| weren't working well and were tremendously slow)
| amarshall wrote:
| Don't lop HBAs and RAID controllers together. The former
| is just PCIe to SATA or SCSI or whatever (otherwise it is
| not just an HBA, but indeed a RAID controller). Such a
| thing is still useful and perhaps necessary for software
| RAID if there are insufficient ports on the motherboard.
| karmakaze wrote:
| Hardware caching raid controllers do have the advantage
| if power is lost, the cache can still be written out
| without the CPU/software to do it. This let's you safely
| run without write-thru cache fsync. This was a common
| spec for provisioned bare-metal MySQL servers I'd worked
| with.
| amarshall wrote:
| Sometimes. Other times they may make things worse by
| lying to the filesystem (and thereby also the
| application) about writes being completed, which may
| confound higher-level consistency models.
| wtallis wrote:
| It does seem to me that it's much easier to reason about
| the overall system's resiliency when the capacitor-
| protected caches are in the drives themselves (standard
| for server SSDs) and nothing between that and the OS lies
| about data consistency. And for solid state storage, you
| probably don't need those extra layers of caching to get
| good performance.
| Godel_unicode wrote:
| The entire comment thread of this article is on-prem, low
| scale admins and high-scale cloud admins talking past
| each other.
|
| You can build in redundancy at the component level, at
| the physical computer level, at the rack level, at the
| datacenter level, at the region level. Having all of them
| is almost certainly redundant and unnecessary at best.
| seized wrote:
| ZFS needs HBAs. Those get your disks connected but
| otherwise get out of the way of ZFS.
|
| But yes, hardware RAID controllers and ZFS don't go
| together.
| amarshall wrote:
| You omitted the context from the rest of the sentence:
|
| > most database-like applications do their redundancy
| themselves, at the application level ...
|
| If that's not the case for your storage (doesn't sound like
| it), then the author's point doesn't apply to your case anyway.
| In which case, yes, RAID may be useful.
| linsomniac wrote:
| We are currently converting our SSD-based Ganeti clusters from
| LVM on RAID to ZFS, to prepare for our NVMe future, without
| RAID cards (1). Was hoping to get the second box in our dev
| ganeti cluster reinstalled this morning to do further testing,
| but the first box has been working great!
|
| 1: LSI has a NVMe RAID controller for U.2 chassis, preparing
| for a non-RAID future, just in case.
| dcminter wrote:
| "Most people need their data a lot less than they think they do"
| - great way to put it and a thought provoking article.
| rattray wrote:
| Yep, reminds me of a tech blog I read in the last year or so
| (can't remember where) that talked about using 6 replicas for
| each DB shard (across 3 AZ's IIRC)... they just used EC2's on-
| disk NVMe storage (which is "ephemeral") because it's faster
| and hey, if the machine dies, you have replicas!
|
| This post's point that even with that setup, it's still nice to
| have volume-based storage for quicker image replacements is
| interesting, I'm not experienced enough with cloud setups to
| know if that makes sense (eg how long does it take to upgrade
| an EC2 instance that has data on disk? upgrade your OS? upgrade
| your pg version? does ephemeral storage vs volume storage
| affect these? I imagine not...)
| behringer wrote:
| I can't imagine wasting NVME storage on backup data. That's
| what I have my 8tb spinning hdds for.
| Koshkin wrote:
| Isn't tape more cost-effective and reliable? (Also, your
| backups do not need to be spinning all the time, if that's
| what they are doing.)
| walrus01 wrote:
| The problem with really high capacity tape these days is
| there's almost literally 1 vendor for drives and tapes,
| which is even worse than the 3 companies worldwide that
| manufacture hard drives. If you want relatively not so fast
| storage for a lot of terabytes, I bet I could clone a
| backblaze storage pod with some modifications and achieve a
| better $/TB ratio than an enterprise priced tape solution.
| adrian_b wrote:
| There are at least 2 manufacturers for tapes (Fujifilm
| and Sony), but their tapes are also sold under many other
| brands (e.g. IBM, Quantum, HP).
|
| The price per TB is currently about $7 or less (per real
| TB of LTO-8, not per marketing compressed TB), so there
| is no chance to approach the cost and reliability of
| tapes using HDDs.
|
| The problem with tapes is that currently the tape drives
| are very expensive, because their market is small, so you
| need to store a lot of data before the difference in
| price between HDDs and tapes exceeds the initial expense
| for the tape drive.
|
| Nevertheless, if you want to store data for many years,
| investing in a tape drive may be worthwhile just for the
| peace of mind, because when storing HDDs for years you
| can never be sure how will they work when operated again.
| whatshisface wrote:
| Right now disks are cheaper than tapes, even if you don't
| count the very expensive tape readers.
| eqvinox wrote:
| This is incorrect. I recently acquired an LTO-6 tape
| library; the tapes are <20EUR for 2.5TB true capacity
| (marketing = "6TB compressed".) That's <8EUR/TB. Disk
| drives start at 20EUR/TB for the cheapest garbage.
|
| Sources:
|
| https://geizhals.eu/?cat=zip&xf=1204_LTO-6 (tapes)
|
| https://geizhals.eu/?cat=hde7s&sort=r (disks)
|
| (For completeness, the tape library ran me 500EUR on eBay
| used, but they normally run about twice that. It's a
| 16-slot library, which coincidentally matches the
| breakeven - filled for 40TB it's 820EUR, the same in
| disks would've been 800EUR, though I would never buy the
| cheapest crappy disks.)
|
| FWIW, a major reason for going for tapes for me was that
| at some point my backup HDDs always ended up used for
| "real". The tape library frees me from trying to
| discipline myself to properly separate backup HDDs ;)
| adrian_b wrote:
| For newer tape formats the prices are similar or even a
| little lower.
|
| For example, at Amazon in Europe an LTO-7 cartridge is
| EUR 48 for 6 real TB (EUR 8 per real TB), while an LTO-8
| cartridge is EUR 84 for 12 real TB (EUR 7 per real TB).
|
| At Amazon USA I see now $48 for LTO-7 and $96 for LTO-8,
| i.e. $8 per TB if you buy 1 tape. For larger quantities,
| there are discounts.
|
| Older formats like LTO-6 have the advantage that you may
| find much cheaper tape drives, but you must handle more
| cartridges for a given data size.
|
| Currently the cheapest HDDs are the external USB drives
| with 5400 rpm, which are much slower, about 3 times
| slower than the tapes, but even those are many times more
| expensive than the tapes (e.g. $27 ... $30 per TB).
| Dylan16807 wrote:
| I don't disagree much overall but in my experience cheap
| or sale drives can beat 15 per TB.
| jabroni_salad wrote:
| Tape is great for long term storage of archival records,
| but nobody wants to restore a live virtual machine from 7
| years ago.
|
| Depending on the business system you might want your server
| to be backed up multiple times a day to just weekly, and if
| you are doing multiple times a day you do want it to be
| decently fast or to have some extra capacity so as not to
| slow down normal operations while the backup runs.
| [deleted]
| santoshalper wrote:
| Especially in a world where so much data now "lives" in the
| cloud. Between my dropbox, github, google photos, etc. Very
| little of my data only lives on a hard drive. The stuff that
| does lives on a Synology NAS and is mirrored to S3 Glacier
| weekly.
| wazoox wrote:
| But Dropbox, Github, Google photos etc rely on massive piles
| of hard drives.
| santoshalper wrote:
| Sure, but the comment was about personal storage. Fault
| tolerance at the edge is less important in a cloud world.
| sreeramb93 wrote:
| NAS mirrored to S3 Glacier. How much does it cost?
| xyzzy_plugh wrote:
| In my experience this is very cheap. I take it the parent
| is not retrieving from Glacier often/ever, which is where
| the significant costs go. It's a decent balance for
| disaster recovery.
|
| I sync my photos to S3 (a mix of raw and jpeg, sidecar
| rawtherapee files) across a few devices so Glacier is
| prohibitively expensive in this regard, but I still pay
| <$100 a year for more stuff than I could ever store
| locally.
| kstrauser wrote:
| I did the math, and for me Glacier is great for backups
| where homeowners insurance is likely to be involved in
| the restoral. It was ferociously expensive for anything
| less drastic.
| laurentb wrote:
| how would that process look like in practice should you
| need to get to call the insurance guys? as in, would you
| claim on the cost to retrieve the data or ? (this
| question is general, regardless of the actual country)
| kstrauser wrote:
| I honestly don't know. I've never had to use it.
| DougWebb wrote:
| I'm trying to figure out the costs. To back up my NAS at
| full capacity, I need 10TB of storage. Using S3 Glacier
| Deep Archive, that seems to cost $10/month per full
| backup image I keep. That's not bad.
|
| What's confusing is that the calculator has "Restore
| Requests" as the # of requests, "Data Retrievals" as
| TB/month, but there's also a "Data Transfer" section for
| the S3 calculator. If I add 1 restore request for 10TB of
| data (eg: restoring my full backup to the NAS), that adds
| about $26 for that month. Totally reasonable.
|
| However, if "Data Transfer" is relevant, and I can't tell
| if it is or isn't, uploading my backup data is free but
| retrieving 10TB would cost $922! Is that right?
|
| This is what has always deterred me from using AWS. It's
| so unclear what services and fees will apply to any given
| use case, and it seems like there's no way to know until
| Amazon decides that you've incurred them. At $10/month
| for storage and $26 if I need to restore, I can just set
| this up and I don't need to plan for disaster recovery
| expenses. But if it's going to cost me $922 to get my
| data back, I've got to figure out how to make sure my
| insurance is going to cover that. This isn't a no-brainer
| anymore. Also, what assurance do I have that the cost
| isn't going to be higher when I need the data, or that
| there won't be other fees tacked on that I've missed?
|
| [1] https://calculator.aws/#/createCalculator/S3
| Dylan16807 wrote:
| > However, if "Data Transfer" is relevant, and I can't
| tell if it is or isn't, uploading my backup data is free
| but retrieving 10TB would cost $922! Is that right?
|
| That's right. AWS charges offensive prices for bandwidth.
|
| There are alternate methods to get data out for about
| half the price, or you can try your luck on using
| lightsail and if they don't decide it's a ToS violation
| you could get the transfer costs to around $50.
| seized wrote:
| Glacier pricing can be hard to grok...
|
| With Glacier as it is usually used you dont read data
| directly from the Glacier storage, it has to be restored
| to S3 where you then access it. That is the restore
| charges and the delays, so you can pay a low rate for the
| bulk option that takes up to 24 hours to restore your
| data to S3. But the real cost is the bandwidth from S3
| back to your NAS/datacenter/etc, which brings it up to
| about $90USD/TB.
|
| Other fees would include request pricing, some low amount
| per 1000 requests. So costs can go up a bit if you store
| 1 million small files to Glacier vs 1000 large files.
| There is also a tipping point (IIRC about 170KB) where it
| is cheaper to store small files on S3 than Glacier.
|
| Depending on your data and patterns it can be better to
| use Glacier as a second backup which is what I do. All my
| data is backed up to a Google Workspace as that is
| "unlimited" for now. The most important subset (a few TB)
| also goes to Glacier. Glacier is pay as you go, there
| isnt some "unlimited" or "5TB for life" type deal that
| can change. If Google Workspace ever becomes not
| "unlimited" or something happens to it, I have the most
| important data in Glacier and its data that I have no
| qualms paying >$1k to get back.
|
| But for me restoring from Glacier means that my NAS is
| dead (ZFS RAIDZ2 on good hardware) and Google Workspace
| has failed me at the same time.
| DougWebb wrote:
| Cool, thank you for the details. None of their marketing
| or FAQs for Glacier mention that getting the data back
| means going to S3 first and then paying S3's outgoing
| bandwidth costs. As deceptive as I expected.
|
| I'll check out Google Workspace; that sounds like the
| right level of kludge for me, since this is the first
| time I've ever bothered to try to setup off-site backups.
| I only started using RAID a couple of years ago.
| kstrauser wrote:
| Are you sure about the "restored to s3" bit? Their SDK
| seems to fetch directly from Glacier.
|
| Note that the official name is "S3 Glacier", so from
| AWS's public perspective, it _is_ S3.
| kstrauser wrote:
| The $922 sounds about right. That jibes with my
| estimates.
|
| There's another (unofficial!) calculator at
| http://liangzan.net/aws-glacier-calculator/ you can toy
| with.
| DougWebb wrote:
| Thanks, I'll check it out.
| seized wrote:
| Glacier is about $1USD/TB/month just for storing data. If
| you need to retrieve it ends up being about $90USD/TB, most
| of that is bandwidth charges.
| adrian_b wrote:
| That means that if you store the data for much more than
| a half of year, Glacier becomes more expensive than
| storing on tapes.
|
| Of course, tapes require a tape drive and its cost would
| require a lot of data to compensate the cost, but at a
| such high cost of retrieval it would not take much data
| to equal the cost of a tape drive.
|
| Glacier is OK for a couple of TB, but for tens or
| hundreds it would not be suitable.
| Dylan16807 wrote:
| > Of course, tapes require a tape drive and its cost
| would require a lot of data to compensate the cost, but
| at a such high cost of retrieval it would not take much
| data to equal the cost of a tape drive.
|
| But the less you expect to use it, the less this matters.
|
| So I'd put the break-even point a bit higher. Tape is
| good for 100TB or more but for tens it's hard to justify
| a tape drive.
|
| Also it's important to remember to get those tapes
| offsite every week!
| santoshalper wrote:
| Few bucks a month. More if I needed to retrieve something.
| unethical_ban wrote:
| I don't have automated mirroring set up, but I have
| insight.
|
| I use a Windows free tool called FastGlacier. I set up an
| IAM user on my AWS account for my backups, and use those
| creds to login. Then it's drag and drop! You can even use
| FastGlacier to encrypt/decrypt on the fly as you upload and
| download.
|
| Glacier is cheap because the retrieval times are very slow
| - something like 1-12 hours depending on the tier.
|
| I have about 100GB of critical data. Personal documents,
| photos and some music I don't want to have to search for if
| the house burns down. It's something like a dollar a month.
| Less than a cup of coffee.
| thrtythreeforty wrote:
| Deep Archive is super cost effective, $1/TB/mo. For the
| house-burns-down scenario, I don't mind the 24hr
| retrieval time
| zwieback wrote:
| Yes, interesting thought. On my ride in to work I was actually
| thinking how our situation is exactly opposite: in our
| environment (R&D test and measurement and production
| automation) data is everything and never in the cloud so we
| don't get to benefit from all the cool stuff the kids are doing
| these days. Historical data can go in the cloud (as long as
| we're reasonably sure it's secure) but operational data from
| our test and production tooling (e.g. assembly lines and end-
| of-line audit tools) has to be right there with super short
| latency.
|
| So we're still very interested in things like hard disks, NVMe,
| etc.
| birdyrooster wrote:
| I know for a fact I don't need 40T of movies and tv, but that
| changes nothing
| NikolaNovak wrote:
| I think it's often true at business level (most departments in
| a large company don't need redundancy and uptime they think
| they do),
|
| and rarely if ever true at personal/family level (most people
| don't think their phone/tablet/laptop could lose their data;
| most people don't think of their data past photos they took
| this week - "Oh it's OK to lose... wait, my tax documents are
| gone? Photos of my baby from two years ago are gone? My poems
| are gone? My [.. etc] is gone???"?).
| tyingq wrote:
| _" most database-like applications do their redundancy
| themselves, at the application level, so that...storage
| replication in distributed storage...is not only useless, but
| actively undesirable"_
|
| I do believe that's been the author's experience. However, I
| think he may be unaware that's not everyone's...or even most
| people's, experience.
| [deleted]
| isotopp wrote:
| If you are not running a database at home, you have the
| database in a replication setup.
|
| That provides capacity, but also redundancy. Better redundancy
| than at the disk level - fewer resources are shared.
|
| https://blog.koehntopp.info/2021/03/24/a-lot-of-mysql.html Here
| is how we run our datatbases.
| tyingq wrote:
| _" If you are not running a database at home, you have the
| database in a replication setup."_
|
| That's one of the perceptions I'm saying isn't always true.
| Especially in big, non-tech companies that have a mish-mash
| of crazy legacy stuff. Traditional H/A disks and backups
| still dominate some spaces.
| Spooky23 wrote:
| Agreed. It depends on what you do.
|
| When I ran large-scale Exchange and database systems, we would
| always get into big fights with the storage people, who
| believed that SAN disk with performance tiering, replication,
| etc was the way to solve all problems.
|
| The problem at the time was that the fancy storage cost $40+
| Gb/mo and performed like hot garbage because the data didn't
| tier well. For email, the "correct" answer in 2010 was local
| disk or dumb sas arrays with exchange dag.
|
| For databases, the answer was "it depends". Often it made sense
| to use the SAN and replication at that layer.
| [deleted]
| walrus01 wrote:
| My biggest take away from this is "I certainly hope this person
| isn't responsible for actually designing the bare metal storage
| infrastructure meant to underly something important". They seem
| to be operating from the premise that data replication at the
| 'cloud' level can solve all their problems.
| 411111111111111 wrote:
| I think you're giving him too little credit there.
|
| The parents point is really spot on though: most websites
| aren't at scale where every stateful api has redundant
| implementations. But the author's point does have merit:
| inevitably, something goes wrong with all systems - and when
| it does, your system goes down if you trust your HA config to
| work. If you actually did go for redundant implementations,
| your users likely aren't even gonna notice anything went
| wrong.
|
| It's however kinda unsustainable to maintain unless you have
| a massive development budget
| alisonkisk wrote:
| But why is his layer the only correct layer for redundancy?
|
| He's also doing wasted work his redundancy has bugs that a
| consumer has to worry about working around.
| [deleted]
| smueller1234 wrote:
| I'm not the original author but I used to work with him
| and now do in storage infrastructure at Google. As others
| pointed out, what the author, Kris, writes kind of
| implies/requires a certain scale of infrastructure to
| make sense. Let me try to provide at least a little bit
| of context:
|
| The larger your infrastructure, the smaller the relative
| efficiency win that's worth pursuing (duh, I know,
| engineering time costs the same, but the absolute savings
| numbers from relative wins go up). That's why an approach
| along the lines of "redundancy at all levels" (raid +
| x-machine replication + x-geo replication etc) starts
| becoming increasingly worth streamlining.
|
| Another, separate consideration is types of failures you
| have to consider: an availability incident (temporary
| unavailability) vs. durability (permanent data loss). And
| then it's worth considering that in the limit to long
| durations, an availability incident will become the same
| as a durability incident. This is contextual: To pick an
| obvious/illustrative example, if your Snapchat messages
| are offline for 24h, you might as well have lost the data
| instead.
|
| Now, machines fail, of course. Doing physical maintenance
| (swapping disks) is going to take significant, human time
| scales. It's not generally tolerable for your data to be
| offline for that long. So local RAID barely helps at all.
| Instead, you're going to want to make sure your data
| stays available despite a certain rate of machine
| failure.
|
| You can now make similar considerations for different,
| larger domains. Network, power, building, city/location,
| etc. They have vastly different failure probabilities and
| also different failure modes (network devices failing is
| likely an availability concern, a mudslide into your DC
| is a bit less likely to recover). Depending on your
| needs, you might accept some of these but not others.
|
| The most trivial way to deal with this is to simply make
| sure you have a replica of each chunk of data in multiple
| of each of these kinds of failure zones. A replica each
| on multiple machines (pick the amount of redundancy you
| need based on a statistical model from component failure
| rates), a replica each on machines under different
| network devices, on different power, in different
| geographies, etc.
|
| That's expensive. The next most efficient thing would be
| to use the same concept as RAID (erasure codes) and apply
| that across a wider scope. So you basically get RAID, but
| you use your clever model of failure zones for placement.
|
| This gets a bit complicated in practice. Most folks stick
| to replicas. (Eg. last I looked, for example HDFS/Hadoop
| only supported replication, but it did use knowledge of
| network topology for placing data.)
|
| The reason why you don't want to do this in your
| application is because it's really kinda complicated.
| You're far more likely to have many applications than
| many storage technologies (or databases).
|
| Now, at some point of infrastructure size or complexity
| or team size it may make sense to separate your storage
| (the stuff I'm talking about) from your databases as
| well. But as Kris argues, many common databases can be
| made to handle some of these failure zones.
|
| In any case, that's the extremely long version of an
| answer to your question why you'd handle redundancy in
| this particular layer. The short answer is: Below this
| layer is too small a scope or with too little meta
| information. But doing it higher in the stack fails to
| exploit a prime opportunity to abstract away some really
| significant complexity. I think we all know how useful
| good encapsulation can be! You avoid doing this in
| multiple places simply because it's expensive.
|
| (Everything above is common knowledge among storage
| folks, nothing is Google specific or otherwise it has
| been covered in published articles. Alas, the way we
| would defend against bugs in storage is not public.
| Sorry.)
| 411111111111111 wrote:
| We're currently quite off-topic if i'm honest.
|
| I think the author was specifically talking about RAID-1
| redundancy and is advocating that you can leave your
| systems with RAID-0 (so no redundant drives in each
| server), as you're gonna need multiple nodes in your
| cluster anyway... so if any of your systems disks break,
| you can just let it go down and replace the disk while
| the node is offline.
|
| but despite being offtopic: redundant implementations are
| -from my experience - not used in a failover way. they're
| active at all times and load is spread if you can do
| that, so you'd likely find the inconsistencies in the
| integration test layer.
| rbanffy wrote:
| > not used in a failover way.
|
| Aurora works like that. Read replicas are on standby and
| become writable when the writable node dies or is
| replaced. They can be used, of course, but so can other
| standby streaming replicas.
| ahupp wrote:
| Not sure what to say, but this is how it works on all the
| large systems I'm familiar with.
|
| Image you have two servers, each with two 1TB disks (4TB
| physical storage). And you have two distinct services with
| 1TB datasets, and want some storage redundancy.
|
| One option is to put each pair of discs in a RAID-1
| configuration, and so each RAID instance holds one dataset.
| This protects against a disk failure, but not against server
| or network failures.
|
| Your other option is to put one copy of the dataset on each
| server. Now you are protected from the failure of any single
| disk, server, or the network to that server.
|
| In both cases you have 2TB of logical storage available.
| wafflespotato wrote:
| you're putting yourself at risk of split-brain though (or
| downtime due to fail-over (or the lack of it)).
|
| In either case what you're describing isn't really the
| 'cloud' alternative.
| dan_quixote wrote:
| Yeah, that's a complication/cost of HA that a significant
| portion of industry has long accepted. Everywhere I've
| been in the last 5 years has had this assumption baked in
| across all distributed systems.
| nightfly wrote:
| Where possible, my organization tries and have services
| deployed in sets of three (and try and require a quorum)
| to reduce/eliminate split-brain situations. And we're
| very small scale.
| [deleted]
| lurkerasdfh8 wrote:
| Even if the author assumptions are true and valid, this is like
| saying mirroring RAID is a safe backup option :)
|
| it is not. if you have 10 DB hosts with the same brand of nvme,
| and they fail under a workload, what good is it that you have
| 10 hot-hot failover hosts? you just bought yourself a few days,
| or hours if you are specially unlucky and using consumer grade.
| tyingq wrote:
| Yep. Mirroring happily mirrors mistakes very well, and very
| quickly.
| alisonkisk wrote:
| Replication also replicates mistakes.
| isotopp wrote:
| Correct.
|
| For that you have time delayed replicas and of course
| backups. But time delayed replicas are usually much
| faster to restore than a backup.
| kardos wrote:
| Not following the leaf-and-spine figure. 40x10G is 400G, but
| 4x40G is 160G, so how is this completely oversubscription free?
|
| Edit: followed the link, it is 2.5:1 oversubscribed
| walrus01 wrote:
| it's worth noting that 40GbE is an absolute dead end in terms
| of technology, in the ISP world. Look at how many 40GbE members
| there are on many major IXes and other things. It's super cheap
| to buy an older Arista switch on ebay with N x 10GbE ports and
| a few 40GbE uplinks, because they're all getting pulled from
| service and sold off cheap. The upgrade path for 10GbE is
| 100GbE.
|
| major router and switch manufacturers aren't even making new
| line cards with multiple 40GbE ports on them anymore, they're
| tech from 6-7 years ago. You can buy either something with
| dense 10GbE SFP+ ports, or something with QSFP/QSFP28 or
| similar 100GbE ports.
|
| If you want something for a test/development lab, or are on a
| _very_ tight budget, sure.
| bluedino wrote:
| Meanwhile some budget homelabbers are jumping on the 2.5Gb
| bandwagon
| isotopp wrote:
| The image is from 2012 and describes a topology, but not a
| technology.
|
| Today you'd rather use Arista 7060, 7050cx orJuniper 5200 as
| ToR. You'd not build 1:1, but plan for it in terms of ports and
| cable space, then add capacity as needed.
|
| Almost nobody in an Enterprise environment actually needs 1:1,
| unlike hosters or hyperscalers renting nodes to others. Even
| then you'd probably being able to get away with a certain
| amount of oversubscription.
___________________________________________________________________
(page generated 2021-06-11 23:01 UTC)