[HN Gopher] Seagate Creates an NVMe Hard Disk Drive
___________________________________________________________________
Seagate Creates an NVMe Hard Disk Drive
Author : drewrem11
Score : 64 points
Date : 2021-11-13 12:56 UTC (1 days ago)
(HTM) web link (www.pcmag.com)
(TXT) w3m dump (www.pcmag.com)
| joenathanone wrote:
| >"Hence, using the faster NVME protocol may seem rather
| pointless."
|
| Isn't it the interface that is faster and not the protocol? PCIe
| vs SATA
|
| Edit: after reading more, this article is littered with
| inaccuracies
| wtallis wrote:
| It's both. Basic things like submitting a command to the drive
| requires fewer round trips with NMVe than AHCI+SATA, allowing
| for lower latency and lower CPU overhead. But the raw
| throughput advantage of multiple lanes of PCIe each running at
| 8Gbps or higher compared to a single SATA link at 6Gbps is far
| more noticeable.
| joenathanone wrote:
| I get that, but with NVME being designed from the ground up
| specifically for SSD's wouldn't using it for an HDD present
| extra overhead for the controller to deal with an HDD,
| negating any theoretical protocol advantages?
| wtallis wrote:
| NVMe as originally conceived was still based around the
| block storage abstraction implemented by hard drives. Any
| SSD you can buy at retail is still fundamentally emulating
| classic hard drive behavior, with some optional extra
| functionality to allow the host and drive to cooperate
| better (eg. Trim/Deallocate). But out of the box, you're
| still dealing with reading and writing to 512-byte LBAs, so
| there's not actually much that needs to be added back in to
| make NVMe work well for hard drives.
|
| The low-level advantages of NVMe 1.0 were mostly about
| reducing overhead and improving scalability in ways that
| were not strictly necessary when dealing with mechanical
| storage and were not possible without breaking
| compatibility with old storage interfaces. Nothing about
| eg. the command submission and completion queue structures
| inherently favor SSDs over hard drives, except that
| allowing multiple queues per drive each supporting queue
| lengths of hundreds or thousands of commands is a bit silly
| in the context of a single hard drive (because you never
| actually want the OS to enqueue 18 hours worth of IO at
| once).
| londons_explore wrote:
| > because you never actually want the OS to enqueue 18
| hours worth of IO at once
|
| As a thought experiment, I think there _are_ usecases for
| this kind of thing for a hard drive.
|
| The very nature of a hard drive is that sometimes
| accessing certain data happens to be very cheap - for
| example, if the head just happens to pass over a block of
| data on the way to another block of data I asked to read.
| In that case, the first read was 'free'.
|
| If the drive API could represent this, then very low
| priority operations, like reading and compressing dormant
| data, defragmentation, error checking existing data,
| rebuilding RAID arrays etc. might benefit from such a
| long queue. Pretty much, a super long queue of "read this
| data only if you can do so without delaying the actual
| high priority queue".
| wtallis wrote:
| When a drive only has one actuator for all of the heads,
| there's only a little bit of throughput to be gained from
| Native Command Queueing, and that only requires a dozen
| or so commands in the queue. What you're suggesting goes
| a little further than just plain NCQ, but I'd be
| surprised if it could yield more than another 5%
| throughput increase even in the absence of high-priority
| commands.
|
| But the big problem with having the drive's queue contain
| a full second or more worth of work (let alone the
| _hours_ possible with NVMe at hard drive speeds) is that
| you start needing the ability to cancel or re-order /re-
| prioritize commands that have already been sent to the
| drive, unless you're working in an environment with
| absolutely no QoS targets whatsoever. The drive is the
| right place for scheduling IO at the millisecond scale,
| but over longer time horizons it's better to leave things
| to the OS, which may be able to fulfill a request using a
| different drive in the array, or provide some
| feedback/backpressure to the application, or simply have
| more memory available for buffering and combining
| operations.
| [deleted]
| vmception wrote:
| What other connectors are coming down the pipeline or may
| currently be in draft specification phase?
|
| Admittedly, I totally did not know NVMe's were becoming a thing
| until a year ago, as I had been in the laptop-only space for a
| while or didnt need to optimize storage speed when connecting an
| existing drive to a secondhand motherboard.
|
| I like being ahead of the curve and am now curious whats next
| ksec wrote:
| >I like being ahead of the curve and am now curious whats next
|
| Others correct me if I am wrong.
|
| NVMe in itself is an interface specification, people often use
| the term NVMe when they are meant M2, the Connector.
|
| You wont get a new connector in the pipeline. But the M.2 is
| essentially just 4 lane PCI-E express. So every time you get
| PCI-E express update, currently 4.0, 5.0 around the corner and
| 6.0 in final draft. 7.0 possibly within next 3-4 years. So you
| can expect 14GB/s SSD soon, 28GB/s in ~2024, 50GB/s within this
| decade assuming we could somehow get the SSD controller power
| usage down to a reasonable level.
| vmception wrote:
| Hmm insightful, yes I recall noticing that
| connector/specification thing when I was trying to get the
| right size cards, had figured NVMe and M.2 were just synonyms
| but I see the cause for the correlation now
|
| So the NVMe card I added to an old motherboard's PCI-E slot
| is really just PCI-E on PCI-E? yo dawg
| wtallis wrote:
| > So the NVMe card I added to an old motherboard's PCI-E
| slot is really just PCI-E on PCI-E?
|
| Assuming that's a PCIe to M.2 adapter card, it's just
| rearranging the wires to a more compact connector. There's
| no nesting or layering of any protocols. Electrically,
| nothing changed about how the PCIe signals are carried
| (though M.2 lacks the 12V power supply), and either way you
| have NVMe commands encapsulated in PCIe packets (exactly
| analogous to how your network may have IP packets
| encapsulated in Ethernet frames).
| wtallis wrote:
| In the server space, SAS, U.2 and U.3 connectors are
| mechanically compatible with each other and partially
| compatible with SATA connectors. U.3 is probably the dead end
| for that family, but they won't disappear completely for a long
| time.
|
| Traditional PCIe add-in cards (PCIe CEM connector) are still
| around and also not going to be disappearing anytime soon, but
| are in decline as many use cases have switched over to other
| connectors and form factors, particularly for the sake of
| better hot-swap support.
|
| M.2 (primarily SSDs) is in rapid decline in the server space.
| It may hang on for a while for boot drives, but for everything
| else you want hot-swap and better power delivery (12V rather
| than 3.3V).
|
| The up and coming connector is SFF-TA-1002, used in the EDSFF
| form factors and a few other applications like the latest OCP
| NIC form factor. Its smaller configurations are only a bit
| larger than M.2, and the wider versions are quite a bit denser
| than regular PCIe add-in card slots. EDSFF provides a variety
| of form factors suitable for 1U and 2U servers, replacing 2.5"
| drives. The SFF-TA-1002 connector can also be used as direct
| replacement for the PCIe CEM connector, but I'm not sure if
| that's actually going to happen anytime soon.
|
| I haven't seen any sign that EDSFF or SFF-TA-1002 will be
| showing up in consumer systems. Existing solutions like M.2 and
| PCIe CEM are good enough for now and the foreseeable future.
| The older connectors sometimes need to have tolerances
| tightened up to support higher signal rates, but so far a
| backwards-incompatible redesign hasn't been necessary to
| support newer generations of PCIe (though the practical
| distances usable without retimers have been decreasing).
| h2odragon wrote:
| Disks have had complex CPUs on them for awhile, might as well go
| full mainframe and admit they're smart storage subsystems and put
| them on the first class bus. is "DASD" still an IBM trademark?
|
| Of course theres a long history of "multiple interface" drives
| which are always ugly hacks that turn up as rare collectors items
| and examples of boondoggle.
| DaiPlusPlus wrote:
| > and admit they're smart storage subsystems and put them on
| the first class bus. is "DASD" still an IBM trademark?
|
| Y'know that eventually we'll be running everything off Intel's
| Optane non-volatile RAM: we won't have block-addressable
| storage anymore, _everything_ will be directly byte-
| addressable. All of the storage abstractions that have popped-
| up over the past decades (tracks, heads, cylinders, ew, block-
| addressing, unnecessarily large block sizes, etc) will be
| obsolete because we 'll already have _perfect-storage_.
|
| It's not quite DASD, but it's much better.
| trasz wrote:
| Optane didn't exactly take the market by storm. Also: flash
| memory doesn't work this way; its inherently organized into
| blocks/pages.
| DaiPlusPlus wrote:
| The "Optane" that Intel released as their 3DXPoint NVMe
| brand, (and quietly withdrew it recently) isn't the same
| Optane as their byte-addressable non-volatile storage+RAM
| combo. It isn't Flash memory with blocks/pages, it really
| is byte-addressable:
| https://dl.acm.org/doi/10.1145/3357526.3357568
|
| "true" Optane hasn't taken over the scene because, to my
| knowledge, there's no commercially supported OS that's
| built around a unified memory model (heck, why not have a
| single memory-space?) for both storage and memory.
|
| We can't even write software that can just reanimate itself
| from a process image that would be automagically persisted
| in the unified storage model. We've got a long way to go
| before we'll see operating systems and applications that
| take advantage of it.
| spijdar wrote:
| But RAM is itself accessed in blocks. The process is
| hidden to software, but memory is always fetched in word-
| aligned blocks. It doesn't contradict your point, but
| just pointing out that even DRAM is pulled in chunks not
| unlike traditional drives (if you squint)
|
| (Of course, getting those chunks down to cache line sizes
| does open up a lot of possibilities...)
| trasz wrote:
| Well, yeah, the cacheline can be considered a kind of a
| 64-byte block. But it doesn't work like this because of
| how RAM works - you could access DRAM in words if you
| wished to; it's just that it doesn't make sense because
| of CPU cache. For flash, the blocks (and pages) are
| inherent to its design and there is no way around it.
|
| Also, RAM "block size" is 64B, while for flash its more
| like 4kB. CPU cache will "deblock" the 64B blocks, but it
| can't efficiently do it for 4kB ones.
|
| And then there's the speed. Does replacing PCIe with a
| memory bus actually make a performance difference that's
| measurable given flash latency?
| rlkf wrote:
| > commercially supported OS that's built around a unified
| memory model
|
| Doesn't OS/400 work that way? (Of course then there is
| the question to which degree "commercially" should imply
| "readily open to third-party software and hardware
| vendors")
| t0mas88 wrote:
| NVMe also does away with the controller knowing about the disk
| layout and addressing. Which may make sense for future disks and
| ever increasing cache sizes. At some point you probably want to
| put all the logic in the drive itself to optimise it (as SSDs
| already do)
| trasz wrote:
| SCSI already got rid of knowledge of disk layout/addressing
| some 30 years ago.
| wmf wrote:
| Hasn't LBA been around since 1990 or so?
| wtallis wrote:
| Even the old cylinder/head/sector addressing almost never
| matched the real hard drive geometry, because the floppy-
| oriented BIOS CHS routines used a different number of bits
| for each address component than ATA did, so translation was
| required even for hard drives with capacities in the hundreds
| of MB.
| oneplane wrote:
| Indeed, ideally the controller or HBA should really just
| provide the fabric and nothing more. A bit like Thunderbolt and
| USB4.
| oneplane wrote:
| Like the article explains, it makes sense if you can do this and
| reduce the amount of now 'legacy' interfaces.
|
| We used to do IDE emulation over SATA for a while, then we got
| AHCI over SATA and AHCI over other fabrics. Makes sense to stop
| carrying all the legacy loads all the time. For people that
| really need it for compatibility reasons we still have the
| vortex86 style solutions that generally fit the bill as well as
| integrated controllers that do PCIe-to-PCI bridging with a
| classic IDE controller attached to that converted PCI bus.
| Options will stay, but cutting legacy by default makes sense to
| me. Except UART of course, that can stay forever.
|
| Edit: I stand corrected, AHCI (like the name implies: Advanced
| Host Controller Interface) is for the communication up to the
| controller. Essentially the ATA commands continue to be sent to
| the drive but the difference is between commanding the controller
| over IDE or AHCI modes. This is also why the controller then
| needs to know about the drive, where NVMe doesn't need to know
| about the drive because the ATA protocol is no longer needed to
| be understood by the controller (as posted here elsewhere).
| masklinn wrote:
| > then we got AHCI over SATA
|
| AHCI is the native SATA mode, AFAIK explicit mentions of AHCI
| are mostly over non-SATA interfaces (usually m.2, because an
| m.2 SSD can be SATA, AHCI over PCIe, or NVMe).
| wtallis wrote:
| AHCI is the driver protocol used for communication between
| the CPU/OS and the SATA host bus adapter. It stops there, and
| nothing traveling over the SATA cable can be correctly called
| AHCI.
|
| You can have fully standard SATA communication happening
| between a SATA drive and a SATA-compatible SAS HBA that uses
| a proprietary non-AHCI driver interface, and from the drive's
| end of the SATA cable this situation is completely
| indistinguishable from using a normal SATA HBA.
|
| Likewise, you can have AHCI communication to a PCIe SSD and
| the OS will _think_ it 's talking to a single-port SATA HBA,
| with the peculiarity that sustained transfers in excess of
| 6Gbps are possible.
| AussieWog93 wrote:
| >It stops there, and nothing traveling over the SATA cable
| can be correctly called AHCI.
|
| Informally, a lot of BIOSes (another informality there!)
| would give you the choice back in the day between IDE or
| AHCI when it came to communicating with the drive.
|
| I think this is where most of the confusion came from.
| wtallis wrote:
| That choice was largely about which drivers your OS
| included. The IDE compatibility mode meant you could use
| an older OS that didn't include an AHCI driver. The
| toggle didn't actually change anything about the
| communication between the chipset and the drive, but
| _did_ change how the OS saw the storage controller built-
| in to the chipset on the motherboard.
|
| (Later, a third option appeared for proprietary RAID
| modes that required vendor-specific drivers, and that
| eventually led to more insanity and user
| misunderstandings when NVMe came onto the scene.)
| KingMachiavelli wrote:
| Would this reduce the need to have expensive and power hungry
| RAID/HBA cards? I would assume splitting nvme/PCIe is a lot
| simpler than PCIe to SATA.
| toast0 wrote:
| Seems like it. The article mentions a PCIe switch, but PCIe
| bifurcation may also be an option. (That's splitting a multiple
| lane slot into multiple slots, requires system firmware
| suppport though)
| formerly_proven wrote:
| Bifurcation has never been a thing on desktop platforms and
| even most entry-level (single socket, desktop-equivalent)
| servers don't support it. It seems to be reserved for HEDT
| and real server platforms. (This is of course purely a market
| segmentation decision by Intel/AMD).
| wtallis wrote:
| PCIe bifurcation works fine on AMD consumer platforms. I've
| used a passive quad-M.2 riser card on an AMD B550
| motherboard with no trouble other than changing a firmware
| setting to turn on bifurcation. It's only Intel that is
| strict about this aspect of product segmentation.
| toast0 wrote:
| My A520 mini-itx board supports it; can't get anymore
| desktop than that. Although, that has limited options, I
| think I can do either 2 x8 or one x8 and two x4. For this,
| it looks like each drive is expected to be x1, so you'd
| want one x16 to 16 x1s. It's doable, but not without
| mucking about in the firmware (either by the OEM, or
| dedicated enthusiasts), so a PCIe switch is probably
| advisable.
| formerly_proven wrote:
| Ah I see. It seems like previously bifurcation was not
| qualified for anything but the X-series chipset, but in
| the 500 series it's qualified for all. On top of that, it
| seems like some boards just allowed it regardless in
| prior generations.
|
| Another complication is of course that the non-PEG slots
| on the (non-HEDT) platforms are usually electrically only
| x4 or x1, so bifurcation really only makes sense in the
| PEG.
| wtallis wrote:
| PCIe root ports in CPUs are generally designed to provide
| an x16 with bifurcation down to x4x4x4x4 (or merely
| x8x4x4 for Intel consumer CPUs). Large-scale PCIe
| switches also commonly support bifurcation only down to
| x4 or sometimes x2, though x1 may start catching on with
| PCIe gen5.
|
| Smaller PCIe switches and motherboard chipsets usually
| support link widths from x4 down to x1. Treating each
| lane individually goes hand in hand with the fact that
| many of the lanes provided by a motherboard chipset can
| be reconfigured between some combination of PCIe, SATA
| and USB: they design a multi-purpose PHY, put down one or
| two dozen copies of it at the perimeter of the die, and
| connect them to an appropriate variety of MACs.
| formerly_proven wrote:
| Yeah, but what I meant above was that if you only have an
| x4 slot electrically, sticking in an x16 -> 4x M.2 riser
| isn't going to do a whole lot, because the 12 lanes of 3
| out of 4 slots aren't hooked up to anything. So in this
| scenario you'd really want a riser with a switch in it
| instead (which are more expensive than almost all
| motherboards).
|
| So on the consumer platforms that give you two PEGs best
| you could do while still having a GPU is stick that riser
| in the second PEG and use the x8/x8 split. Now the
| question becomes whether the UEFI allows you to use the
| x8/x8 bifurcation meant for dual GPU or similar use in an
| x8/(x4+x4) triple bifurcation kind of setup.
|
| Realistically this entire thing just doesn't make a lot
| of sense on the consumer platforms because they just
| don't have enough PCIe lanes out of the CPU. Intel used
| to be slightly worse here with 20 (of which 4 are
| reserved for the PCH but you know that), while AM4 has 28
| (4 for the I/O hub again). On an HEDT platform with 40+
| lanes though...
|
| (When I say bifurcation I meant bifurcation of a slot on
| the mainboard, not the various ways the slots and ports
| on the board itself can be configured, though that's
| technically bifurcation as well (or even switching
| protocols)
| wtallis wrote:
| > Yeah, but what I meant above was that if you only have
| an x4 slot electrically, sticking in an x16 -> 4x M.2
| riser isn't going to do a whole lot, because the 12 lanes
| of 3 out of 4 slots aren't hooked up to anything. So in
| this scenario you'd really want a riser with a switch in
| it instead (which are more expensive than almost all
| motherboards).
|
| True; but given how PCIe speeds are no longer stalled, we
| may soon see motherboards offering an x4 slot that can be
| operated as x1x1x1x1. Currently, the only risers you're
| likely to find that split a slot into independent x1
| ports are intended for crypto mining, and they require
| switches. A passive (or retimer-only) quad-M.2 riser that
| only provides one PCIe lane per drive currently sounds a
| bit bandwidth-starved and wouldn't work with current
| motherboards. But given PCIe gen5 SSDs or widespread
| availability of PCIe-native hard drives, those uses for
| an x4 slot will start to make sense.
| wtallis wrote:
| PCIe switches are a lot simpler and more standardized than RAID
| and HBA controllers, but I'm not sure they're any cheaper for
| similar bandwidth and port counts. Broadcom/Avago/PLX and
| Microchip/Microsemi are the only two vendors for large-scale
| current generation PCIe switches, and starting with PCIe gen3
| they decided to price them _way_ out of the consumer market,
| contributing to the disappearance of multi-GPU from gaming PCs.
| inetknght wrote:
| Does this have the reliability as the rest of Seagate's lineup?
| Or is this actually something that isn't a ripoff?
| wmf wrote:
| Obligatory reminder that there are only three hard disk vendors
| and all of them have made bad drives at one time.
| thijsvandien wrote:
| To be specific: Seagate, Toshiba and Western Digital.
| mastax wrote:
| Could you use multiple NVMe namespaces to represent the separate
| actuators in a multiple-actuator drive? Would there be a benefit?
| Do different namespaces get separate command queues or whatever?
| wtallis wrote:
| NVMe supports multiple queues even for a single namespace, and
| multiple queues are used more for efficiency on the software
| side (one queue per CPU core) than for exposing the parallelism
| of the storage hardware.
|
| There are several NVMe features intended to expose some
| information about the underlying segmentation and allocation of
| the storage media, for the sake of QoS. Since current multi-
| actuator drives are merely multiple actuators per spindle but
| still only one head per platter, this split could be exposed as
| separate namespaces or separate NVM sets or separate endurance
| groups. If we ever see multiple heads per platter come back (a
| la Conner Chinook), that would be best abstracted with multiple
| queues.
___________________________________________________________________
(page generated 2021-11-14 23:00 UTC)