[HN Gopher] The Next Backblaze Storage Pod
___________________________________________________________________
The Next Backblaze Storage Pod
Author : TangerineDream
Score : 189 points
Date : 2021-06-24 16:11 UTC (6 hours ago)
(HTM) web link (www.backblaze.com)
(TXT) w3m dump (www.backblaze.com)
| [deleted]
| dragontamer wrote:
| I have to imagine that by making Storage Pods 1.0 through 6.0,
| maybe they "encouraged" Dell (and other manufacturers) that this
| particular 60+ hard drive server was a good idea.
|
| And now that multiple "storage pod-like" systems exist in the
| marketplace (not just Dell, but also Supermicro) selling 60-bay
| or 90-bay 3.5" Hard drive storage servers in 4U rack form
| factors, there's not much reason to build their own?
|
| At least, that's my assumption. After all, if the server chassis
| is a commodity now (and it absolutely is), no point making custom
| small runs to make a hypothetical Storage Pod 7. Economies of
| scale is too big a benefit (worst case scenario: its now Dell's
| or Supermicro's problem rather than Backblaze's).
|
| EDIT: I admit that I don't really work in IT, I'm just a
| programmer. So I don't really know how popular 4U / ~60 HDD
| servers were before Backblaze Storage Pod 1.0
| bluedino wrote:
| You'd think they could build a ARM-powered, credit card sized
| controller for them with a disk breakout card and network IO.
| PC motherboard and full-sized cards seem like overkill.
| francoisfeugeas wrote:
| A French SDS company, now owned by OVH, did exactly that a
| few years ago : https://fr.slideshare.net/openio/openio-
| serverless-storage-7...
|
| I don't think they actually sold any.
| yabones wrote:
| I have seen some people build Ceph clusters using the HC2
| board [1] before. I'm not sure what the performance is
| like, but it seems like a neat way to scale out storage.
| The only real shortcoming is that there's a single NIC...
| If there were two, you could use an HA stack for your
| network and have a very robust system for very cheap.
|
| [1] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-
| two/
| dragontamer wrote:
| They're running a fair bit of math (probably Reed Solomon
| matrix multiplications for error correction) over all the
| data accesses.
|
| Given the bandwidth of 60+ hard drives (150MB/s per hard
| drive x 60 == 9GB/s in/out), I'm pretty sure you need a
| decent CPU just to handle the PCIe traffic. At least PCIe 3.0
| x16, just for the hard drives. And then another x16 for
| network connections (multiple PHY for Fiber in/out that can
| handle that 9GB/s to a variety of switches).
|
| We're looking at PCIe 3.0 x32 just for HDDs and Networking.
| Throw down a NVMe-cache or other stuff and I'm not seeing any
| kind of small system working out here.
|
| ---------
|
| Then the math comes in: matrix multiplications over every bit
| of data to verify checksums and reed-solomon error correction
| starts to get expensive. Maybe if you had an FPGA or some
| kind of specialist DSP (lol GPUs maybe, since they're good at
| matrix multiplication), you can handle the bandwidth. But it
| seems nontrivial to me.
|
| Server CPU seems to be a cheap and simple answer: get the
| large number of PCIe I/O lanes plus a beefy CPU to handle the
| calculations. Maybe a cheap CPU with many I/O lanes going to
| a GPU / FPGA / ASIC for the error checking math, but...
| specialized chips cost money. I don't think a cheap low-power
| CPU would be powerful enough to perform real-time error
| correction calculations over 9GBps of data.
|
| --------
|
| We can leave Backblaze's workload and think about typical SAN
| or NAS workloads too. More I/O is needed if you add NVMe
| storage to cache hard drive reads/writes, tons of RAM is
| needed if you plan to dedup.
| gpm wrote:
| I'm not familiar with the algorithms, but matrix
| multiplication sounds well suited towards GPUs. I wonder if
| you could get away with a much cheaper CPU and a cheaper
| GPU for less cost?
| dragontamer wrote:
| But the main issue with GPUs (or FPGAs / ASICs) is now
| you need to send 9GBps to some other chip AND back again.
|
| Which means 9GBps downstream (to be processed by the GPU)
| + 9GBps upstream (GPU is done with the data), or a total
| bandwidth of 18GBps aggregate to the GPU / FPGA / ASIC /
| whatever coprocessor you're using.
|
| So that's what? Another 32x lanes of PCIe 3.0? Maybe a
| 16x PCIE 4.0 GPU can handle that kind of I/O... but you
| can see that moving all this data around is non-trivial,
| even if we assume the math is instantaneous.
|
| ---------
|
| Practically speaking, it seems like any CPU with enough
| PCIe bandwidth to handle this traffic is a CPU beefy
| enough to seemingly run the math.
| Dylan16807 wrote:
| PCIE 3.0 is 1GB/s per lane _in each direction_. A 3.0 8x
| link would do a good job of saturating the drives. And
| basically any CPU could run 8x to the storage controllers
| and 8x to a GPU. Get any Ryzen chip and you can run 4
| lanes directly to a network card too.
| nine_k wrote:
| If only the ASIC on the HDD could run these computations
| and correct bit errors right during data transfers!
| dragontamer wrote:
| The HDD ASIC certainly is doing those computations.
|
| The issue is that Backblaze has a 2nd layer of error
| correction codes. This 2nd layer of error correction
| codes needs to be calculated somewhere. If enough errors
| come from a drive, the administrators take down the box
| and replace the hard-drives and resilver the data.
|
| Backblaze physically distributes the data over 20
| separate computers in 20 separate racks. Some computer
| needs to run the math to "Combine" the data (error
| correction + checksums and all) back into the original
| data on every single read. So singular hard drive can do
| this math because the data has been reliably dispersed to
| so many different computers.
| dwild wrote:
| > Given the bandwidth of 60+ hard drives (150MB/s per hard
| drive x 60 == 9GB/s in/out)
|
| Given their scale and goal, it would be pretty wasteful to
| build it to max the writing speed of all hard drives.
| Considering you rarely write on the pod, you would be
| better of getting a fraction of that speed and writing on
| multiple pods at the same time to get the required peak
| performance.
|
| In fact actually that makes much more sense to put that
| math on some ingests server and theses hard drive servers
| would simply write the resulting data. It makes it much
| easier and faster to divide it over 20 pods like they
| currently do.
| bluedino wrote:
| Definitely limited by the 1gbps or even 10gbs network
| connection
| e12e wrote:
| Any pod like this would normally have at least 1x40gbps
| uplink minimum?
|
| Like most blade setups, like (random example): https://ww
| w.storagereview.com/review/supermicro-x11-microbla...
| dragontamer wrote:
| Storage Pod 6.0 seems to be 2x10Gbps Ethernet:
| https://www.backblaze.com/blog/open-source-data-storage-
| serv...
| nine_k wrote:
| Read load alone can be pretty high.
|
| And no, you want to calculate checksums and fix bit
| errors right here in the RAM buffers you just read or
| received, because at such scales hardware is not error-
| free.
| scottlamb wrote:
| > They're running a fair bit of math (probably Reed Solomon
| matrix multiplications for error correction) over all the
| data accesses.
|
| Do those run on this machine? I imagine backblaze has
| redundancy at the cluster level rather than machine level.
| That allows them to lose a single machine without any data
| becoming unavailable. It also means we shouldn't assume the
| erasure code calculations happen on a machine with 60
| drives attached. That's still possible but alternatively
| the client [1] could do those calculations and the drive
| machines could simply handle read/write raw chunks. This
| can mean less network bandwidth [2] and better load
| balancing (heavier calculations done further from the
| stateful component).
|
| [1] Meaning a machine handling a user-facing request or re-
| replication after drive/machine loss.
|
| [2] Assume data is divided into slices that are
| reconstructed from N/M chunks, such that each chunk is
| smaller than its slice. [3] On read, the client-side
| erasure code design means N chunk transfers from drive
| machine to client. If instead the client queries one of the
| relevant drive machines, that machine has to receive N-1
| chunks from the others and send back a full slice. (Similar
| for writes.) More network traffic on the drive machine and
| across the network in total, less on the client.
|
| [3] This assumption might not make sense if they care more
| about minimizing seeks on read than minimizing bytes
| stored. Then they might have at least one full copy that
| doesn't require accessing the others.
| morei wrote:
| That's not really how it's done.
|
| RS is normally used as erasure code: It's used when writing
| (to compute code blocks), and when reading _only when data
| is missing_. Checksums are used to detect corrupt data,
| which is then treated as missing and RS used to reconstruct
| it. Using RS to detect/correct corrupt data is very
| inefficient.
|
| Checksums are also normally free (CRC + memcpy on most
| modern CPUs runs in the same time that memcpy does: it's
| entirely memory bound).
|
| The generation of code blocks is also fairly cheap:
| Certainly no large matrix multiplications! This is because
| the erasure code generally only spans a small number of
| blocks (e.g. 10 data blocks), so every code byte is only
| dependent on 10 data bytes. The math for this is reasonably
| simple, and further simplified with some reasonable sized
| look-up tables.
|
| That's not to say that there is no CPU needed, but it's
| really not all that much, certainly nothing that needs
| acceleration support.
| bluedino wrote:
| Xeon E5-1620 last I saw
| pram wrote:
| They existed for other applications, like NetApp and the ZFS
| Appliance. Long, long before 2009.
| [deleted]
| walrus01 wrote:
| 3U and 4U x86 whitebox servers designed for any standard
| 12"x13" motherboard, where the entire front panel was hotswap
| 3.5" HDD bays were already a thing many, many years before
| backblaze existed.
|
| What wasn't really a thing was servers with hotswap HDD trays
| on both ends (like the supermicros) and things that were
| designed with vertical hard drives dropped down from a top-
| opening lid to achieve even higher density.
| briffle wrote:
| The Sun X4500 "thumper" server had 48 drives, if I remember
| correctly, and came out in 2006ish.
|
| It had hot-swap SATA disks (up to 512GB disks initially!) and
| was actually pretty cool and forward thinking
|
| https://web.archive.org/web/20061128164442/http://www.sun.co.
| ..
| zrail wrote:
| IIRC the company I was at at the time had a set of thumpers
| for something. Maybe a SAN?
|
| They were incredibly cool for the time.
| notyourday wrote:
| That's not the backblaze design. The backblaze design is that
| the drives are individually hot-swappable without a tray. 60
| commodity SATA drives that can be removed and serviced
| individually while a 4U server continues to operate normally
| is pretty amazing.
| dangerboysteve wrote:
| The company which manufactures the metal cases created a
| spinoff company called 45Drives which sells commercially
| supported pods.
| notyourday wrote:
| They are fantastic.
| Wassight wrote:
| Just a reminder that backblaze uses dark patterns for their
| account cancellations. You'll never be able to use all of the
| time you pay for.
| foodstances wrote:
| Did the surge in Chia mining affect global hard drive prices at
| all?
| that_lurker wrote:
| Not yet, but when Chia pools become available the mining will
| take off and hdd prices will most likely rice
| richwater wrote:
| Chia is a literal scam.
|
| There's a massive amount of premined chia controlled by the
| "chia strategic reserve".
|
| It will take a decade for the amount of mined value to equal
| the pre-mined value.
| josefresco wrote:
| I looked into Chia mining as a hobby and was directed to
| Burstcoin. I don't know much about either, but Burstcoin
| advocates claim it's the "better" PoC.
|
| Note: I went to double check something on the Burstcoin
| website and realized today, June 24th they changed their
| name to Signum - https://www.burst-coin.org/
| d33lio wrote:
| Not for hyperscalers like BackBlaze. They have contracts with
| specific purchase quotas and guaranteed price deltas. Chia has
| certainly affected prices on the secondary markets, there
| hasn't been a better time in the past decade to be a secondary
| server "junk" seller on eBay! NetApp 4246 JBOD's are going for
| $1000! Absolutely insane!
| sliken wrote:
| I've been watching drive prices, it's hard to say exactly why,
| but around mid April disk prices at Newegg and amazon jumped
| significantly. One drive that had been $300, jumped to $400,
| $500, and even spiked to $800 for a bit. By Jun 1st had dropped
| to $550, and only this week has dropped for $400. Still above
| the original $300, but at least a not terribly painful premium.
| ev1 wrote:
| Chia mining-before-transactions-are-released -> Chia
| transactions released -> Chia price at a high price -> Chia
| price halves shortly after, turns out virtually impossible
| for US users to get on any of the exchanges handling XCH
| infogulch wrote:
| If low volume is a problem for manufacturers because you don't
| need that much, the obvious solution is to increase volume by
| selling them. Of course that would introduce even more problems
| to solve, but at least volume wouldn't be one of them.
| igravious wrote:
| fta: "Right after we introduced Storage Pod 1.0 to the world,
| we had to make a decision as to whether or not to make and sell
| Storage Pods in addition to our cloud-based services. We did
| make and sell a few Storage Pods--we needed the money--but we
| eventually chose software."
| jleahy wrote:
| They already sell them, but personally I thought they were too
| expensive, probably they were adding a mark-up on their build
| price when selling them.
| bluedino wrote:
| I don't think Backblaze ever sold them, but 45Drives does.
| I'm not sure if they assemble them for BB or if they were
| just using their published design.
| igravious wrote:
| they did at one point, fta: "Right after we introduced
| Storage Pod 1.0 to the world, we had to make a decision as
| to whether or not to make and sell Storage Pods in addition
| to our cloud-based services. We did make and sell a few
| Storage Pods--we needed the money--but we eventually chose
| software."
| jleahy wrote:
| Indeed, I was thinking of 45Drives.
| igravious wrote:
| Not any more they don't, fta: "Right after we introduced
| Storage Pod 1.0 to the world, we had to make a decision as to
| whether or not to make and sell Storage Pods in addition to
| our cloud-based services. We did make and sell a few Storage
| Pods--we needed the money--but we eventually chose software."
| ineedasername wrote:
| Servicing & warranty management for hardware sales is a very
| different business than their core competency.
| ajaimk wrote:
| It's actually interesting to me that backblaze has actually
| reached a size where global logistics plays a bigger part in
| costs than the actual servers. (And the servers got cheaper).
|
| Also, Dell and Supermicro have storage servers inspired by the BB
| Pods.
|
| Glad to see this scrappy company hit this amount of scale; a long
| way from schucking Hard Drives
| andrewtbham wrote:
| It's really amazing they made their own for so long... hardware
| is a commodity business.
| bluedino wrote:
| The original storage pod was only 1/7th as much as a Dell
| solution, and that didn't include any labor, software, blah
| blah blah.
|
| They're still only buying the assembled hardware from Dell.
| jasode wrote:
| _> hardware is a commodity business._
|
| The _hardware components_ of the Backblaze Pod are commodities
| but the entire finished unit is not a commodity. E.g. the rough
| equivalent from 45drives is not a commodity:
| https://www.45drives.com/products/storinator-xl60-configurat...
| ineedasername wrote:
| I guess it's kind of like building your own gaming PC: Even
| paying retail prices for the parts, you can build your own for
| significantly cheaper than a comparable pre-built system. Since
| their business model is "extremely cheap unlimited backup
| storage" they had to go it alone, but now there are more COTS
| options similar to their needs.
| wilhil wrote:
| I'm curious what is being used for the drives (and to a lesser
| extent, memory) - Dell or OEM and how does support work?
|
| We sell a lot of Dell and for base models, it is very economical
| compared to self built.
|
| The moment however we add a few high capacity hard drives or
| memory, all bets are off the table and it's usually 1.75-4x the
| price of a white box part.
|
| I get not supporting the part itself, but, had them not support a
| raid card error (corrupt memory) after they saw we had a third
| party drive.... we only buy a handful of servers a month - I can
| imagine this possibly being a huge problem for Backblaze
| though...
| ocdtrekkie wrote:
| Especially flash storage goes through the roof at enterprise
| purchasing. I've bought the drive trays and used consumer SSDs
| in servers more than a few times with no real ill effects where
| SATA is acceptable. If you need SAS, you just need to accept
| the pain that is about to come when you order.
| wfleming wrote:
| Backblaze has usually sourced their own hard drives, and I
| suspect they still are/will. (The post didn't seem to indicate
| otherwise.)
|
| Every year they post a summary of what models they're working
| with and how they perform, which is usually good reading. This
| is last year's: https://www.backblaze.com/blog/backblaze-hard-
| drive-stats-fo....
| ineedasername wrote:
| _The post didn 't seem to indicate otherwise_
|
| That appeared to depend on whether the vendor imposed massive
| markups on the drives. However they also mentioned service
| etc.: If they struck a deal with Dell, then Dell might be
| perfectly happy to sell the servers at a very modest profit
| while making their money on the service agreement.
| d33lio wrote:
| _But can it farm Chia?_
|
| Always cool to get insights into the business and technical
| challenges at Backblaze!
| amelius wrote:
| Title is misleading as there is no next Backblaze storage pod,
| and there never will be, according to the article.
| wtallis wrote:
| > and there never will be, according to the article.
|
| The article doesn't say that. It says:
|
| > So the question is: Will there ever be a Storage Pod 7.0 and
| beyond? We want to say yes. We're still control freaks at
| heart, meaning we'll want to make sure we can make our own
| storage servers so we are not at the mercy of "Big Server Inc."
| In addition, we do see ourselves continuing to invest in the
| platform so we can take advantage of and potentially create
| new, yet practical ideas in the space (Storage Pod X anyone?).
| So, no, we don't think Storage Pods are dead, they'll just have
| a diverse group of storage server friends to work with.
| ceejayoz wrote:
| "The Next Backblaze Storage Pod" is commercially available
| storage from Dell. It's a little clickbaity, but it's both a)
| the title of the article, which HN encourages using and b)
| accurate.
| choppaface wrote:
| > That's a trivial number of parts and vendors for a hardware
| company, but stating the obvious, Backblaze is a software
| company.
|
| Stating the obvious: Backblaze wants investors to value them like
| a SaaS company. This blog post suggests they're more of a
| logistics and product company-- huge capex and depreciating
| assets on hand. As a customer, I like their product, but they're
| no Dropbox. If they would allow personal NAS then I could see
| them being a software company.
| ahmedalsudani wrote:
| It's easy to buy a bunch of hard drives and connect in a data
| center. Managing petabytes per user for thousands of users is
| the hard part, and it's a software problem.
|
| BackBlaze is definitely a SaaS company... though the quality of
| their offering certainly lags behind Dropbox, both in terms of
| feature set and user experience. They're also in a very
| competitive industry. Storage/backup is basically a commodity
| nowadays.
| ricardobeat wrote:
| I sync my NAS to Backblaze B2 without any issues, and pricing
| is great.
| chx wrote:
| For me the Big Deal is Backblaze B2. Especially when fronted by
| Cloudflare -- zero traffic costs. Storage is cheap as far as
| cloud storage provider goes and traffic is decidedly the
| cheapest possible.
| wmf wrote:
| Dropbox owns more hardware than Backblaze.
| edgeform wrote:
| Always love these articles, look forward to them without knowing
| it.
|
| This one is particularly interesting as they discuss the
| logistical challenges of their own success in having to build
| more and more Storage Pods.
|
| As always, a super fascinating read worth your time.
| jleahy wrote:
| I'm surprised Dell don't make they buy hard drives from them at a
| substantial mark-up, as they allude to vendors doing earlier in
| the story.
| tpetry wrote:
| Or dell forces them to buy their drives but their markup is not
| as high as their competitors?
| gnopgnip wrote:
| If you are buying in bulk the pricing can be a lot better
| than the list prices. But generally there isn't a problem
| buying a server without a drive, Dell will still support the
| server for non disk related warranty issues, you don't need
| special firmware or disks
| foobarbazetc wrote:
| Dell (and, really, all server providers apart from
| Supermicro) have crazy markups on storage.
|
| It's where they make most of their margin.
|
| And then, most of the time, the drives they sell come on
| custom sleds that they don't sell separately as a form of
| DRM/lock in.
|
| Then you get a nice little trade on Chinese-made sleds that
| sort of work, but not for anything recent like hot swap NVMe
| drives.
|
| I'm sure BB were able to negotiate down a lot (Dell usually
| come down 50% off the list price if you press them hard
| enough for one off projects), but... yeah. That's how it
| generally goes.
| jleahy wrote:
| The default markup is awful, check the Dell website. I'd
| describe the process of buying a drive from Dell a bit like
| getting mugged.
| Analemma_ wrote:
| I imagine if you're buying sixty pods a month every month
| you have some leverage with Dell to get better prices,
| especially if you have a demonstrated ability to just walk
| away and build your own if you don't like their offer.
| gpm wrote:
| Dell's not in the best negotiating position here, given that
| "build our own" is a valid alternative.
| ineedasername wrote:
| Dell me be very happy to have a high volume customer with very
| standardized & predictable needs, and so they're happy with
| modest markups & extra profit on the service agreements, which
| is a nice benefit for Backblaze since building their own pods
| doesn't give them any service guarantee/warranty.
| bluedino wrote:
| Any idea what Dell is actually selling them? The DVR's we buy
| (Avigilon) are white Dell 7x0's with a custom white bezel, but
| those only fit 18 3.5" drives.
| narism wrote:
| Dell's densest server is the PowerEdge XE7100 [1] (100 3.5"
| drives in 5U) but the bezel cover picture looks like more like
| a standard 2U, maybe a R740xd2 (26 3.5" in 2U).
|
| [1] https://www.delltechnologies.com/asset/en-
| ae/products/server...
|
| https://www.servethehome.com/dell-emc-poweredge-xe7100-100-d...
| erikpt-work wrote:
| Could be the MD3060e with an R650 server?
|
| https://i.dell.com/sites/doccontent/shared-content/data-shee...
| toomuchtodo wrote:
| Your link 404s, I think it's the extra character on the end.
| maxclark wrote:
| I'd love to know this as well. Dell doesn't have anything
| remotely close to what Backblaze was designing/building
| themselves.
|
| So did they do something custom (unlikely at this volume) or
| did Backblaze change their hardware approach?
| philjohn wrote:
| They do since last year - https://www.servethehome.com/dell-
| emc-poweredge-xe7100-100-d...
| ineedasername wrote:
| Not quite the same, but they do have something like the Pods,
| but a bit more modular:
|
| It's their PowerEdge MX platform, which allows you to slot in
| different "sleds" for storage/compute etc. as needed. It can
| take 7 storage sleds for a total of 112 drivers per chasis.
| brandon wrote:
| Based on the pictured bezel, it looks like they've got three
| rows 3.5" 14TB SATA drives up front in 14th generation
| carriers. Best guess would be something like an R740XD2 which
| has 26 total drive bays per 2U.
| wrikl wrote:
| The author recently commented:
| https://www.backblaze.com/blog/next-backblaze-storage-pod/#c...
|
| It's apparently the "Dell PowerEdge R740xd2 rack server".
| chx wrote:
| https://ifworlddesignguide.com/entry/281015-poweredge-r740xd.
| .. super interesting design.
| encryptluks2 wrote:
| I can't say that I'm surprised, and honestly anyone can open
| source the architecture for a storage array of comparable use. I
| think the only unique thing here really is the chassis, but there
| are plenty of whitebox vendors that sell storage chassis. You may
| not get as many drives in one, but usually the other components
| in these things are usually pretty cheap minus the storage. I
| don't really see this being a loss in the community at all, and
| maybe someone else will get creative and build something better.
___________________________________________________________________
(page generated 2021-06-24 23:00 UTC)