[HN Gopher] My 71 TiB ZFS NAS After 10 Years and Zero Drive Fail...
___________________________________________________________________
My 71 TiB ZFS NAS After 10 Years and Zero Drive Failures
Author : louwrentius
Score : 400 points
Date : 2024-09-13 23:21 UTC (23 hours ago)
(HTM) web link (louwrentius.com)
(TXT) w3m dump (louwrentius.com)
| ggm wrote:
| There have been drives where power cycling was hazardous. So,
| whilst I agree to the model, it shouldn't be assumed this is
| always good, all the time, for all people. Some SSD need to be
| powered periodically. The duty cycle for a NAS probably meets
| that burden.
|
| Probably good, definitely cheaper power costs. Those extra grease
| on the axle drives were a blip in time.
|
| I wonder if backblaze do a drive on-off lifetime stats model? I
| think they are in the always on problem space.
| louwrentius wrote:
| > There have been drives where power cycling was hazardous.
|
| I know about this story from 30+ years ago. It may have been
| true then. It may be even true now.
|
| Yet, in my case, I don't power cycle these drives often. At
| most a few times a month. I can't say or prove it's a huge
| risk. I only believe it's not. I have accepted this risk for
| over 15+ years.
|
| Update: remember that hard drives have an option to spin down
| when idle. So hard drives can handle many spinups a day.
| neilv wrote:
| In the early '90s, some Quantum 105S hard drives had a
| "stiction" problem, and were shipped with Sun SPARCstations.
|
| IME at the time, power off a bunch of workstations, such as
| for building electrical work, and probably at least one of
| them wouldn't spin back up the next business day.
|
| Pulling the drive sled, and administering percussive
| maintenance against the desktop, could work.
|
| https://sunmanagers.org/1992/0383.html
| ghaff wrote:
| Stiction was definitely a thing back in that general period
| when you'd sometimes knock a drive to get it back to
| working again.
| ggm wrote:
| I debated posting because it felt like shitstirring. I think
| overwhelmingly what you're doing is right. And if a remote
| power on eg WOL works on the device, so much the better. If I
| could wish for one thing, it's mods to code or documentation
| of how to handle drive power down on zfs. The rumour mill is
| zfs doesn't like spindown.
| louwrentius wrote:
| I did try using HDD spindown on ZFS but I remember (It's a
| long time ago) that I encountered too many vague errors
| that scared me and I just disabled spindown all together.
| Kirby64 wrote:
| What is there to handle? I have a ZFS array that works just
| fine with hard drives that automatically spin down. ZFS
| handles this without an issue.
|
| The main gotchas tend to be: if you use the array for many
| things, especially stuff that throws off log files, you
| will constantly be accessing that array and resetting the
| spin down timers. Or you might be just at the threshold for
| spindown and you'll put a ton of cycles on it as it bounces
| from spindown to access to spin up.
|
| For a static file server (rarely accessed backups or
| media), partitioned correctly, it works great.
| bluedino wrote:
| Long ago I had a client who could have been an episode of "IT
| Nightmares".
|
| They used internal 3.5" hard drives along with USB docks to
| backup a couple Synology devices...It seemed like 1/10 times
| when you put a drive back in the dock to restore a file or make
| another backup, the drive wouldn't power back up.
| 2OEH8eoCRo0 wrote:
| > The 4 TB HGST drives have roughly 6000 hours on them after ten
| years.
|
| So they mostly sit idle? Mine are about ~4 years old with ~35,000
| hours.
| louwrentius wrote:
| To quote myself from a few lines below that fragment:
|
| > My NAS is turned off by default. I only turn it on (remotely)
| when I need to use it.
| markoman wrote:
| I loved the idea around the 10Gb NICs but wondered what model
| switch are you connecting to? And what throughput is this
| topology delivering?
| louwrentius wrote:
| I don't have a 10Gb switch. I connect this server directly
| to two other machines as they all have 2 x 10Gbit. This NAS
| can saturate 10Gbit but the other side can't, so I'm stuck
| at 500-700 MB/s, I haven't measured it in a while.
| russfink wrote:
| What does one do with all this storage?
| flounder3 wrote:
| This is a drop in the bucket for photographers, videographers,
| and general backups of RAW / high resolution videos from mobile
| devices. 80TB [usable] was "just enough" for my household in
| 2016.
| patchymcnoodles wrote:
| Exactly that. I'm not even shooting in ProRes or similar
| "raw" video. But one video project easily takes 3TB. And I'm
| not even a professional.
| jiggawatts wrote:
| Holy cow, what are you shooting with!?
|
| I have a Nikon Z8 that can output up to 8.3K @ 60 fps raw
| video, and my biggest project is just 1 TB! Most are on the
| order of 20 GB, if that.
| patchymcnoodles wrote:
| I use a Sony a1, my videos are 4k 100fps. But on the last
| project I also had an Insta 360 x4 shooting B-Roll. So on
| some days that adds up a lot.
| andrelaszlo wrote:
| Especially since it's mostly turned off and it seems like the
| author is the only user
| rhcom2 wrote:
| For me I have about 35TB and growing in Unraid for
| Plex/Torrents/Backups/Docker
| tbrownaw wrote:
| It's still not enough to hold a local copy of sci-hub, but
| could probably hold quite a few recorded conference talks (or
| similar multimedia files) or a good selection of huggingface
| models.
| Nadya wrote:
| If you prefer to own media instead of streaming it, are into
| photography, video editing, 3D modelling, any AI-related stuff
| (models add up) or are a digital hoarder/archivist you blow
| through storage rather quickly. I'm sure there are some other
| hobbies that routinely work with large file sizes.
|
| Storage is cheap enough that rather than deleting 1000's of
| photos and never be able to reclaim or look at them again I'd
| rather buy another drive. I'd rather have a RAW of an 8 year
| old photo that I overlooked and decide I really like and want
| to edit/work with than a 87kb resized and compressed JPG of the
| same file. Same for a mostly-edited 240GB video file. What if I
| want or need to make some changes to it in the future? May as
| well hold onto it than have to re-edit the video or re-shoot
| the video if the original footage was also deleted.
|
| Content creators have deleted their content often enough that
| if I enjoyed a video and think future me might enjoy rewatching
| the video - I download it rather than trust that I can still
| watch it in the future. Sites have been taken offline
| frequently enough that I download things. News sites keep
| restructuring and breaking all their old article links so I
| download the articles locally. JP artists are notorious for
| deleting their entire accounts and restarting under a new alias
| that I routinely archive entire Pixiv/Twitter accounts if I
| like their art as there is no guarantee it will still be there
| to enjoy the next day.
|
| It all adds up and I'm approaching 2 million well-organized and
| (mostly) tagged media files in my Hydrus client [0]. I have
| many scripts to automate downloading and tagging content for
| these purposes. I very, very rarely delete things. My most
| frequent reason for deleting anything is "found in higher
| quality" which conceptually isn't _really_ deleting.
|
| Until storage costs become unreasonable I don't see my habits
| changing anytime soon. On the contrary - storage keeps getting
| cheaper and cheaper and new formats keep getting created to
| encode data more and more efficiently.
|
| [0] https://hydrusnetwork.github.io/hydrus/index.html
| denkmoon wrote:
| avoid giving disney money
| complex1314 wrote:
| Versioned datasets for machine learning.
| fiddlerwoaroof wrote:
| This sounds to me like it's just a matter of luck and not really
| a model to be imitated.
| louwrentius wrote:
| It's likely that you are right and that I misjudged the
| likelihood of this result being special.
|
| You can still imitate the model to save money on power, but
| your drives may not last longer, no evidence for that indeed.
| turnsout wrote:
| I've heard the exact opposite advice (keep the drives running to
| reduce wear from power cycling).
|
| Not sure what to believe, but I like having my ZFS NAS running so
| it can regularly run scrubs and check the data. FWIW, I've run my
| 4 drive system for 10 years with 2 drive failures in that time,
| but they were not enterprise grade drives (WD Green).
| louwrentius wrote:
| Hard drives are often configured to spin down when idle for a
| certain time. This can cause many spinups and spindowns per
| day. So I don't buy this at all. But I don't have supporting
| evidence that back up this notion.
| turnsout wrote:
| I think NAS systems in particular are often configured to
| prevent drives from spinning down
| max-ibel wrote:
| There seems to be a huge difference between spin-down while
| NAS is up vs shutting the whole NAS down and restart. When I
| start my NAS, it takes a bunch of time to be back up: it
| seems to do a lot of checking on/ syncing off the drives and
| puts a fair amount of load on them (same is true for CPU as
| well, just look at your CPU load right after startup).
|
| OTOH, when the NAS spins up a single disk again, I haven't
| notice any additional load. Presumably, the read operation
| just waits until the disk is ready.
| Wowfunhappy wrote:
| > Hard drives are often configured to spin down when idle for
| a certain time. This can cause many spinups and spindowns per
| day.
|
| I was under the impression that this _was_ , in fact, known
| to reduce drive longevity! It is done anyway in order to save
| power, but the reliability tradeoff is known.
|
| No idea where I read that though, I thought it was "common
| knowledge" so maybe I'm wrong.
| Dalewyn wrote:
| >Not sure what to believe
|
| Keep them running.
|
| Why?:
|
| * The read/write heads experience literally next to no wear
| while they are floating above the platters. They physically
| land onto shelves or onto landing zones on the platters
| themselves when turned off; landing and takeoff are by far the
| most wear the heads will suffer.
|
| * Following on the above, in the worst case the read/write
| heads might be torn off during takeoff due to stiction.
|
| * Bearings will last longer; they might also seize up if left
| stationary for too long. Likewise the drive motor.
|
| * The rush of current when turning on is an electrical
| stressor, no matter how minimal.
|
| The only reasons to turn your hard drives off are to save
| power, reduce noise, or transport them.
| tomxor wrote:
| > Keep them running [...] Bearings will last longer; they
| might also seize up if left stationary for too long. Likewise
| the drive motor.
|
| All HDD failures I've ever seen in person (5 across 3
| decades), were bearing failures, in machine that were almost
| always on with drives spun up. It's difficult to know for
| sure without proper A-B comparisons, but I've never seen a
| bearing failure in a machine where drives were spun down
| automatically.
|
| It also seems intuitive that for mechanical bearings the
| longer they are spun up the greater the wear and the greater
| the chance of failure.
| manwe150 wrote:
| I think I have lost half a dozen hard drives (and a couple
| DVD-RW drives) over the decades because they sat in a box
| for a couple years on a shelf (I recall that one recovered
| working with a higher amperage 12V supply, but only long
| enough to copy off most of the data)
| mulmen wrote:
| My experience with optical drives is similar to ink jet
| printers. They work once when new then never again.
| telgareith wrote:
| Citations needed.
|
| Counterpoints for each: Heads don't suffer wear when parking.
| The armature does.
|
| If the platters are not spinning fast enough, or the air
| density is too low: the heads will crash into the sides of
| the platters.
|
| The main wear on platter bearings is vibration, it takes an
| extremely long time for the lube to "gum up." If its still a
| thing at all. I suspect it used to happen because they were
| petroleum distilate lubes. So, shorter chains would
| evaporate/sublimate leaving longer more viscous chains. Or
| straight polymerize.
|
| With fully synthetic PAO oils, and other options they won't
| do that anymore.
|
| What inrush? They're polyphase steppers. The only reason for
| inrush is that the engineers didn't think it'd affect
| lifetime.
|
| Counter: turn your drives off, thebsaved power of 8 drives
| being off half the day easily totals $80 a year- enough to
| replace all but the highest capacities.
| akira2501 wrote:
| Using power creates heat. Thermal cycles are never good.
| Heating parts up and cooling them down often reduces their
| life.
| bigiain wrote:
| > The only reasons to turn your hard drives off are to save
| power, reduce noise, or transport them.
|
| One reason some of my drives get powered down 99+% of the
| time is that its a way to guard against the whole network
| getting cryptolockered. i have a weekly backup run by a
| script that powers up a pair of raid1 usb drives, does and
| incremental no-delete backup, then unmounts and powers them
| back down again. Even in a busy week theyre rarely running
| for more than an hour or two. I'd have to get unlucky enough
| to not have the powerup script detect being cryptolockered
| (it checks md5 hashes of a few "canary files") and powering
| up the bav=ckup drives anyway. I figure that's a worthwhile
| reason to spin them down weekly...
| philjohn wrote:
| Yes - although it's worth bearing in mind the number of
| load/unload cycles a drive is rated for over its lifetime.
|
| In the case of the IronWolf NAS drives in my home server,
| that's 600,000.
|
| I spin the drives down after 20 minutes of no activity, which
| I feel is a good balance between having them be too thrashy
| and saving energy. After 3 years I'm at about 60,000 load
| unload cycles.
| CTDOCodebases wrote:
| I think a lot of the advice around keeping the drives running
| is about avoiding wear caused by spin downs and startups i.e.
| keeping the "Start Stop Cycles" low.
|
| Theres a difference between spinning a drive up/down once or
| twice a day and spinning it down every 15 minutes or less.
|
| Also WD Green drives are not recommended for NAS usage. I know
| in the past they used to park the read/write head every few
| seconds or so which is fine if data is being accessed
| infrequently but continuously however a server this can result
| in continuous wear which leads to premature failure.
| larusso wrote:
| There used to be some tutorials going around to flash the
| firmware to turn greens into reds I believe. Which simply
| disables the head parking.
| cm2187 wrote:
| Agree. I do weekly backups, the backup NAS is only switched
| on and off 52 times a year. After 5-6 years the disks are
| probably close to new in term of usage vs disks that have
| been running continuously over that same period.
|
| Which leads to another strategy which is to swap the primary
| and the backup after 5 years to get a good 10y out of the two
| NAS.
| foobarian wrote:
| > regularly run scrubs and check the data
|
| Does this include some kind of built-in hash/checksum system to
| record e.g. md5 sums of each file and periodically test them? I
| have a couple of big drives for family media I'd love to
| protect with a bit more assurance than "the drive did not
| fail".
| Filligree wrote:
| It's ZFS, so that's built-in. A scrub does precisely that.
| adastra22 wrote:
| Yes, zfs includes file-level checksums.
| giantrobot wrote:
| Block level checksums.
| jclulow wrote:
| This is not strictly accurate. ZFS records checksums of the
| records of data that make up the file storage. If you want
| an end to end file-level checksum (like a SHA-256 digest of
| the contents of the file) you still need to layer that on
| top. Which is not to say it's bad, and it's certainly
| something I rely on a lot, but it's not quite the same!
| ghostly_s wrote:
| https://en.wikipedia.org/wiki/ZFS?wprov=sfti1#Resilvering_an.
| ..
| Cyph0n wrote:
| Yep, ZFS reads everything in the array and validates
| checksums. ZFS (at least on Linux) ships with scrub systemd
| timers: https://openzfs.github.io/openzfs-
| docs/man/master/8/zpool-sc...
| bongodongobob wrote:
| This is completely dependant on access frequency. Do you have a
| bunch of different people accessing many files frequently? Are
| you doing frequent backups?
|
| If so then yes, keeping them spinning may help improve lifespan
| by reducing frequent disk jerk. This is really only applicable
| when you're at a pretty consistent high load and you're trying
| to prevent your disks from spinning up and down every few
| minutes or something.
|
| For a homelab, you're probably wasting way more money in
| electricity than you are saving in disk maintenance by leaving
| your disks spin.
| mvanbaak wrote:
| the 'secret' is not that you turn them off. it's simply luck.
|
| I have 4TB HGST drives running 24/7 for over a decade. ok, not 24
| but 8, and also 0 failures. But I'm also lucky, like you. Some of
| the people I know have several RMAs with the same drives so
| there's that.
|
| My main question is: What is it that takes 71TB but can be turned
| off most of the time? Is this the server you store backups?
| louwrentius wrote:
| It can be luck, but with 24 drives, it feels very lucky.
| Somebody with proper statistics knowledge can probably
| calculate the risk with a guestimated 1% yearly failure rate
| how likely it would be to have all 24 drives remaining.
|
| And remember, my previous NAS with 20 drives also didn't have
| any failures. So N=44, how lucky must I be?
|
| It's for residential usage, and if I need some data, I often
| just copy it over 10Gbit to a system that uses much less power
| and this NAS is then turned off again.
| the_gorilla wrote:
| We don't really have to guess. Backblaze posted their stats
| for 4 TB HGST drives for 2024, and of their 10,000 drives, 5
| failed. If OP's 2014 4 TB HGST drives are anything like this,
| then this is just snake oil and magic rituals and it doesn't
| really matter what you do.
| louwrentius wrote:
| 5 drives failed in Q1 2024. 8 died in Q2. That is still a
| very low failure rate.
| renewiltord wrote:
| Drives have a bathtub curve, but if you want you can be
| conservative and estimate first year failure rates
| throughout. So that's p=5/10000 for drive failure. So
| chance of no-failure per year (because of our assumption)
| is 1-p. So, chance of no-failure per ten year is (1-p)^10
| or about 99.5%
| louwrentius wrote:
| Is that for one drive, or for all 24 drives to survive 10
| years?
| dn3500 wrote:
| That's for one drive. For all 24 it's about 88.7%.
| toast0 wrote:
| > If OP's 2014 4 TB HGST drives are anything like this,
| then this is just snake oil and magic rituals and it
| doesn't really matter what you do.
|
| It might matter what you do, but we only have public data
| for people in datacenters. Not a whole lot of people with
| 10,000 drives are going to have them mostly turned off, and
| none of them shared their data.
| formerly_proven wrote:
| Those are different drives though, they're MegaScale DC
| 4000 while OP is using 4 TB Deskstars. Not sure if they're
| basically the same (probably). I've also had a bunch of
| these 4TB Megascale drives and absolutely no problems
| whatsoever in about 10 years as well. Run very cool as well
| (I think they're 5400 rpm not 7200 rpm).
|
| The main issue with drives like these is that 4 TB is just
| so little storage compared to 16-20 TB class drives, it
| kinda gets hard to justify the backplane slot.
| manquer wrote:
| The failure rate is not truly random with a nice normal
| distribution of failures over time. There are sometimes
| higher rates in specific batches or they can start failing
| altogether etc.
|
| Backblaze reports always are interesting insights into how
| consumers drives behave under constant load.
| dastbe wrote:
| it's (1-p)^24^10, where p is drive failure rate per year
| (assuming it doesn't go up over time). so at 1% that's about
| 9% or a 1/10 chance of this result. Not exactly great, but
| not impossible.
|
| the backblaze rates are all over the place, but it does
| appear they have drives that this rate or lower:
| https://www.backblaze.com/blog/backblaze-drive-stats-
| for-q2-...
| louwrentius wrote:
| I can see that, my assumption that my result (no drive
| failure over 10 year) was rare is wrong. So I've updated my
| blog about that.
| CTDOCodebases wrote:
| I'm curious what the "Stop/Stop cycle count" is on these
| drives and roughly how many times per week/day you are
| accessing the server.
| monocasa wrote:
| In fact, the conventional wisdom for a long time was to not
| turn them off if you want longevity. Bearings seize when cold
| for instance.
| ryanjshaw wrote:
| > What is it that takes 71TB but can be turned off most of the
| time?
|
| Still waiting for somebody to explain this to me as well.
| leptons wrote:
| I have a 22TB RAID10 system out in my detached garage that
| works as an "off-site" backup server for all my other
| systems. It stays off most of the time. It's on when I'm
| backing up data to it, or if it's running backups to LTO
| tape. Or it's on when I'm out in the garage doing whatever
| project, I use it to play music and look up stuff on the web.
| Otherwise it's off, most of the time.
| naming_the_user wrote:
| For what it's worth this isn't that uncommon. Most drives fail in
| the first few years, if you get through that then annualized
| failure rates are about 1-2%.
|
| I've had the (small) SSD in a NAS fail before any of the drives
| due to TBW.
| rkagerer wrote:
| I have a similar-sized array which I also only power on nightly
| to receive backups, or occasionally when I need access to it for
| a week or two at a time.
|
| It's a whitebox RAID6 running NTFS (tried ReFS, didn't like it),
| and has been around for 12+ years, although I've upgraded the
| drives a couple times (2TB --> 4TB --> 16TB) - the older Areca
| RAID controllers make it super simple to do this. Tools like Hard
| Disk Sentinel are awesome as well, to help catch drives before
| they fail.
|
| I have an additional, smaller array that runs 24x7, which has
| been through similar upgrade cycles, plus a handful of clients
| with whitebox storage arrays that have lasted over a decade.
| Usually the client ones are more abused (poor temperature control
| when they delay fixing their serveroom A/C for months but keep
| cramming in new heat-generating equipment, UPS batteries not
| replaced diligently after staff turnover, etc...).
|
| Do I notice a difference in drive lifespan between the ones that
| are mostly-off vs. the ones that are always-on? Hard to say. It's
| too small a sample size and possibly too much variance in 'abuse'
| between them. But definitely seen a failure rate differential
| between the ones that have been maintained and kept cool, vs.
| allowed to get hotter than is healthy.
|
| I _can_ attest those 4TB HGST drives mentioned in the article
| were tanks. Anecdotally, they 're the most reliable ones I've
| ever owned. And I have a more reasonable sample size there as I
| was buying dozens at a time for various clients back in the day.
| louwrentius wrote:
| I bought the HGSTs specifically because they showed good stats
| in those Backblaze drive stats that they just started to
| publish back then.
| lostmsu wrote:
| I have a mini PC + 4x external HDDs (I always bought used) on
| Windows 10 with ReFS since probably 2016 (recently upgraded to
| Win 11), maybe earlier. I don't bother powering off.
|
| The only time I had problems is when I tried to add a 5th disk
| using a USB hub, which caused drives attached to the hub get
| disconnected randomly under load. This actually happened with 3
| different hubs, so I since stopped trying to expand that
| monstrosity and just replace drives with larger ones instead.
| Don't use hubs for storage, majority of them are shitty.
|
| Currently ~64TiB (less with redundancy).
|
| Same as OP. No data loss, no broken drives.
|
| A couple of years ago I also added an off-site 46TiB system with
| similar software, but a regular ATX with 3 or 4 internal drives
| because the spiderweb of mini PC + dangling USBs + power supplies
| for HDDs is too annoying.
|
| I do weekly scrubs.
|
| Some notes: https://lostmsu.github.io/ReFS/
| anjel wrote:
| Ca't help but wonder how much electricity would have been
| consumed if you had left it on 24/7 for ten years...
| louwrentius wrote:
| On the original blog it states that the machine used 200W idle.
| Thats 4,8 KWh a day. 17520 KWh over 10 years? At around 0.30
| euro per KWh that's 5K+ if I'm not mistaken.
| nine_k wrote:
| Why wonder, let's approximate.
|
| A typical 7200 rpm disk consumes about 5W when idle. For 24
| drives, it's 120W. Rather substantial, but not an electric
| kettle level. At $0.25 / kWh, it's $0.72 / day, or about $22 /
| mo, or slightly more than $260 / year. But this is only the
| disks; the CPU + mobo can easily consume half as much on
| average, so it would be more like $30-35 / mo.
|
| And if you have electricity at a lower price, the numbers
| change accordingly.
|
| This is why my ancient NAS uses 5400 RPM disks, and a future
| upgrade could use even slower disks if these were available.
| The reading bandwidth is multiplied by the number of disks
| involved.
| louwrentius wrote:
| Because a lot of people pointed out that it's not that unlikely
| that all 24 drives survive without failure over 10 years, given
| the low failure rate reported by Backblaze Drive Stats Reports,
| I've also updated the article with that notion.
| Jedd wrote:
| Surprised to not find 'ecc' on that page.
|
| I know it's not a guarantee of no-corruption, and ZFS without ECC
| is probably no more dangerous than any other file system without
| ECC, but if data corruption is a major concern for you, and
| you're building out a _pretty hefty_ system like this, I can 't
| imagine not using ECC.
|
| Slow on-disk data corruption resulting from gradual and near-
| silent RAM failures may be like doing regular 3-2-1 backups --
| you either mitigate against the problem because you've been stung
| previously, or you're in that blissful pre-sting phase of your
| life.
|
| EDIT: I found TFA's link to the original build out - and happily
| they are in fact running a Xeon with ECC. Surprisingly it's a
| 16GB box (I thought ZFS was much hungrier on the RAM : disk
| ratio.) Obviously it hasn't helped for physical disk failures,
| but the success of _the storage array_ owes a lot to this
| component.
| louwrentius wrote:
| The system is using ECC and I specifically - unrelated to ZFS -
| wanted to use ECC memory to reduce risk of data/fs corruption.
| I've also added 'ecc' to the original blog post to clarify.
|
| Edit: ZFS for home usage doesn't need a ton of RAM as far as
| I've learned. There is the 1 GB of RAM per 1TB of storage rule
| of thumb, but that was for a specific context. Maybe the ill-
| fated data deduplication feature, or was it just to sustain
| performance?
| rincebrain wrote:
| It was a handwavey rule of estimation for dedup, handwavey
| because dedup scales on number of records, which is going to
| vary wildly by recordsize.
| InvaderFizz wrote:
| Additionally unless it's changed in the last six years, you
| should pretend ZFS dedupe doesn't exist.
| rincebrain wrote:
| Not in a stable release yet, but check out
| https://github.com/openzfs/zfs/discussions/15896 if you
| have a need for that.
| Jedd wrote:
| Thanks, and all good - it was my fault for not following the
| link in this story to your post about the actual build,
| before starting on my mini-rant.
|
| I'd heard the original ZFS memory estimations were somewhat
| exuberant, and recommendations had come down a lot since the
| early days, but I'd imagine given your usage pattern -
| powered on periodically - a performance hit for whatever
| operations you're doing during that time wouldn't be
| problematic.
|
| I used to use mdadm for software RAID, but for several years
| now my home boxes are all hardware RAID. LVM2 provides the
| other features I need, so I haven't really ever explored zfs
| as a replacement for both - though everyone I know that uses
| it, loves it.
| hinkley wrote:
| Accidentally unplugged my raid 5 array and thought I damaged
| the raid card. Hours after boot I'd get problems. I glitched a
| RAM chip and the array was picking it up as disk corruption.
| Filligree wrote:
| It's difficult as a home user to find ECC memory, harder to
| make sure it actually works in your hardware configuration, and
| near-impossible to find ECC memory that doesn't require lower
| speeds than what you can get for $50 on amazon.
|
| I would very much like to put ECC memory in my home server, but
| I couldn't figure it out this generation. After four hours I
| decided I had better things to do with my time.
| Jedd wrote:
| Indeed. I'd started to add an aside to the effect of 'ten
| years ago it was probably _easier_ to go ECC '. I'll add it
| here instead.
|
| A decade ago if you wanted ECC your choice was basically
| Xeon, and all( _) Xeon motherboards would accept ECC.
|
| I agree that these days it's much more complex, since you are
| ineluctably going get sucked into the despair-spiral of
| trying to work out what combination of Ryzen + motherboard +
| ECC RAM will give you _actual, demonstrable* ECC (with
| correction, not just detection).
| rpcope1 wrote:
| Sounds like the answer is to just buy another Xeon then,
| even if it's a little older and maybe secondhand. I think
| there's a reason the vast majority of Supermicro
| motherboards are still just Intel only.
| Filligree wrote:
| You might also need performance. Or efficiency.
| throw0101c wrote:
| > _This NAS is very quiet for a NAS (video with audio)._
|
| Big (large radius) fans can move a lot of air even at low RPM.
| And be much more energy efficient.
|
| Oxide Computer, in one of their presentations, talks about using
| 80mm fans, as they are quiet and (more importantly) don't use
| much power. They observed, in other servers, as much as 25% of
| the power went just to powering the fans, versus the ~1% of
| theirs:
|
| * https://www.youtube.com/shorts/hTJYY_Y1H9Q
|
| * https://www.youtube.com/watch?v=4vVXClXVuzE
| louwrentius wrote:
| +1 for mentioning 0xide. I love that they went this route and
| that stat is interesting. I hate the typical DC high RPM small
| fan whine.
|
| I also hope that they do something 'smart' when they control
| the fan speed ;-)
| mkeeter wrote:
| It's moderately smart - there's a PID loop with per-component
| target temperatures, so it's trying not to do more work than
| necessary.
|
| (source: I wrote it, and it's all published at https://github
| .com/oxidecomputer/hubris/tree/master/task/the... )
|
| We also worked with the fan vendor to get parts with a lower
| minimum RPM. The stock fans idle at about 5K RPM, and ours
| idle at 2K, which is already enough to keep the system cool
| under light loads.
| louwrentius wrote:
| Ha! thanks a lot for sharing, love it. Nice touch to use
| low idle RPM fans.
|
| Same thing for my ancient NAS: after boot, the fans run at
| idle for hours and the PID controller just doesn't have to
| do anything at all.
| sss111 wrote:
| just curious, are you associated with them, as these are very
| obscure youtube videos :D
|
| Love it though, even the reduction in fan noise is amazing. I
| wonder why nobody had thought of it before, it seems so simple.
| throw0101c wrote:
| > _just curious, are you associated with them, as these are
| very obscure youtube videos :D_
|
| Unassociated, but tech-y videos are often recommended to me,
| and these videos got pushed to me. (I have viewed other,
| unrelated Tech Day videos, so probably why I got that short.
| Also an old Solaris admin, so aware of Cantril, especially
| his rants.)
|
| > _Love it though, even the reduction in fan noise is
| amazing. I wonder why nobody had thought of it before, it
| seems so simple._
|
| Depends on the size of the server: can't really expand fans
| with 1U or even 2U pizza boxes. And for general purpose
| servers, I'm not sure how many 4U+ systems are purchased--
| perhaps some more now that perhaps GPUs cards may be a
| popular add-on.
|
| For a while chassis systems (e.g., HP c7000) were popular,
| but I'm not sure how they are nowadays.
| baby_souffle wrote:
| > I'm not sure how many 4U+ systems are purchased--perhaps
| some more now that perhaps GPUs cards may be a popular add-
| on.
|
| Going from what i see at eCycle places, 4U dried up years
| ago. Everything is either 1 or 2U or massive blade
| receptacles (10+ U).
|
| We (the home-lab on a budget people) may see a return to 4U
| now that GPUs are in vogue but i'd bet that the hyper
| scalers are going to drive that back down to something
| that'll be 3U with water cooling or so over the longer
| term.
|
| We may also see similar with storage systems too; it's only
| a matter of time before SSD gets "close enough" to spinning
| rust on the $/gig/unit-volume metrics.
| daemonologist wrote:
| Interesting - I'm used to desktop/workstation hardware where
| 80mm is the _smallest_ standard fan (aside from 40mm 's in the
| near-extinct Flex ATX PSU), and even that is kind of rare.
| Mostly you see 120mm or 140mm.
| mustache_kimono wrote:
| > 80mm is the smallest standard fan (aside from 40mm's in the
| near-extinct Flex ATX PSU)
|
| Those 40mm PSU fans, and the PSU, are what they are replacing
| with a DC bus bar.
| throw0101c wrote:
| > _Those 40mm PSU fans, and the PSU, are what they are
| replacing with a DC bus bar._
|
| DC (power) in the DC (building) isn't anything new: the
| telco space has used -48V (nominal) power for decades. Do a
| search for (say) "NEBS DC power" and you'll get a bunch of
| stuff on the topic.
|
| Lot's of chassis-based system centralized the AC-DC power
| supplies.
| globular-toast wrote:
| Yeah. In a home environment you should absolutely use desktop
| gear. I have 5 80mm and one 120mm PWM fans in my NAS and they
| are essentially silent as they can't be heard over the sound
| of the drives (which is essentially the noise floor for a
| NAS).
|
| It is necessary to use good PWM fans though if concerned
| about noise as cheaper ones can "tick" annoyingly. Two brands
| I know to be good in this respect are Be Quiet! and Noctua.
| DC would in theory be better but most motherboards don't
| support it (would require an external controller and thermal
| sensors I think).
| chiph wrote:
| My Synology uses two 120mm fans and you can barely hear them
| (it's on the desk next to me). I'm sold on the idea of moving
| more volume at less speed.
|
| (which I understand can't happen in a 1U or 2U chassis)
| bearjaws wrote:
| I feel like 10 years is when my drives started failing the most.
|
| I run a 8x8tb array zraid2 redundancy, initially it was a 8x2tb
| array but drives started failing once every 4 months, after 3
| drives failed I upgraded the remaining ones.
|
| Only downside to hosting your own is power consumption. OS
| upgrades have been surprisingly easy.
| leighleighleigh wrote:
| Regarding the custom PID controller script: I could have sworn
| the Linux kernel had a generic PID controller available as a
| module, which you could setup via the device tree, but I can't
| seem to find it! (grepping for 'PID' doesn't provide very helpful
| results lol).
|
| I think it was used on nVidia Tegra systems, maybe? I'd be
| interested to find it again, if anyone knows. :)
| ewalk153 wrote:
| Maybe related to this?
|
| https://github.com/torvalds/linux/blob/master/tools/thermal/...
| rnxrx wrote:
| In my experience the environment where the drives are running
| makes a huge difference in longevity. There's a ton more
| variability in residential contexts than in data center (or even
| office) space. Potential temperature and humidity variability is
| a notable challenge but what surprised me was the marked effect
| of even small amounts of dust.
|
| Many years ago I was running an 8x500G array in an old Dell
| server in my basement. The drives were all factory-new Seagates -
| 7200RPM and may have been the "enterprise" versions (i.e. not
| cheap). Over 5 years I ended up averaging a drive failure every 6
| months. I ran with 2 parity drives, kept spares around and RMA'd
| the drives as they broke.
|
| I moved houses and ended up with a room dedicated to lab stuff.
| With the same setup I ended up going another 5 years without a
| single failure. It wasn't a surprise that the new environment was
| better, but it was surprising how _much_ better a cleaner, more
| stable environment ended up being.
| stavros wrote:
| How does dust affect things? The drives are airtight.
| kenhwang wrote:
| They're airtight now (at the high end or enterprise level).
| They weren't airtight not very long ago and had filters to
| regulate the air exchange.
| Kirby64 wrote:
| They're not airtight in the true sense (besides the helium
| filled ones nowadays), but every drive made in the past...
| 30? 40 years is airtight in the sense that no dust can ever
| get into the drive. There's a breather hole somewhere (with
| a big warning to not cover it!) to equalize pressure, and a
| filter that doesn't allow essentially any particles in.
| kenhwang wrote:
| No dust is supposed to get "in" the drive, but dust can
| very well clog the breather hole and cause pressure
| issues that could kill the drive.
| Kirby64 wrote:
| Unless you're moving the altitude of the drive
| substantially after it's already clogged, how would this
| happen? There's no air exchange on hard drives.
| kenhwang wrote:
| Unless your drives are in a perfectly controlled
| temperature, humidity, and atmospheric pressure
| environment, those will all impact the internal pressure.
| Temperature being the primary concern because drives do
| get rather warm internally while operating.
| Kirby64 wrote:
| Sure, it has some impact, but we're not talking about
| anything too crazy. And that also assumes full total
| clogging of all pores... which is unlikely to happen. You
| won't have perfect sealing and pressure will just
| equalize.
| deafpolygon wrote:
| everything else isn't. the dust can get into power supplies
| and cause irregularities.
| Loughla wrote:
| Are your platters open to air? Or was it the cooling system?
| I'm confused.
| ylee wrote:
| >Many years ago I was running an 8x500G array in an old Dell
| server in my basement. The drives were all factory-new Seagates
| - 7200RPM and may have been the "enterprise" versions (i.e. not
| cheap). Over 5 years I ended up averaging a drive failure every
| 6 months. I ran with 2 parity drives, kept spares around and
| RMA'd the drives as they broke.
|
| Hah! I had a 16x500GB Seagate array and also averaged an RMA
| every six months. I think there was a firmware issue with that
| generation.
| kalleboo wrote:
| A drive failure every 6 months almost sounds more like dirty
| power than dust, I've always kept my NAS/file servers in dusty
| residential environments (I have a nice fuzzy gray Synology
| logo visible right now) and never seen anything like that
| bitexploder wrote:
| Drives are sealed anyway. Humidity maybe. Dust can't really
| get in. Power or bad batch of drives.
| rkagerer wrote:
| Don't know the details, but dust could have been impeding
| the effectiveness of his fans or clumping to create other
| hotspots in the system (including in the PSU).
| userbinator wrote:
| Except for the helium-filled ones, they aren't sealed;
| there is a very fine filter that equalises atmospheric
| pressure. (This is also why they have a maximum operating
| altitude --- the head needs a certain amount of atmospheric
| pressure to float.)
| aaronmdjones wrote:
| Yup, this is why the label will say something along the
| lines of "DO NOT COVER DRIVE HOLES".
| bitexploder wrote:
| How does the helium stay in if it is not sealed? I am not
| familiar with hard drive construction, but helium is
| notoriously good at escaping.
| lorax wrote:
| I think he meant in general drives aren't sealed, except
| the helium ones are sealed.
| bitexploder wrote:
| Oh, I see. Makes sense. I wonder if dust really can
| infiltrate a drive? Hmm.
| daniel-s wrote:
| Does dust matter for SSD drives?
| earleybird wrote:
| Only when checking for finger prints :-)
| sega_sai wrote:
| It is most likely the model's fault. I once had a machine with
| 36 Seagate ST3000DM001, they were failing almost once a month
| -- see the annual failure rate here
| https://www.backblaze.com/blog/best-hard-drive-q4-2014/
| mapt wrote:
| "Do you think that's air you're breathing?"
|
| This is no longer much of an issue with sealed, helium filled
| drives, if it ever was.
| 8n4vidtmkvmk wrote:
| I run a similar but less sophisticated setup. About 18 TiB now,
| and I run it 16 hours a day. I let it sleep 8 hours per night so
| that it's well rested in the morning. I just do this on a cron
| because I'm not clever enough to SSH into a turned off (and
| unplugged!) machine.
|
| 4 drives: 42k hours (4.7 years), 27k hours (3 years), 15k hours
| (1.6 years), and the last drive I don't know because apparently
| it isn't SMART.
|
| 0 errors according to scrub process.
|
| ... but I guess I can't claim 0 HDD failures. There has been 1 or
| 2, but not for years now. Knock on wood. No data loss because of
| mirroring. I just can't lose 2 in a pair. (Never run RAID5 BTW,
| lost my whole rack doing that)
| BenjiWiebe wrote:
| Looks like you're quite clever actually, if you can get cron to
| run on a powered off unplugged machine.
|
| I think I'm missing something.
| alanfranz wrote:
| Some bios and firmware support turning on at a certain time.
| Maybe cron was a way to simplify.
| bigiain wrote:
| I use a wifi controlled powerpoint to power up and down a
| pair of raid1 backup drives.
|
| A weekly cronjob on another (always on) machine does some
| simple tests (md5 checksums of "canary files" on a few
| machines on the network) then powers up and mounts the
| drives, runs an incremental backup, waits for it to finish,
| then unmounts and powers them back down. (There's also a
| double-check cronjob that runs 3 hours later that confirms
| they are powered down, and alerts me if they aren't. My
| incrementals rarely take more than an hour.)
| 8n4vidtmkvmk wrote:
| Just power, not unplugged. It's simply
|
| 0 2 * * * /usr/sbin/rtcwake -m off -s 28800 # off from 2am to
| 10am
|
| "and unplugged" was referring to OP's setup, not mine
| lvl155 wrote:
| I have a similar approach but I don't use ZFS. It's a bit
| superfluous especially if you're using your storage periodically
| (turn on and off). I use redundant NVMEs in two stages and
| periodically save important data into multiple HDDs (cold
| storage). Worth noting, it's important to prune your data.
|
| I also do not backup photos and videos locally. It's a major
| headache and they just take up a crap ton of space when Amazon
| Prime will give you photo storage for free.
|
| Anecdotally, only drives that failed on me were enterprise-grade
| HDDs. And they all failed within a year and in an always-on
| system. I also think RAIDs are over-utilized and frankly a big
| money pit outside of enterprise-level environments.
| ed_mercer wrote:
| > Losing the system due to power shenanigans is a risk I accept.
|
| A UPS provides more than just that, it delivers constant energy
| without fluctuations and thus makes your hardware last longer.
| vunderba wrote:
| Yeah, this definitely caused me to raise an eyebrow. UPS covers
| brown outs and obviously the occasional temporary power outage.
| All those drives spinning at full speed suddenly coming to a
| grinding halt as the power is suddenly cut, and you're
| quibbling over a paltry additional 10 watts? I can only assume
| that the data is not that important.
| flemhans wrote:
| As part of resilience testing I've been turning off our 24
| drive backup drive array daily for two years, by flicking the
| wall switch. So far nothing happened.
| snvzz wrote:
| >but for residential usage, it's totally reasonable to accept the
| risk.
|
| Polite disagree. Data integrity is the natural expectation humans
| have from computers, and thus we should stick to filesystems with
| data checksums such as ZFS, as well as ECC memory.
| naming_the_user wrote:
| I agree.
|
| I think that the author may not have experienced these sorts of
| errors before.
|
| Yes, the average person may not care about experiencing a
| couple of bit flips per year and losing the odd pixel or block
| of a JPEG, but they will care if some cable somewhere or
| transfer or bad RAM chip or whatever else manages to destroy a
| significant amount of data before they notice it.
| AndrewDavis wrote:
| I had a significant data loss years ago.
|
| I was young and only had a deskop, so all my data was there.
|
| So I purchased a 300GB external usb drive to use for periodic
| backup. It was all manual copy/paste files across with no
| real schedule, but it was fine for the time and life was
| good.
|
| Over time my data grew and the 300GB drive wasn't large
| enough to store it all. For a while some of it wasnt backed
| up (I was young with much less disposable income).
|
| Eventually I purchased a 500GB drive.
|
| But what I didn't know is my desktop drive was dying. Bits
| were flipping, a lot of them.
|
| So when I did my first backup with the new drive I copied all
| my data off my desktop along with the corruption.
|
| It was months before I realised a huge amount of my files
| were corrupted. By that point I'd wiped the old backup drive
| to give to my Mum to do her own backups. My data was long
| gone.
|
| Once I discovered ZFS I jumped on it. It was the exact thing
| that would have prevented this because I could have detected
| the corruption when I purchased the new backup drive and did
| the initial backup to it.
|
| (I made up the drive sizes because I can't remember, but the
| ratios will be about right).
| 369548684892826 wrote:
| There's something disturbing about the idea of silent data
| loss, it totally undermines the peace of mind of having
| backups. ZFS is good, but you can also just run rsync
| periodically with checksum and dryrun args and check the
| output for diffs.
| willis936 wrote:
| It happens all the time. Have a plan, perform fire
| drills. It's a lot of time and money, but there's no
| equivalent feeling to unfucking yourself quite like being
| able to get your lost, fragile data back.
| lazide wrote:
| The challenge with silent data loss is your backups will
| eventually not have the data either - it will just be
| gone, silently.
|
| After having that happen a few times (pre-ZFS), I started
| running periodic find | md5sum > log.txt type jobs and
| keeping archives.
|
| It's caught more than a few problems over the years, and
| allows manual double checking even when using things like
| ZFS. In particular, some tools/settings just aren't sane
| to use to copy large data sets, and I only discovered
| that when... some of it didn't make it to it's
| destination.
| AndrewDavis wrote:
| Absolutely, if you can't use a filesystem with checksums
| (zfs, btrfs, bcachefs) then rsync is a great idea.
|
| I think filesystem checksums have one big advantage vs
| rsync. With rsync if there's a difference it isn't clear
| which one is wrong.
| bombcar wrote:
| I have an MP3 file that still skips to this day, because
| a few frames were corrupted on disk twenty years ago.
|
| I could probably find a new copy of it online, but that
| click is a good reminder about how backups aren't just
| copies but have to be verified.
| taneq wrote:
| My last spinning hard drive failed silently like this, but
| I didn't lose too much data... I think. The worst part is
| not knowing.
| mustache_kimono wrote:
| > Data integrity is the natural expectation humans have from
| computers
|
| I've said it once, and I'll say it again: the only reason ZFS
| isn't the norm is because we all once lived through a
| primordial era when it didn't exist. No serious person
| designing a filesystem today would say it's okay to misplace
| your data.
|
| Not long ago, on this forum, someone told me that ZFS is only
| good because _it had no competitors_ in its space. Which is
| kind of like saying the heavyweight champ is only good because
| no one else could compete.
| wolrah wrote:
| To paraphrase, "ZFS is the worst filesystem, except for all
| those other filesystems that have been tried from time to
| time."
|
| It's far from perfect, but it has no peers.
|
| I spent many years stubbornly using btrfs and lost data
| multiple times. Never once did the redundancy I had
| supposedly configured actually do anything to help me. ZFS
| has identified corruption caused by bad memory and a bad CPU
| and let me know immediately which files were damaged.
| Too wrote:
| The reason ZFS isn't the norm is because it historically was
| difficult to set up. Outside of NAS solutions, it's only
| since Ubuntu 20.04 it has been supported out of the box on
| any high profile customer facing OS. The reliability of the
| early versions was also questionable, with high zsys cpu
| usage and some times arcane commands needed to rebuild pools.
| Anecdotally, I've had to support lots of friends with zfs
| issues, never so with other file systems. The data always
| comes back, it's just that it needs petting.
|
| Earlier, there used to be lot of fears around the license,
| with Torvalds advising against its use, both for that reason
| and for lack of maintainers. Now i believe that has been
| mostly ironed out and should be less of an issue.
| mustache_kimono wrote:
| > The reason ZFS isn't the norm is because it historically
| was difficult to set up. Outside of NAS solutions, it's
| only since Ubuntu 20.04 it has been supported out of the
| box on any high profile customer facing OS.
|
| In this one very narrow sense, we are agreed, if we are
| talking about Linux on root. IMHO it should also have been
| virtually everywhere else. It should have been in MacOS,
| etc.
|
| However, I think your particular comment may miss the
| forest for the trees. Yes, ZFS was difficult to set up for
| Linux, because Linux people disfavored its use (which you
| do touch upon later).
|
| People sometimes imagine that purely technical
| considerations govern the technical choices of remote
| groups. However, I think when people say "all tech is
| political" in the cultural-war-ing American politics sense,
| they may be right, but they are absolutely right in the
| small ball open source politics sense.
|
| Linux communities were convinced not to include or build
| ZFS support. Because licensing was a problem. Because btrfs
| was coming and would be better. Because Linus said ZFS was
| mostly marketing. So they didn't care to build support. Of
| course, this was all BS or FUD or NIH, but it was what
| happened, not that ZFS had new and different recovery tool,
| or was less reliable in the arbitrary past. It was because
| the Linux community engaged in its own (successful) FUD
| campaign against another FOSS project.
| zvr wrote:
| Was there any change in the license that made you believe
| that it should be less than a issue?
|
| Or do you think people simply stopped paying attention?
| Too wrote:
| Canonical took a team of lawyers to deeply review the
| license in 2016. It's beyond my legal skills to say if
| the conclusion made it more or less of an issue, at least
| the boundaries should now be more clear, for those who
| understand these matters more.
|
| https://canonical.com/blog/zfs-licensing-and-linux
|
| https://softwarefreedom.org/resources/2016/linux-kernel-
| cddl...
| hulitu wrote:
| > The reason ZFS isn't the norm is because it historically
| was difficult to set up.
|
| Has this changed ? ZFS comes with a BSD view of the world
| (i.e slices). It also needed a sick amount of RAM to
| function properly.
| KMag wrote:
| > No serious person designing a filesystem today would say
| it's okay to misplace your data.
|
| Former LimeWire developer here... the LimeWire splash screen
| at startup was due to experiences with silent data
| corruption. We got some impossible bug reports, so we created
| a stub executable that would show a splash screen while
| computing the SHA-1 checksums of the actual application DLLs
| and JARs. Once everything checked out, that stub would use
| Java reflection to start the actual application. After moving
| to that, those impossible bug reports stopped happening. With
| 60 million simultaneous users, there were always some of them
| with silent disk corruption that they would blame on
| LimeWire.
|
| When Microsoft was offering free Win7 pre-release install
| ISOs for download, I was having install issues. I didn't want
| to get my ISO illegally, so I found a torrent of the ISO, and
| wrote a Python script to download the ISO from Microsoft, but
| use the torrent file to verify chunks and re-download any
| corrupted chunks. Something was very wrong on some device
| between my desktop and Microsoft's servers, but it eventually
| got a non-corrupted ISO.
|
| It annoys me to no end that ECC isn't the norm for all
| devices with more than 1 GB of RAM. Silent bit flips are just
| not okay.
|
| Edit: side note: it's interesting to see the number of
| complaints I still see from people who blame hard drive
| failures on LimeWire stressing their drives. From very early
| on, LimeWire allowed bandwidth limiting, which I used to keep
| heat down on machines that didn't cool their drives properly.
| Beyond heat issues that I would blame on machine vendors,
| failures from write volume I would lay at the feet of drive
| manufacturers.
|
| Though, I'm biased. Any blame for drive wear that didn't fall
| on either the drive manufacturers or the filesystem
| implementers not dealing well with random writes would
| probably fall at my feet. I'm the one who implemented
| randomized chunk order downloading in order to rapidly
| increase availability of rare content, which would increase
| the number of hard drive head seeks on non-log-based
| filesystems. I always intended to go back and (1) use
| sequential downloads if tens of copies of the file were in
| the swarm, to reduce hard drive seeks and (2) implement
| randomized downloading of rarest chunks first, rather than
| the naive randomization in the initial implementation. I say
| naive, but the initial implementation did have some logic to
| randomize chunk download order in a way to reduce the size of
| the messages that swarms used to advertise which peers had
| which chunks. As it turns out, there were always more
| pressing things to implement and the initial implementation
| was good enough.
|
| (Though, really, all read-write filesystems should be copy-
| on-write log-based, at least for recent writes, maybe having
| some background process using a count-min-sketch to estimate
| locality for frequently read data and optimize read locality
| for rarely changing data that's also frequently read.)
|
| Edit: Also, it's really a shame that TCP over IPv6 doesn't
| use CRC-32C (to intentionally use a different CRC polynomial
| than Ethernet, to catch more error patterns) to end-to-end
| checksum data in each packet. Yes, it's a layering
| abstraction violation, but IPv6 was a convenient point to
| introduce a needed change. On the gripping hand, it's
| probably best in the big picture to raise flow control,
| corruption/loss detection, retransmission (and add forward
| error correction) in libraries at the application layer (a la
| QUIC, etc.) and move everything to UDP. I was working on
| Google's indexing system infra when they switched
| transatlantic search index distribution from multiple
| parallel transatlantic TCP streams to reserving dedicated
| bandwidth from the routers and blasting UDP using rateless
| forward error codes. Provided that everyone is implementing
| responsible (read TCP-compatible) flow control, it's really
| good to have the rapid evolution possible by just using UDP
| and raising other concerns to libraries at the application
| layer. (N parallel TCP streams are useful because they
| typically don't simultaneously hit exponential backoff, so
| for long-fat networks, you get both higher utilization and
| lower variance than a single TCP stream at N times the
| bandwidth.)
| pbhjpbhj wrote:
| It sounds like a fun comp sci exercise to optimise the algo
| for randomised block download to reduce disk operations but
| maintain resilience. Presumably it would vary significantly
| by disk cache sizes.
|
| It's not my field, but my impression is that it would be
| equally resilient to just randomise the start block (adjust
| spacing of start blocks according to user bandwidth?) then
| let users just run through the download serially; maybe
| stopping when they hit blocks that have multiple sources
| and then skipping to a new start block?
|
| It's kinda mindbogglingly to me too think of all the
| processes that go into a 'simple' torrent download at the
| logical level.
|
| If AIs get good enough before I die then asking it to
| create simulations on silly things like this will probably
| keep me happy for all my spare time!
| KMag wrote:
| For the completely randomized algorithm, my initial
| prototype was to always download the first block if
| available. After that, if fewer than 4 extents
| (continuous ranges of available bytes) were downloaded
| locally, randomly chose any available block. (So, we
| first get the initial block, and 3 random blocks.) If 4
| or more extents were available locally, then always try
| the block after the last downloaded block, if available.
| (This is to minimize disk seeks.) If the next block isn't
| available, then the first fallback was to check the list
| of available blocks against the list of next blocks for
| all extents available locally, and randomly choose one of
| those. (This is to chose a block that hopefully can be
| the start of a bunch of sequential downloads, again
| minimizing disk seeks.) If the first fallback wasn't
| available, then the second fallback was to compute the
| same thing, except for the blocks before the locally
| available extents rather than the blocks after. (This is
| to avoid increasing the number of locally available
| extents if possible.) If the second fallback wasn't
| available, then the final fallback was to randomly
| uniformly pick one of the available blocks.
|
| Trying to extend locally available extents if possible
| was desirable because peers advertised block availability
| as pairs of <offset, length>, so minimizing the number of
| extents minimized network message sizes.
|
| This initial prototype algorithm (1) minimized disk seeks
| (after the initial phase of getting the first block and 3
| other random blocks) by always downloading the block
| after the previous download, if possible. (2) Minimized
| network message size for advertising available extents by
| extending existing extents if possible.
|
| Unfortunately, in simulation this initial prototype
| algorithm biased availability of blocks in rare files,
| biasing in favor of blocks toward the end of the file.
| Any bias is bad for rapidly spreading rare content, and
| bias in favor of the end of the file is particularly bad
| for audio and video file types where people like to start
| listening/watching while the file is still being
| downloaded.
|
| Instead, the algorithm in the initial production
| implementation was to first check the file extension
| against a list of extensions likely to be accessed by the
| user while still downloading (mp3, ogg, mpeg, avi, wma,
| asf, etc.).
|
| For the case where the file extension indicates the user
| is unlikely to access the content until the download is
| finished (the general case algorithm), look at the number
| of extents (continuous ranges of bytes the user already
| has). If the number of extents is less than 4, pick any
| block randomly from the list of blocks that peers were
| offering for download. If there are 4 or more extents
| available locally, for each end of each extent available
| locally, check the block before it and the block after it
| to see if they're available for download from peers. If
| this list of available adjacent blocks is non-empty, then
| randomly chose one of those adjacent blocks for download.
| If the list of available adjacent blocks is empty, then
| uniformly randomly chose from one of the blocks available
| from peers.
|
| In the case of file types likely to be viewed while being
| downloaded, it would download from the front of the file
| until the download was 50% complete, and then randomly
| either download the first needed block, or else use the
| previously described algorithm, with the probability of
| using the previous (randomized) algorithm increasing as
| the percentage of the download completed increased. There
| was also some logic to get the last few chunks of files
| very early in the download for file formats that required
| information from a file footer in order to start using
| them (IIRC, ASF and/or WMA relied on footer information
| to start playing).
|
| Internally, there was also logic to check if a chunk was
| corrupted (using a Merkle tree using the Tiger hash
| algorithm). We would ignore the corrupted chunks when
| calculating the percentage completed, but would remove
| corrupted chunks from the list of blocks we needed to
| download, unless such removal resulted in an empty list
| of blocks needed for download. In this way, we would
| avoid re-downloading corrupted blocks unless we had
| nothing else to do. This would avoid the case where one
| peer had a corrupted block and we just kept re-requesting
| the same corrupted block from the peer as soon as we
| detected corruption. There was some logic to alert the
| user if too many corrupted blocks were detected and give
| the user options to stop the download early and delete
| it, or else to keep downloading it and just live with a
| corrupted file. I felt there should have been a third
| option to keep downloading until a full-but-corrupt
| download was had, retry downloading every corrupt block
| once, and then re-prompt the user if the file was still
| corrupt. However, this option would have resulted in more
| wasted bandwidth and likely resulted in more user
| frustration due to some of them hitting "keep trying"
| repeatedly instead of just giving up as soon as it was
| statistically unlikely they were going to get a non-
| corrupted download. Indefinite retries without prompting
| the user were a non-starter due to the amount of
| bandwidth they would waste.
| KMag wrote:
| How are the memory overheads of ZFS these days? In the old
| days, I remember balking at the extra memory required to run
| ZFS on the little ARM board I was using for a NAS.
| doublepg23 wrote:
| That was always FUD more or less. ZFS uses RAM as its
| primary cache...like every other filesystem, so it if you
| have very little RAM for caching the performance will
| degrade...like every other filesystem.
| KMag wrote:
| But if you have a single board computer with 1 GB of RAM
| and several TB of ZFS, will it just be slow, or actually
| not run? Granted, my use case was abnormal, and I was
| evaluating in the early days when there were both license
| and quality concerns with ZFS on Linux. However, my
| understanding at the time was that it wouldn't actually
| work to have several TB in a ZFS pool with 1 GB of RAM.
|
| My understanding is that ZFS has its own cache apart from
| the page cache, and the minimum cache size scales with
| the storage size. Did I misundertand/is my information
| outdated?
| homebrewer wrote:
| > will it just be slow
|
| This. I use it on a tiny backup server with only 1 GB of
| RAM and a 4 TB HDD pool, it's fine. Only one machine
| backs up to that server at a time, and they do that at
| network speed (which is admittedly only 100 Mb/s, but it
| should go somewhat higher if it had faster network).
| Restore also runs ok.
| KMag wrote:
| Thanks for this. I initially went with xfs back when
| there were license and quality concerns with zfs on Linux
| before btrfs was a thing, and moved to btrfs after btrfs
| was created and matured a bit.
|
| These days, I think I would be happier with zfs and one
| RAID-Z pool across all of the disks instead of individual
| btrfs partitions or btrfs on RAID 5.
| BSDobelix wrote:
| >That was always FUD more or less
|
| Thank you thank you, exactly this! And additionally that
| cache is compressed. In the day's of 4GB machines ZFS was
| overkill but today...no problem.
| magicalhippo wrote:
| > That was always FUD more or less.
|
| To give some context. ZFS support de-duplication, and
| until fairly recently, the de-duplication data structures
| _had_ to be resident in memory.
|
| So if you used de-duplication earlier, then yes, you
| absolutely _did_ need a certain amount of memory per byte
| stored.
|
| However, there is absolutely no requirement to use de-
| duplication, and without it the memory _requirements_ are
| just a small, fairly fixed amount.
|
| It'll store writes in memory until it commits them in a
| so-called transaction group, so you need to have room for
| that. But the limits on a transaction group is
| configurable, so you can lower the defaults.
| doublepg23 wrote:
| I don't think I came across anyone suggesting zfs dedupe
| without insisting that it was effectively broken except
| for very specific workloads.
| hulitu wrote:
| > the only reason ZFS isn't the norm is because we all once
| lived through a primordial era when it didn't exist.
|
| There were good filesystems before ZFS. I would love to have
| a versioning filesystem like Apollo had.
| Timshel wrote:
| Just checked my scrub history, for 20TB on consumer hardware
| during the last two years it repaired twice around 2 and 4
| blocks each time.
|
| So not much but at the same time with a special kind of luck
| might have been on an encrypted archive ^^.
| nolok wrote:
| That's all fine and good until that one random lone broken
| block stops you from opening that file you really need.
| lazide wrote:
| Or in my case, a key filesystem metadata block that ruins
| everything. :s
| pbhjpbhj wrote:
| I only know about FAT but these "key file metadata
| blocks" are redundant, so you need really special double-
| plus bad luck to do that.
| Szpadel wrote:
| so I can consider myself very lucky and unlucky at the
| same time. I had data corruption on zfs filesystem that
| destroyed whole pool to unrecoverable state (zfs was
| segfaulting while trying to import, all recovery zfs
| features where crashing zfs module and required reboot)
| the lucky part is that this happened just after
| (something like next day) I migrated whole pool to
| another (bigger) server/pool so that system was already
| scheduled for full disk wipe
| howard941 wrote:
| This happened to me too. The root cause was a bad memory
| stick.
| lazide wrote:
| It was ext4, and I've had it happen two different times -
| in fact, I've never had it happen in a 'good' recoverable
| way before that I've ever seen.
|
| It triggered a kernel panic in every machine that I
| mounted it in, and it wasn't a media issue either. Doing
| a block level read of the media had zero issues and
| consistently returned the exact same data the 10 times I
| did it.
|
| Notably, I had the same thing happen using btrfs due to
| power issues on a Raspberry Pi (partially corrupted
| writes resulting in a completely unrecoverable
| filesystem, despite it being in 2x redundancy mode).
|
| Should it be impossible? Yes. Did it definitely, 100% for
| sure happen? You bet.
|
| I never actually lost data on ZFS, and I've done some
| terrible things to pools before that took quite awhile to
| unbork, including running it under heavy write load with
| a machine with known RAM problems and no ECC.
| hulitu wrote:
| Good to know. Ext2 is much more robust against
| corruption. Or at least it was 10 years ago when i had
| kernel crashes or power failures.
| mrjin wrote:
| What you need is backup. RAID is not backup and is not for
| most home/personal users. I learnt that the hard way. Now
| my NAS use simple volumes only, after all, I really don't
| have many things I cannot lose on it. If it's something
| really important, I have multiple copies on different
| drives, and some offline cold backup. So now if any of my
| NAS drive is about to fail, I can just copy out the data
| and replace the drive, instead of spending weeks trying to
| rebuild the RAID and ended with a total loss as multiple
| drives failed in a row. The funny thing is that, after
| moving to simply volumes approach, I never had a drive with
| even a bad sector since.
| nolok wrote:
| Oh I have backups myself. But parent is more or less
| talking about a 71TiB NAS for residential usage and being
| able to ignore the bit rot; in that context such a person
| probably wouldn't have backup.
|
| Personnaly I have long since moved out of raid 5/6 into
| raid 1 or 10 with versionned backup, at some level of
| data raid 5/6 just isn't cutting it anymore in case
| anything goes slightly wrong.
| louwrentius wrote:
| Most people don't run ZFS on their laptop, desktop so let's not
| pretend it's such a huge deal.
| BSDobelix wrote:
| Most people run Windows on their Laptop (without ReFS), and
| many people use paid data restore services if something
| "important" gets missing/corrupt.
|
| >let's not pretend it's such a huge deal
|
| Depends on the importance of you data right?
| louwrentius wrote:
| I bet even 99.9% of HN visitors don't run ZFS on their
| laptop/desktop. Basically we all take this risk except for
| a few dedicated nerds.
|
| Everything has a price and people like to have their
| uncorrupted files, but not at all cost.
| rabf wrote:
| I find this thinking difficult to reconcile. When I setup
| up my workstation it does usually take me half a day to
| sort out an encrypted rootfs mirrored volume with
| zfsbootmenu + linux, but after that its all set for the
| next decade. A small price for the peace of mind it
| affords.
| mytailorisrich wrote:
| I think this is over the top for standard residential usage.
|
| Make sure you have good continuous backups and perhaps RAID 1
| on your file server (if you have one) to save efforts in case a
| disk fails and you are more than covered.
| rollcat wrote:
| > we should stick to filesystems with data checksums such as
| ZFS, as well as ECC memory.
|
| While I don't disagree with this statement, consider the
| reality:
|
| - APFS has metadata checksums, but no data checksums. WTF
| Apple?
|
| - Very few Linux distributions ship zfs.ko (&spl.ko); those
| that do, theoretically face a legal risk (any kernel
| contributor could sue them for breaching the GPL); rebuilding
| the driver from source is awkward (even with e.g. DKMS), pulls
| more power, takes time, and may randomly leave your system
| unbootable (YES it happened to me once).
|
| - Linux itself explicitly treats ZFS as unsupported; loading
| the module taints the kernel.
|
| - FreeBSD is great, and is actually making great progress
| catching up with Linux on the desktop. Still, it is a catch-up
| game. I also don't want to install a system that needs to
| install another guest system to actually run the programs I
| need.
|
| - There are no practical alternatives to ZFS that even come
| close; sibling comment complains about btrfs data loss. I never
| had the guts to try btrfs in production after all the horror
| stories I've heard over the decade+.
|
| - ECC memory on laptops is practically unheard of, save for a
| couple niche Thinkpad models; and comes with large premiums on
| desktops.
|
| What are the _practical_ choices for people who _do not want
| to_ cosplay as sysadmins?
| louwrentius wrote:
| I feel that on HN people tend to be a bit pedantic about
| topics like data integrity, and in business settings I
| actually agree with them.
|
| But for residential use, risks are just different and as you
| point out, you have no options except to only use a desktop
| workstation with ECC. People like/need laptops so that's not
| realistic for most people. Just run Linux/freebsd with ZFS
| isn't reasonable advice to me.
|
| What I feel most strongly about is that it's all about
| circumstances, context and risk evaluation. And I see so much
| blanket absolutist statements that don't think about the
| reality of life and people's circumstances.
| nubinetwork wrote:
| > - Very few Linux distributions ship zfs.ko (&spl.ko); those
| that do, theoretically face a legal risk (any kernel
| contributor could sue them for breaching the GPL)
|
| > - Linux itself explicitly treats ZFS as unsupported;
| loading the module taints the kernel.
|
| So modify the ZFS source so it appears as a external GPL
| module... just don't tell anyone or distribute it...
|
| I can't say much about dracut or having to build the module
| from source... as a Gentoo user, I do it about once a month
| without any issues...
| rollcat wrote:
| > So modify the ZFS source
|
| Way to miss the point
| nubinetwork wrote:
| Not really... the complaint was over licensing and
| tainting the kernel... so just tell the kernel it's not a
| CDDL module... problem solved.
| rollcat wrote:
| > _What are the practical choices for people who do not
| want to cosplay as sysadmins?_
|
| The specific complaint is not at all about the kernel
| identifying itself as tainted, the specific complaint is
| about the kernel developers' unyielding unwillingness to
| support any scenario where ZFS is concerned, thus leaving
| one with even more "sysadmin duties". I want to _use_ my
| computer, not _serve_ it.
| defrost wrote:
| Off the shelf commodity home NAS systems with ZFS onboard?
|
| eg: https://www.qnap.com/en-au/operating-system/quts-hero
|
| IIRC (been a while since I messed with QNAP) QuTS hero would
| be a modded Debian install with ZFS baked in and a web based
| admin dasboard.
|
| https://old.reddit.com/r/qnap/comments/15b9a0u/qts_or_quts_h.
| ..
|
| As a rule of thumb (IMHO) steer clear of commodity NAS Cloud
| add ons, such things attract ransomware hackers like flies to
| a tip whether it's QNAP, Synology, or InsertVendorHere.
| freeone3000 wrote:
| "Tainting" the kernel doesn't affect operations, though.
| You're not allowed to redistribute it with changes -- but
| you, as an entity, can freely use ZFS and the kernel together
| without restriction. Linux plus zfs works fine.
| simoncion wrote:
| > I never had the guts to try btrfs in production after all
| the horror stories I've heard over the decade+.
|
| I've been running btrfs as the primary filesystem for all of
| my desktop machines since shortly after the on-disk format
| stabilized and the extX->btrfs in-place converter appeared
| [0], and for my home servers for the past ~five years. In the
| first few years after I started using it on my desktop
| machines, I had four or five "btrfs shit the bed and trashed
| some of my data" incidents. I've had zero issues in the past
| ~ten years.
|
| At $DAYJOB we use btrfs as the filesystem for our CI workers
| and have been doing so for years. Its snapshot functionality
| makes creating the containers for CI jobs instantaneous, and
| we've had zero problems with it.
|
| I can think of a few things that might separate me from the
| folks who report issues that they've had within the past
| five-or-ten years:
|
| * I don't use ANY of the built-in btrfs RAID stuff.
|
| * I deploy btrfs ON TOP of LVM2 LVs, rather than using its
| built-in volume management stuff. [1]
|
| * I WAS going to say "I use ECC RAM", but one of my desktop
| machines does not and can never have ECC RAM, so this isn't
| likely a factor.
|
| The BTRFS features I use at home are snapshotting (for
| coherent point-in-time backups), transparent compression, the
| built-in CoW features, and the built-in checksumming
| features.
|
| At work, we use all of those except for compression, and
| don't use snapshots for backup but for container volume
| cloning.
|
| [0] If memory serves, this was around the time when the OCZ
| Vertex LE was hot, hot shit.
|
| [1] This has actually turned out to be a really cool
| decision, as it has permitted me to do low- or no- downtime
| disk replacement or repartitioning by moving live data off of
| local PVs and on to PVs attached via USB or via NBD.
| curt15 wrote:
| >At $DAYJOB we use btrfs as the filesystem for our CI
| workers and have been doing so for years. Its snapshot
| functionality makes creating the containers for CI jobs
| instantaneous, and we've had zero problems with it.
|
| $DAYJOB == "facebook"?
| simoncion wrote:
| > $DAYJOB == "facebook"?
|
| Nah. AFAIK, we don't have any Linux kernel gurus on the
| payroll... we're ordinary users just like most everyone
| else.
| why_only_15 wrote:
| I'm confused about optimizing 7 watts as important -- rough
| numbers, 7 watts is 61 kWh/y. If you assume US-average prices of
| $0.16/kWh that's about $10/year.
|
| edit: looks like for the netherlands (where he lives) this is
| more significant -- $0.50/kWh is the average price, so ~$32/year
| yumraj wrote:
| CA is average ~$0.5/kWh.
|
| Where in US matters a lot. I think OR is pretty cheap.
| louwrentius wrote:
| Althoug outside the scope of the artice, I try to keep a small
| electricity usage footprint.
|
| My lowest is 85Kwh in an apartment a moth but that was because
| of perfect solar weather.
|
| I average around 130 KWH a month now.
| ztravis wrote:
| Around 12 years ago I helped design and set up a 48-drive, 9U,
| ~120TB NAS in the Chenbro RM91250 chassis (still going strong!
| but plenty of drive failures along the way...). This looks like
| it's probably the 24-drive/4U entry in the same line (or
| similar). IIRC the fans were very noisy in their original hot-
| swappable mounts but replacing them with fixed (screw) mounts
| made a big difference. I can't tell from the picture if this has
| hot-swappable fans, though - I think I remember ours having
| purple plastic hardware.
| hi_hi wrote:
| I've had the exact same NAS for over 15 years. It's had 5 hard
| drives replaced, 2 new enclosures and 1 new power supply, but
| it's still as good as new...
| matheusmoreira wrote:
| > It's possible to create the same amount of redundant storage
| space with only 6-8 hard drives with RAIDZ2 (RAID 6) redundancy.
|
| I've given up on striped RAID. Residential use requires easy
| expandability to keep costs down. Expanding an existing parity
| stripe RAID setup involves failing every drive and slowly
| replacing them one by one with bigger capacity drives while the
| whole array is in a degraded state and incurring heavy I/O load.
| It's easier and safer to build a new one and move the data over.
| So you pretty much need to buy the entire thing up front which is
| expensive.
|
| Btrfs has a flexible allocator which makes expansion easier but
| btrfs just isn't trustworthy. I spent years waiting for RAID-Z
| expansion only for it to end up being a suboptimal solution that
| leaves the array in some kind of split parity state, old data in
| one format and new data in another format.
|
| It's just _so_ tiresome. Just give up on the "storage
| efficiency" nonsense. Make a pool of double or triple mirrors
| instead and call it a day. It's simpler to set up, easier to
| understand, more performant, allows heterogeneous pools of drives
| which lowers risk of systemic failure due to bad batches, gradual
| expansion is not only possible but actually easy and doesn't take
| literal weeks to do, avoids loading the entire pool during
| resilvering in case of failures, and it offers so much redundancy
| the only way you'll lose data is if your house literally burns
| down.
|
| https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...
| louwrentius wrote:
| I dislike that article/advice because it's dishonest /
| downplaying a limitation of ZFS and advocating that people
| should spend a lot more money, that may likely not be necessary
| at all.
| matheusmoreira wrote:
| What limitation is it downplaying? I would like to know if
| there are hidden downsides to the proposed solution.
|
| Compared to a RAID setup, this requires a lot less money.
| It's really good for residential use.
| orbital-decay wrote:
| Do you have a drive rotation schedule?
|
| 24 drives. Same model. Likely the same batch. Similar wear.
| Imagine most of them failing at the same time, and the rest
| failing as you're rebuilding it due to the increased load,
| because they're already almost at the same point.
|
| Reliable storage is tricky.
| winrid wrote:
| This. I had just two drives in raid 1, and the 2nd drive failed
| _immediately_ after silvering a new drive to re-create the
| array. very lucky :D
| otras wrote:
| Reminds me of the HN outage where two SSDs both failed after
| 40k hours: https://news.ycombinator.com/item?id=32031243
| throwaway48476 wrote:
| That's a firmware bug, not wear.
| hawk_ wrote:
| Yes and risk management dictates diversification to
| mitigate this kind of risk as well.
| tcdent wrote:
| bug or feature?
| generalizations wrote:
| For one reason or another, the drives tended to age out at
| the same time. Firmware bugs are just hardware failures for
| solid state devices.
| sschueller wrote:
| Reminds me of the time back in the day when Dell shipped us a
| server with drives serial numbers being consecutive.
|
| Of course both failed at the same time and I spent an all
| nighter doing a restore.
| jll29 wrote:
| I ordered my NAS drives on Amazon, to avoid getting the same
| batch (all consecutive serial numbers) I used amazon.co.uk
| for one half and amazon.de for the other half of them. One
| could also stage the orders in time.
| orbital-decay wrote:
| Yeah, the risk of the rest of the old drives failing under
| high load while rebuilding/restoring is also very real, so
| staging is necessary as well.
|
| I don't exactly hoard data by dozens of terabytes, but I
| rotate my backup drives each few years, with a 2-year
| difference between them.
| Tempest1981 wrote:
| Back in the day, I remember driving to different Frys and
| Central Computers stores to get a mix of manufacturing
| dates.
| madduci wrote:
| That's why you buy different drives from different stores, so
| you can reduce the chances to get HDDs from the same batch
| flemhans wrote:
| Drive like a maniac to the datacenter to shake 'em up a bit
| louwrentius wrote:
| I bought the drives in several batches from 2 or 3 different
| shops.
| londons_explore wrote:
| Software bugs might cause that (eg. drive fails after exactly 1
| billion IOPS due to some counter overflowing). But hardware
| wear probably won't be as consistent.
| lazide wrote:
| That depends entirely on how good their Q&A and manufacturing
| quality is - the better it is, the more likely eh?
|
| Especially in an array where it's possible every drive
| operation will be identical between 2 or 3 different drives.
| Tinned_Tuna wrote:
| I've seen this happen to a friend. Back in the noughties they
| built a home NAS similar to the one in the article, using fewer
| (smaller) drives. It was in RAID5 configuration. It lasted
| until one drive died and a second followed it during the
| rebuild. Granted, it wasn't using ZFS, there was no regular
| scrubbing, 00s drive failure rates were probably different, and
| they didn't power it down when not using it. The point is the
| correlated failure, not the precise cause.
|
| Usual disclaimers, n=1, rando on the internet, etc.
| layer8 wrote:
| This is the reason why I would always use RAID 6. A second
| drive failing during rebuild is significantly likely.
| SlightlyLeftPad wrote:
| You're far better off having two raids, one as a daily
| backup of progressive snapshots that only turns on
| occasionally to backup and is off the rest of the time.
| tedk-42 wrote:
| Really a non article as it feels like an edge case for usage.
|
| It's not on 24/7.
|
| No mention of I/O metrics or data stored.
|
| For all we know, OP is storing their photos and videos and never
| actually need to have 80% of the drives actually on and
| connected.
| louwrentius wrote:
| It's 80% full. I linked to the original article about the
| system for perf stats (sequential)
| fulafel wrote:
| Regular reminder: RAID (and ZFS) don't replace backups. It's an
| availability solution to reduce downtime in event of disk
| failure. Many things can go wrong with your files and filesystem
| besides disk failure, eg user error, userspace software/script
| bugs, driver or FS or hardware bugs, ransomware, etc)
|
| The article mentions backups near the end saying eg "most of the
| data is not important" and the "most important" data is backed
| up. Feeling lucky I guess.
| louwrentius wrote:
| I'm well aware of the risks, I just accept it.
|
| You shouldn't ever do what I do if you really care about tour
| data.
| lonjil wrote:
| ZFS can help you with backups and data integrity beyond what
| RAID provides, though. For example, I back up to another
| machine using zfs's snapshot sending feature. Fast and
| convenient. I scrub my machine and the backup machine every
| week, so if any data has become damaged beyond repair on my
| machine, I know pretty quickly. Same with the backup machine.
| And because of the regular integrity checking on my machine,
| it's very unlikely that I accidentally back up damaged data.
| And finally, frequent snapshots are a great way to recover from
| software and some user errors.
|
| Of course, there are still dangers, but ZFS without backup is a
| big improvement over RAID, and ZFS with backups is a big
| improvement over most backup strategies.
| yread wrote:
| Nowadays you could almost fit all that on a single 61TB SSD and
| not bother with 24 disks
| tmikaeld wrote:
| And loose all of it when it fails.
| ffsm8 wrote:
| > _Losing the system due to power shenanigans is a risk I
| accept._
|
| There is another (very rare) failure an ups protects against, and
| that's imbalance in the electricity.
|
| You can get a spike (up or down, both can be destructive) if
| there is construction in your area and something happens with the
| electricity, or lightning hits a pylon close enough to your
| house.
|
| First job I worked at had multiple servers die like that, roughly
| 10 yrs ago. it's the only time I've ever heard of such an issue
| however
|
| To my understanding, an ups protects from such spikes as well, as
| it will die before letting your servers get damaged
| Gud wrote:
| Electronics is absolutely sensitive to this.
|
| Please use filters.
| bboygravity wrote:
| Filters won't help against prolonged periods of higher/lower
| voltages though.
| Gud wrote:
| Voltages should be normalised before they hit the servers
| psu.
| dist-epoch wrote:
| But computer equipment uses switched power supplies which
| doesn't care about voltage, as long as there is enough
| power.
| int0x29 wrote:
| Isn't this what a surge protector is for?
| bayindirh wrote:
| Yes. In most cases, assuming you live in a 220V country, a
| surge protector will absorb the upwards spike, and the
| voltage range (a universal PSU can go as low as 107V) will
| handle the brownout voltage dip.
| Kerb_ wrote:
| Pretty sure surge protectors are less effective against dips
| than they are spikes
| acstapleton wrote:
| Nothing is really going to protect you from a direct
| lightning strike. Lightning strikes are on the order of
| millions of volts and thousands of amps. It will arc between
| circuits that are close enough and it will raise the ground
| voltage by thousands of volts too. You basically need a
| lighting rod buried deep into the earth to prevent it hitting
| your house directly and then you're still probably going to
| deal with fried electronics (but your house will survive).
| Surge protectors are for faulty power supplies and much
| milder transient events on the grid and maybe a lightning
| strike a mile or so away.
| Wowfunhappy wrote:
| Would a UPS protect against that either, though?
| Kirby64 wrote:
| No. Current will find a way. Lightning will destroy
| things you didn't even think would be possible to
| destroy.
| Wowfunhappy wrote:
| So I'm still left with int0x29's original question:
| "Isn't this [an electricity spike that a UPS could
| protect against] what a surge protector is for?"
| danw1979 wrote:
| I've had firsthand experience of a lightning strike hitting
| some gear that I maintained...
|
| My parent's house got hit right on the TV antenna, which was
| connected via coax down the the booster/splitter unit in comms
| cupboard ... then somehow it got onto the nearby network patch
| panel and fried every wired ethernet controller attached to the
| network, including those built into switch ports, APs, etc. In
| the network switch, the current destroyed the device's power
| supply too, as it was trying to get to ground I guess.
|
| Still a bit of a mystery how it got from the coax to the cat5.
| maybe a close parallel run the electricians put in somewhere ?
|
| Total network refit required, but thankfully there were no
| wired computers on site... I can imagine storage devices
| wouldn't have fared very well.
| nuancebydefault wrote:
| Well this is an other order of magnitude than 'spikes on the
| net'. The electrical field is so intense that current will
| easily cross large air gaps.
| manmal wrote:
| We've had such spikes in an old apartment we were living in. I
| had no servers back then, but LED lamps annoyingly failed every
| few weeks. It was an old building from the 60s and our own
| apartment had some iffy quick fixes in the installation.
| JonChesterfield wrote:
| Lightning took out a modem and some nearby hardware here about
| a week ago. Residential. The distribution of dead vs damaged vs
| nominally unharmed hardware points very directly at the copper
| wire carrying vdsl. Modem was connected via ethernet to
| everything else.
|
| I think the proper fix for that is probably to convert to
| optical, run along a fibre for a bit, then convert back. It
| seems likely that electricity will take a different route in
| preference to the glass. That turns out to be
| disproportionately annoying to spec (not a networking guy, gave
| up after an hour trying to distinguish products) so I've put a
| wifi bridge between the vdsl modem and everything else.
| Hopefully that's the failure mode contained for the next storm.
|
| Mainly posting because I have a ZFS array that was wired to the
| same modem as everything else. It seems to have survived the
| experience but that seems like luck.
| louwrentius wrote:
| True, this is also what I mean with power shenanigans.
|
| My server is off most off the time, disconnected. But even if
| it wasn't, I just accept the risk.
| ragebol wrote:
| Assuming you live in the Netherlands judging just by name:
| our power grid is pretty damn reliable with little
| shenanigans. I'd take that risk indeed.
| louwrentius wrote:
| Yes, I'm in NL, indeed our grid is very reliable.
| deltarholamda wrote:
| This depends very much on the type of UPS. Big, high dollar
| UPSes will convert the AC to DC and back to AC, which gives
| amazing pure sine wave power.
|
| The $99 850VA APC you get from Office Depot does not do this.
| It switches from AC to battery very quickly, but it doesn't
| really do power conditioning.
|
| If you can afford the good ones, they genuinely improve
| reliability of your hardware over the long term. Clean power is
| great.
| sneak wrote:
| My home NAS is about 200TB, runs 24/7, is very loud and power
| inefficient, does a full scrub every Sunday, and also hasn't had
| any drive failures. It's only been 4 or 5 years, however.
| tie-in wrote:
| We've been using a multi-TB PostgreSQL database on ZFS for quite
| a few years in production and have encountered zero problems so
| far, including no bit flips. In case anyone is interested, our
| experience is documented here:
|
| https://lackofimagination.org/2022/04/our-experience-with-po...
| lifeisstillgood wrote:
| My takeaway is that there is a difference between residential and
| industrial usage, just as there is a difference between
| residential car ownership and 24/7 taxi / industrial use
|
| And that no matter how amazing the industrial revolution has
| been, we can build reliability at the residential level but not
| the industrial level.
|
| And certainly at the price points.
|
| The whole "At FAANG scale" is a misnomer - we aren't supposed to
| use residential quality (possibly the only quality) at that scale
| - maybe we are supposed to park our cars in our garages and drive
| them on a Sunday
|
| Maybe we should keep our servers at home, just like we keep our
| insurance documents and our notebooks
| bofadeez wrote:
| I might be interested in buying storage at 1/10 of the price if
| the only tradeoff was a 5 minute wait to power on a hard drive.
| tobiasbischoff wrote:
| Let me tell you powering these drives on and off is far more
| dangerous then just keeping them running. 10 years is well in the
| MTBF of these enterprise drives. (I worked for 10 years as
| enterprise storage technician, i saw a lot if sh*).
| manuel_w wrote:
| Discussions on checksumming filesystems usually revolve around
| ZFS and BTRFS, but has someone any experience with bcachefs? It's
| upstreamed in the linux kernel, I learned, and is supposed to
| have full checksumming. The author also seems to take filesystem
| responsibility seriously.
|
| Is anyone using it around here?
|
| https://bcachefs.org/
| olavgg wrote:
| It is marked experimental, and since it was merged into the
| kernel there have been a few major issues that has been
| resolved. I wouldn't risk production data on it, but for a home
| lab it could be fine. But you need to ask yourself, how much
| time are you willing to spend if something should go wrong? I
| have also been running ZFS for 15+ years, and I've seen a lot
| of crap because of bad hardware. But with good enterprise
| hardware it has been working flawless.
| clan wrote:
| That was a decision Linus regretted[1]. There has been some
| recent discussion about this here on Hacker News[2].
|
| [1] https://linuxiac.com/torvalds-expresses-regret-over-
| merging-...
|
| [2] https://news.ycombinator.com/item?id=41407768
| Ygg2 wrote:
| Context. Linux regrets it because bcachefs doesn't have same
| commitment to stability as Linux.
|
| Kent wants to fix a bug with large PR
|
| Linux doesn't want to merge and review PR that touches so
| many non-bcachefs things.
|
| They're both right in a way. Kent wants bcachefs to be
| stable/work good, Linus wants Linux to be stable.
| teekert wrote:
| Edit: replied to wrong person. I agree with you.
|
| Kent from bcachefs was just late in the cycle, somewhere in
| rc5. That was indeed too late for such a huge push of new
| code touching so many things.
|
| There is some tension but there is no drama and implying so
| is annoying.
|
| Bcachefs is going places, I think I'd already choose it
| over btrfs atm.
| homebrewer wrote:
| As usual, the top comments in that submission are very
| biased. I think HN should sort comments in a random order in
| every polarizing discussion. Anyone reading this, do yourself
| a favor and dig through both links, or ignore the parent's
| comment altogether.
|
| Linus "regretted" it in the sense "it was a bit too early
| because bcachefs is moving at such a fast speed", and not in
| the sense "we got a second btrfs that eats your data for
| lunch".
|
| Please provide context and/or short human-friendly
| explanation, because I'm pretty sure most readers won't go
| further than your comment and will remember it as "Linus
| regrets merging bcachefs", helping spread FUD for years down
| the line.
| clan wrote:
| Well. Point taken. You have an important core of truth to
| your argument about polarization.
|
| But...
|
| Strongly disagree.
|
| I think that is a very unfair reading of what I wrote. I
| feel that you might have a bias which shows but that would
| be the same class of ad hominem as you have just displayed.
| That is why I choose to react even though it might be wise
| to let slepping dogs lie. We should minimize polarization
| but not to a degree where we cannot have civilized
| disagreement. You are then doing exactly what you preach
| not to do. Is that then FUD with FUD on top? Two wrongs
| make a right?
|
| I was reacting on the implicit approval in mentioning that
| it had been upstreamed in the kernel. The reason for the
| first link. Regrets where clearly expressed.
|
| Another HN trope is rehashing the same discussions over and
| over again. That was the reason for the second link. I
| would like to avoid yet another discussion on a topic which
| was put into light less than 14 days ago. Putting that more
| bluntly would have been impolite and polarizing. Yet here I
| am.
|
| The sad part is that my point got through to you loud and
| clear. Sad because rather than simply dismissing as
| polarizing that would have been a great opener for a
| discussion. Especially in the context of ZFS and
| durability.
|
| You wrote:
|
| > Linus "regretted" it in the sense "it was a bit too early
| because bcachefs is moving at such a fast speed", and not
| in the sense "we got a second btrfs that eats your data for
| lunch".
|
| If you allow me a little lighthearted response. The first
| thing which comes to mind was the "They're the same
| picture" meme[1] from The Office. Some like to move quickly
| and break things. That is a reasonable point of view. But
| context matters. For long term data storage I am much more
| conservative. So while you might disagree; to me it is the
| exact same picture.
|
| Hence I very much object to what I feel is an ad hominem
| attack because your own worldview was not reflected
| suitably in my response. It is fair critique that you feel
| it is FUD. I do however find it warranted for a filesystem
| which is marked experimental. It might be the bees knees
| but in my mind it is not ready for mainstream use. Yet.
|
| That is an important perspective for the OP to have. If the
| OP just want to play around all is good. If the OP does not
| mind moving quickly and break things, fine. But for
| production use? Not there yet. Not in my world.
|
| Telling people to ignore my comment because you know people
| cannot be bothered to actually read the links? And then
| lecturing me that people might take the wrong spin on it?
| Please!
|
| [1] https://knowyourmeme.com/memes/theyre-the-same-picture
| Novosell wrote:
| You're saying this like the takeaway of "Linus regrets
| merging bcachefs" is unfair when the literal quote from
| Linus is "[...] I'm starting to regret merging bcachefs."
| And earlier he says "Nobody sane uses bcachefs and expects
| it to be stable[...]".
|
| I don't understand how you can read Linus' response and
| think "Linus regrets merging bcachefs" is an unfair
| assessment.
| CooCooCaCha wrote:
| After reading the email chain I have to say my enthusiasm for
| bcachefs has diminished significantly. I had no idea Kent was
| _that_ stubborn and seems to have little respect for Linus or
| his rules.
| eru wrote:
| I'm using it. It's been ok so far, but you should have all your
| data backed up anyway, just in case.
|
| I'm trying a combination where I have an SSD (of about 2TiB) in
| front of a big hard drive (about 8 TiB) and using the SSD as a
| cache.
| rollcat wrote:
| Can't comment on bcachefs (I think it's still early), but I've
| been running with bcache in production on one "canary" machine
| for years, and it's been rock-solid.
| ffsm8 wrote:
| I tried it out on my homelab server right after the merge into
| the Linux kernel.
|
| Took roughly one week for the whole raid to stop mounting
| because of the journal (8hdd, 2 ssd write cache, 2 nvme read
| cache).
|
| The author responded on Reddit within a day, I tried his fix,
| (which meant compiling the Linux kernel and booting from that),
| but his fix didn't resolve the issue. He sadly didn't respond
| after that, so I wiped and switched back to a plain mdadmin
| raid after a few days of waiting.
|
| I had everything important backed up, obviously (though I did
| lose some unimportant data), but it did remind me that bleeding
| edge is indeed ... Unstable
|
| The setup process and features are fantastic however, simply
| being able to add a disk and flag it as read/write cache feels
| great. I'm certain I'll give it another try in a few years,
| after it had some time in the oven.
| iforgotpassword wrote:
| New filesystems seems to have a chicken and egg problem
| really. It's not like switching from Nvidia's proprietary
| drivers to nouveau and then back if it turns out they don't
| work that well. Switching filesystems, especially in larger
| raid setups where you desperately need more testing and real
| world usage feedback, is pretty involved, and even if you
| have everything backed up it's pretty time consuming
| restoring everything should things go haywire.
|
| And even if you have the time and patience to be one of these
| early adopters, debugging any issues encountered might also
| be difficult, as ideally you want to give the devs full
| access to your filesystem for debugging and attempted fixes,
| which is obviously not always feasible.
|
| So anything beyond the most trivial setups and usage patterns
| gets a miniscule amount of testing.
|
| In an ideal world, you'd nail your FS design first try, make
| no mistakes during implementation and call it a day. I'd like
| to live in an ideal world.
| mdaniel wrote:
| > In an ideal world, you'd nail your FS design first try,
| make no mistakes during implementation and call it a day
|
| Crypto implementations and FS implementations strike me as
| the ideal audience for actually investing the mental energy
| in the healthy ecosystem we have of modeling and
| correctness verification systems
|
| Now, I readily admit that I could be talking out of my ass,
| given that I've not tried to use those verification systems
| in anger, as I am not in the crypto (or FS) authoring space
| but AWS uses formal verification for their ... fork? ... of
| BoringSSL et al https://github.com/awslabs/aws-lc-
| verification#aws-libcrypto...
| orbital-decay wrote:
| A major chunk of storage reliability is all these weird
| and unexpected failure modes and edge cases which are not
| possible to prepare for, let alone write fixed specs for.
| Software correctness assumes the underlying system
| behaves correctly and stays fixed, which is not the case.
| You can't trust the hardware and the systems are too
| diverse - this is the worst case for formal verification.
| DistractionRect wrote:
| I'm optimistic about it, but probably won't switch over my home
| lab for a while. I've had quirks with my (now legacy) zsys +
| zfs on root for Ubuntu, but since it's a common config//widely
| used for years it's pretty easy to find support.
|
| I probably won't use bcachefs until a similar level of
| adoption/community support exists.
| wazoox wrote:
| I currently support many NAS servers in the 50TB - 2PB range,
| many of them being 10, 12, and up to 15 years old for some of
| them. Most of them still run with their original power supplies,
| motherboard and most of their original (HGST -- now WD --
| UlstraStar) drives, though of course a few drives have failed for
| some of them (but not all).
|
| 2, 4, 8TB HGST UltraStar disks are particularly reliable. All of
| my desktop PCs currently hosts mirrors of 2009 vintage, 2 TB
| drives that I got when they're put out of service. I have heaps
| of spare, good 2 TB drives (and a few hundreds still running in
| production after all these years).
|
| For some reason 14TB drives seem to have a much higher failure
| rate than Helium drives of all sizes. On a fleet of only about 40
| 14 TB drives, I had more failures than on a fleet of over 1000 12
| and 16 TB.
| n3storm wrote:
| After 20 years using ext3 and ext4 I only lost data when ffing
| around with parted and did.
| rollcat wrote:
| There's an infinite amount of ways you can lose data. Here's
| one of my recent stories:
| https://www.rollc.at/posts/2022-05-02-disaster-recovery/
|
| Just among my "personal" stuff, over the last 12 years I've
| completely lost 4 hard drives due to age/physical failure. ZFS
| made it a no-deal twice, and aided with recovery once (once
| I've dd'd what was left of one drive, zvol snapshots made
| "risky" experimentation cheap&easy).
| Tepix wrote:
| Having 24 drives probably offers some performance advantages, but
| if you don't require them, having a 6-bay NAS with 18TB disks
| instead woukld offer a ton of advantages in terms of power usage,
| noise, space required, cost and reliability.
| louwrentius wrote:
| I agree, that's what I would do today if I wanted the same
| amount of space, as I states in the article.
| foepys wrote:
| I'd want more redundancy in that case. With such large HDDs zfs
| resilver could kill another disk and then you would lose your
| data.
| prepend wrote:
| 18TB drives didn't exist back when this setup was designed.
|
| Of course a 3 bay with 128TB drives would also be superior, but
| this comment only makes sense in a few years.
| prepend wrote:
| I wish he had talked more about his movie collection. I'm
| interested in the methods of selecting initial items as well as
| ones that survive in the collection for 10+ years.
| not_the_fda wrote:
| My server isn't nearly as big as his, but my collection is
| mostly Criterion https://www.criterion.com/closet-picks
| lvl155 wrote:
| Off topic but why do people run Plex and movies locally in
| 2024? Is "iTunes" that bad?
| prepend wrote:
| I don't know about other people but I run Plex because it
| lets me run my home movie collection through multiple clients
| (Apple TV, browser, phones, etc). iTunes works great with
| content bought from Apple, but is useless when playing your
| own media files from other sources.
|
| I just want every Disney movie available so my kids can watch
| without bothering me.
| bevenhall wrote:
| I wouldn't call a single user computer a server if you can "Turn
| the server off when you're not using it." Not really a challenge
| or achievement.
| lawaaiig wrote:
| Regarding the intermittent power cutoffs during boot it should be
| noted the drives pull power from the 5V rail on startup:
| comparable drives typically draw up to 1.2A. Combined with the
| maximum load of 25A on the 5V rail (Seasonic Platinum 860W), it's
| likely you'll experience power failures during boot if staggered
| spinup is not used.
| 383toast wrote:
| Can someone explain why a single geolocated node makes sense for
| data storage? If there's a house fire for example, wouldn't all
| the data be lost?
| Fastjur wrote:
| I'm guessing that the 71 TiB is mostly used for media, as in,
| plex/jellyfin media, which is sad to loose but not
| unrecoverable. How would one ever store that much of personal
| data? I hope they have an off site backup for the all important
| unrecoverable data like family photos and whatnot.
| leptons wrote:
| I have about about 80TB (my wife's data and mine) backed up
| to LTO5 tape. It's pretty cheap to get refurb tape drives on
| ebay. I pay about $5.00/TB for tape storage, not including
| the ~$200 for the LTO drive and an HBA card, so it was pretty
| economical.
| Fastjur wrote:
| Wow that is surprisingly cheap actually.
| leptons wrote:
| I get email alerts from ebay for anything related to
| LTO-5, and I only buy the cheap used tapes. They are
| still fine, most of them are very low use, the tapes
| actually have a chip in them that stores usage data like
| a car's odometer, so you can know how much a tape has
| been used. So far I trust tapes more than I'd trust a
| refurb hard drive for backups. And I also really like
| that the tapes have a write-protect notch on them, so
| once I write my backup data, there's no risk of having a
| tape accidentally get erased, unlike if I plugged in a
| hard drive and maybe there's some ransomware virus that
| automatically fucks with any hard drive drive that gets
| plugged in. It's just one less thing to worry about.
| Saris wrote:
| I'm curious what's your use case for 71TB of data where you can
| also shut it down most of the time?
|
| My NAS is basically constantly in use, between video footage
| being dumped and then pulled for editing, uploading and editing
| photos, keeping my devices in sync, media streaming in the
| evening, and backups from my other devices at night..
| drzzhan wrote:
| It's always nice to know that people can store their data for so
| long. In my research lab, we still only use separate external HDD
| drives due to budget reasons. Last year 4 (over 8) drives failed
| and we lost the data. I guess we mainly work with public data so
| it is not a big deal. But, it is a dream of mine to research
| without such worries. I do keep backups for my stuff though, but
| only me in my lab.
___________________________________________________________________
(page generated 2024-09-14 23:01 UTC)