[HN Gopher] The No-Order File System (2012)
___________________________________________________________________
The No-Order File System (2012)
Author : harporoeder
Score : 96 points
Date : 2021-01-25 07:42 UTC (1 days ago)
(HTM) web link (pages.cs.wisc.edu)
(TXT) w3m dump (pages.cs.wisc.edu)
| jzer0cool wrote:
| > Modern file systems use ordering points to maintain consistency
| in the face of system crashes.
|
| Could someone shed more light? In modern or older O/S, what
| exactly happens during the crash?
| tyingq wrote:
| I'd put it in 3 buckets of "not journaled", "metadata only
| journaled", and "data and metadata journaled".
|
| In the "not journaled" bucket, you have to hope something like
| fsck can make sense of the state of the filesystem. Often, the
| filesystem could be so "fscked" that you can't mount/boot.
|
| In the "metadata only journaled" bucket, if you use
| order...meaning writing the data before you write the metadata,
| you will have a consistent filesystem that can be mounted,
| albeit possibly missing some data.
|
| In the last bucket, you're also getting a consistent filesystem
| that can be mounted, but with less data loss. At the cost of
| some performance.
|
| I think they are saying both buckets 2 and 3 are "modern". I
| certainly encountered lots of fsck giving up in the 90's on
| "bucket 1" type filesystems.
| jzer0cool wrote:
| Thanks! Is bucket 1 also related to hard drive won't boot.
| Eventually be able to recover mbr and proper boot location on
| and to the OS?
| tyingq wrote:
| Yes, "won't boot" meaning the root filesystem isn't
| recoverable, can't fully boot without that. You could have,
| for example, a working /boot, but a dead /. In that case,
| the kernel will boot up and then panic when it can't mount
| root.
|
| A bad mbr is also possible, for reasons unrelated to any
| filesystem. And a journaling filesystem doesn't always help
| with drive errors, etc. I was trying to scope down to
| filesystem issues due to an unorderly shutdown, where the
| disk drive itself is fine.
| ro_bit wrote:
| Title could use a (2012)
| not2b wrote:
| The publications they point to are from 2012 and 2013. Did this
| work ever go anywhere after that?
| alextheparrot wrote:
| Remzi and Andrea [0] have a pretty well regarded OS book used to
| teach the OS course at Wisconsin [1-2]. Hope they're doing well,
| Andrea's course teaching Scratch to 4th and 5th graders at a
| local elementary school was a highlight of my time there.
|
| [0] I would use honorifics, but they are married which makes it a
| bit confusing. [1] Which I never took but was well regarded when
| I was an undergrad
|
| [2] http://pages.cs.wisc.edu/~remzi/OSTEP/
| pmiller2 wrote:
| If you think that's confusing, I had two professors in grad
| school who were married and shared the same last name. She had
| a PhD when I started there, so I called her "Doctor," but he
| didn't, so I called him "Steve." Well, Steve later got his PhD,
| so I was incredibly confused at that point.
|
| We collectively resolved the dilemma by all agreeing to be on a
| first name basis. Now that I've spent significant time in
| industry and have worked with more people with PhDs than were
| in my department in grad school, I've come to realize I was
| being a little bit silly. In my experience, at least once you
| get to the graduate level, anybody who insists on being called
| "doctor" seems a little full of it to me. That said, I still
| think it's appropriate for undergrads to call their professors
| "doctor," when applicable, and I was always careful to do so
| whenever I was in the presence of any undergrads.
| leetcrew wrote:
| I always just called my professors "professor {lastname}".
| that way I didn't have to keep track of who was a grad
| student or lecturer with a masters and who actually had a
| phd.
| skissane wrote:
| That works in the US, given in the US students call all
| academics "professor".
|
| However, in other countries, you would not call someone
| "professor X" unless they actually had the word "professor"
| in their job title. Here in Australia, a lot of academics
| don't - you start out as a "Lecturer", then get promoted to
| "Senior Lecturer", then "Associate Professor", then finally
| "Professor", and calling a lecturer "professor" is not
| done. And you certainly wouldn't use the word "professor"
| when addressing a PhD student.
| pmiller2 wrote:
| It's not correct at all for graduate student instructors,
| though.
| vxNsr wrote:
| > _anybody who insists on being called "doctor" seems a
| little full of it_
|
| Careful, this apparently is blasphemy these days.
| thewakalix wrote:
| Professors and politicians are fairly disjoint.
| Ericson2314 wrote:
| All I ask from storage: please give me a b-epsilon tree for
| content-addressed data. I will handle naming, mutation, etc.
| myself.
|
| The number one problem is all the current abstractions are
| stupefied, with everyone conway's law-ing around everyone else.
| Animats wrote:
| That's progress.
|
| I've previously suggested that operating systems should have
| stronger file integrity guarantees. "Unit" files (rewriting
| replaces the whole file atomically, no reader ever sees a
| partially written file). That's the default. "Log" files (always
| end at a clean end point, don't tail off into junk). "Temp" files
| (disappear on reboot). And, for databases, "Managed" files.
|
| Managed files have more I/O functions. In particular, you get two
| completion events on writes - "copy complete" (the caller can
| reuse the buffer) and "safely stored" (the data has reached its
| final resting place, all links are complete, etc.). Programs like
| databases would use that. Those are the semantics databases want,
| and struggle to get by flushing, waiting, and various
| workarounds.
|
| When I mention this, what usually happens is that people get lost
| in complicated workarounds for simulating unit files. Different
| approaches are needed for Linux, Windows, NTFS, and various VM
| systems. This should Just Work.
|
| This isn't my invention; it's from Popek's kernel in 1985 at
| UCLA, later seen as UCLA-Locus and as an IBM product. They had
| explicit commit and revert functions for file systems. I'd
| suggest having the default be commit on normal close or normal
| program exit, but if the program aborts or crashes or is killed,
| unit files don't commit and remain unchanged.
| jcranmer wrote:
| POSIX filesystem semantics are one of the things where POSIX
| just plain got stuff wrong. And it isn't helped by the fact
| that a lot of people want to game benchmarks by slightly lying
| about their durability compliance, so you really have to jump
| through hoops to make sure you actually achieve durability.
|
| I do agree with you that we need a better way to interact with
| the filesystem with regards to integrity and durability,
| although I'm not sure we entirely agree on how that would look.
| The idea of multiple modes makes a lot of sense:
|
| * File-atomic mode. This is I believe trivial to implement in
| the filesystem layer for all filesystems, and the basic idea of
| this mode has existed for decades. When a reader opens a file
| for reading, it will never see any other writes to the file. A
| writer will only update the file when it closes the file [1],
| at which point any _new_ reader will see only the new file
| created. The code is intrinsically safe in the face of multiple
| processes interacting with the file, and is probably the
| semantics most people would prefer in that situation.
|
| * Append-only transactional files. Here, you can't random-
| access write into the file (but you can random-access read),
| only write at the end or truncate the file. A writer designates
| the text to append to the file as atomic blocks: the reader
| will only atomically see or not see the block [2]. If the file
| is truncated, all readers see the original contents of the file
| until they close.
|
| * Raw files. Don't pretend that a file is a stream of bytes.
| Instead, expose it as a set of blocks that can be atomically
| updated (including atomically adding or removing blocks from
| the file at different places). I don't know filesystem
| semantics to give any good details here, but my understanding
| is that databases basically try to get these semantics today,
| and that getting good guarantees on fully random-access
| read/write semantics is effectively impossible anyways.
|
| There does feel to me to be a bit of a hole here, where you
| basically get no multiprocess interactions via files unless you
| completely change how your code works, but I'm not sure it's
| entirely feasible to have a middle ground here. You can
| probably get close enough for most needs with a way to be
| notified and reopen the file in file-atomic mode, and anything
| where that's not sufficient probably needs you to go to raw
| files to really get the guarantees you want.
|
| In addition to the basic file I/O issues, there also needs to
| be a way to be more transactional with directories, I think.
| Using paths as the basis for filesystem issues is already
| opening up programmers to time-of-check-time-of-use attacks
| today, and moving to a file descriptor-based approach for
| directory manipulation would solve that while opening up the
| possibility for better transactional support on the directory
| level.
|
| The other issue is durability. Most applications in the first
| two modes would probably be fine with an optional durability:
| the result of an unexpected power outage would be a file that
| is out of date, but not corrupt. The filesystem could provide
| an optional callback on commit that returns when it is
| durability committed, which would handle those cases where you
| do actually need to make sure that the data will be committed
| on unexpected power outage. And the simple semantics of the
| first two modes means that providing durability reliably should
| be easy for filesystems.
|
| [1] This also suggests that there should be a way to abort the
| write.
|
| [2] You can also see how a file-atomic reader can interact with
| an append-only writer: the filesystem layer needs to remember
| the size the reader first saw and pretend that's the EOF, but
| otherwise there's no issue. And append-only readers will act as
| a file-atomic reader with respect to a file-atomic writer. The
| interactions make sense, that's a good sign for the model!
| rodgerd wrote:
| Meanwhile ext4 is plowing even further down the path of "we get
| great benchmarks if we make data integrity a userspace problem
| LOL"
| Dylan16807 wrote:
| How so? They had that recent batch of safe optimizations, and
| trying to make fsync act closer to what it actually says is a
| pretty good thing.
| wmf wrote:
| The fact that write(), close(), rename() isn't safe unless
| you fsync is pretty annoying.
| Dylan16807 wrote:
| It is, but also isn't new.
| the8472 wrote:
| a) posix doesn't guarantee or require it b) ext4 actually
| provides a workaround for that kind of broken software
| with its _auto_da_alloc_ behavior, which is on by
| default.
| Dylan16807 wrote:
| What posix actually requires is an awful mess and just as
| broken if not moreso.
|
| Oh, you just said to sync the file, not the directory?
| Too bad, data's gone.
| josephg wrote:
| I agree - although I see all of those cases as special cases of
| the same general action: filesystem operations are state
| machine transitions.
|
| I've been playing with the idea lately that the OS could
| support general transformation functions (eg "insert an 'A'
| character at this location", "append this log entry",
| "overwrite this byte range", etc. Those transformation
| functions could be generic or written in wasm/BPF. The
| filesystem can process operations like this efficiently by
| storing them to a log and periodically flushing. (Or whatever
| makes sense for the operations).
|
| Having a completion API that separates "buffer can be reused"
| and "persistently flushed" is great, and should be available
| for all filesystem operations. Not just for databases!
|
| If an application wants to use traditional posix semantics,
| write() calls can just be one of the supported operation types.
|
| And as a bonus, filesystem watching can become highly granular
| - you could subscribe to the stream of semantic changes to a
| file!
| skissane wrote:
| > This isn't my invention; it's from Popek's kernel in 1985 at
| UCLA, later seen as UCLA-Locus and as an IBM product
|
| Which IBM product? I am guessing AIX PS/2 and AIX/370, since
| those are the two operating systems Locus Computing developed
| for IBM.
| tentacleuno wrote:
| Didn't Plan9 do something like this?
| jxy wrote:
| plan 9 likes those WORM filesystems, Venti, cwfs, or this early
| incantation https://9p.io/sys/doc/fs/fs.html
|
| I guess you can modify the WORM fs to do the same as this No-
| Order fs does.
| pwinnski wrote:
| It's a fascinating concept, but nothing seems to have happened in
| the eight years since.
|
| This seems like it might gain a bit of performance in exchange
| for slightly-higher filesystem overhead, though it's not clear
| there are any stats on either. It's also not clear how exactly
| performance increases would surface: are reads slower but writes
| faster?
| pwinnski wrote:
| I clicked through to read the paper, and some of my questions
| are answered there, but not (from my initial skim) all.
|
| My intuition was wrong: reads are generally speedy, while
| writes are a bit slower than ext3.
| tytso wrote:
| I can think of a number of reasons why this may not have gotten a
| lot of traction in the intervening eight years. One is that
| DIF/DIX disks are not easily available at reasonable prices.
|
| The other potential shortcoming is that it provides substantially
| weaker consistency guarantees than what people are used to. After
| all of the scan threads are finished with the recovery, yes, the
| file system metadata will be self-consistent; but that's all you
| can count upon.
|
| Suppose that a particular file is getting updated at the time of
| the crash. There might be several blocks that were newly
| allocated right before the crash, and the a larger number of data
| blocks that were getting overwritten right before the crash.
| There is no guarantee which set of data blocks will be persistent
| across the reboot, and which newly allocated blocks will actually
| be attached to the file. There might also be newly allocated
| blocks that were attached to the file, but the data might not be
| written to the block, such that stale data (the previous contents
| of the block, which mgiht be another user's medical data, or
| private e-mail, etc.) that would become visible to the file
| across the crash.
| gray_-_wolf wrote:
| > Sorry! The URL you requested was not found on our server.
|
| The link seems to be dead :/
| bytematic wrote:
| Works for me
| chrismorgan wrote:
| 404 for me on accessing from both India and Australia.
| [deleted]
___________________________________________________________________
(page generated 2021-01-26 23:00 UTC)