[HN Gopher] The No-Order File System (2012)
       ___________________________________________________________________
        
       The No-Order File System (2012)
        
       Author : harporoeder
       Score  : 96 points
       Date   : 2021-01-25 07:42 UTC (1 days ago)
        
 (HTM) web link (pages.cs.wisc.edu)
 (TXT) w3m dump (pages.cs.wisc.edu)
        
       | jzer0cool wrote:
       | > Modern file systems use ordering points to maintain consistency
       | in the face of system crashes.
       | 
       | Could someone shed more light? In modern or older O/S, what
       | exactly happens during the crash?
        
         | tyingq wrote:
         | I'd put it in 3 buckets of "not journaled", "metadata only
         | journaled", and "data and metadata journaled".
         | 
         | In the "not journaled" bucket, you have to hope something like
         | fsck can make sense of the state of the filesystem. Often, the
         | filesystem could be so "fscked" that you can't mount/boot.
         | 
         | In the "metadata only journaled" bucket, if you use
         | order...meaning writing the data before you write the metadata,
         | you will have a consistent filesystem that can be mounted,
         | albeit possibly missing some data.
         | 
         | In the last bucket, you're also getting a consistent filesystem
         | that can be mounted, but with less data loss. At the cost of
         | some performance.
         | 
         | I think they are saying both buckets 2 and 3 are "modern". I
         | certainly encountered lots of fsck giving up in the 90's on
         | "bucket 1" type filesystems.
        
           | jzer0cool wrote:
           | Thanks! Is bucket 1 also related to hard drive won't boot.
           | Eventually be able to recover mbr and proper boot location on
           | and to the OS?
        
             | tyingq wrote:
             | Yes, "won't boot" meaning the root filesystem isn't
             | recoverable, can't fully boot without that. You could have,
             | for example, a working /boot, but a dead /. In that case,
             | the kernel will boot up and then panic when it can't mount
             | root.
             | 
             | A bad mbr is also possible, for reasons unrelated to any
             | filesystem. And a journaling filesystem doesn't always help
             | with drive errors, etc. I was trying to scope down to
             | filesystem issues due to an unorderly shutdown, where the
             | disk drive itself is fine.
        
       | ro_bit wrote:
       | Title could use a (2012)
        
       | not2b wrote:
       | The publications they point to are from 2012 and 2013. Did this
       | work ever go anywhere after that?
        
       | alextheparrot wrote:
       | Remzi and Andrea [0] have a pretty well regarded OS book used to
       | teach the OS course at Wisconsin [1-2]. Hope they're doing well,
       | Andrea's course teaching Scratch to 4th and 5th graders at a
       | local elementary school was a highlight of my time there.
       | 
       | [0] I would use honorifics, but they are married which makes it a
       | bit confusing. [1] Which I never took but was well regarded when
       | I was an undergrad
       | 
       | [2] http://pages.cs.wisc.edu/~remzi/OSTEP/
        
         | pmiller2 wrote:
         | If you think that's confusing, I had two professors in grad
         | school who were married and shared the same last name. She had
         | a PhD when I started there, so I called her "Doctor," but he
         | didn't, so I called him "Steve." Well, Steve later got his PhD,
         | so I was incredibly confused at that point.
         | 
         | We collectively resolved the dilemma by all agreeing to be on a
         | first name basis. Now that I've spent significant time in
         | industry and have worked with more people with PhDs than were
         | in my department in grad school, I've come to realize I was
         | being a little bit silly. In my experience, at least once you
         | get to the graduate level, anybody who insists on being called
         | "doctor" seems a little full of it to me. That said, I still
         | think it's appropriate for undergrads to call their professors
         | "doctor," when applicable, and I was always careful to do so
         | whenever I was in the presence of any undergrads.
        
           | leetcrew wrote:
           | I always just called my professors "professor {lastname}".
           | that way I didn't have to keep track of who was a grad
           | student or lecturer with a masters and who actually had a
           | phd.
        
             | skissane wrote:
             | That works in the US, given in the US students call all
             | academics "professor".
             | 
             | However, in other countries, you would not call someone
             | "professor X" unless they actually had the word "professor"
             | in their job title. Here in Australia, a lot of academics
             | don't - you start out as a "Lecturer", then get promoted to
             | "Senior Lecturer", then "Associate Professor", then finally
             | "Professor", and calling a lecturer "professor" is not
             | done. And you certainly wouldn't use the word "professor"
             | when addressing a PhD student.
        
               | pmiller2 wrote:
               | It's not correct at all for graduate student instructors,
               | though.
        
           | vxNsr wrote:
           | > _anybody who insists on being called "doctor" seems a
           | little full of it_
           | 
           | Careful, this apparently is blasphemy these days.
        
             | thewakalix wrote:
             | Professors and politicians are fairly disjoint.
        
       | Ericson2314 wrote:
       | All I ask from storage: please give me a b-epsilon tree for
       | content-addressed data. I will handle naming, mutation, etc.
       | myself.
       | 
       | The number one problem is all the current abstractions are
       | stupefied, with everyone conway's law-ing around everyone else.
        
       | Animats wrote:
       | That's progress.
       | 
       | I've previously suggested that operating systems should have
       | stronger file integrity guarantees. "Unit" files (rewriting
       | replaces the whole file atomically, no reader ever sees a
       | partially written file). That's the default. "Log" files (always
       | end at a clean end point, don't tail off into junk). "Temp" files
       | (disappear on reboot). And, for databases, "Managed" files.
       | 
       | Managed files have more I/O functions. In particular, you get two
       | completion events on writes - "copy complete" (the caller can
       | reuse the buffer) and "safely stored" (the data has reached its
       | final resting place, all links are complete, etc.). Programs like
       | databases would use that. Those are the semantics databases want,
       | and struggle to get by flushing, waiting, and various
       | workarounds.
       | 
       | When I mention this, what usually happens is that people get lost
       | in complicated workarounds for simulating unit files. Different
       | approaches are needed for Linux, Windows, NTFS, and various VM
       | systems. This should Just Work.
       | 
       | This isn't my invention; it's from Popek's kernel in 1985 at
       | UCLA, later seen as UCLA-Locus and as an IBM product. They had
       | explicit commit and revert functions for file systems. I'd
       | suggest having the default be commit on normal close or normal
       | program exit, but if the program aborts or crashes or is killed,
       | unit files don't commit and remain unchanged.
        
         | jcranmer wrote:
         | POSIX filesystem semantics are one of the things where POSIX
         | just plain got stuff wrong. And it isn't helped by the fact
         | that a lot of people want to game benchmarks by slightly lying
         | about their durability compliance, so you really have to jump
         | through hoops to make sure you actually achieve durability.
         | 
         | I do agree with you that we need a better way to interact with
         | the filesystem with regards to integrity and durability,
         | although I'm not sure we entirely agree on how that would look.
         | The idea of multiple modes makes a lot of sense:
         | 
         | * File-atomic mode. This is I believe trivial to implement in
         | the filesystem layer for all filesystems, and the basic idea of
         | this mode has existed for decades. When a reader opens a file
         | for reading, it will never see any other writes to the file. A
         | writer will only update the file when it closes the file [1],
         | at which point any _new_ reader will see only the new file
         | created. The code is intrinsically safe in the face of multiple
         | processes interacting with the file, and is probably the
         | semantics most people would prefer in that situation.
         | 
         | * Append-only transactional files. Here, you can't random-
         | access write into the file (but you can random-access read),
         | only write at the end or truncate the file. A writer designates
         | the text to append to the file as atomic blocks: the reader
         | will only atomically see or not see the block [2]. If the file
         | is truncated, all readers see the original contents of the file
         | until they close.
         | 
         | * Raw files. Don't pretend that a file is a stream of bytes.
         | Instead, expose it as a set of blocks that can be atomically
         | updated (including atomically adding or removing blocks from
         | the file at different places). I don't know filesystem
         | semantics to give any good details here, but my understanding
         | is that databases basically try to get these semantics today,
         | and that getting good guarantees on fully random-access
         | read/write semantics is effectively impossible anyways.
         | 
         | There does feel to me to be a bit of a hole here, where you
         | basically get no multiprocess interactions via files unless you
         | completely change how your code works, but I'm not sure it's
         | entirely feasible to have a middle ground here. You can
         | probably get close enough for most needs with a way to be
         | notified and reopen the file in file-atomic mode, and anything
         | where that's not sufficient probably needs you to go to raw
         | files to really get the guarantees you want.
         | 
         | In addition to the basic file I/O issues, there also needs to
         | be a way to be more transactional with directories, I think.
         | Using paths as the basis for filesystem issues is already
         | opening up programmers to time-of-check-time-of-use attacks
         | today, and moving to a file descriptor-based approach for
         | directory manipulation would solve that while opening up the
         | possibility for better transactional support on the directory
         | level.
         | 
         | The other issue is durability. Most applications in the first
         | two modes would probably be fine with an optional durability:
         | the result of an unexpected power outage would be a file that
         | is out of date, but not corrupt. The filesystem could provide
         | an optional callback on commit that returns when it is
         | durability committed, which would handle those cases where you
         | do actually need to make sure that the data will be committed
         | on unexpected power outage. And the simple semantics of the
         | first two modes means that providing durability reliably should
         | be easy for filesystems.
         | 
         | [1] This also suggests that there should be a way to abort the
         | write.
         | 
         | [2] You can also see how a file-atomic reader can interact with
         | an append-only writer: the filesystem layer needs to remember
         | the size the reader first saw and pretend that's the EOF, but
         | otherwise there's no issue. And append-only readers will act as
         | a file-atomic reader with respect to a file-atomic writer. The
         | interactions make sense, that's a good sign for the model!
        
         | rodgerd wrote:
         | Meanwhile ext4 is plowing even further down the path of "we get
         | great benchmarks if we make data integrity a userspace problem
         | LOL"
        
           | Dylan16807 wrote:
           | How so? They had that recent batch of safe optimizations, and
           | trying to make fsync act closer to what it actually says is a
           | pretty good thing.
        
             | wmf wrote:
             | The fact that write(), close(), rename() isn't safe unless
             | you fsync is pretty annoying.
        
               | Dylan16807 wrote:
               | It is, but also isn't new.
        
               | the8472 wrote:
               | a) posix doesn't guarantee or require it b) ext4 actually
               | provides a workaround for that kind of broken software
               | with its _auto_da_alloc_ behavior, which is on by
               | default.
        
               | Dylan16807 wrote:
               | What posix actually requires is an awful mess and just as
               | broken if not moreso.
               | 
               | Oh, you just said to sync the file, not the directory?
               | Too bad, data's gone.
        
         | josephg wrote:
         | I agree - although I see all of those cases as special cases of
         | the same general action: filesystem operations are state
         | machine transitions.
         | 
         | I've been playing with the idea lately that the OS could
         | support general transformation functions (eg "insert an 'A'
         | character at this location", "append this log entry",
         | "overwrite this byte range", etc. Those transformation
         | functions could be generic or written in wasm/BPF. The
         | filesystem can process operations like this efficiently by
         | storing them to a log and periodically flushing. (Or whatever
         | makes sense for the operations).
         | 
         | Having a completion API that separates "buffer can be reused"
         | and "persistently flushed" is great, and should be available
         | for all filesystem operations. Not just for databases!
         | 
         | If an application wants to use traditional posix semantics,
         | write() calls can just be one of the supported operation types.
         | 
         | And as a bonus, filesystem watching can become highly granular
         | - you could subscribe to the stream of semantic changes to a
         | file!
        
         | skissane wrote:
         | > This isn't my invention; it's from Popek's kernel in 1985 at
         | UCLA, later seen as UCLA-Locus and as an IBM product
         | 
         | Which IBM product? I am guessing AIX PS/2 and AIX/370, since
         | those are the two operating systems Locus Computing developed
         | for IBM.
        
       | tentacleuno wrote:
       | Didn't Plan9 do something like this?
        
         | jxy wrote:
         | plan 9 likes those WORM filesystems, Venti, cwfs, or this early
         | incantation https://9p.io/sys/doc/fs/fs.html
         | 
         | I guess you can modify the WORM fs to do the same as this No-
         | Order fs does.
        
       | pwinnski wrote:
       | It's a fascinating concept, but nothing seems to have happened in
       | the eight years since.
       | 
       | This seems like it might gain a bit of performance in exchange
       | for slightly-higher filesystem overhead, though it's not clear
       | there are any stats on either. It's also not clear how exactly
       | performance increases would surface: are reads slower but writes
       | faster?
        
         | pwinnski wrote:
         | I clicked through to read the paper, and some of my questions
         | are answered there, but not (from my initial skim) all.
         | 
         | My intuition was wrong: reads are generally speedy, while
         | writes are a bit slower than ext3.
        
       | tytso wrote:
       | I can think of a number of reasons why this may not have gotten a
       | lot of traction in the intervening eight years. One is that
       | DIF/DIX disks are not easily available at reasonable prices.
       | 
       | The other potential shortcoming is that it provides substantially
       | weaker consistency guarantees than what people are used to. After
       | all of the scan threads are finished with the recovery, yes, the
       | file system metadata will be self-consistent; but that's all you
       | can count upon.
       | 
       | Suppose that a particular file is getting updated at the time of
       | the crash. There might be several blocks that were newly
       | allocated right before the crash, and the a larger number of data
       | blocks that were getting overwritten right before the crash.
       | There is no guarantee which set of data blocks will be persistent
       | across the reboot, and which newly allocated blocks will actually
       | be attached to the file. There might also be newly allocated
       | blocks that were attached to the file, but the data might not be
       | written to the block, such that stale data (the previous contents
       | of the block, which mgiht be another user's medical data, or
       | private e-mail, etc.) that would become visible to the file
       | across the crash.
        
       | gray_-_wolf wrote:
       | > Sorry! The URL you requested was not found on our server.
       | 
       | The link seems to be dead :/
        
         | bytematic wrote:
         | Works for me
        
           | chrismorgan wrote:
           | 404 for me on accessing from both India and Australia.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-01-26 23:00 UTC)