[HN Gopher] Understanding ZFS Channel Programs
       ___________________________________________________________________
        
       Understanding ZFS Channel Programs
        
       Author : rodrigo975
       Score  : 58 points
       Date   : 2021-09-09 06:54 UTC (16 hours ago)
        
 (HTM) web link (klarasystems.com)
 (TXT) w3m dump (klarasystems.com)
        
       | neonate wrote:
       | http://web.archive.org/web/20210908231536/https://klarasyste...
        
       | Rygian wrote:
       | I find it surprising that a filesystem feature is so tightly
       | coupled to a specific programming language. Why is it not an API
       | that lets you begin/commit a "channel program" from any language,
       | then execute zfs commands within session? Surely speed of
       | execution is not so big of a constraint.
        
         | hossbeast wrote:
         | Or, why isn't it leveraging ebpf? Seems like that would be a
         | better fit here than lua.
        
           | monocasa wrote:
           | eBPF has significantly greater restrictions on the shape of
           | code that can be run, since it needs to be able to run in
           | interrupt context.
        
           | 1over137 wrote:
           | Isn't ebpf linux only? ZFS is not.
        
           | IshKebab wrote:
           | That's Linux-specific.
        
         | derefr wrote:
         | Think of it like a GPU compute shader. Like a shader, you have
         | to submit "the whole thing" to the kernel at once, so that it
         | has the program _available_ all at once in its own context, and
         | can guarantee it won 't get blocked waiting for your app to
         | submit "the rest of it."
         | 
         | To enable the client program to submit "the whole thing", you
         | really need to represent that sequence of operations as some
         | kind of _instruction stream_ loaded into a buffer -- either a
         | text-based one (source code in some language), or bytecode in
         | some ISA.
         | 
         | (Why not capture a sequence of API operations that do nothing
         | when called, but instead build such an instruction stream;
         | where "committing" them actually runs the instruction stream?
         | Y'know, like a Redis pipeline? Because a "pipeline" like that
         | can't _branch_ based on the result of executing part of it,
         | unless branching /looping/etc are also implemented as
         | operations in this hypothetical fluent-API-DSL. At which point,
         | you're just programming with the semantics of another language,
         | with extra steps.)
         | 
         | And if a static analysis pass is going to occur to check the
         | code for runtime complexity, then you're already transforming
         | the input into a different verified output with an intermediate
         | AST-like structure; so at that point you _may as well_ accept a
         | text-based format and expose a  "compile" system call. Just as
         | GPU drivers do for compute shaders.
         | 
         | I'm guessing that there's probably an internal ZFS API allowing
         | you to instead pass in some kind of bytecode (LuaJIT bytecode?)
         | and just have it directly static-analyzed and plopped into a
         | kernel handle, rather than parsed, then static-analyzed, then
         | codegen'ed. They'd definitely need something like that for
         | testing.
        
           | josephg wrote:
           | This is a tricky problem, but it's something all good
           | transactional databases solve.
           | 
           | For example, the way this is handled in foundationdb is that
           | you create a transaction, then within that transaction issue
           | a series of read and write calls. The read calls are executed
           | immediately, and they see the database as if other concurrent
           | changes weren't happening. The write calls get batched up and
           | at the end of the transaction block they are submitted
           | together - with information about which keys were read. On
           | submission, the transaction block can fail with E_RETRY -
           | which means a conflict happened, the transaction was entirely
           | aborted and the userland program should run the transaction
           | again from the start.
           | 
           | I'd absolutely adore an API like this for filesystem
           | operations. There are all sorts of half baked, dirty
           | workarounds to get around the lack of atomicity in posix
           | filesystems. Eg, that whole "write a file then rename it"
           | thing - which almost nobody implements correctly because it's
           | error prone and most correct solutions on top of posix have
           | terrible performance.
        
             | derefr wrote:
             | I don't think you understand the point of the constraints
             | ZFS has put in place here. They're not there for
             | transactional isolation per se. They're there because these
             | ZFS maintenance operations run as _synchronous system calls
             | that take per-filesystem global mutex locks_. _Blocking_
             | within one, with those locks acquired, and thunking back
             | down to userspace, would mean that 1. a CPU core, and 2.
             | the filesystem itself, would both be tied up _indefinitely_
             | until that callback thunk returns. If it ever does!
             | 
             | Picture a database server where once you begin a
             | transaction, the whole DB server goes single-threaded and
             | doesn't serve any other clients until the TX completes.
             | Fundamentally, at least for its more arcane global
             | operations, that's exactly what a filesystem is. Usually
             | this doesn't matter, because even these arcane filesystem
             | operations take on the order of microseconds to complete
             | (just with lots of non-locked pre/post execution overhead
             | that inflates the total syscall time.) But a userspace
             | program can do arbitrary-much stuff in a callback. It can
             | even just get into an infinite loop, and never get back to
             | the kernel. This is why there _isn 't such a thing_ as a
             | system call with userspace callbacks! (Or any API
             | equivalent to one--e.g. one with tx_begin + tx_commit
             | calls, where tx_begin acquires a global kernel lock before
             | returning to userspace, under the expectation that
             | userspace call tx_commit to release the kernel lock.)
             | 
             | On the other hand, submitting a "callback program" as a
             | whole to the kernel, allows the kernel to statically
             | analyze the behavior of the "callback program" in advance;
             | and only if it is determined to be "simple" (e.g. if it
             | predictably halts after a short time as a static property),
             | will the kernel issue the program a capability for actually
             | running that "callback program."
             | 
             | This is how compute shaders work. This is how eBPF works.
             | And this is (apparently) how this ZFS Lua thing works. They
             | all do it for similar reasons: to ensure that the program
             | is a complete, _soft-realtime practical_ unit of work to be
             | executing in some bounded-time context.
             | 
             | You do get atomicity from this, but it's not MVCC
             | atomicity. It's more like Redis's atomicity: during the
             | execution of a ZFS Lua program, all other clients making
             | system calls against the filesystem will wait on acquiring
             | the filesystem mutex, and so will block until the program
             | completes. There's never a case where other activity might
             | "interleave" with the Lua program's execution, and cause it
             | to retry. There is no other activity. The only case in
             | which the Lua program can actually fail (and so get rolled
             | back), is if it dies/is killed in the middle of execution,
             | due to e.g. a hardware power cut. _That 's_ the kind of
             | atomicity that ZFS is concerned about--guaranteeing that
             | changes that make it into the filesystem journal, are
             | complete valid change units. In this case, one Lua program
             | is one change unit in the journal.
        
               | grantwu wrote:
               | > Blocking within one, with those locks acquired, and
               | thunking back down to userspace, would mean that 1. a CPU
               | core, and 2. the filesystem itself, would both be tied up
               | indefinitely until that callback thunk returns. If it
               | ever does!
               | 
               | In this theoretical design, you could just block all
               | other administrative modifications. I don't think you
               | need to tie up an entire CPU core, and I'm fairly sure
               | that these zfs operations don't block regular reads and
               | writes.
               | 
               | I think you had it right in your initial comment. There's
               | no good way express branching with an implementation
               | which incrementally submits operations to be committed as
               | a batch. You'd have to take an admin lock on an entire
               | zpool.
        
               | derefr wrote:
               | The CPU core is tied up because the original thread would
               | still be "parked" in the middle of a system call. There
               | are ways to deschedule both userspace and kernel threads,
               | but there is no mechanism to deschedule a userspace
               | thread _while_ it 's executing in the middle of kernel
               | mode because of a syscall.
               | 
               | Think of it like trying to deschedule a userspace thread
               | in the middle of it having jumped to kernelspace because
               | of an interrupt handler. Just wouldn't work; not a state
               | that can be cleanly represented during a context switch
               | with a PUSHA etc.
        
         | magicalhippo wrote:
         | Given that the whole point is to perform script-like operations
         | as an atomic unit in the kernel, it seems to me you'd be
         | reinventing a lot of script-like stuff by exposing a "bare" API
         | for generating a "program" to be run.
         | 
         | For example, consider what the API must support to make a
         | channel program that lists all the child datasets of a given
         | dataset, and takes a snapshot of each child dataset that has a
         | certain ZFS property.
         | 
         | The looping cannot be done by the user program, as that would
         | introduce edge cases when datasets are created or destroyed
         | between listing and generating the snapshot commands.
         | 
         | As such, picking Lua as the API isn't a bad choice I think.
        
         | Hackbraten wrote:
         | I'm not familiar with the feature at all but I wonder whether
         | your suggested alternative API would be powerful enough to
         | atomically do things like:
         | 
         | 1. Find the latest three snapshots, and then 2. delete those
         | three snapshots.
         | 
         | With a generic API, that'd take two calls, with a possible
         | TOCTOU in between so it wouldn't be atomic.
         | 
         | But if you send that same program to the ZFS subsystem, and
         | have it evaluate and execute the script, the subsystem would be
         | able to guarantee atomicity, right?
        
           | throwaway2048 wrote:
           | it seems pretty trivial to just have some kind of "begin/end
           | atomic section" flag or function rather than jamming in a
           | whole special purpose scripting enviroment.
        
             | infogulch wrote:
             | begin()         loop {}
        
               | Hackbraten wrote:
               | > If the channel program exhausts an instruction or
               | memory limit, a fatal error will be generated and the
               | program will be stopped
        
             | Hackbraten wrote:
             | But what is the API supposed to do when it comes across a
             | `begin` flag? Stop all filesystem activity in all programs
             | until yours sends the `end` flag? What if that flag never
             | arrives, e.g. due to a bug in your client code?
        
       | gbrown_ wrote:
       | > kernel operations are faster than userland operations
       | 
       | Call me a pedant but this grinds my gears. The speed my CPU will
       | carry out instructions is not determined by if it's in a kernel
       | or user context. Avoiding making switches between those contexts
       | certainly means less work though.
        
       | openasocket wrote:
       | I have a special place in my heart for these sort of "weird"
       | features, which seem really exotic compared to the standard POSIX
       | APIs we're all used to. I just enjoy thinking about alternatives
       | and new ways of doing and architecting things. Even if I can't
       | think of a practical use for such a feature, I always appreciate
       | the novelty. It makes me wonder about the future. Are the typical
       | POSIX-style APIs still going to be dominant 50 or 100 years from
       | now, or will new APIs and ideas start to replace them?
        
       | IshKebab wrote:
       | Does this mean there's a Lua interpreter in ZFS? That's pretty
       | wild!
        
         | ducktective wrote:
         | Try pretty _scary_
        
         | 0x000000001 wrote:
         | Appears so
         | 
         | https://svnweb.freebsd.org/base/head/sys/cddl/contrib/openso...
        
       | rbanffy wrote:
       | Not to be confused with
       | https://en.wikipedia.org/wiki/Channel_I/O#Channel_program
        
         | monocasa wrote:
         | As the article says, it's a direct reference to mainframe style
         | channel I/O.
        
           | rbanffy wrote:
           | I know, but it felt very different.
           | 
           | For starters, it doesn't run on the channel subsystem, but
           | uses the CPU for it. Also, channel programs (the IBM ones)
           | can do a lot of stuff - IBM's ISAM uses self-modifying
           | channel programs. These channels can't be that flexible
           | because they run within the kernel and being too flexible
           | would be a security risk.
        
             | monocasa wrote:
             | > For starters, it doesn't run on the channel subsystem,
             | but uses the CPU for it.
             | 
             | Low/mid range mainframes don't really make sense anymore,
             | but channel programs did run on the main CPU on quite a few
             | low/mid range mainframe systems.
             | 
             | > Also, channel programs (the IBM ones) can do a lot of
             | stuff - IBM's ISAM uses self-modifying channel programs.
             | These channels can't be that flexible because they run
             | within the kernel and being too flexible would be a
             | security risk.
             | 
             | There's no reason this couldn't do that either as long as
             | it's run in a preemptable context (which I'm pretty sure is
             | the case from looking at the example program). Schemes like
             | eBPF mainly control code flow in pursuit of the goal of
             | running in non preemptable contexts like interrupts.
        
       ___________________________________________________________________
       (page generated 2021-09-09 23:01 UTC)