[HN Gopher] Understanding ZFS Channel Programs
___________________________________________________________________
Understanding ZFS Channel Programs
Author : rodrigo975
Score : 58 points
Date : 2021-09-09 06:54 UTC (16 hours ago)
(HTM) web link (klarasystems.com)
(TXT) w3m dump (klarasystems.com)
| neonate wrote:
| http://web.archive.org/web/20210908231536/https://klarasyste...
| Rygian wrote:
| I find it surprising that a filesystem feature is so tightly
| coupled to a specific programming language. Why is it not an API
| that lets you begin/commit a "channel program" from any language,
| then execute zfs commands within session? Surely speed of
| execution is not so big of a constraint.
| hossbeast wrote:
| Or, why isn't it leveraging ebpf? Seems like that would be a
| better fit here than lua.
| monocasa wrote:
| eBPF has significantly greater restrictions on the shape of
| code that can be run, since it needs to be able to run in
| interrupt context.
| 1over137 wrote:
| Isn't ebpf linux only? ZFS is not.
| IshKebab wrote:
| That's Linux-specific.
| derefr wrote:
| Think of it like a GPU compute shader. Like a shader, you have
| to submit "the whole thing" to the kernel at once, so that it
| has the program _available_ all at once in its own context, and
| can guarantee it won 't get blocked waiting for your app to
| submit "the rest of it."
|
| To enable the client program to submit "the whole thing", you
| really need to represent that sequence of operations as some
| kind of _instruction stream_ loaded into a buffer -- either a
| text-based one (source code in some language), or bytecode in
| some ISA.
|
| (Why not capture a sequence of API operations that do nothing
| when called, but instead build such an instruction stream;
| where "committing" them actually runs the instruction stream?
| Y'know, like a Redis pipeline? Because a "pipeline" like that
| can't _branch_ based on the result of executing part of it,
| unless branching /looping/etc are also implemented as
| operations in this hypothetical fluent-API-DSL. At which point,
| you're just programming with the semantics of another language,
| with extra steps.)
|
| And if a static analysis pass is going to occur to check the
| code for runtime complexity, then you're already transforming
| the input into a different verified output with an intermediate
| AST-like structure; so at that point you _may as well_ accept a
| text-based format and expose a "compile" system call. Just as
| GPU drivers do for compute shaders.
|
| I'm guessing that there's probably an internal ZFS API allowing
| you to instead pass in some kind of bytecode (LuaJIT bytecode?)
| and just have it directly static-analyzed and plopped into a
| kernel handle, rather than parsed, then static-analyzed, then
| codegen'ed. They'd definitely need something like that for
| testing.
| josephg wrote:
| This is a tricky problem, but it's something all good
| transactional databases solve.
|
| For example, the way this is handled in foundationdb is that
| you create a transaction, then within that transaction issue
| a series of read and write calls. The read calls are executed
| immediately, and they see the database as if other concurrent
| changes weren't happening. The write calls get batched up and
| at the end of the transaction block they are submitted
| together - with information about which keys were read. On
| submission, the transaction block can fail with E_RETRY -
| which means a conflict happened, the transaction was entirely
| aborted and the userland program should run the transaction
| again from the start.
|
| I'd absolutely adore an API like this for filesystem
| operations. There are all sorts of half baked, dirty
| workarounds to get around the lack of atomicity in posix
| filesystems. Eg, that whole "write a file then rename it"
| thing - which almost nobody implements correctly because it's
| error prone and most correct solutions on top of posix have
| terrible performance.
| derefr wrote:
| I don't think you understand the point of the constraints
| ZFS has put in place here. They're not there for
| transactional isolation per se. They're there because these
| ZFS maintenance operations run as _synchronous system calls
| that take per-filesystem global mutex locks_. _Blocking_
| within one, with those locks acquired, and thunking back
| down to userspace, would mean that 1. a CPU core, and 2.
| the filesystem itself, would both be tied up _indefinitely_
| until that callback thunk returns. If it ever does!
|
| Picture a database server where once you begin a
| transaction, the whole DB server goes single-threaded and
| doesn't serve any other clients until the TX completes.
| Fundamentally, at least for its more arcane global
| operations, that's exactly what a filesystem is. Usually
| this doesn't matter, because even these arcane filesystem
| operations take on the order of microseconds to complete
| (just with lots of non-locked pre/post execution overhead
| that inflates the total syscall time.) But a userspace
| program can do arbitrary-much stuff in a callback. It can
| even just get into an infinite loop, and never get back to
| the kernel. This is why there _isn 't such a thing_ as a
| system call with userspace callbacks! (Or any API
| equivalent to one--e.g. one with tx_begin + tx_commit
| calls, where tx_begin acquires a global kernel lock before
| returning to userspace, under the expectation that
| userspace call tx_commit to release the kernel lock.)
|
| On the other hand, submitting a "callback program" as a
| whole to the kernel, allows the kernel to statically
| analyze the behavior of the "callback program" in advance;
| and only if it is determined to be "simple" (e.g. if it
| predictably halts after a short time as a static property),
| will the kernel issue the program a capability for actually
| running that "callback program."
|
| This is how compute shaders work. This is how eBPF works.
| And this is (apparently) how this ZFS Lua thing works. They
| all do it for similar reasons: to ensure that the program
| is a complete, _soft-realtime practical_ unit of work to be
| executing in some bounded-time context.
|
| You do get atomicity from this, but it's not MVCC
| atomicity. It's more like Redis's atomicity: during the
| execution of a ZFS Lua program, all other clients making
| system calls against the filesystem will wait on acquiring
| the filesystem mutex, and so will block until the program
| completes. There's never a case where other activity might
| "interleave" with the Lua program's execution, and cause it
| to retry. There is no other activity. The only case in
| which the Lua program can actually fail (and so get rolled
| back), is if it dies/is killed in the middle of execution,
| due to e.g. a hardware power cut. _That 's_ the kind of
| atomicity that ZFS is concerned about--guaranteeing that
| changes that make it into the filesystem journal, are
| complete valid change units. In this case, one Lua program
| is one change unit in the journal.
| grantwu wrote:
| > Blocking within one, with those locks acquired, and
| thunking back down to userspace, would mean that 1. a CPU
| core, and 2. the filesystem itself, would both be tied up
| indefinitely until that callback thunk returns. If it
| ever does!
|
| In this theoretical design, you could just block all
| other administrative modifications. I don't think you
| need to tie up an entire CPU core, and I'm fairly sure
| that these zfs operations don't block regular reads and
| writes.
|
| I think you had it right in your initial comment. There's
| no good way express branching with an implementation
| which incrementally submits operations to be committed as
| a batch. You'd have to take an admin lock on an entire
| zpool.
| derefr wrote:
| The CPU core is tied up because the original thread would
| still be "parked" in the middle of a system call. There
| are ways to deschedule both userspace and kernel threads,
| but there is no mechanism to deschedule a userspace
| thread _while_ it 's executing in the middle of kernel
| mode because of a syscall.
|
| Think of it like trying to deschedule a userspace thread
| in the middle of it having jumped to kernelspace because
| of an interrupt handler. Just wouldn't work; not a state
| that can be cleanly represented during a context switch
| with a PUSHA etc.
| magicalhippo wrote:
| Given that the whole point is to perform script-like operations
| as an atomic unit in the kernel, it seems to me you'd be
| reinventing a lot of script-like stuff by exposing a "bare" API
| for generating a "program" to be run.
|
| For example, consider what the API must support to make a
| channel program that lists all the child datasets of a given
| dataset, and takes a snapshot of each child dataset that has a
| certain ZFS property.
|
| The looping cannot be done by the user program, as that would
| introduce edge cases when datasets are created or destroyed
| between listing and generating the snapshot commands.
|
| As such, picking Lua as the API isn't a bad choice I think.
| Hackbraten wrote:
| I'm not familiar with the feature at all but I wonder whether
| your suggested alternative API would be powerful enough to
| atomically do things like:
|
| 1. Find the latest three snapshots, and then 2. delete those
| three snapshots.
|
| With a generic API, that'd take two calls, with a possible
| TOCTOU in between so it wouldn't be atomic.
|
| But if you send that same program to the ZFS subsystem, and
| have it evaluate and execute the script, the subsystem would be
| able to guarantee atomicity, right?
| throwaway2048 wrote:
| it seems pretty trivial to just have some kind of "begin/end
| atomic section" flag or function rather than jamming in a
| whole special purpose scripting enviroment.
| infogulch wrote:
| begin() loop {}
| Hackbraten wrote:
| > If the channel program exhausts an instruction or
| memory limit, a fatal error will be generated and the
| program will be stopped
| Hackbraten wrote:
| But what is the API supposed to do when it comes across a
| `begin` flag? Stop all filesystem activity in all programs
| until yours sends the `end` flag? What if that flag never
| arrives, e.g. due to a bug in your client code?
| gbrown_ wrote:
| > kernel operations are faster than userland operations
|
| Call me a pedant but this grinds my gears. The speed my CPU will
| carry out instructions is not determined by if it's in a kernel
| or user context. Avoiding making switches between those contexts
| certainly means less work though.
| openasocket wrote:
| I have a special place in my heart for these sort of "weird"
| features, which seem really exotic compared to the standard POSIX
| APIs we're all used to. I just enjoy thinking about alternatives
| and new ways of doing and architecting things. Even if I can't
| think of a practical use for such a feature, I always appreciate
| the novelty. It makes me wonder about the future. Are the typical
| POSIX-style APIs still going to be dominant 50 or 100 years from
| now, or will new APIs and ideas start to replace them?
| IshKebab wrote:
| Does this mean there's a Lua interpreter in ZFS? That's pretty
| wild!
| ducktective wrote:
| Try pretty _scary_
| 0x000000001 wrote:
| Appears so
|
| https://svnweb.freebsd.org/base/head/sys/cddl/contrib/openso...
| rbanffy wrote:
| Not to be confused with
| https://en.wikipedia.org/wiki/Channel_I/O#Channel_program
| monocasa wrote:
| As the article says, it's a direct reference to mainframe style
| channel I/O.
| rbanffy wrote:
| I know, but it felt very different.
|
| For starters, it doesn't run on the channel subsystem, but
| uses the CPU for it. Also, channel programs (the IBM ones)
| can do a lot of stuff - IBM's ISAM uses self-modifying
| channel programs. These channels can't be that flexible
| because they run within the kernel and being too flexible
| would be a security risk.
| monocasa wrote:
| > For starters, it doesn't run on the channel subsystem,
| but uses the CPU for it.
|
| Low/mid range mainframes don't really make sense anymore,
| but channel programs did run on the main CPU on quite a few
| low/mid range mainframe systems.
|
| > Also, channel programs (the IBM ones) can do a lot of
| stuff - IBM's ISAM uses self-modifying channel programs.
| These channels can't be that flexible because they run
| within the kernel and being too flexible would be a
| security risk.
|
| There's no reason this couldn't do that either as long as
| it's run in a preemptable context (which I'm pretty sure is
| the case from looking at the example program). Schemes like
| eBPF mainly control code flow in pursuit of the goal of
| running in non preemptable contexts like interrupts.
___________________________________________________________________
(page generated 2021-09-09 23:01 UTC)