[HN Gopher] BPF for storage: an exokernel-inspired approach [pdf]
       ___________________________________________________________________
        
       BPF for storage: an exokernel-inspired approach [pdf]
        
       Author : gbrown_
       Score  : 90 points
       Date   : 2021-03-27 12:31 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jankotek wrote:
       | Many databases can bypass filesystem and use block device
       | (partitions) directly. But modifying NVMe drivers is pretty novel
       | approach. Maybe Redis compiled into Linux kernel is not bad idea
       | :)
        
       | appleflaxen wrote:
       | Holy crap this is a cool idea.
        
       | musicale wrote:
       | This kind of nails what has been obvious to me for a long time:
       | eBPF is basically the revenge of exokernels.
        
       | nynx wrote:
       | Exokernels coming back would be so cool.
       | 
       | I wrote a prototype kernel/os [0] a few years back that ran wasm
       | in kernel mode. I'd love to see that idea pursued further.
       | 
       | [0]: https://github.com/nebulet/nebulet
        
         | jFriedensreich wrote:
         | I really loved the idea of nebulet! Do you think it is possible
         | to reach the same security guarantees of eBPF with a feauture
         | set like wasm? Other big challenges you encountered?
        
           | nynx wrote:
           | I think the main issue is that eBPF cannot have loops. It
           | restricts the programs you can write. Wasm does not have that
           | property, so you cannot prove that it will complete.
        
             | rkeene2 wrote:
             | To be clear, eBPF programs can have loops -- they're just
             | jumps that have negative offsets (which are signed 16-bit
             | numbers), but for security reasons many verifiers do not
             | allow them so they can ensure that the program halts.
        
       | kijiki wrote:
       | Everything old is new again.
       | 
       | https://en.wikipedia.org/wiki/Channel_I/O
        
       | hardwaresofton wrote:
       | I wonder if Unikernels would also see this speedup, though
       | they're so much harder to implement/use that I assume BPF-powered
       | storage will see mass adoption before unikernels do...
        
       | wtallis wrote:
       | There's also some interest in making eBPF the standard for
       | computational storage [1]: offloading processing to the SSD
       | controller itself. Many vendors have found that it's easy to add
       | a few extra ARM cores or some fixed-function accelerators to a
       | high-end enterprise SSD controller, but the software ecosystem is
       | lacking and fragmented.
       | 
       | This work may be a very complementary approach. Using eBPF to
       | move some processing into the kernel should make it easier to
       | later move it off the CPU entirely and into the storage device
       | itself.
       | 
       | [1] https://nvmexpress.org/bringing-compute-to-storage-with-
       | nvm-...
        
         | toolslive wrote:
         | wasn't it the other way around ? The vendors got extra cores
         | from their suppliers and started asking themselves what to do
         | with them?
        
           | wtallis wrote:
           | By "vendors", I meant drive and controller vendors rather
           | than server/appliance vendors. They're looking for ways to
           | differentiate their products and offer more value add, but
           | extending an SSD's functionality beyond mere block storage
           | (potentially with transparent compression or encryption)
           | requires a lot of buy-in from end users who need to write
           | software to use such capabilities.
        
             | toolslive wrote:
             | I meant the Seagates and WD's from this world. Around 2014
             | they also had HDDs with an extra core, where I suspect they
             | just got that for free from they supplier with the message
             | "look, we stopped making these single core CPUs anymore,
             | here's a dual core for the same price".
        
               | monocasa wrote:
               | WD and Seagate are their chip vendors. They design their
               | own custom SoCs.
        
               | toolslive wrote:
               | In case of Seagate's Kinetic Drives, one core was used by
               | the controller, the other 'extra' CPU was used to manage
               | a Key value store on the driver. These were ARM
               | processors. I don't think they make these themselves.
        
               | monocasa wrote:
               | You would be wrong. I work with guys who used to do SoC
               | design for both Seagate and WD.
               | 
               | This is how WD was able to jump to RiscV so quickly; they
               | didn't have any suppliers they needed to negotiate with,
               | and the market for CortexR style in order, single issue,
               | fully pipelined real time cores is sort of a commodity.
        
               | toolslive wrote:
               | Well, that's why I was asking. Thx!
        
         | mgerdts wrote:
         | The draft spec is here:
         | 
         | https://www.snia.org/sites/default/files/technical_work/Publ...
        
         | hinkley wrote:
         | Can I get Postgres running on a RAID controller while we're at
         | it?
        
           | lucasvr_br wrote:
           | Samsung's SmartSSD comes with a Xilinx FPGA accelerator that
           | can be used to execute code without having to move data
           | outside the device. And their open source SDK includes
           | domain-specific routines including databases.
           | 
           | See https://xilinx.github.io/Vitis_Libraries/ for details on
           | their software stack.
        
           | sp332 wrote:
           | It's not SQL, but Samsung made a Key Value SSD that uses
           | short keys instead of "addresses" to index blocks of data.
        
           | lambda_obrien wrote:
           | Imagine an SSD serving up query responses! I love it.
        
             | wmf wrote:
             | Not too far from Netezza.
        
         | nynx wrote:
         | I'd prefer it to use spirv or wasm. eBPF is intentionally an
         | extremely limited language.
        
           | KMag wrote:
           | But trivially provable termination is really nice property if
           | you don't want to need your hypervisor to trust every guest
           | kernel.
        
           | benlwalker wrote:
           | eBPF the bytecode is not particularly limited. You can parse
           | complex formats like Arrow or Parquet even. The Linux kernel
           | overlays a verifier on top which adds all sorts of draconian
           | restrictions (for good reason). When people talk of eBPF they
           | don't always mean to include the Linux verifier limitations
           | as well. In particular, that nvmexpress working group link in
           | the parent post does not say one way or the other.
        
             | nynx wrote:
             | Why not use a different bytecode that is already more
             | common, in that case?
        
               | monocasa wrote:
               | Because ebpf is designed for verification under
               | constrained circumstances like kernels, but allowing easy
               | JITing that's almost a 1 to 1 translation. Stuff like not
               | doing register allocation, being 2 address, etc.
        
         | pjmlp wrote:
         | So basically back to SCSI.
        
           | wtallis wrote:
           | I'm not seeing any new parallels to SCSI, beyond the
           | similarities that have long existed between NVMe and SCSI.
           | What kind of programmable functionality over SCSI are you
           | referring to?
        
             | pjmlp wrote:
             | SCSI uses their own controllers that receive a set of
             | commands and take it from there for the data retrieval
             | operations.
             | 
             | https://en.wikipedia.org/wiki/SCSI_command
        
               | wtallis wrote:
               | SCSI uses simple fixed command sets, the same as standard
               | NVMe drives. As far as I'm aware, SCSI doesn't have any
               | way to chain commands together, so you can't really do
               | anything more complex than predefined commands like copy
               | or compare and write. It's nothing at all analogous to an
               | actual programmable offload engine like you'd get with
               | eBPF, and all the semi-advanced SCSI commands that are
               | actually applicable to SSDs already have equivalents in
               | the base NVMe command set.
        
       ___________________________________________________________________
       (page generated 2021-03-27 23:01 UTC)