[HN Gopher] BPF for storage: an exokernel-inspired approach [pdf]
___________________________________________________________________
BPF for storage: an exokernel-inspired approach [pdf]
Author : gbrown_
Score : 90 points
Date : 2021-03-27 12:31 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jankotek wrote:
| Many databases can bypass filesystem and use block device
| (partitions) directly. But modifying NVMe drivers is pretty novel
| approach. Maybe Redis compiled into Linux kernel is not bad idea
| :)
| appleflaxen wrote:
| Holy crap this is a cool idea.
| musicale wrote:
| This kind of nails what has been obvious to me for a long time:
| eBPF is basically the revenge of exokernels.
| nynx wrote:
| Exokernels coming back would be so cool.
|
| I wrote a prototype kernel/os [0] a few years back that ran wasm
| in kernel mode. I'd love to see that idea pursued further.
|
| [0]: https://github.com/nebulet/nebulet
| jFriedensreich wrote:
| I really loved the idea of nebulet! Do you think it is possible
| to reach the same security guarantees of eBPF with a feauture
| set like wasm? Other big challenges you encountered?
| nynx wrote:
| I think the main issue is that eBPF cannot have loops. It
| restricts the programs you can write. Wasm does not have that
| property, so you cannot prove that it will complete.
| rkeene2 wrote:
| To be clear, eBPF programs can have loops -- they're just
| jumps that have negative offsets (which are signed 16-bit
| numbers), but for security reasons many verifiers do not
| allow them so they can ensure that the program halts.
| kijiki wrote:
| Everything old is new again.
|
| https://en.wikipedia.org/wiki/Channel_I/O
| hardwaresofton wrote:
| I wonder if Unikernels would also see this speedup, though
| they're so much harder to implement/use that I assume BPF-powered
| storage will see mass adoption before unikernels do...
| wtallis wrote:
| There's also some interest in making eBPF the standard for
| computational storage [1]: offloading processing to the SSD
| controller itself. Many vendors have found that it's easy to add
| a few extra ARM cores or some fixed-function accelerators to a
| high-end enterprise SSD controller, but the software ecosystem is
| lacking and fragmented.
|
| This work may be a very complementary approach. Using eBPF to
| move some processing into the kernel should make it easier to
| later move it off the CPU entirely and into the storage device
| itself.
|
| [1] https://nvmexpress.org/bringing-compute-to-storage-with-
| nvm-...
| toolslive wrote:
| wasn't it the other way around ? The vendors got extra cores
| from their suppliers and started asking themselves what to do
| with them?
| wtallis wrote:
| By "vendors", I meant drive and controller vendors rather
| than server/appliance vendors. They're looking for ways to
| differentiate their products and offer more value add, but
| extending an SSD's functionality beyond mere block storage
| (potentially with transparent compression or encryption)
| requires a lot of buy-in from end users who need to write
| software to use such capabilities.
| toolslive wrote:
| I meant the Seagates and WD's from this world. Around 2014
| they also had HDDs with an extra core, where I suspect they
| just got that for free from they supplier with the message
| "look, we stopped making these single core CPUs anymore,
| here's a dual core for the same price".
| monocasa wrote:
| WD and Seagate are their chip vendors. They design their
| own custom SoCs.
| toolslive wrote:
| In case of Seagate's Kinetic Drives, one core was used by
| the controller, the other 'extra' CPU was used to manage
| a Key value store on the driver. These were ARM
| processors. I don't think they make these themselves.
| monocasa wrote:
| You would be wrong. I work with guys who used to do SoC
| design for both Seagate and WD.
|
| This is how WD was able to jump to RiscV so quickly; they
| didn't have any suppliers they needed to negotiate with,
| and the market for CortexR style in order, single issue,
| fully pipelined real time cores is sort of a commodity.
| toolslive wrote:
| Well, that's why I was asking. Thx!
| mgerdts wrote:
| The draft spec is here:
|
| https://www.snia.org/sites/default/files/technical_work/Publ...
| hinkley wrote:
| Can I get Postgres running on a RAID controller while we're at
| it?
| lucasvr_br wrote:
| Samsung's SmartSSD comes with a Xilinx FPGA accelerator that
| can be used to execute code without having to move data
| outside the device. And their open source SDK includes
| domain-specific routines including databases.
|
| See https://xilinx.github.io/Vitis_Libraries/ for details on
| their software stack.
| sp332 wrote:
| It's not SQL, but Samsung made a Key Value SSD that uses
| short keys instead of "addresses" to index blocks of data.
| lambda_obrien wrote:
| Imagine an SSD serving up query responses! I love it.
| wmf wrote:
| Not too far from Netezza.
| nynx wrote:
| I'd prefer it to use spirv or wasm. eBPF is intentionally an
| extremely limited language.
| KMag wrote:
| But trivially provable termination is really nice property if
| you don't want to need your hypervisor to trust every guest
| kernel.
| benlwalker wrote:
| eBPF the bytecode is not particularly limited. You can parse
| complex formats like Arrow or Parquet even. The Linux kernel
| overlays a verifier on top which adds all sorts of draconian
| restrictions (for good reason). When people talk of eBPF they
| don't always mean to include the Linux verifier limitations
| as well. In particular, that nvmexpress working group link in
| the parent post does not say one way or the other.
| nynx wrote:
| Why not use a different bytecode that is already more
| common, in that case?
| monocasa wrote:
| Because ebpf is designed for verification under
| constrained circumstances like kernels, but allowing easy
| JITing that's almost a 1 to 1 translation. Stuff like not
| doing register allocation, being 2 address, etc.
| pjmlp wrote:
| So basically back to SCSI.
| wtallis wrote:
| I'm not seeing any new parallels to SCSI, beyond the
| similarities that have long existed between NVMe and SCSI.
| What kind of programmable functionality over SCSI are you
| referring to?
| pjmlp wrote:
| SCSI uses their own controllers that receive a set of
| commands and take it from there for the data retrieval
| operations.
|
| https://en.wikipedia.org/wiki/SCSI_command
| wtallis wrote:
| SCSI uses simple fixed command sets, the same as standard
| NVMe drives. As far as I'm aware, SCSI doesn't have any
| way to chain commands together, so you can't really do
| anything more complex than predefined commands like copy
| or compare and write. It's nothing at all analogous to an
| actual programmable offload engine like you'd get with
| eBPF, and all the semi-advanced SCSI commands that are
| actually applicable to SSDs already have equivalents in
| the base NVMe command set.
___________________________________________________________________
(page generated 2021-03-27 23:01 UTC)