[HN Gopher] Open-source drivers according to Habana
___________________________________________________________________
Open-source drivers according to Habana
Author : Aissen
Score : 73 points
Date : 2022-03-31 14:42 UTC (8 hours ago)
(HTM) web link (threedots.ovh)
(TXT) w3m dump (threedots.ovh)
| naoqj wrote:
| ...and then they'll complain that corporations would rather
| maintain their own forks of the kernel.
| Aissen wrote:
| ...and their customers will simply not buy their product if it
| needs a custom kernel.
| yboris wrote:
| Do I understand this right? It seems like Habana has some super-
| efficient and fast ML hardware, but you can't just drop in an ML
| project and start using it? For example, only a subset of
| TensorFlow or PyTorch is supported?
|
| Is that right? If you want to use their hardware, you need to
| jump through some hoops?
|
| https://docs.habana.ai/en/latest/Tensorflow_User_Guide/Tenso...
|
| https://docs.habana.ai/en/latest/PyTorch_User_Guide/PyTorch_...
| j16sdiz wrote:
| If you want to use that, you need the closed-source driver.
|
| The open-source driver is the minimal to make linux kernel
| developer happy.
| yboris wrote:
| But my point is that even if you use some closed-source
| driver or whatever is required to "set up" your project - you
| still can't re-use your regular code wholesale, but will have
| to modify to fit within whatever features they support.
| Right?
| mirker wrote:
| The API of Tensorflow and PyTorch is quite large. Even
| Tensorflow has a dialect specifically for TPU, so the
| notion of accelerator independence (especially when
| considering performance) has not been realized yet (though
| this is still a work in progress).
|
| Anyway, how could you reuse your code wholesale when one of
| the most common operations is calling ".cuda()" on a
| tensor?
| my123 wrote:
| Unsupported layers will transparently fall back to CPU
| execution. You can then choose to implement those who can
| be as TPC kernels. It's fundamentally much less flexible
| than a GPU.
|
| Efficiency of Habana hardware isn't that great either, but
| that's another story... (they're still using TSMC 16nm in
| 2022 notably). Where Habana has an advantage in some
| workloads is cost.
| sjmm1989 wrote:
| Okay, I clearly don't understand everything going on in this one.
| Why? There must be something going on that explains why we are
| letting people get away with breaking the rules the rest of us
| are supposed to be follow. Must be. Why?
|
| Because if I understand anything at all about open source, it's
| that you don't get to post closed source as if it is open source;
| even if you try to resort to shenanigans and trickery to get
| around the rules.
|
| So my question is this. Why are we letting Intel get away with
| loopholing the rules the rest of us would have to follow? Seems
| to me the best thing to do here is punish them like the petulant
| child they are being. Erase their code, and tell them to politely
| fuck off. Or at least more politely than Linus Torvalds did with
| Nvidia.
|
| Also, I don't see why letting them put their code up is of any
| use in the first place if they are just going to do things like
| this to essentially break it. Like seriously folks, what in the
| ever living fuck?
|
| If it were up to me, I'd have people hacking them just to show
| them their place, and putting EVERYTHING up online for all to
| use. And that would be after doing something like sending
| evidence of wrong doings to the justice departments of every
| single nation who wants to take a chew out of them. (of which
| there are likely many)
|
| Why?
|
| Because the fact we allow companies like Microsoft and Intel to
| continuously get away with all their bullshit that they keep on
| trying to pull more bullshit. It's not rocket science folks.
|
| Time to apply the brake to their bullshitmobile and hard.
| Topgamer7 wrote:
| > If it were up to me, I'd have people hacking them
|
| Cut this garbage shit out. Hack on the driver implementation
| instead.
|
| > Because if I understand anything at all about open source,
| it's that you don't get to post closed source as if it is open
| source
|
| If it was upstreamed into the linux tree, then they have open
| sourced, with the appropriate license. So the driver is not
| fully functional, but what they have included could be the
| building blocks for someone to add the additional
| functionality.
|
| It is a shitty practice, and if you want to drive adoption,
| this isn't going to do it.
|
| At least they didn't make a marketing statement that they're
| open source purveyors and software saviors.
| j16sdiz wrote:
| Background story: https://lwn.net/Articles/867168/
|
| TLDR: Intel want some huge DRI-related in linux kernel. The DRI
| maintainer insist there need at least one open user mode user. So
| we have this Proof-of-Concept driver
| drewg123 wrote:
| Thanks for this, without this I had no clue what the article
| was referring to.
|
| It would be nice if there was a fairly stable kernel API (or
| ABI) so drivers like this didn't _have_ to be in the kernel.
| Out of tree drivers are a nightmare to maintain.
|
| I maintained some out-of-tree drivers for years at Myricom
| (version of myri10ge with valuable features nak'ed by netdev,
| MX HPC drivers). Doing this was a massive PITA. Pretty much
| every minor release brought with it some critical function
| changing the number of arguments, changing names, etc. RHEL
| updates were my own special version of hell, since their kernel
| X.Y.Z in no way resembled the upstream X.Y.Z It got so bad to
| support 2.6.9 through 3.x that the shim layer for the Linux
| driver was almost as big as the entire FreeBSD driver (where
| nobody cared that I implemented LRO).
| 10000truths wrote:
| This is by design. Linux doesn't _want_ to pay the
| maintenance and performance costs of guaranteeing a stable
| in-kernel API /ABI:
|
| https://www.kernel.org/doc/Documentation/process/stable-
| api-...
| charcircuit wrote:
| You can make the same argument about user space
| compatibility. It's extra work and may prevent some
| improvements, but it's nice to have.
| josephcsible wrote:
| > It would be nice if there was a fairly stable kernel API
| (or ABI) so drivers like this didn't _have_ to be in the
| kernel.
|
| Why? We _want_ as many drivers as possible to be in the
| kernel.
| aseipp wrote:
| There's another preceding case here, from 2020, involving
| Qualcomm submitting a similar driver (and to some extent
| Microsoft), which is quietly linked to and worth looking at for
| some other history: https://lwn.net/Articles/821817/
|
| The situation there is that Habana already had their driver
| submitted very early I suppose, and I guess the resistance
| wasn't high enough at the time to keep it out. Qualcomm later
| came around and their own AI 100 driver was rejected, on
| similar grounds that would have kept Habana out, had they been
| applied at the time. (Airlie even called out Greg at this
| time.)
|
| The later scuffle (your OP link) is because the Habana driver
| eventually wanted to adopt DMA-BUF and P2P-DMA support in the
| driver, which the original developers intended for the GPU
| subsystem, so they consider this over the line, because the
| criteria for new GPU drivers is "a testable open source
| userspace". So, that work was rejected, but the driver itself
| wasn't pulled entirely. Just that particular series of patches
| was not applied.
|
| Microsoft had a weirdly similar case where they wanted a
| virtualization driver for Linux that would effectively pass GPU
| compute through to Windows hosts running under HyperV, for the
| purposes of running Machine Learning compute workloads -- not
| graphics. (The underlying Windows component to handle these
| tasks _does_ use DirectX, but only the DirectCompute part of
| it.) But, it wasn 't out of hand rejected on the same
| principle; it's more like a VFIO passthrough device
| conceptually, and didn't need to use any DRI/DRM specific
| subsystems to accomplish that. But the basic outline is the
| same where the userspace component would be closed source, so
| the driver is just connecting a binary blob to a binary blob.
| It doesn't use any deeply involved APIs, but it's also not very
| useful for anyone except the WSL team. It's a bit of an
| inbetween case where it isn't quite the same thing, but it's
| not _not_ the same thing. Strange one.
|
| As of right now, looking at upstream:
|
| - Habana now has DMA-BUF support, as of late last year, so
| presumably the minimal userspace given above was "good enough"
| for upstream, since they can presumably at least run minimal
| testing on the driver paths:
| https://github.com/torvalds/linux/commit/a9498ee575fa116e289...
|
| - Microsoft's DXGI/whatever-it's-called driver for compute is
| still not upstream, but I think they ship it with their custom-
| by-default WSL2 kernel (`wsl -e uname -a` gives me
| `5.10.16.3-microsoft-standard-WSL2` right now). It was not
| rejected out of hand but they also didn't seem to mind if it
| didn't land immediately. I have no idea what it's status is.
|
| - Qualcomm's driver for AI 100 was completely rejected
| immediately and I do not know of any further attempts to
| upstream it.
|
| - And there are probably even more cases of this. I believe
| Xilinx has a driver for their (similarly closed) compiler +
| runtime stack included in Vitis, and I doubt it's going
| upstream soon (xocl/xcmgmnt)
|
| So the rules in general aren't particularly conclusive. But it
| looks like most accelerator designs will eventually fall under
| the rules of the graphics subsystem, if they seek to scale
| through P2P/DMA designs. As a result of that, a lot of people
| will probably get blocked, but Habana to some extent got a
| first-mover advantage, I think.
|
| Arguably if people want to complain about SynapseAI Core being
| unsuitable for production use, to some extent, they should also
| share a bit of blame with the Linux developers for that, if
| they consider the drivers a problem. I think this isn't an
| unreasonable position.
|
| But ultimately this comes down to there being two different
| desires among people: the kernel developers' concerns _aren 't_
| that every userspace stack for every accelerator, shipped to
| every production user, is fully open source. That might be the
| concern of some people who are _users_ of the kernel and Linux
| (including some kernel developers themselves), but not "them"
| at large. Their concern might be more accurately stated as:
| they have enough tooling and information to maintain their own
| codebase and APIs reliably, given the hardware drivers they
| have. These are not the same objective, and this is a good
| example of that.
| AshamedCaptain wrote:
| Kernel maintainers tend to refuse drivers that only work with
| proprietary user-space, so I guess this is just one way to
| workaround that.
| hansendc wrote:
| It's not just drivers. It's really about ensuring that the
| folks that maintain the kernel have a way to test the code they
| maintain. The reasons that we (the kernel maintainers) have for
| this requirement are varied. But, for me, it's really nice to
| have at least one open source implementation that can _test_
| the kernel code. Without that, the kernel code can bit rot too
| easily.
|
| Even better is if an open source implementation is _in_ the
| kernel tree, like in tools /testing/selftests. That makes it
| even less likely that the kernel code gets broken.
|
| Disclaimer: I work on Linux at Intel, although not on drivers
| like this Habana one.
| 10000truths wrote:
| Or they could pull an NVidia, and dedicate a whole in-house
| kernel team to maintaining an out-of-tree kernel module.
| my123 wrote:
| For NVIDIA? More than one.
|
| The Tegra stack uses a totally separate out-of-tree but GPLv2
| kernel module (which also works on some dGPU SKUs). It's
| available at https://nv-tegra.nvidia.com/r/gitweb?p=linux-
| nvgpu.git;a=sum...
|
| And then there's the partially closed kernel module stack,
| which is a different code base...
| eikenberry wrote:
| One of the points of having the drivers in kernel is that means
| they kernel can actually run on that hardware. In addition to
| allowing for testing as others have pointed out, it is also a
| way to make sure that drivers aren't used to restrict access to
| the hardware. It ensures the freedom of the platform.
___________________________________________________________________
(page generated 2022-03-31 23:01 UTC)