[HN Gopher] Triton: Open-Source GPU Programming for Neural Networks
___________________________________________________________________
Triton: Open-Source GPU Programming for Neural Networks
Author : ag8
Score : 143 points
Date : 2021-07-28 16:18 UTC (2 hours ago)
(HTM) web link (www.openai.com)
(TXT) w3m dump (www.openai.com)
| thebruce87m wrote:
| Unfortunate name clash with NVIDIAs Triton Inference Server:
| https://developer.nvidia.com/nvidia-triton-inference-server
| [deleted]
| polynomial wrote:
| My first thought exactly. This will cause nothing but confusion
| and Triton (the inference server) is well integrated into the
| space.
|
| So it's especially weird to see it coming from OpenAI, and not
| a more random startup. It honestly makes no sense they would
| deliberately do this, unless there is some secret cult of
| Triton that is going on in the Bay Area world of AI/ML.
| 6gvONxR4sf7o wrote:
| The author commented on reddit about that (https://www.reddit.c
| om/r/MachineLearning/comments/otdpkx/n_i...)
|
| > PS: The name Triton was coined in mid-2019 when I released my
| PhD paper on the subject
| (http://www.eecs.harvard.edu/~htk/publication/2019-mapl-
| tille...). I chose not to rename the project when the Triton
| inference server came out a year later since it's the only
| thing that ties my helpful PhD advisors to the project.
| shubuZ wrote:
| I have found writing CUDA code is much simpler than writing
| correct multi-threaded AVX2/AVX-512 code.
| dragontamer wrote:
| If you need CPU-side SIMD, then try ispc:
| https://ispc.github.io/
|
| Its pretty much the OpenCL-model, except it compiles into AVX2
| code / AVX512 code. Very similar to CUDA / OpenCL style
| programming. Its not single-source like CUDA, but it largely
| accomplishes the programming model IMO.
| gnufx wrote:
| Why not a standard? OpenMP is more than pretty much C(++) and
| Fortran, and has offload inspired by the needs of the Sierra
| supercomputer.
| mvanaltvorst wrote:
| As a sidenote, these SVG graphs are absolutely beautiful. The
| flow diagram is even responsive on mobile!
| riyadparvez wrote:
| I am confused. Is it another competitor of Tensorflow, JAX, and
| Pytorch? Or something else?
| peytoncasper wrote:
| I believe this is more of an optimization layer to be utilized
| by libraries like Tensorflow and JAX. More of a simplification
| of the interaction with traditional CUDA instructions.
|
| I imagine these libraries and possibly some users would
| implement libraries on top of this language and reap some of
| the optimization benefit without having to maintain low-level
| CUDA specific code.
| blueblisters wrote:
| So is this similar to XLA?
| jpf0 wrote:
| XLA is domain-specific compiler for linear algebra. Triton
| generates and compiles an intermediate representation for
| tiled computation. This IR allows more general functions
| and also claims higher performance.
|
| obligatory reference to the family of work:
| https://github.com/merrymercy/awesome-tensor-compilers
| giacaglia wrote:
| OpenAI keeps innovating. Amazing to see the speed of execution of
| the team
| ipsum2 wrote:
| This guy developed Triton for his PhD thesis, and OpenAI hired
| him to continue working on it. Doesn't really seem fair to give
| all the innovation credit to OpenAI.
|
| See:
| https://www.reddit.com/r/MachineLearning/comments/otdpkx/n_i...
| notthedroids wrote:
| Does Triton support automatic differentiation? I don't see that
| feature in a quick poke through the docs.
|
| If it does compile to LLVM, I suppose it can use Enzyme
| https://enzyme.mit.edu/
| croes wrote:
| Too bad it's CUDA Sooner or later this will become a problem
| because you are depending on the benevolence of a single
| manufacturer.
| shubuZ wrote:
| Which other hardware vendor provides the level of performance
| that Nvidia's GPU provide? Wasnt the benevolence on single (or
| couple) manufacturer(s) true in 90s, 2020s?
| hhh wrote:
| Tenstorrent.
| croes wrote:
| It's not about performance but open standards. Remember
| Oracle vs. Google, at some time in the future NVidia could
| decide to get money out of CUDA.
| dragontamer wrote:
| AMD's MI100 is slightly faster than NVidia A100 for double-
| precision FLOPs at slightly lower costs. Good enough for Oak
| Ridge National Labs (Frontier Supercomputer), to say the
| least.
|
| NVidia is faster at 4x4 16-bit matrix multiplications (common
| in Tensor / Deep Learning stuff), but MI100 still has 4x4
| 16-bit matrix multiplication instructions and acceleration.
| Its not far behind, and the greater 64-bit FLOPs is enough to
| win in scientific fields.
| dragontamer wrote:
| AMD's ROCm 4.0 now supports cooperative groups, which is
| probably one of the last major holdouts for CUDA compatibility.
|
| There's still the 64-wavefront (for AMD CDNA cards) instead of
| 32-wavefronts (for CUDA). But AMD even has 4x4 half-float
| matrix multiplication instructions in ROCm (for MI100, the only
| card that supports the matrix-multiplication / tensor
| instructions)
|
| ---------
|
| I think CUDA vs OpenCL is over. ROCm from AMD has its
| restrictions, but... it really is easier to program than
| OpenCL. Its a superior model: having a single language that
| supports both CPU and GPU code is just easier than switching
| between C++ and OpenCL (where data-structures can't be shared
| as easily).
|
| -----------
|
| The main issue with AMD is that they're cutting support for
| their older cards. The cheapest card you can get that supports
| ROCm is Vega56 now... otherwise you're basically expected to go
| for the expensive MI-line (MI50, MI100).
| meragrin_ wrote:
| > The main issue with AMD is that they're cutting support for
| their older cards.
|
| No, their main issue is not properly supporting ROCm in
| general. No Windows support at all? It still feels like they
| don't know whether they want to continue investing in ROCm
| long term.
| dragontamer wrote:
| I'd assume that they're gonna support ROCm as long as the
| Frontier deployment at Oak Ridge National Labs is up. ORNLs
| isn't exactly a customer you want to piss off.
| zozbot234 wrote:
| The new contest is not CUDA vs. OpenCL but CUDA vs. Vulkan
| Compute. As support for Vulkan in hardware becomes more
| widespread, it makes more and more sense to just standardize
| on it for all workloads. The programming model is quite
| different between the two (kernels vs. shaders) and OpenCL
| 2.x has quite a few features that are not in Vulkan, but the
| latest version of OpenCL has downgraded many of these to
| extensions.
| dragontamer wrote:
| CUDA programmers choose CUDA because when you make a struct
| FooBar{}; in CUDA, it works on both CPU-side and GPU-side.
|
| Vulkan / OpenCL / etc. etc. don't have any data-structure
| sharing like that with the host code. Its a point of
| contention that makes anything more complicated than a
| 3-dimensional array hard to share.
|
| Yeah, Vulkan / OpenCL have all sorts of pointer-sharing
| arrangements (Shared Virtual Memory) or whatnot. But its
| difficult to use in practice, because they keep the
| concepts of "GPU" code separate from "CPU" code.
|
| ---------
|
| When you look at these things: such as Triton, you see that
| people want to unify the CPU-and-GPU code into a single
| code base. Look and read these Triton examples: they're
| just Python code, inside of the rest of CPU-Python code.
|
| I think people are realizing that high-level code can flow
| between the two execution units (CPU or GPU) without
| changing the high level language. The compiler works hard
| to generate code for both systems, but its better for the
| compiler to work rather than the programmer to work on
| integration.
| yumraj wrote:
| Sure, but if this abstraction layer becomes popular, then it
| becomes much easier to support other GPUs without requiring
| client libraries to change, which is a much harder problem.
| dragontamer wrote:
| A big reason why CUDA is popular with compilers is that the
| PTX assembly-ish language is well documented and reasonable.
|
| Compilers generate PTX, then the rest of the CUDA
| infrastructure turns PTX into Turing machine code, or Ampere
| machine code, or Pascal machine code.
|
| In theory, SPIR-V should do the same job, but its just not as
| usable right now. In the meantime, getting it to work on PTX
| is easier, and then there's probably hope (in the far future)
| to move to SPIR-V if that ever actually takes off.
|
| I'm not a developer on Triton, but that'd be my expectation.
| boulos wrote:
| Folks might find the author's research paper [1] while at Harvard
| more informative. This is a great high-level description, but if
| you want more detail, I recommend the paper.
|
| [1] https://dl.acm.org/doi/abs/10.1145/3315508.3329973
| xmaayy wrote:
| I wonder if this can be used for graphics programming. Shaders
| are notoriously hard to write correctly and this seems like it
| might provide an easier gateway than OpenGLSL
___________________________________________________________________
(page generated 2021-07-28 19:00 UTC)