[HN Gopher] New exponent functions that make SiLU and SoftMax 2x...
       ___________________________________________________________________
        
       New exponent functions that make SiLU and SoftMax 2x faster, at
       full accuracy
        
       Author : weinzierl
       Score  : 136 points
       Date   : 2024-05-15 19:57 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | koe123 wrote:
       | Perhaps somewhat off topic, does anyone know how stuff like ggml
       | compares to runtimes (tensorflow lite, onnxruntime, etc.)?
        
         | ilaksh wrote:
         | On what hardware exactly?
        
           | koe123 wrote:
           | Probably should have specified that! I'm referring to cpu
           | inference.
        
         | refulgentis wrote:
         | I'm intimately familiar, maintain ONNX and llama.cpp Flutter
         | libraries across all 6 True Platforms.
         | 
         | Quick opinionated TL;DR:
         | 
         | - llama.cpp for LLMs and can do whisper with it's core
         | dependency, GGML.
         | 
         | - ONNX for everything else.
         | 
         | - TF is the Apple of ML, it's great if you're completely wedded
         | to Google ML ecosystem. Virtually dead outside that. (Something
         | absurd ,like, 94%, of HF models are Pytorch)
         | 
         | - only chance I'd have to do a direct comparison in inference
         | performance is Whisper in ONNX vs. GGML, someone got my llama
         | .cpp lib running with Whisper and didn't report significant
         | perf difference
        
           | svnt wrote:
           | What are the True Platforms, in this case?
        
             | crthpl wrote:
             | I'm guessing MacOS, Linux, Android, Windows, iOS, and Web?
        
               | refulgentis wrote:
               | Correct
        
           | a_wild_dandan wrote:
           | What're your thoughts on MLX? It's been phenomenal on my MBP.
        
             | refulgentis wrote:
             | No time* to try it unfortunately :( Sounds great, though,
             | and Mac just kicks the pants of every other platform on
             | local inference thanks to Metal, I imagine MLX must extend
             | that lead to the point Qualcomm/Google has to a serious
             | investment in open source acceleration. Cheapest iPhone
             | from 2022 kicks the most expensive Android from 2023 (Pixel
             | Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.
             | 
             | * it sounded like a _great_ idea to do an OpenAI LLM x
             | search app. Then it sounded like a great idea to add
             | embeddings locally for privacy (thus, ONNX). Then it
             | sounded like a great idea to do a local LLM (thus,
             | llama.cpp). Then it sounded like a great idea to
             | differentiate by being on all platforms, supported equally.
             | Really taxing. Think I went too far this time. It works
             | but, jeez, the workload...hopefully, after release, it
             | turns out maintenance load is relatively low
        
           | catgary wrote:
           | I don't think that's a terribly fair description of Google -
           | AWS's chips (inferon and trainium) both have robust XLA/JAX
           | support. Plus JAX now exports MLIR, so there is a really
           | compelling JAX -> IREE pipeline so JAX models can more or
           | less be deployed anywhere, even on bare metal embedded
           | devices.
        
             | refulgentis wrote:
             | You're right, if you need to go from data => model running
             | in web app, I'd do TF - the inartful Apple analogy is meant
             | to indicate that: great vertical integration.
             | 
             | For local inference of an existing model, TF Lite pales in
             | comparison to ONNX. ONNX goes out of its way for you get
             | ~anything running ~anywhere on the best accelerator
             | available on the platform.* AFAIK TF Lite only helps if
             | your model was in TF.
             | 
             | And there simply isn't an LLM scene for TensorFlow, so it
             | "loses" to llama.cpp for that. There isn't an ONNX LLM
             | scene either, though. (see below)
             | 
             | * There's one key exception...until recently...LLMs! ONNX's
             | model format was limited due to protobuf, IIRC it was 2-4
             | GB. Part of the Phi-3 announcement was this library they've
             | been stubbing out that's on top of ONNX, but more
             | specialized for LLMs. That being said, haven't seen any
             | LLMs in it except Phi-3, and it's an absolute mess, the
             | library was announced weeks ahead of when it was planned to
             | be released, and then throw in the standard 6-week
             | slippage, I'm probably not trying it again until June.
        
       | a_wild_dandan wrote:
       | (in llama.cpp, for CPU)
        
         | jart wrote:
         | I developed this originally for llamafile, which was included
         | in the last two releases: https://github.com/Mozilla-
         | Ocho/llamafile/releases/tag/0.8.2 Now we're upstreaming it to
         | the llama.cpp project. There are other performance enhancements
         | you can currently only get from llamafile, such as Kawrakow's
         | work making K quants go much faster.
        
           | breakingcups wrote:
           | Is that just because nobody has made an effort yet to port
           | them upstream, or is there something inherently difficult
           | about making those changes work in llama.cpp?
        
       | SeanAnderson wrote:
       | off topic, but sheesh. I was skimming this and thought, "This
       | seems like a crazy optimization. It's complex and the code in
       | question has had a ton of eyes on it already." and then I saw the
       | contributor and was like "Of course it's jart. It's _always_ jart
       | with the crazy good solutions. "
       | 
       | well done :)
        
         | mhh__ wrote:
         | > adapted from arm limited optimized routine
         | 
         | Shoulders of giants and all that
        
         | neonsunset wrote:
         | Mostly looks scary* because that's just how it is with
         | intrinsics syntax in C and C++. As with many things there, this
         | pain is mostly self-inflicted.
         | 
         | There are C++ libraries that allow for C#-style SIMD and
         | hardware instrinsics syntax as far as I'm aware. It comes at a
         | disadvantage as you can't directly lookup mnemonics in ISAs
         | documentation though.
         | 
         | *not to seem as if dismissing the importance of the work done
         | there, just highlighting that it could have been much more
         | accessible to wider audience, but I'm not gonna suggest
         | something that everyone here would consider preposterous such
         | as rewriting inference back-end in C# just yet
        
       | mhh__ wrote:
       | > replaces short[65536] look up table
       | 
       | Is that not quite dim to begin with (having a LUT the size of the
       | whole L1 cache?) or does it work surprisingly well because of
       | some probabilistic fudging?
        
         | andy99 wrote:
         | Why you (probably) shouldn't use a lookup table
         | https://specbranch.com/posts/lookup-tables/ gives some
         | discussion about when it's appropriate generally. My narrow
         | experience is that you can do a lot of real time calculation
         | before it's faster to do a lookup.
        
         | Tuna-Fish wrote:
         | The lookup table does surprisingly well because the workload is
         | otherwise extremely cache-hostile, and it doesn't really matter
         | if you blow up your L1 cache, none of the data you evicted
         | because you needed to fit the LUT was ever going to be reused
         | anyway.
         | 
         | ML loads in general are streaming loads that linearly load the
         | entire dataset for every iteration.
        
       | mysteria wrote:
       | How much do these silu and softmax improvements affect the LLM
       | inference speed as a whole? Correct me if I'm wrong but I feel
       | that this change will only have a small effect as the majority of
       | the time is spent doing matrix multiplications.
        
       | lxe wrote:
       | At this point, is gguf/llama.cpp a more performant solution for
       | unbatched inference on CUDA devices, or is
       | exllamav2+flashattention still reigning supreme?
        
       | mjcohen wrote:
       | About 20 years ago, I was programming for the Hughes radar signal
       | processor, a highly parallel pipelined machine which accounted
       | for much of Hughes success in radar processing. Anyway, I needed
       | to compute e^x for 0 < x < 1. The processor had a multiply, so I
       | used four 256 long tables of e^x for each possible 8-bit values
       | in the 4 blocks in the 32-bit word, multiplied them to get the
       | final value. It was about 5 times as fast as the previous best
       | e^x routine. That machine was fun! It is obsolete now, but for
       | many years is could process radar signals faster that processors
       | that were nominally many times faster.
        
       | KaoruAoiShiho wrote:
       | This does help the gguf usecase of partial offloading to GPU?
       | It'll help the CPU part be faster too?
        
       ___________________________________________________________________
       (page generated 2024-05-15 23:00 UTC)