[HN Gopher] Web LLM - WebGPU Powered Inference of Large Language...
       ___________________________________________________________________
        
       Web LLM - WebGPU Powered Inference of Large Language Models
        
       Author : summarity
       Score  : 101 points
       Date   : 2023-04-15 18:42 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cahoot_bird wrote:
       | Do LLMs have a way around the high end GPU requirements, or can
       | CPU code potentially be much more optimized somehow?
       | 
       | This is the only thing I can think of, not everyone will have the
       | latest high end GPUs to run such software..
        
         | verdverm wrote:
         | Check out LLAMA-CPP
        
           | cahoot_bird wrote:
           | Looks like this was hacked together pretty quickly. This in
           | CPU is exactly what needs optimized to run on more devices,
           | if that's even possible..
           | 
           | I guess it will take hardware and software a while to catch
           | up to compete with ChatGPT..
        
             | verdverm wrote:
             | If you look at the news, yes it came together quickly, but
             | it has also gotten a lot of performance upgrades which have
             | made significant improvements.
        
         | m1el wrote:
         | If you're doing inference on neural networks, each weight has
         | to be read at least once per token. This means you're going to
         | read at least the size of the entire model, per token, at least
         | once during inference. If your model is 60GB, and you're
         | reading it from the hard drive, then your bare minimum time of
         | inference per token will be limited by your hard drive read
         | throughput. Macbooks have ~4GB/s sequential read speed. Which
         | means your inference time per token will be strictly more than
         | 15 seconds. If your model is in RAM, then (according to Apple's
         | advertising) your memory speed is 400GB/s, which is 100x your
         | hard drive speed, and just the memory throughput will not be as
         | much of a bottleneck here.
        
           | rahimnathwani wrote:
           | Your answer applies equally to GPU and CPU, no?
           | 
           | The comment to which you replied was asking about the need
           | for a GPU, not the need for a lot of RAM.
        
         | ratg13 wrote:
         | There will be LLM specific chips coming to market soon which
         | will be specialized to the task.
         | 
         | Tesla already has already been creating AI chips for their FSD
         | features in their vehicles. Over the next years, everyone will
         | be racing to be the first to put out LLM specific chips, with
         | AI specific hardware devices following.
        
         | brucethemoose2 wrote:
         | The next generation of Intel/AMD IGPs operating out of RAM
         | should be quite usable.
        
       | junrushao1994 wrote:
       | To clarify, running this WebLLM demo doesn't need a 3.5k MacBook
       | Pro which costs $3.5k :-)
       | 
       | WebGPU supports multiple backends, besides Metal on Apple
       | Silicon, it offloads to Vulkan, DirectX, etc. It means a windows
       | laptop with Vulkan support should work. My 2019 Intel MacBook
       | with AMDGPU works as well. And of course, NVIDIA GPUs too!
       | 
       | Our model is int4 quantized, and it is 4G in size, so it doesn't
       | need 64GB memory either. Somewhere around 6G should suffice.
        
         | bazmattaz wrote:
         | Sorry non technical person here; has this been benchmarked
         | against ChatGPT? Do you have any idea how it performs alongside
         | GPT3 or GPT4?
        
           | junrushao1994 wrote:
           | This is unfortunately non-trivial to quantitatively evaluate
           | the performance against ChatGPT :-(
           | 
           | We didn't do much evaluation because there isn't much
           | innovation on model side, but instead we are demoing the
           | possibility of running an end-to-end model on ordinary client
           | GPUs via WebGPU without server resources.
        
         | slimsag wrote:
         | Nice work! I knew it wouldn't be long before someone put this
         | together :)
         | 
         | I'm curious if given a different language (like Zig) with
         | WebGPU access if you could easily translate that last-mile of
         | code to execute there or not? In specific I wonder if I can do
         | it, and if you can give me an overview of where the code for
         | "Universal deployment" in your diagram actually lives?
         | 
         | I found llm_chat.js, but it seems that doesn't include the
         | logic necessary for building WGSL shaders? Am I wrong or does
         | that happen elsewhere like in the TVM runtime? How much is
         | baked into llm_chat.wasm, and where is the source for that?
        
           | crowwork wrote:
           | The WGSL are generated and compiled through TVM and embedded
           | into the wasm.
           | 
           | I think what you mean is wgpu native support. At the moment
           | the web gpu runtime dispatches to the js webgpu environment.
           | Once TVM runtime comes with wgpu native support (like the
           | current ones in vulkan or metal), then it is possible to
           | leverage any wgpu native runtime like what Zig provide.
           | 
           | Additionally, currently tvm natively support targets like
           | vulkan, metal directly which allows targeting these other
           | platforms
        
             | slimsag wrote:
             | OK that makes sense; so basically if I want to give this a
             | shot then I would just need to read llm_chat.js and the TVM
             | docs, and translate llvm_chat.js to my language of choice
             | effectively?
        
               | crowwork wrote:
               | I think instead what would be needed is a wgpu native
               | runtime support for TVM. Like the implementations in tvm
               | vulkan, then it will be naturally link to any runtime
               | that provides webgpu.h
               | 
               | Then yah the llm_chat.js would be high-level logic that
               | targets the tvm runtime, and can be implemented in any
               | language that tvm runtime support(that includes, js,
               | java, c++ rust etc).
               | 
               | Support webgpu native is an interesting direction. Feel
               | free to open a thread in tvm discuss forum and perhaps
               | there would be fun things to collaborate in OSS
        
               | summarity wrote:
               | How big is the "runtime" part? My use case would
               | basically be: run this in a native app that links against
               | webgpu (wgpu or dawn). Is there a reference
               | implementation for this runtime that one could study?
        
       | [deleted]
        
       | auggierose wrote:
       | Just tried out the demo, finally something that runs out of the
       | box on my iMac Pro! This old 16GB card can finally breathe some
       | LLM air!!
        
       | hongkonger wrote:
       | You guys are awesome. Both Web LLM and Web Stable Diffusion demos
       | work on my Intel i3-1115G4 laptop with only 5.9GB of shared GPU
       | memory.
        
       | bhouston wrote:
       | It is funny how we have WebNN API coming, but it is so slow to
       | arrive that we are misusing graphics APIs to do NN again. Anyone
       | have a clue when WebNN API will arrive? It would be significantly
       | more power efficient than using WebGPU/WebGL.
       | 
       | https://www.w3.org/TR/webnn/
       | 
       | Seems to be in development over at Chrome:
       | https://chromestatus.com/feature/5738583487938560
        
       | brucethemoose2 wrote:
       | > For example, the latest MacBook Pro can have more than 60G+
       | unified GPU RAM that can be used to store the model weights and a
       | reasonably powerful GPU to run many workloads.
       | 
       | ...for $3.5K minimum, according to the Apple website :/
       | 
       | Is there any chance WebGPU could utilize the matrix instructions
       | shipping on newer/future IGPs? I think MLIR can do this through
       | Vulkan, which is how SHARK is so fast in Stable Diffusion on the
       | AMD 7900 series, but I know nothing about webgpu's restrictions
       | or Apache TVM.
        
         | satvikpendem wrote:
         | Getting 64 GB of VRAM for $3.5k is a lot cheaper than buying
         | the equivalent Nvidia discrete GPUs.
        
           | summarity wrote:
           | It's also interesting that this opens up the full saturation
           | of Apple Silicon (minus the ANE): GGML can run on the CPU,
           | using NEON and AMX, while another instance could run via
           | Metal on the GPU using MLC/dawn. Though the two couldn't
           | share (the same) memory at the moment.
        
             | brucethemoose2 wrote:
             | The GPU's ML task energy is so much lower that you'd
             | probably get better performance running everything on the
             | GPU.
             | 
             | I think some repos have tried splitting things up between
             | the NPU and GPU as well, but they didn't get good
             | performance out of that combination? Not sure why, as the
             | NPU is very low power.
        
           | verdverm wrote:
           | You can get it for a lot less from https://frame.work
           | 
           | But 64G of VRAM is not the same as GPU mem, apples and
           | oranges
        
             | satvikpendem wrote:
             | Where does Framework offer 64 GB of VRAM? By VRAM I am
             | referring to GPU RAM, yes.
        
               | brucethemoose2 wrote:
               | Technically any newish laptop with 64GB of RAM has 64GB
               | of "VRAM," but right now the Apple M series and AMD 7000
               | series are the only IGPs with any significant ML power.
        
               | sroussey wrote:
               | I'm not sure what you mean. Typically, an iGPU slices off
               | part of RAM for the GPU at boot time, which means it's
               | fixed and not shared. When did this change?
        
               | brucethemoose2 wrote:
               | Its fixed at boot but (on newer IGPs) can grow beyond the
               | initial capacity.
        
               | KeplerBoy wrote:
               | Doesn't it all boil down to bandwidth?
               | 
               | AMDs IGP are way less attractive because they use rather
               | slow DDR4/5 memory while the M2 has blazing fast memory
               | integrated in the package.
               | 
               | We're talking about 50 GB/s vs 400 GB/s. Nvidia's A100
               | has 1000 GB/s.
               | 
               | Memory bandwidth is usually the bottleneck in GPU
               | performance as many kernels are memory-bound (look up the
               | roofline performance model).
        
               | brucethemoose2 wrote:
               | The AMD 6000 series has 128-bit LPDDR5 as an option, the
               | 7000 series had LPDDR5X. This is similar to the M1/M2.
               | 
               | The Pro/Max have double/quadruple that bus width. But
               | they are much bigger/more expensive chips.
        
           | brucethemoose2 wrote:
           | Maybe Intel will start offering cheap, high capacity ARC
           | dGPUs as a power play? That would certainly be disruptive.
           | 
           | But yeah, AMD/Nvidia are never going offer huge memory pools
           | affordably on dGPUs.
        
         | summarity wrote:
         | That would need support in dawn:
         | https://dawn.googlesource.com/dawn
         | 
         | Dawn and WebIDL is also an easy way to add GPU support to any
         | application (that can link C code (or use via a lib)). And
         | Google maintains the compiler layer for the GPU frameworks
         | (Metal, DX, Vulkan ...). This is going to be a great leap
         | forward for GPGPU for many apps.
        
           | brucethemoose2 wrote:
           | Hmmm is there an issue tracker for Dawn matrix acceleration
           | in... Vulkan or D3D12, I guess? Diving into these software
           | stacks is making my head hurt.
        
       | dwheeler wrote:
       | > Thanks to the open-source efforts like LLaMA ...
       | 
       | Is the LLaMA model really open source now? Last I checked it was
       | only licensed for non-commercial use, which isn't open source
       | software at least. Have they changed the license? Are people
       | depending on "databases can't be copyrighted"? Are people just
       | presuming they won't be caught?
       | 
       | There's lots of OSS that can use LLaMA but that's different from
       | the model itself.
       | 
       | This is a genuine question, people are making assertions but I
       | can't find evidence for the assertions.
        
         | Anduia wrote:
         | Not all implementations of it use the same license. See for
         | example lit-llama by Lightning AI:
         | 
         | https://news.ycombinator.com/item?id=35344787
        
       ___________________________________________________________________
       (page generated 2023-04-15 23:00 UTC)