[HN Gopher] Web LLM - WebGPU Powered Inference of Large Language...
___________________________________________________________________
Web LLM - WebGPU Powered Inference of Large Language Models
Author : summarity
Score : 101 points
Date : 2023-04-15 18:42 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cahoot_bird wrote:
| Do LLMs have a way around the high end GPU requirements, or can
| CPU code potentially be much more optimized somehow?
|
| This is the only thing I can think of, not everyone will have the
| latest high end GPUs to run such software..
| verdverm wrote:
| Check out LLAMA-CPP
| cahoot_bird wrote:
| Looks like this was hacked together pretty quickly. This in
| CPU is exactly what needs optimized to run on more devices,
| if that's even possible..
|
| I guess it will take hardware and software a while to catch
| up to compete with ChatGPT..
| verdverm wrote:
| If you look at the news, yes it came together quickly, but
| it has also gotten a lot of performance upgrades which have
| made significant improvements.
| m1el wrote:
| If you're doing inference on neural networks, each weight has
| to be read at least once per token. This means you're going to
| read at least the size of the entire model, per token, at least
| once during inference. If your model is 60GB, and you're
| reading it from the hard drive, then your bare minimum time of
| inference per token will be limited by your hard drive read
| throughput. Macbooks have ~4GB/s sequential read speed. Which
| means your inference time per token will be strictly more than
| 15 seconds. If your model is in RAM, then (according to Apple's
| advertising) your memory speed is 400GB/s, which is 100x your
| hard drive speed, and just the memory throughput will not be as
| much of a bottleneck here.
| rahimnathwani wrote:
| Your answer applies equally to GPU and CPU, no?
|
| The comment to which you replied was asking about the need
| for a GPU, not the need for a lot of RAM.
| ratg13 wrote:
| There will be LLM specific chips coming to market soon which
| will be specialized to the task.
|
| Tesla already has already been creating AI chips for their FSD
| features in their vehicles. Over the next years, everyone will
| be racing to be the first to put out LLM specific chips, with
| AI specific hardware devices following.
| brucethemoose2 wrote:
| The next generation of Intel/AMD IGPs operating out of RAM
| should be quite usable.
| junrushao1994 wrote:
| To clarify, running this WebLLM demo doesn't need a 3.5k MacBook
| Pro which costs $3.5k :-)
|
| WebGPU supports multiple backends, besides Metal on Apple
| Silicon, it offloads to Vulkan, DirectX, etc. It means a windows
| laptop with Vulkan support should work. My 2019 Intel MacBook
| with AMDGPU works as well. And of course, NVIDIA GPUs too!
|
| Our model is int4 quantized, and it is 4G in size, so it doesn't
| need 64GB memory either. Somewhere around 6G should suffice.
| bazmattaz wrote:
| Sorry non technical person here; has this been benchmarked
| against ChatGPT? Do you have any idea how it performs alongside
| GPT3 or GPT4?
| junrushao1994 wrote:
| This is unfortunately non-trivial to quantitatively evaluate
| the performance against ChatGPT :-(
|
| We didn't do much evaluation because there isn't much
| innovation on model side, but instead we are demoing the
| possibility of running an end-to-end model on ordinary client
| GPUs via WebGPU without server resources.
| slimsag wrote:
| Nice work! I knew it wouldn't be long before someone put this
| together :)
|
| I'm curious if given a different language (like Zig) with
| WebGPU access if you could easily translate that last-mile of
| code to execute there or not? In specific I wonder if I can do
| it, and if you can give me an overview of where the code for
| "Universal deployment" in your diagram actually lives?
|
| I found llm_chat.js, but it seems that doesn't include the
| logic necessary for building WGSL shaders? Am I wrong or does
| that happen elsewhere like in the TVM runtime? How much is
| baked into llm_chat.wasm, and where is the source for that?
| crowwork wrote:
| The WGSL are generated and compiled through TVM and embedded
| into the wasm.
|
| I think what you mean is wgpu native support. At the moment
| the web gpu runtime dispatches to the js webgpu environment.
| Once TVM runtime comes with wgpu native support (like the
| current ones in vulkan or metal), then it is possible to
| leverage any wgpu native runtime like what Zig provide.
|
| Additionally, currently tvm natively support targets like
| vulkan, metal directly which allows targeting these other
| platforms
| slimsag wrote:
| OK that makes sense; so basically if I want to give this a
| shot then I would just need to read llm_chat.js and the TVM
| docs, and translate llvm_chat.js to my language of choice
| effectively?
| crowwork wrote:
| I think instead what would be needed is a wgpu native
| runtime support for TVM. Like the implementations in tvm
| vulkan, then it will be naturally link to any runtime
| that provides webgpu.h
|
| Then yah the llm_chat.js would be high-level logic that
| targets the tvm runtime, and can be implemented in any
| language that tvm runtime support(that includes, js,
| java, c++ rust etc).
|
| Support webgpu native is an interesting direction. Feel
| free to open a thread in tvm discuss forum and perhaps
| there would be fun things to collaborate in OSS
| summarity wrote:
| How big is the "runtime" part? My use case would
| basically be: run this in a native app that links against
| webgpu (wgpu or dawn). Is there a reference
| implementation for this runtime that one could study?
| [deleted]
| auggierose wrote:
| Just tried out the demo, finally something that runs out of the
| box on my iMac Pro! This old 16GB card can finally breathe some
| LLM air!!
| hongkonger wrote:
| You guys are awesome. Both Web LLM and Web Stable Diffusion demos
| work on my Intel i3-1115G4 laptop with only 5.9GB of shared GPU
| memory.
| bhouston wrote:
| It is funny how we have WebNN API coming, but it is so slow to
| arrive that we are misusing graphics APIs to do NN again. Anyone
| have a clue when WebNN API will arrive? It would be significantly
| more power efficient than using WebGPU/WebGL.
|
| https://www.w3.org/TR/webnn/
|
| Seems to be in development over at Chrome:
| https://chromestatus.com/feature/5738583487938560
| brucethemoose2 wrote:
| > For example, the latest MacBook Pro can have more than 60G+
| unified GPU RAM that can be used to store the model weights and a
| reasonably powerful GPU to run many workloads.
|
| ...for $3.5K minimum, according to the Apple website :/
|
| Is there any chance WebGPU could utilize the matrix instructions
| shipping on newer/future IGPs? I think MLIR can do this through
| Vulkan, which is how SHARK is so fast in Stable Diffusion on the
| AMD 7900 series, but I know nothing about webgpu's restrictions
| or Apache TVM.
| satvikpendem wrote:
| Getting 64 GB of VRAM for $3.5k is a lot cheaper than buying
| the equivalent Nvidia discrete GPUs.
| summarity wrote:
| It's also interesting that this opens up the full saturation
| of Apple Silicon (minus the ANE): GGML can run on the CPU,
| using NEON and AMX, while another instance could run via
| Metal on the GPU using MLC/dawn. Though the two couldn't
| share (the same) memory at the moment.
| brucethemoose2 wrote:
| The GPU's ML task energy is so much lower that you'd
| probably get better performance running everything on the
| GPU.
|
| I think some repos have tried splitting things up between
| the NPU and GPU as well, but they didn't get good
| performance out of that combination? Not sure why, as the
| NPU is very low power.
| verdverm wrote:
| You can get it for a lot less from https://frame.work
|
| But 64G of VRAM is not the same as GPU mem, apples and
| oranges
| satvikpendem wrote:
| Where does Framework offer 64 GB of VRAM? By VRAM I am
| referring to GPU RAM, yes.
| brucethemoose2 wrote:
| Technically any newish laptop with 64GB of RAM has 64GB
| of "VRAM," but right now the Apple M series and AMD 7000
| series are the only IGPs with any significant ML power.
| sroussey wrote:
| I'm not sure what you mean. Typically, an iGPU slices off
| part of RAM for the GPU at boot time, which means it's
| fixed and not shared. When did this change?
| brucethemoose2 wrote:
| Its fixed at boot but (on newer IGPs) can grow beyond the
| initial capacity.
| KeplerBoy wrote:
| Doesn't it all boil down to bandwidth?
|
| AMDs IGP are way less attractive because they use rather
| slow DDR4/5 memory while the M2 has blazing fast memory
| integrated in the package.
|
| We're talking about 50 GB/s vs 400 GB/s. Nvidia's A100
| has 1000 GB/s.
|
| Memory bandwidth is usually the bottleneck in GPU
| performance as many kernels are memory-bound (look up the
| roofline performance model).
| brucethemoose2 wrote:
| The AMD 6000 series has 128-bit LPDDR5 as an option, the
| 7000 series had LPDDR5X. This is similar to the M1/M2.
|
| The Pro/Max have double/quadruple that bus width. But
| they are much bigger/more expensive chips.
| brucethemoose2 wrote:
| Maybe Intel will start offering cheap, high capacity ARC
| dGPUs as a power play? That would certainly be disruptive.
|
| But yeah, AMD/Nvidia are never going offer huge memory pools
| affordably on dGPUs.
| summarity wrote:
| That would need support in dawn:
| https://dawn.googlesource.com/dawn
|
| Dawn and WebIDL is also an easy way to add GPU support to any
| application (that can link C code (or use via a lib)). And
| Google maintains the compiler layer for the GPU frameworks
| (Metal, DX, Vulkan ...). This is going to be a great leap
| forward for GPGPU for many apps.
| brucethemoose2 wrote:
| Hmmm is there an issue tracker for Dawn matrix acceleration
| in... Vulkan or D3D12, I guess? Diving into these software
| stacks is making my head hurt.
| dwheeler wrote:
| > Thanks to the open-source efforts like LLaMA ...
|
| Is the LLaMA model really open source now? Last I checked it was
| only licensed for non-commercial use, which isn't open source
| software at least. Have they changed the license? Are people
| depending on "databases can't be copyrighted"? Are people just
| presuming they won't be caught?
|
| There's lots of OSS that can use LLaMA but that's different from
| the model itself.
|
| This is a genuine question, people are making assertions but I
| can't find evidence for the assertions.
| Anduia wrote:
| Not all implementations of it use the same license. See for
| example lit-llama by Lightning AI:
|
| https://news.ycombinator.com/item?id=35344787
___________________________________________________________________
(page generated 2023-04-15 23:00 UTC)