[HN Gopher] Load LLaMA Models Instantly
___________________________________________________________________
Load LLaMA Models Instantly
Author : polyrand
Score : 134 points
Date : 2023-03-17 16:39 UTC (6 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| turnsout wrote:
| "It's a ~200 LOC change that only took me a few hours and it
| worked the first time."
|
| jart is on another plane of existence--and 100% earned this flex
| superkuh wrote:
| Woo! This sounds like it will make it easier to run in normal
| mode (ie, not interactive) and manage the chat history yourself
| if there's less penalty for a full program reload. Currently my
| perl irc bot wrapper for llama.cpp just open2's the program in
| interactive mode (-i) and reads/writes to stdout/stdin of
| llama.cpp to get the loading time savings of having it manage
| history and keep state. In one-shot there'd still be the "extra"
| inference time of processing the full history each time instead
| of saving state like interactive does but the memory load time
| matters just as much.
|
| For me personally this matters most because right now when
| llama.cpp runs out of the 2048 tokens it segfaults and this
| causes difficulties. In interactive mode if it goes off rails and
| generates 1000 tokens of nonsense then that nonsense is taking up
| tokens for the next line from chat. In normal mode where it just
| runs once and all history has to be manually supplied this can be
| avoided.
| jart wrote:
| Indeed. I'm working on persuading the author of this project to
| make it shell scriptable too. The default behavior of the
| command should only print to stdout the response to a prompt.
| adultSwim wrote:
| I'm seeing a lot of interest in generative use cases. Has anyone
| tried LLaMA or GPT for classification?
| lxe wrote:
| I wonder if something like this is possible for CUDA/pytorch
| loading?
| jart wrote:
| I don't know about Cuda, but with PyTorch it's quite possible.
| The solution should generalize to any code that loads into
| memory created by malloc(). You'd obviously be capturing a lot
| of superfluous Python objects in your mappable model file, but
| given the size of these things it should be comparatively
| minor.
| kir-gadjello wrote:
| This is cool, but SSD read bandwidth is still the bottleneck. On
| my non-mac machine it still takes several seconds to load the
| model.
| Tepix wrote:
| This assumes (and makes it possible) that it's already in the
| kernel cache
| dougmwne wrote:
| Can someone break this down? Since this seems to be inferencing
| without having the entire model loaded into memory, is it
| possible this could be a way to relax memory requirements of the
| 65b model?
| layer8 wrote:
| I don't think so:
|
| > the gains here are mostly due to not copying memory anymore,
| and better cooperation with the kernel's page manager. We
| unfortunately aren't getting any additional gains from lazy
| page loading, since this is a dense model. To generate a single
| token, every single page in the model file needs to be loaded.
| What this means is that first runs that load from spinning disk
| are still going to be slow, even though the average case has
| greatly improved.
|
| If I understand correctly, this only provides a speed up when
| the model is already in the OS's file system cache, and in any
| case you still have to load the entire model into memory.
| jart wrote:
| Author here. It does speed things up. It's just that a 2
| second improvement doesn't mean much if the disk reads take
| 60 seconds. When it comes to disk though there's something
| far more important happening here. When you read() or
| memcpy() loaded memory from a 12GB file, then you need at
| least 24GB of RAM, because the copied memory pages are
| competing with the kernel's file cache pages. Once you go
| over that, the kernel is going to delete the file cache,
| thereby ensuring you have to do a full disk read every time.
| Using mmap() ensures the kernel knows they're both the same
| thing. That means you should be able to run models that are
| 2x larger than before, without compromising system stability.
| Tepix wrote:
| Hi, can something along these lines also be used to speed
| up loading of models running in the GPU memory?
|
| With a GDDR6X memory bandwidth of 936 GB/s and PCIe 4.0 x16
| bandwidth of 64GB/s loading something like 20GB into the
| VRAM of an RTX 3090 shouldn't take longer than 1/2 a second
| or so, right? (assuming it is in the kernel cache)
| ithkuil wrote:
| Would reading without the cache (e.g. O_DIRECT) can be used
| to achieve a similar effect?
| dekhn wrote:
| O_DIRECT is absurdly slow in my my experience.
| jart wrote:
| Well I think that would guarantee your always hit disk
| for sure. The goal here is to be able to run the
| llama.cpp command repeatedly, possibly in shell scripts,
| and rely on the file caches being available so that the
| command loads instantly (we're talking like 1 second
| rather than 60 seconds to run, because disk is 100x
| slower).
| adultSwim wrote:
| "We unfortunately aren't getting any additional gains from lazy
| page loading, since this is a dense model. To generate a single
| token, every single page in the model file needs to be loaded.
| What this means is that first runs that load from spinning disk
| are still going to be slow, even though the average case has
| greatly improved"
|
| https://github.com/ggerganov/llama.cpp/issues/91#issuecommen...
| jart wrote:
| You still get the gain of zero loading time. The issue with
| loading a model is it ensures 100% of the file needs to be
| copied into memory. This change means we don't have to copy
| at all. The file cache pages can be made directly accessible
| to the matmul ops. You could have ten of these processes
| running at once, and they'd all share the same memory.
| However those pages still have to be pulled into memory. What
| I was hoping for, when I said that, is it's possible in some
| cases for mmap() to make things even faster than this change
| managed to accomplish. Some scientific computing applications
| use sparse datasets. If you skip loading, and use mmap(),
| then the system will only load pages as-needed on a 4096-byte
| basis. If the data were actually sparsely used, then with
| mmap() you could for instance load a 1TB file into memory on
| a system with 32GB of RAM, and no file cache, only touch a
| small portion, and it'll go fast and barely use any memory at
| all. That's not the case here sadly. It's still however a big
| improvement.
| dekhn wrote:
| There are some cool ideas in here. I've long been curious why
| people don't use mmap to re-use all those wonderful pages that
| got loaded (without reparsing the disk data).
| liuliu wrote:
| That's actually the point of safetensors format, although it
| needs to take care of alignment. I actually use this technique
| to mmap directly to GPU buffer on Apple systems (although it
| seems to segfault on iOS 15.x systems, only supported on 16.x
| and above).
| arthurcolle wrote:
| Can you explain how this is possible to do? Sorry I haven't
| gotten this low level with any of these models before and I'd
| really appreciate how to understand this.
| delusional wrote:
| This has nothing to do with the models, just standard *nix
| stuff. If you mmap the file readonly the pages can be shared
| by multiple processes without duplication since they are
| guaranteed to be the same.
| dekhn wrote:
| To run inference, you need to load a model from disk into
| RAM. Usually the model is written in a disk format that is
| convenient but needs to be parsed at runtime into a RAM-
| resident C application data structure.
|
| In this case, it looks like jart@ modified malloc to capture
| the memory generated by the loading process and serialized
| that to disk. When you run, the application calls mmap to
| make a virtual memory association with the bytes on disk- so
| any time you access RAM and it's not yet loaded, ti gets
| loaded from disk. At that point it gets saved by the kernel
| in a page cache and since the files on disk don't change,
| those pages can stay in memory longer than the process. So
| when the process restarts, all those RAM requests are
| immediately mapped to already-cached virtual memory, rather
| than reading from disk.
|
| The inference library here supports a data pointer that would
| point to the memory mapped location.
|
| This is faster than relying on the kernel's disk read cache;
| in that case, you'd still need to convert the data from the
| disk format to the in-memory format.
|
| Normally the data build process is run as an external program
| that writes the mmap-ready structure to disk (an example is
| the BLAST program which writes the DNA sequence data into an
| index structure that is mmapped at runtime). But in this case
| ti looks like using an instrumented malloc() helps simplify
| the process of building the disk structure.
| selfhoster11 wrote:
| Thank you for taking your time to write this up. For
| someone with a new-found interest in AI, this is invaluable
| information.
| arthurcolle wrote:
| Thank you for taking the time to write this out. Very
| helpful for understanding. I remember using malloc to build
| out C data structures in my coursework but I must admit I
| haven't really done much practical work at this level.
| Thanks again, you are a scholar.
| csmpltn wrote:
| > "I've long been curious why people don't use mmap to re-use
| all those wonderful pages that got loaded"
|
| That's because the people that develop these models are often
| data scientists and have little to no experience with systems
| programming, optimizations, etc.
| dekhn wrote:
| shhh, you're getting close to exposing my guaranteed
| employment secrets
| 0cf8612b2e1e wrote:
| How does jart find the time? Just a brilliant engineer who is
| seemingly cranking out amazing projects on a regular basis.
| ipaddr wrote:
| Works at facebook and can't find work?
| Cyph0n wrote:
| A 100x developer if I've ever seen one.
| jart wrote:
| You can thank Mozilla's MIECO program, my GitHub sponsors, and
| my Patreon supporters. See my recent blog post:
| https://justine.lol/rusage/#funding It's thanks to them that
| I'm able to keep doing what I'm doing.
| meghan_rain wrote:
| I welcome all progress but I don't see why these models aren't
| simply run on a thin Python server that loads the model into
| memory once and then you can curl it instantly whenever you want?
| kkielhofner wrote:
| Because it's a waste for anything other than proof of
| concept/handful of users.
|
| It's really simple to take some python inference code and wrap
| FastAPI around it.
|
| However, inference servers exist for a reason. You'll quickly
| find that performance, VRAM usage, model management, etc isn't
| practical with the FastAPI approach.
|
| Speaking personally inference server implementations like
| Nvidia Triton bring a model to performance metrics that are
| absolutely night and day vs the FastAPI approach - in many
| cases orders of magnitude higher performance in terms of
| response time and requests per second.
| sankha93 wrote:
| Can you list the concrete problems a FastAPI approach will
| have, and what tools like Nvidia Triton do differently to get
| around it? I have no idea about running such models at scale.
| kkielhofner wrote:
| Sure!
|
| FastAPI loads a model statically on startup. There are some
| hacks to reload versions and new models via things with
| load balancers, etc but they're just that - hacks. There
| are also known issues with TensorFlow especially having
| poor memory management over request count.
|
| FastAPI is great but at the end of the day it's Python and
| the performance reflects that (more on this later).
|
| With Nvidia Triton you get:
|
| - Automatic support for various model frameworks/formats:
| native PyTorch/TensorFlow, ONNX, and more.
|
| - Dynamic batching. You can configure an SLA with max
| additional latency for response time where Triton will
| queue requests from multiple clients over a given time
| period and pass them through though model batched. If you
| have the VRAM (you should) it's an instant performance
| multiplier.
|
| - Even better performance: Triton can do things like
| automatically compile/convert a model to TensorRT on the
| runtime hardware. This allows you to deploy models across
| hardware families with optimized performance while not
| worrying about the specific compute architecture or dealing
| with TensorRT itself.
|
| - Optimized and efficient use of multiple GPUs.
|
| - Model version management. Triton has a model management
| API you can use to upload a new model/version and load it
| dynamically. It can hot load/reload a model and serve it
| instantly, with configuration options for always serving
| the latest model or allowing client to request a specific
| version.
|
| - Performance metrics. It has built in support for
| Prometheus.
|
| - Other tools like Model Navigator and Performance
| Analyzer. You can pass a model to these tools and they will
| try every possible model format, batch size, etc, etc
| against an actual Triton server and produce a report and
| optimized model configuration based on your selected
| parameters - requests per second, response time, etc. Even
| memory/compute utilization, power usage, and more.
|
| - Out of the box without any of these tricks Triton is
| faster, uses less memory, less GPU compute, and less CPU
| compute. Written in C and optimized by Nvidia.
|
| - It's a single implementation (often container) that from
| the get go is smaller, lighter weight, and easier to manage
| than pip installing a bunch of dependencies and the entire
| runtime framework itself. It exists solely to serve models
| and serve them well.
|
| When you add it up (as I mentioned) I've personally seen
| cases where requests per second increase by orders of
| magnitude with lower response times than a single request
| against FastAPI (or similar). Plus all of the mlops and
| metrics features.
|
| Frankly, it's pretty amazing.
| woodson wrote:
| Not GP, but what NVidia Triton can do includes
|
| - Dynamic batching while limiting latency to a set
| threshold
|
| - Running multiple instances of a model, effectively load-
| balancing inference requests.
|
| - Loading/unloading/running multiple versions of models
| dynamically, which is useful if you want to update (or roll
| back) your model while not interfering with existing
| inference requests.
|
| Its client provides async based inference APIs, so you can
| easily put a FastAPI-based API server in front and don't
| necessarily need a queue (like Celery).
| dekhn wrote:
| AKA, "model serving".
|
| The reason is that the kernel gives you this featuer and it's
| really powerful, so why not take advantage of it?
|
| During dev work you often want an easily restarted stack (code
| changes). Anyway, if you use this approach, you can just have
| the python server stay resident and then have another app mmap
| it (shared memory) instead of doing inference over an API,
| whcih is always awkward.
| mungoman2 wrote:
| Yes tbh this is the right answer
| eternalban wrote:
| If you want to avoid twitter this discusses the changes:
|
| https://github.com/ggerganov/llama.cpp/issues/91
___________________________________________________________________
(page generated 2023-03-17 23:01 UTC)