[HN Gopher] Load LLaMA Models Instantly
       ___________________________________________________________________
        
       Load LLaMA Models Instantly
        
       Author : polyrand
       Score  : 134 points
       Date   : 2023-03-17 16:39 UTC (6 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | turnsout wrote:
       | "It's a ~200 LOC change that only took me a few hours and it
       | worked the first time."
       | 
       | jart is on another plane of existence--and 100% earned this flex
        
       | superkuh wrote:
       | Woo! This sounds like it will make it easier to run in normal
       | mode (ie, not interactive) and manage the chat history yourself
       | if there's less penalty for a full program reload. Currently my
       | perl irc bot wrapper for llama.cpp just open2's the program in
       | interactive mode (-i) and reads/writes to stdout/stdin of
       | llama.cpp to get the loading time savings of having it manage
       | history and keep state. In one-shot there'd still be the "extra"
       | inference time of processing the full history each time instead
       | of saving state like interactive does but the memory load time
       | matters just as much.
       | 
       | For me personally this matters most because right now when
       | llama.cpp runs out of the 2048 tokens it segfaults and this
       | causes difficulties. In interactive mode if it goes off rails and
       | generates 1000 tokens of nonsense then that nonsense is taking up
       | tokens for the next line from chat. In normal mode where it just
       | runs once and all history has to be manually supplied this can be
       | avoided.
        
         | jart wrote:
         | Indeed. I'm working on persuading the author of this project to
         | make it shell scriptable too. The default behavior of the
         | command should only print to stdout the response to a prompt.
        
       | adultSwim wrote:
       | I'm seeing a lot of interest in generative use cases. Has anyone
       | tried LLaMA or GPT for classification?
        
       | lxe wrote:
       | I wonder if something like this is possible for CUDA/pytorch
       | loading?
        
         | jart wrote:
         | I don't know about Cuda, but with PyTorch it's quite possible.
         | The solution should generalize to any code that loads into
         | memory created by malloc(). You'd obviously be capturing a lot
         | of superfluous Python objects in your mappable model file, but
         | given the size of these things it should be comparatively
         | minor.
        
       | kir-gadjello wrote:
       | This is cool, but SSD read bandwidth is still the bottleneck. On
       | my non-mac machine it still takes several seconds to load the
       | model.
        
         | Tepix wrote:
         | This assumes (and makes it possible) that it's already in the
         | kernel cache
        
       | dougmwne wrote:
       | Can someone break this down? Since this seems to be inferencing
       | without having the entire model loaded into memory, is it
       | possible this could be a way to relax memory requirements of the
       | 65b model?
        
         | layer8 wrote:
         | I don't think so:
         | 
         | > the gains here are mostly due to not copying memory anymore,
         | and better cooperation with the kernel's page manager. We
         | unfortunately aren't getting any additional gains from lazy
         | page loading, since this is a dense model. To generate a single
         | token, every single page in the model file needs to be loaded.
         | What this means is that first runs that load from spinning disk
         | are still going to be slow, even though the average case has
         | greatly improved.
         | 
         | If I understand correctly, this only provides a speed up when
         | the model is already in the OS's file system cache, and in any
         | case you still have to load the entire model into memory.
        
           | jart wrote:
           | Author here. It does speed things up. It's just that a 2
           | second improvement doesn't mean much if the disk reads take
           | 60 seconds. When it comes to disk though there's something
           | far more important happening here. When you read() or
           | memcpy() loaded memory from a 12GB file, then you need at
           | least 24GB of RAM, because the copied memory pages are
           | competing with the kernel's file cache pages. Once you go
           | over that, the kernel is going to delete the file cache,
           | thereby ensuring you have to do a full disk read every time.
           | Using mmap() ensures the kernel knows they're both the same
           | thing. That means you should be able to run models that are
           | 2x larger than before, without compromising system stability.
        
             | Tepix wrote:
             | Hi, can something along these lines also be used to speed
             | up loading of models running in the GPU memory?
             | 
             | With a GDDR6X memory bandwidth of 936 GB/s and PCIe 4.0 x16
             | bandwidth of 64GB/s loading something like 20GB into the
             | VRAM of an RTX 3090 shouldn't take longer than 1/2 a second
             | or so, right? (assuming it is in the kernel cache)
        
             | ithkuil wrote:
             | Would reading without the cache (e.g. O_DIRECT) can be used
             | to achieve a similar effect?
        
               | dekhn wrote:
               | O_DIRECT is absurdly slow in my my experience.
        
               | jart wrote:
               | Well I think that would guarantee your always hit disk
               | for sure. The goal here is to be able to run the
               | llama.cpp command repeatedly, possibly in shell scripts,
               | and rely on the file caches being available so that the
               | command loads instantly (we're talking like 1 second
               | rather than 60 seconds to run, because disk is 100x
               | slower).
        
         | adultSwim wrote:
         | "We unfortunately aren't getting any additional gains from lazy
         | page loading, since this is a dense model. To generate a single
         | token, every single page in the model file needs to be loaded.
         | What this means is that first runs that load from spinning disk
         | are still going to be slow, even though the average case has
         | greatly improved"
         | 
         | https://github.com/ggerganov/llama.cpp/issues/91#issuecommen...
        
           | jart wrote:
           | You still get the gain of zero loading time. The issue with
           | loading a model is it ensures 100% of the file needs to be
           | copied into memory. This change means we don't have to copy
           | at all. The file cache pages can be made directly accessible
           | to the matmul ops. You could have ten of these processes
           | running at once, and they'd all share the same memory.
           | However those pages still have to be pulled into memory. What
           | I was hoping for, when I said that, is it's possible in some
           | cases for mmap() to make things even faster than this change
           | managed to accomplish. Some scientific computing applications
           | use sparse datasets. If you skip loading, and use mmap(),
           | then the system will only load pages as-needed on a 4096-byte
           | basis. If the data were actually sparsely used, then with
           | mmap() you could for instance load a 1TB file into memory on
           | a system with 32GB of RAM, and no file cache, only touch a
           | small portion, and it'll go fast and barely use any memory at
           | all. That's not the case here sadly. It's still however a big
           | improvement.
        
       | dekhn wrote:
       | There are some cool ideas in here. I've long been curious why
       | people don't use mmap to re-use all those wonderful pages that
       | got loaded (without reparsing the disk data).
        
         | liuliu wrote:
         | That's actually the point of safetensors format, although it
         | needs to take care of alignment. I actually use this technique
         | to mmap directly to GPU buffer on Apple systems (although it
         | seems to segfault on iOS 15.x systems, only supported on 16.x
         | and above).
        
         | arthurcolle wrote:
         | Can you explain how this is possible to do? Sorry I haven't
         | gotten this low level with any of these models before and I'd
         | really appreciate how to understand this.
        
           | delusional wrote:
           | This has nothing to do with the models, just standard *nix
           | stuff. If you mmap the file readonly the pages can be shared
           | by multiple processes without duplication since they are
           | guaranteed to be the same.
        
           | dekhn wrote:
           | To run inference, you need to load a model from disk into
           | RAM. Usually the model is written in a disk format that is
           | convenient but needs to be parsed at runtime into a RAM-
           | resident C application data structure.
           | 
           | In this case, it looks like jart@ modified malloc to capture
           | the memory generated by the loading process and serialized
           | that to disk. When you run, the application calls mmap to
           | make a virtual memory association with the bytes on disk- so
           | any time you access RAM and it's not yet loaded, ti gets
           | loaded from disk. At that point it gets saved by the kernel
           | in a page cache and since the files on disk don't change,
           | those pages can stay in memory longer than the process. So
           | when the process restarts, all those RAM requests are
           | immediately mapped to already-cached virtual memory, rather
           | than reading from disk.
           | 
           | The inference library here supports a data pointer that would
           | point to the memory mapped location.
           | 
           | This is faster than relying on the kernel's disk read cache;
           | in that case, you'd still need to convert the data from the
           | disk format to the in-memory format.
           | 
           | Normally the data build process is run as an external program
           | that writes the mmap-ready structure to disk (an example is
           | the BLAST program which writes the DNA sequence data into an
           | index structure that is mmapped at runtime). But in this case
           | ti looks like using an instrumented malloc() helps simplify
           | the process of building the disk structure.
        
             | selfhoster11 wrote:
             | Thank you for taking your time to write this up. For
             | someone with a new-found interest in AI, this is invaluable
             | information.
        
             | arthurcolle wrote:
             | Thank you for taking the time to write this out. Very
             | helpful for understanding. I remember using malloc to build
             | out C data structures in my coursework but I must admit I
             | haven't really done much practical work at this level.
             | Thanks again, you are a scholar.
        
         | csmpltn wrote:
         | > "I've long been curious why people don't use mmap to re-use
         | all those wonderful pages that got loaded"
         | 
         | That's because the people that develop these models are often
         | data scientists and have little to no experience with systems
         | programming, optimizations, etc.
        
           | dekhn wrote:
           | shhh, you're getting close to exposing my guaranteed
           | employment secrets
        
       | 0cf8612b2e1e wrote:
       | How does jart find the time? Just a brilliant engineer who is
       | seemingly cranking out amazing projects on a regular basis.
        
         | ipaddr wrote:
         | Works at facebook and can't find work?
        
         | Cyph0n wrote:
         | A 100x developer if I've ever seen one.
        
         | jart wrote:
         | You can thank Mozilla's MIECO program, my GitHub sponsors, and
         | my Patreon supporters. See my recent blog post:
         | https://justine.lol/rusage/#funding It's thanks to them that
         | I'm able to keep doing what I'm doing.
        
       | meghan_rain wrote:
       | I welcome all progress but I don't see why these models aren't
       | simply run on a thin Python server that loads the model into
       | memory once and then you can curl it instantly whenever you want?
        
         | kkielhofner wrote:
         | Because it's a waste for anything other than proof of
         | concept/handful of users.
         | 
         | It's really simple to take some python inference code and wrap
         | FastAPI around it.
         | 
         | However, inference servers exist for a reason. You'll quickly
         | find that performance, VRAM usage, model management, etc isn't
         | practical with the FastAPI approach.
         | 
         | Speaking personally inference server implementations like
         | Nvidia Triton bring a model to performance metrics that are
         | absolutely night and day vs the FastAPI approach - in many
         | cases orders of magnitude higher performance in terms of
         | response time and requests per second.
        
           | sankha93 wrote:
           | Can you list the concrete problems a FastAPI approach will
           | have, and what tools like Nvidia Triton do differently to get
           | around it? I have no idea about running such models at scale.
        
             | kkielhofner wrote:
             | Sure!
             | 
             | FastAPI loads a model statically on startup. There are some
             | hacks to reload versions and new models via things with
             | load balancers, etc but they're just that - hacks. There
             | are also known issues with TensorFlow especially having
             | poor memory management over request count.
             | 
             | FastAPI is great but at the end of the day it's Python and
             | the performance reflects that (more on this later).
             | 
             | With Nvidia Triton you get:
             | 
             | - Automatic support for various model frameworks/formats:
             | native PyTorch/TensorFlow, ONNX, and more.
             | 
             | - Dynamic batching. You can configure an SLA with max
             | additional latency for response time where Triton will
             | queue requests from multiple clients over a given time
             | period and pass them through though model batched. If you
             | have the VRAM (you should) it's an instant performance
             | multiplier.
             | 
             | - Even better performance: Triton can do things like
             | automatically compile/convert a model to TensorRT on the
             | runtime hardware. This allows you to deploy models across
             | hardware families with optimized performance while not
             | worrying about the specific compute architecture or dealing
             | with TensorRT itself.
             | 
             | - Optimized and efficient use of multiple GPUs.
             | 
             | - Model version management. Triton has a model management
             | API you can use to upload a new model/version and load it
             | dynamically. It can hot load/reload a model and serve it
             | instantly, with configuration options for always serving
             | the latest model or allowing client to request a specific
             | version.
             | 
             | - Performance metrics. It has built in support for
             | Prometheus.
             | 
             | - Other tools like Model Navigator and Performance
             | Analyzer. You can pass a model to these tools and they will
             | try every possible model format, batch size, etc, etc
             | against an actual Triton server and produce a report and
             | optimized model configuration based on your selected
             | parameters - requests per second, response time, etc. Even
             | memory/compute utilization, power usage, and more.
             | 
             | - Out of the box without any of these tricks Triton is
             | faster, uses less memory, less GPU compute, and less CPU
             | compute. Written in C and optimized by Nvidia.
             | 
             | - It's a single implementation (often container) that from
             | the get go is smaller, lighter weight, and easier to manage
             | than pip installing a bunch of dependencies and the entire
             | runtime framework itself. It exists solely to serve models
             | and serve them well.
             | 
             | When you add it up (as I mentioned) I've personally seen
             | cases where requests per second increase by orders of
             | magnitude with lower response times than a single request
             | against FastAPI (or similar). Plus all of the mlops and
             | metrics features.
             | 
             | Frankly, it's pretty amazing.
        
             | woodson wrote:
             | Not GP, but what NVidia Triton can do includes
             | 
             | - Dynamic batching while limiting latency to a set
             | threshold
             | 
             | - Running multiple instances of a model, effectively load-
             | balancing inference requests.
             | 
             | - Loading/unloading/running multiple versions of models
             | dynamically, which is useful if you want to update (or roll
             | back) your model while not interfering with existing
             | inference requests.
             | 
             | Its client provides async based inference APIs, so you can
             | easily put a FastAPI-based API server in front and don't
             | necessarily need a queue (like Celery).
        
         | dekhn wrote:
         | AKA, "model serving".
         | 
         | The reason is that the kernel gives you this featuer and it's
         | really powerful, so why not take advantage of it?
         | 
         | During dev work you often want an easily restarted stack (code
         | changes). Anyway, if you use this approach, you can just have
         | the python server stay resident and then have another app mmap
         | it (shared memory) instead of doing inference over an API,
         | whcih is always awkward.
        
         | mungoman2 wrote:
         | Yes tbh this is the right answer
        
       | eternalban wrote:
       | If you want to avoid twitter this discusses the changes:
       | 
       | https://github.com/ggerganov/llama.cpp/issues/91
        
       ___________________________________________________________________
       (page generated 2023-03-17 23:01 UTC)