[HN Gopher] GGUF, the Long Way Around
       ___________________________________________________________________
        
       GGUF, the Long Way Around
        
       Author : Tomte
       Score  : 96 points
       Date   : 2024-02-29 19:36 UTC (3 hours ago)
        
 (HTM) web link (vickiboykis.com)
 (TXT) w3m dump (vickiboykis.com)
        
       | skadamat wrote:
       | This is an excellent deep dive! Love the depth here Vicki
        
       | cooper_ganglia wrote:
       | I've been looking for a good resource on GGUF for the past week
       | or so, the timing on this is awesome! Thanks!
        
       | RicoElectrico wrote:
       | As LLMs have quite minor changes between architectures, would it
       | make sense to just embed the model compiled to some sort of
       | simple bytecode right in the GGUF file? Then, only implement
       | specific new operations when researchers come up with a new model
       | that gains enough traction to be of interest.
        
         | sroussey wrote:
         | Yeah, but you want to avoid remote code execution:
         | 
         | https://www.bleepingcomputer.com/news/security/malicious-ai-...
        
           | RicoElectrico wrote:
           | The bytecode would not even need to be Turing-complete. Or
           | maybe it could take inspiration from eBPF which gives some
           | guarantees. What you posted is related to the design
           | oversight of Python's pickle format.
        
             | sroussey wrote:
             | I think ONNX does what you say.
        
         | liuliu wrote:
         | Not really. We've been on that road before. Embedding
         | computation graph into the file makes changes to the
         | computation graph harder (you need to make sure it is backward
         | compatible). This is OK in general (as we have onnx already),
         | but then if you have dynamic shape and the fact that different
         | optimizations we implemented are actually tied to the
         | computation graph, this is simply not optimal. (BTW, this is
         | why PyTorch just embed the code into the pth file, much easier
         | and backward compatible than a static computation graph).
        
         | rahimnathwani wrote:
         | It seems like a lot of innovation is around training, no? GGML
         | (the library that reads GGUF format) supports these values for
         | the required 'general.architecture':                 llama
         | mpt       gptneox       gptj       gpt2       bloom
         | falcon       rwkv
        
       | tbalsam wrote:
       | Llama.cpp I think has a ton of clone-and-own boilerplate,
       | presumably from having grown so quickly (I think one of their .cu
       | files is over 10k lines or so, roughly, ATM).
       | 
       | While I haven't seen the model storage and distribution format,
       | the rewrite to GGUF for file storage seems to have been a big
       | boon/boost to the project. Thanks Phil! Cool stuff. Also, he's a
       | really nice guy to boot. Please say hi from Fern to him if you
       | ever run into him. I mean it literally, make his life a hellish
       | barrage of nonstop greetings from Fern.
        
         | liuliu wrote:
         | I honestly think have a way to just use json (a.k.a.
         | safetensors) / msgpack or some lightweight metadata serializer
         | is a better route than coming up with a new file format. That's
         | also why I just use SQLite to serialize the metadata (and
         | tensor weights, this part is an oversight).
        
           | andy99 wrote:
           | Gguf is cleaner to read in languages that don't have a json
           | parsing library, and works with memory mapping in C. It's
           | very appealing for minimal inference frameworks vs other
           | options.
        
             | liuliu wrote:
             | safetensors can mmap too because the tensor data are just
             | offsets and you are free to align to whatever you want.
             | 
             | It is hard to keep metadata minimal, and before long, you
             | will start to have many different "atom"s and end-up with
             | things that mov supports but mp4 doesn't etc etc. (mov
             | format is generally well-defined and easy-to-parse, but
             | being a binary format, you have to write your parser etc is
             | not a pleasant experience).
             | 
             | If you just want minimal dependency, flatbuffers,
             | capnproto, json are all well-supported on many platforms.
        
               | jart wrote:
               | mmap() requires that you map at page aligned intervals
               | which must be congruent with the file offset. You can't
               | just round down because some gpus like metal require that
               | the data pointers themselves be page aligned too.
        
               | liuliu wrote:
               | Yeah, safetensors separates metadata and tensor data. The
               | metadata is an offset reference to the tensor data that
               | you are free to define yourselves. In that way, you can
               | create files in safetensors format but the tensor data
               | itself is paged aligned offsets.
        
       | andy99 wrote:
       | > GPT-Generated Unified Format
       | 
       | GG is Georgi Gerganov
        
       ___________________________________________________________________
       (page generated 2024-02-29 23:00 UTC)