hngopher.com

       [HN Gopher] Show HN: Prompts as WASM Programs
       ___________________________________________________________________
        
       Show HN: Prompts as WASM Programs
        
       AICI is a proposed common interface between LLM inference engines
       (llama.cpp, vLLM, HF Transformers, etc.) and "controllers" -
       programs that can constrain the LLM output according to regexp,
       grammar, or custom logic, as well as control the generation process
       (forking, backtracking, etc.).  AICI is based on Wasm, and is
       designed to be fast (runs on CPU while GPU is busy), secure (can
       run in multi-tenant cloud deployments), and flexible (allow
       libraries like Guidance, LMQL, Outlines, etc. to work on top of
       it).  We (Microsoft Research) have released it recently, and would
       love feedback on the design of the interface, as well as our Rust
       AICI runtime.  I'm the lead developer on this project and happy to
       answer any questions!
        
       Author : mmoskal
       Score  : 110 points
       Date   : 2024-03-11 17:00 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | simonw wrote:
       | This all clicked for me when I got to this example in the README:
       | async def main():                 # This is the prompt we want to
       | run.             # Note how the prompt doesn't mention a number
       | of vehicles or how to format the result.             prompt =
       | "What are the most popular types of vehicles?\n"
       | # Tell the model to generate the prompt string, ie. let's start
       | with the prompt "to complete"             await
       | aici.FixedTokens(prompt)                  # Store the current
       | position in the token generation process             marker =
       | aici.Label()                  for i in range(1,6):
       | # Tell the model to generate the list number               await
       | aici.FixedTokens(f"{i}.")                    # Wait for the model
       | to generate a vehicle name and end with a new line
       | await aici.gen_text(stop_at = "\n")                  await
       | aici.FixedTokens("\n")                  # Store the tokens
       | generated in a result variable             aici.set_var("result",
       | marker.text_since())              aici.start(main())
       | 
       | This is a similar pattern to llama.cpp grammars - it's a way to
       | directly manipulate the token generation phase, such that you can
       | occasionally take over from the model and say "here we are
       | outputting '1.' - then back to you for the bullet item".
       | 
       | The most obvious usage of this is forcing a model to output valid
       | JSON - including JSON that exactly matches a given schema (like
       | OpenAI Functions if they were 100% reliable).
       | 
       | That Python code is a really elegant piece of API design.
        
         | winwang wrote:
         | And now I'm waiting for the "LLM" monad.
        
           | nerpderp82 wrote:
           | Wasm is monadic.
        
             | viksit wrote:
             | could you eli5 how wasm being monadic is helpful / and how
             | an llm monad would be too?
        
               | zarathustreal wrote:
               | It's just a useful way to communicate about it, giving it
               | a name. If you're knee-deep in OOP land you can just call
               | it a "builder"
        
             | winwang wrote:
             | I mean a lawful(ish) monadic type/api, i.e. `LLM a` in FP-
             | land. Otherwise, every sequential machine/program can be
             | said to be monadic.
        
         | refulgentis wrote:
         | llama.cpp is orders of magnitude easier. Rather than
         | controlling token by token, with an imperative statement for
         | each, we create a static grammar to describe ex. a JSON schema.
         | 
         | I'm honestly unsure what this offers over that, especially
         | because I'm one of 3 groups with a WASM llama.cpp, and you can
         | take it from me, you don't want to use it. (~3 tokens/sec with
         | a 3B model on MVP M2 Max/Ultra/whatever they call top of line
         | for MBP. About 2% of perf of Metal, and I'd bet 10% of running
         | on CPU without WASM. And there's no improvement in sight)
        
           | simonw wrote:
           | I find these imperative statements much less intimidating
           | than llama.cpp grammars - the Python code I copied here looks
           | a lot more approachable to me than the gbnf syntax from https
           | ://raw.githubusercontent.com/ggerganov/llama.cpp/master...
           | 
           | I don't think the key idea here is to run llama.cpp itself in
           | WASM - it's to run LLMs in native code, but have fast custom-
           | written code from end-users that can help pick the next
           | token. WASM is a neat mechanism for that because many
           | different languages can use it as a compile target, and it
           | comes with a robust sandbox by default.
        
           | mmoskal wrote:
           | It's only the controller that runs in Wasm, not the
           | inference.
           | 
           | The pyctrl is a just a sample controller, you can write a
           | controller that takes any kind of grammar (eg., a yacc
           | grammar [0] - the python code in that example is only used
           | for glueing).
           | 
           | Llama.cpp grammars were quite slow in my testing (20ms per
           | token or so, compared to 2ms for the yacc grammar referenced
           | above).
           | 
           | [0] https://github.com/microsoft/aici/blob/main/controllers/p
           | yct...
        
         | jameshart wrote:
         | Great summary - thanks. And definitely closer to the LLM api
         | surface we need to really start to use these things.
         | 
         | There's definitely a danger with this kind of code that the
         | first thing this is going to generate to complete the "1."
         | prompt will be something like " Truck; 2. Sedan; 3. Minivan",
         | though.
        
           | mmoskal wrote:
           | You generally have to tell the LLM what you want it to
           | generate and then enforce it, i.e, the LLM has to be somewhat
           | aligned with the constraints. Otherwise, for example if you
           | ask for JSON it will keep generating (legal) white-space or
           | when you ask for C code it will say:                 int
           | sureICanHelpYouHereIsAnExampleOfTheCodeYouWereAskingFor;
           | 
           | In this case however, you can just do:                 await
           | aici.gen_text(regex=r"[a-zA-Z\n]+" stop_at = "\n")
           | 
           | Also note that there still a lot of work to figure out how
           | it's easiest for the programmer of an LLM-enabled app to
           | express these things - AICI is meant to make it simple to
           | implement different surface syntaxes.
        
           | giovannibonetti wrote:
           | Good point. I believe it could be solved with backtracking,
           | just like it is done in compilers/lexers.
        
         | mirekrusin wrote:
         | This is great generic idea.
         | 
         | It's also possible to wrap it in something user friendly a'la
         | [0].
         | 
         | [0] https://github.com/ollama/ollama/issues/3019
        
           | mmoskal wrote:
           | What you describe there is a great example of a custom
           | controller - could be implemented on top of pyctrl, jsctrl,
           | or just natively in Rust.
        
         | nighthawk454 wrote:
         | Isn't that also similar to the (formerly?) Microsoft Guidance
         | project? https://github.com/guidance-ai/guidance
        
           | mmoskal wrote:
           | We believe Guidance can run on top of AICI (we're working on
           | efficient Earley parser for that [0], together with local
           | Guidance folks). AICI is generally lower level (though our
           | sample controllers are at similar level to Guidance).
           | 
           | [0] https://github.com/microsoft/aici/blob/main/controllers/a
           | ici...
        
             | nighthawk454 wrote:
             | Ah, that clears it up, thank you!
        
             | dartos wrote:
             | How does it compare to GBNF grammars or LMQL?
        
         | babyshake wrote:
         | Although don't you still have problems with being able to
         | accurately anticipate the output for control flow purposes? In
         | this example, stopping at a new line. I guess that should work
         | out fine in this example, although if I'm understanding this
         | paradigm correctly, to help ensure that you'd want to have
         | something like: await aici.FixedTokens(f"Here is a bullet point
         | list:\n\n.") before entering into the for-loop.
        
       | ilaksh wrote:
       | Awesome. I wonder if you could use this with a game engine or
       | something. Maybe the aici module could render a Nethack screen.
       | And perhaps automatically reject incorrect command keys in
       | context if they are errors (if integrated deeply enough).
       | 
       | Is it possible to combine this with some kind of reinforcement
       | training system?
        
         | reissbaker wrote:
         | The easiest thing would be to use supervised finetuning on the
         | LLMs you're trying to control (most open-source LLMs have some
         | sort of off-the-shelf system for finetuning), combined with
         | this system to control the output. I suppose there's nothing
         | stopping you from writing an RL training system to alter the
         | model weights other than needing to write a bunch of code,
         | though... Maybe LlamaGym
         | (https://github.com/KhoomeiK/LlamaGym/tree/main) could reduce
         | the amount of code you need?
        
         | gsuuon wrote:
         | I'm really excited for LLMs in gaming, like an Ender's Game /
         | Homeworld type game where you can shout orders and the units
         | scramble around or Stellaris where you have actual discussions
         | with the other factions. Local and reliable output enables that
         | sort of stuff, though the perf tradeoff of LLM vs game
         | rendering might be hard to deal with.
        
       | jaan wrote:
       | Does it support constrained generation during training?
       | 
       | This is what we need for the large language models I am training
       | for health care use cases.
       | 
       | For example, constraining LLM output is currently done by
       | masking, and having this rust based library would enable novel
       | ways to train LLMs.
       | 
       | Relevant papers:
       | 
       | https://github.com/epfl-dlab/transformers-CFG
       | 
       | https://neurips.cc/virtual/2023/poster/70782
        
       ___________________________________________________________________
       (page generated 2024-03-11 23:00 UTC)