[HN Gopher] Octopus v2: On-device language model for super agent
       ___________________________________________________________________
        
       Octopus v2: On-device language model for super agent
        
       Author : lawrencechen
       Score  : 78 points
       Date   : 2024-04-03 05:12 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | wanderingmind wrote:
       | I'm going to start commenting on ArXiV paper links with the same
       | request.
       | 
       | 1. Show me the data
       | 
       | 2. Show me the code
       | 
       | 3. Show me the model
       | 
       | If we can't play and modify it easily it doesn't belong in HN.
        
         | mirekrusin wrote:
         | https://huggingface.co/NexaAIDev/Octopus-v2
        
         | smcleod wrote:
         | Yeah I've got to agree with this. Having links to a paper is
         | useful - but not that interesting without a demo and the source
         | code. It's not helped that ArXiV has a pretty horrible
         | interface for anyone other than people writing papers.
        
       | CGamesPlay wrote:
       | So, I guess it's a LoRa for function calls. Makes sense that this
       | would work well, and bodes well for creating really cheap request
       | routers in more advanced cloud-based situations.
        
         | iandanforth wrote:
         | It's not. They do train one version with LoRa but also train
         | three variants without.
        
       | gardnr wrote:
       | > To mitigate such errors, we propose designating functions as
       | unique functional tokens.
       | 
       | I just skimmed the paper but this seems to be the crux of it.
       | They map functions to a single token and can then fine-tune
       | models to use the token instead of the function name. This
       | increases accuracy of smaller LLMs and reduces total number of
       | tokens required for prompts and for generations, which is where
       | they get their speed gains from.
       | 
       | The paper is worth a look just to see "Figure (2)"
        
         | alwa wrote:
         | Figure 2 is incredible.
         | 
         | With only passing familiarity with the norms in this kind of
         | work, the accuracy rates of all models on this benchmark suite
         | seem suspiciously (and uniformly) high. Is choosing the right
         | intention among "20 vehicle functions" or "20 Android APIs"
         | consistent with an ordinary level of ambition in this kind of
         | research these days?
        
         | jerpint wrote:
         | That's pretty clever, encoding atomic concepts as a token
        
       | iandanforth wrote:
       | They might even get higher accuracies with a dedicated
       | classification layer. By using the existing vocabulary they are
       | spreading the probability mass across a _much_ larger space. If
       | they stuck to N options where N is the total number of functions
       | available to the model I suspect they could get to 100% accuracy.
       | 
       | It's also not clear whether there is sufficient ambiguity in the
       | test data for this to be a generalizable model. The difficulty
       | with "intent recognition" (which they don't mention but is what
       | this problem is called for agents like Siri) is that human
       | generated inputs vary widely and are often badly formed. If they
       | haven't done extensive evaluation with human users and/or they've
       | constrained the functions to be quite distinct then they aren't
       | yet tackling a hard problem, they've just got a complex setting.
        
       | vessenes wrote:
       | Short summary of the paper:
       | 
       | Take Gemma-2B. Take your API. Use ChatGPT-3.5 to generate 1,000
       | "correct" API function call responses by dint of placing only
       | your API calls in the pre-prompt, then prompting it. I imagine
       | they use ChatGPT to create the request language as well. Then
       | make 1,000 "incorrect" API call responses by filling the pre-
       | prompt with functions not from your API.
       | 
       | Finetune.
       | 
       | Note that they use "functional tokens" in training - they convert
       | a function to a particular, previously unused tokenization, and
       | refer to it that way. They claim this speeds up inference (I'm
       | sure it does). They don't make any claims as to whether or not it
       | changes their accuracy (I bet that it does). It definitely makes
       | the system more fragile / harder to train for large and very
       | large APIs.
       | 
       | Outcome: highly capable _single API_ function call LLM. They say
       | you could do it with as little as 100 training inputs if you
       | really wanted.
       | 
       | I think this is interesting, but not world-shattering. I could
       | imagine building a nice little service company on it, basically
       | just "send us a git repo and you'll get a helpful function call
       | API for this version of your code which you can hook up to an API
       | endpoint / chatbot".
       | 
       | Limitations are going to be largely around Gemma-2B's skills -- A
       | 2B model isn't super sophisticated. And you can see they specify
       | "<30 tokens" for the prompt. But, I imagine this could be trained
       | quickly enough that it could be part of a release CI process.
       | There are a number of libraries I use that I would like to have
       | access to such a model.
       | 
       | I'd be interested in something that has general knowledge of a
       | large set of packages for a language, and could pull in /
       | finetune / MoE little models for specific repositories I'm coding
       | on. Right now I would rely on either a very large model and hope
       | its knowledge cutoff is right (Claude/GPT-4), or using a lot of a
       | large context window. There might be some Goldilocks version in
       | the middle here which would be helpful in a larger codebase but
       | be faster and more accurate than the cloud monopoly providers.
        
         | saltsaman wrote:
         | I can see people training loras this way, which allows for
         | multiple API function calls
        
           | vessenes wrote:
           | Yeah, good idea! I'm not sure if you would be successful
           | mixing LoRA + functional tokens. If you could, that would be
           | great. Then You could ship very light LoRA packs with
           | repositories.
           | 
           | Their LoRA training was _I think_ against their finetuned
           | model, not Gemma-2B directly. But, seems worth playing with
           | -- could be super useful.
        
       | turnsout wrote:
       | This is the frontier--tiny, specialized models like this and
       | ReALM [0], coupled to the application logic and able to run on-
       | device.
       | 
       | Eventually devices will be powerful enough to run more general
       | purpose models locally, but for high-frequency user tasks with a
       | low tolerance for error, smaller specialized models may always
       | win.
       | 
       | [0]: https://arxiv.org/abs/2403.20329
        
       | mikece wrote:
       | "What is better than one recipe for Octopus?"
       | 
       | I can't be the only person who heard that line in their head
       | instantly when reading that headline.
        
       ___________________________________________________________________
       (page generated 2024-04-03 23:02 UTC)