[HN Gopher] AI agents but they're working in big tech
       ___________________________________________________________________
        
       AI agents but they're working in big tech
        
       Author : alsima
       Score  : 55 points
       Date   : 2024-08-06 19:48 UTC (3 hours ago)
        
 (HTM) web link (alexsima.substack.com)
 (TXT) w3m dump (alexsima.substack.com)
        
       | alsima wrote:
       | If we structured AI agents like big tech org charts, which
       | company structures would perform better? Inspired by James
       | Huckle's thoughts on how organizational structures impact
       | software design, I decided to put this to the test:
       | https://bit.ly/ai-corp-agents.
        
       | simonw wrote:
       | "Research shows that AI systems with 30+ agents out-performs a
       | simple LLM call in practically any task (see More Agents Is All
       | You Need), reducing hallucinations and improving accuracy."
       | 
       | Has anyone heard of that actually playing out practically in
       | real-world applications? This article links to the paper about it
       | - https://arxiv.org/abs/2402.05120 - but I've not heard from
       | anyone who's implementing production systems successfully in that
       | way.
       | 
       | (I still don't actually know what an "agent" is, to be honest.
       | I'm pretty sure there are dozens of conflicting definitions
       | floating around out there by now.)
        
         | exe34 wrote:
         | I bet they spend 95% of their time in agile refinement
         | meetings.
        
         | alsima wrote:
         | I would check out this company, Swarms
         | (https://github.com/kyegomez/swarms) who's working with
         | enterprises to integrate multi-agents. But definitely a great
         | point to focus on, the research paper mentions that the scaling
         | of performance reduces with complexity of the task, which is
         | definitely true for SWE
        
           | simonw wrote:
           | I totally believe that people are selling solutions around
           | this idea, what I'd like to hear is genuine success stories
           | from people who have used them in production (and aren't
           | currently employed by a vendor).
        
             | alsima wrote:
             | Hmmm I get what you mean...I think it's hard to sell a
             | solution around this idea, but I think it will become
             | something more like a common practice/performance
             | improvement method. James Huckle on Linkedin (https://www.l
             | inkedin.com/feed/update/urn:li:activity:7214295...)
             | mentioned that agent communication would be something more
             | like a hyperparameter to tune which I agree with.
        
             | kgdiem wrote:
             | Midjourney uses multiple agents for determining if a prompt
             | is appropriate or not.
             | 
             | I kinda did this, too.
             | 
             | I made a 3 agent system -- one is a router that parses the
             | request and determines where to send it (to the other 2)
             | one is a chat agent and the third is an image generator.
             | 
             | If the router determines an image is requested, the chat
             | agent is tasked with making a caption to go along with the
             | image.
             | 
             | It works well enough.
        
               | 42lux wrote:
               | Midjourney does not use an agent system, they use a
               | single call.
        
               | kgdiem wrote:
               | I remembered reading that they used > 1, found this
               | screengrab of discord on Reddit:
               | 
               | https://www.reddit.com/r/midjourney/comments/137bj1o/new_
               | upd...
        
               | 42lux wrote:
               | It's still one call they just use a different llm now.
        
               | simonw wrote:
               | Is that multiple agents, or just multiple prompts? Are
               | agents and prompts the same thing?
        
               | kgdiem wrote:
               | I am fairly certain that agents are the same thing as
               | prompts (but also could be different _).
               | 
               | Only the chat prompt/agent/whatever is connected to RAG;
               | the image generator is DALLE and the router is a one-off
               | call each time.
               | 
               | _ eg, it could be the same model with a different prompt
               | or a different model + different prompt all together.
               | AFAIU it's just serving a different purpose than the
               | other calls
        
         | tk90 wrote:
         | "agent" is a buzzword. All it is, is a bunch of LLM calls in a
         | while loop.
         | 
         | - _' rag' is meaningless as a concept. imagine calling a web
         | app 'database augmented programming'. [1]_
         | 
         | - _' agent' probably just means 'run an llm in a loop' [1]_
         | 
         | [1] https://x.com/atroyn/status/1819396701217870102
        
           | alsima wrote:
           | honestly agree. When I first started working with agents I
           | didnt fully understand what it really was either but I
           | eventually fell on a definition of an LLM call that performs
           | a unique function proactively -\\_(tsu)_/-.
        
         | Imnimo wrote:
         | I don't even buy that the linked paper justifies the claim. All
         | the paper does is draw multiple samples from an LLM and take
         | the majority vote. They do try integrating their majority vote
         | algorithm with an existing multi-agent system, but it usually
         | performs _worse_ than just straightforwardly asking the model
         | multiple times (see Table 3). I don 't understand how the
         | author of this article can make that claim, nor why they
         | believe the linked paper supports it. It does, however, support
         | my prior that LLM agents are snake oil.
        
           | alsima wrote:
           | Looking at Table 3: "Our [sampling and voting] method
           | outperforms other methods used standalone in most cases and
           | always enhances other methods across various tasks and LLMs",
           | which benchmark did the majority vote algorithm perform worse
           | in?
        
             | Imnimo wrote:
             | I'm not saying that sampling and majority voting performed
             | worse. I'm saying that multi-agent interaction (labeled
             | Debate and Reflection) performed worse than straightforward
             | approaches that just query multiple times. For example, the
             | Debate method combined with their voting mechanism gets
             | 0.48 GSM8K with Llama2-13B. But majority voting with no
             | multi-agent component (Table 2) gets 0.59 on the same
             | setting. And majority voting with Chain-of-thought (Table 3
             | CoT/ZS-CoT) does even better.
             | 
             | Fundamentally, drawing multiple independent samples from an
             | LLM is _not_ "AI Agents". The only rows on those tables
             | that are AI Agents are Debate and Reflection. They provide
             | marginal improvements on only a few task/model
             | combinations, and do so at a hefty increase in
             | computational cost. In many tasks, they are significantly
             | behind simpler and cheaper alternatives.
        
               | alsima wrote:
               | I see. Agree with the point about marginal improvements
               | at a hefty increase in computational cost (I touch on
               | this a bit at the end of the blog post where I mention
               | that better performance requires better tooling/base
               | models). Though I would still consider sampling and
               | voting a "multi-agent" framework as it's still performing
               | an aggregation over multiple results.
        
         | memhole wrote:
         | Thanks for this paper! Still early so not quite production, but
         | I've seen positive results on tasks scaling the number of
         | "agents". I more or less think of agents as a task focused
         | prompt and these agents are slight variations of the task. It
         | makes me think I'm running some kind of Monte Carlo simulation.
        
         | fmbb wrote:
         | Thirty recursive loops of LLMs perform better than one prompt?
         | I should hope so!
         | 
         | But how much power does it need?
        
           | alsima wrote:
           | A lot...as you might imagine the costs of running the whole
           | organization scale immensely.
        
       | aantix wrote:
       | Is there an agent framework that lives up to the hype?
       | 
       | Where you specify a top-level objective, it plans out those
       | objectives, it selects a completion metric so that it knows when
       | to finish, and iterates/reiterates over the output until
       | completion?
        
         | simonw wrote:
         | To this date, ChatGPT Code Interpreter is still the most
         | impressive implementation of this pattern that I've seen.
         | 
         | Give it a task, it writes code, runs the code, gets errors,
         | fixes bugs, tries again generally until it succeeds.
         | 
         | That's over a year old at this point, and it's not clear to me
         | if it counts as an "agent" by many people's definitions (which
         | are often frustratingly vague).
        
         | alsima wrote:
         | Well, you have Cognition AI and Devin that became a recent
         | unicorn startup (partnerships with Microsoft and stuff) but
         | true, I can't think of an agent that actually lives up to the
         | hype (heard Devin wasn't great).
        
         | danenania wrote:
         | > Where you specify a top-level objective, it plans out those
         | objectives, it selects a completion metric so that it knows
         | when to finish, and iterates/reiterates over the output until
         | completion?
         | 
         | I built Plandex[1], which works roughly like this. The goal (so
         | far) is not to take you from an initial prompt to a 100%
         | working solution in one go, but to provide tools that help you
         | iterate your way to a 90-95% solution. You can then fill in the
         | gaps yourself.
         | 
         | I think the idea of a fully autonomous AI engineer is currently
         | mostly hype. Making that the target is good for marketing, but
         | in practice it leads to lots of useless tire-spinning and
         | wasted tokens. It's not a good idea, for example, to have the
         | LLM try to debug its own output by default. It _might_ , on a
         | case-by-case basis, be a good idea to feed an error back to the
         | LLM, but just as often it will be faster for the developer to
         | do the debugging themselves.
         | 
         | 1 - https://plandex.ai
        
       | fancyfredbot wrote:
       | Can I just ask whether other people think that "agentic" is a
       | word?
       | 
       | As far as I can tell it's not in the OED or Miriam Webster
       | dictionaries. But recently everyone's using it so perhaps it soon
       | will be.
        
         | alsima wrote:
         | It's the new cerebral valley slang dude
        
         | sandspar wrote:
         | "Agentic" is a term of art from psychology that's diffused into
         | common usage. It dates back to the 1970s, primarily associated
         | with Albert Bandura, the guy behind the Bobo doll experiment.
         | 
         | From ChatGPT: Other examples include "heuristic," "cognitive
         | dissonance," "meta-cognition," "self-actualization," "self-
         | efficacy," "locus of control," and "archetype."
        
           | fancyfredbot wrote:
           | Thanks. Interesting! I see "Agentic state" is one where an
           | "individual perceives themselves as an agent of the authority
           | figure and is willing to carry out their commands, even if it
           | goes against their own moral code". That's ironic as most
           | LLMs have such strong safety training that it's almost
           | impossible to get them to enter such a state.
        
             | connicpu wrote:
             | In the case of LLMs, I think their training is the supreme
             | authority.
        
         | hexator wrote:
         | That's how words form
        
         | candiddevmike wrote:
         | My brain wants the word to be "authentic" or "adriatic".
        
         | raybb wrote:
         | Seems to have been for a while.
         | 
         | https://en.wiktionary.org/wiki/agentic
        
       | henning wrote:
       | - Big tech is very different from open source
       | 
       | - The original SWE-bench paper only consists of solved issues
       | when a big part of Open Source is triage, follow-up,
       | clarification and dealing with crappy issues
       | 
       | - Saying "<Technique> is all you need" when you are increasing
       | your energy usage 30-fold just to fail > 50% of the time is
       | intellectually dishonest
        
         | alsima wrote:
         | Definitely not saying multi-agents is all you need for SWE-
         | bench haha. I touch on this at the end of the blog post, where
         | I mention jumps in progress require better base models or
         | tooling.
        
       | det2x wrote:
       | It's interesting how long the word "agents"/"intelligent agents"
       | have been around for and how long they've been hyped up for. If
       | you go back to the 80s and 90s you will see how Microsoft was
       | hyping up "intelligent agents" in Windows but nothing ever became
       | of it[1].
       | 
       | I have yet to see an actual useful usecase for agents despite the
       | countless posts asking for examples nobody has provided one.
       | 
       | [1] https://www.wired.com/1995/09/future-forward/
        
         | sroussey wrote:
         | Or CORBA based intelligent agents in the 1990s
        
       ___________________________________________________________________
       (page generated 2024-08-06 23:00 UTC)