[HN Gopher] AI agents but they're working in big tech
___________________________________________________________________
AI agents but they're working in big tech
Author : alsima
Score : 55 points
Date : 2024-08-06 19:48 UTC (3 hours ago)
(HTM) web link (alexsima.substack.com)
(TXT) w3m dump (alexsima.substack.com)
| alsima wrote:
| If we structured AI agents like big tech org charts, which
| company structures would perform better? Inspired by James
| Huckle's thoughts on how organizational structures impact
| software design, I decided to put this to the test:
| https://bit.ly/ai-corp-agents.
| simonw wrote:
| "Research shows that AI systems with 30+ agents out-performs a
| simple LLM call in practically any task (see More Agents Is All
| You Need), reducing hallucinations and improving accuracy."
|
| Has anyone heard of that actually playing out practically in
| real-world applications? This article links to the paper about it
| - https://arxiv.org/abs/2402.05120 - but I've not heard from
| anyone who's implementing production systems successfully in that
| way.
|
| (I still don't actually know what an "agent" is, to be honest.
| I'm pretty sure there are dozens of conflicting definitions
| floating around out there by now.)
| exe34 wrote:
| I bet they spend 95% of their time in agile refinement
| meetings.
| alsima wrote:
| I would check out this company, Swarms
| (https://github.com/kyegomez/swarms) who's working with
| enterprises to integrate multi-agents. But definitely a great
| point to focus on, the research paper mentions that the scaling
| of performance reduces with complexity of the task, which is
| definitely true for SWE
| simonw wrote:
| I totally believe that people are selling solutions around
| this idea, what I'd like to hear is genuine success stories
| from people who have used them in production (and aren't
| currently employed by a vendor).
| alsima wrote:
| Hmmm I get what you mean...I think it's hard to sell a
| solution around this idea, but I think it will become
| something more like a common practice/performance
| improvement method. James Huckle on Linkedin (https://www.l
| inkedin.com/feed/update/urn:li:activity:7214295...)
| mentioned that agent communication would be something more
| like a hyperparameter to tune which I agree with.
| kgdiem wrote:
| Midjourney uses multiple agents for determining if a prompt
| is appropriate or not.
|
| I kinda did this, too.
|
| I made a 3 agent system -- one is a router that parses the
| request and determines where to send it (to the other 2)
| one is a chat agent and the third is an image generator.
|
| If the router determines an image is requested, the chat
| agent is tasked with making a caption to go along with the
| image.
|
| It works well enough.
| 42lux wrote:
| Midjourney does not use an agent system, they use a
| single call.
| kgdiem wrote:
| I remembered reading that they used > 1, found this
| screengrab of discord on Reddit:
|
| https://www.reddit.com/r/midjourney/comments/137bj1o/new_
| upd...
| 42lux wrote:
| It's still one call they just use a different llm now.
| simonw wrote:
| Is that multiple agents, or just multiple prompts? Are
| agents and prompts the same thing?
| kgdiem wrote:
| I am fairly certain that agents are the same thing as
| prompts (but also could be different _).
|
| Only the chat prompt/agent/whatever is connected to RAG;
| the image generator is DALLE and the router is a one-off
| call each time.
|
| _ eg, it could be the same model with a different prompt
| or a different model + different prompt all together.
| AFAIU it's just serving a different purpose than the
| other calls
| tk90 wrote:
| "agent" is a buzzword. All it is, is a bunch of LLM calls in a
| while loop.
|
| - _' rag' is meaningless as a concept. imagine calling a web
| app 'database augmented programming'. [1]_
|
| - _' agent' probably just means 'run an llm in a loop' [1]_
|
| [1] https://x.com/atroyn/status/1819396701217870102
| alsima wrote:
| honestly agree. When I first started working with agents I
| didnt fully understand what it really was either but I
| eventually fell on a definition of an LLM call that performs
| a unique function proactively -\\_(tsu)_/-.
| Imnimo wrote:
| I don't even buy that the linked paper justifies the claim. All
| the paper does is draw multiple samples from an LLM and take
| the majority vote. They do try integrating their majority vote
| algorithm with an existing multi-agent system, but it usually
| performs _worse_ than just straightforwardly asking the model
| multiple times (see Table 3). I don 't understand how the
| author of this article can make that claim, nor why they
| believe the linked paper supports it. It does, however, support
| my prior that LLM agents are snake oil.
| alsima wrote:
| Looking at Table 3: "Our [sampling and voting] method
| outperforms other methods used standalone in most cases and
| always enhances other methods across various tasks and LLMs",
| which benchmark did the majority vote algorithm perform worse
| in?
| Imnimo wrote:
| I'm not saying that sampling and majority voting performed
| worse. I'm saying that multi-agent interaction (labeled
| Debate and Reflection) performed worse than straightforward
| approaches that just query multiple times. For example, the
| Debate method combined with their voting mechanism gets
| 0.48 GSM8K with Llama2-13B. But majority voting with no
| multi-agent component (Table 2) gets 0.59 on the same
| setting. And majority voting with Chain-of-thought (Table 3
| CoT/ZS-CoT) does even better.
|
| Fundamentally, drawing multiple independent samples from an
| LLM is _not_ "AI Agents". The only rows on those tables
| that are AI Agents are Debate and Reflection. They provide
| marginal improvements on only a few task/model
| combinations, and do so at a hefty increase in
| computational cost. In many tasks, they are significantly
| behind simpler and cheaper alternatives.
| alsima wrote:
| I see. Agree with the point about marginal improvements
| at a hefty increase in computational cost (I touch on
| this a bit at the end of the blog post where I mention
| that better performance requires better tooling/base
| models). Though I would still consider sampling and
| voting a "multi-agent" framework as it's still performing
| an aggregation over multiple results.
| memhole wrote:
| Thanks for this paper! Still early so not quite production, but
| I've seen positive results on tasks scaling the number of
| "agents". I more or less think of agents as a task focused
| prompt and these agents are slight variations of the task. It
| makes me think I'm running some kind of Monte Carlo simulation.
| fmbb wrote:
| Thirty recursive loops of LLMs perform better than one prompt?
| I should hope so!
|
| But how much power does it need?
| alsima wrote:
| A lot...as you might imagine the costs of running the whole
| organization scale immensely.
| aantix wrote:
| Is there an agent framework that lives up to the hype?
|
| Where you specify a top-level objective, it plans out those
| objectives, it selects a completion metric so that it knows when
| to finish, and iterates/reiterates over the output until
| completion?
| simonw wrote:
| To this date, ChatGPT Code Interpreter is still the most
| impressive implementation of this pattern that I've seen.
|
| Give it a task, it writes code, runs the code, gets errors,
| fixes bugs, tries again generally until it succeeds.
|
| That's over a year old at this point, and it's not clear to me
| if it counts as an "agent" by many people's definitions (which
| are often frustratingly vague).
| alsima wrote:
| Well, you have Cognition AI and Devin that became a recent
| unicorn startup (partnerships with Microsoft and stuff) but
| true, I can't think of an agent that actually lives up to the
| hype (heard Devin wasn't great).
| danenania wrote:
| > Where you specify a top-level objective, it plans out those
| objectives, it selects a completion metric so that it knows
| when to finish, and iterates/reiterates over the output until
| completion?
|
| I built Plandex[1], which works roughly like this. The goal (so
| far) is not to take you from an initial prompt to a 100%
| working solution in one go, but to provide tools that help you
| iterate your way to a 90-95% solution. You can then fill in the
| gaps yourself.
|
| I think the idea of a fully autonomous AI engineer is currently
| mostly hype. Making that the target is good for marketing, but
| in practice it leads to lots of useless tire-spinning and
| wasted tokens. It's not a good idea, for example, to have the
| LLM try to debug its own output by default. It _might_ , on a
| case-by-case basis, be a good idea to feed an error back to the
| LLM, but just as often it will be faster for the developer to
| do the debugging themselves.
|
| 1 - https://plandex.ai
| fancyfredbot wrote:
| Can I just ask whether other people think that "agentic" is a
| word?
|
| As far as I can tell it's not in the OED or Miriam Webster
| dictionaries. But recently everyone's using it so perhaps it soon
| will be.
| alsima wrote:
| It's the new cerebral valley slang dude
| sandspar wrote:
| "Agentic" is a term of art from psychology that's diffused into
| common usage. It dates back to the 1970s, primarily associated
| with Albert Bandura, the guy behind the Bobo doll experiment.
|
| From ChatGPT: Other examples include "heuristic," "cognitive
| dissonance," "meta-cognition," "self-actualization," "self-
| efficacy," "locus of control," and "archetype."
| fancyfredbot wrote:
| Thanks. Interesting! I see "Agentic state" is one where an
| "individual perceives themselves as an agent of the authority
| figure and is willing to carry out their commands, even if it
| goes against their own moral code". That's ironic as most
| LLMs have such strong safety training that it's almost
| impossible to get them to enter such a state.
| connicpu wrote:
| In the case of LLMs, I think their training is the supreme
| authority.
| hexator wrote:
| That's how words form
| candiddevmike wrote:
| My brain wants the word to be "authentic" or "adriatic".
| raybb wrote:
| Seems to have been for a while.
|
| https://en.wiktionary.org/wiki/agentic
| henning wrote:
| - Big tech is very different from open source
|
| - The original SWE-bench paper only consists of solved issues
| when a big part of Open Source is triage, follow-up,
| clarification and dealing with crappy issues
|
| - Saying "<Technique> is all you need" when you are increasing
| your energy usage 30-fold just to fail > 50% of the time is
| intellectually dishonest
| alsima wrote:
| Definitely not saying multi-agents is all you need for SWE-
| bench haha. I touch on this at the end of the blog post, where
| I mention jumps in progress require better base models or
| tooling.
| det2x wrote:
| It's interesting how long the word "agents"/"intelligent agents"
| have been around for and how long they've been hyped up for. If
| you go back to the 80s and 90s you will see how Microsoft was
| hyping up "intelligent agents" in Windows but nothing ever became
| of it[1].
|
| I have yet to see an actual useful usecase for agents despite the
| countless posts asking for examples nobody has provided one.
|
| [1] https://www.wired.com/1995/09/future-forward/
| sroussey wrote:
| Or CORBA based intelligent agents in the 1990s
___________________________________________________________________
(page generated 2024-08-06 23:00 UTC)