hngopher.com

       [HN Gopher] DynaSaur: Large Language Agents Beyond Predefined Ac...
       ___________________________________________________________________
        
       DynaSaur: Large Language Agents Beyond Predefined Actions
        
       Author : surprisetalk
       Score  : 118 points
       Date   : 2024-12-01 05:21 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | golol wrote:
       | I don't like the way LLM papers are written. LLMs receive inputs
       | and produce outputs that are best represented as plaintext with
       | some special characters. Simply showing a few examples of the
       | agent's core LLM text continuation job would explain the
       | architecture much better than figures. I can't help but feel that
       | the authors which do this are intentionally obfuscating things.
        
         | exe34 wrote:
         | it's a delaying tactic, isn't it, if you're working on version
         | 2, you don't want version 1 to be too obvious that somebody
         | might scoop your version 2. I just wish the reviewers would
         | clamp down on this kind of thing.
        
         | ukuina wrote:
         | I suspect they do this to give the paper more weight than a
         | mere prompt deserves. As an example:
         | 
         | > Given a task u \in \mathcal{U} and a human-designed action
         | set \mathcal{A}u with R \in \mathcal{A}u , at time step t , we
         | sample a thought-action pair (h_t, a_t) \sim \pi\theta(a_t \mid
         | \mathcal{A}u, u, c{t-1}) following the ReAct framework (Yao et
         | al., 2023b). Here, c{t-1} = \\{(h_1, a_1, o_1), \dots,
         | (h_{t-1}, a_{t-1}, o_{t-1})\\} represents the interaction
         | history up to time t-1 . The action a_t is executed, and an
         | observation o_t is returned from the environment, updating the
         | context to c_t = c_{t-1} \cup \\{(h_t, a_t, o_t)\\} . If a_t
         | contains a new function not present in \mathcal{A}_{t-1}^g , we
         | update the generated action set by setting \mathcal{A}t^g =
         | \mathcal{A}{t-1}^g \cup f(a_t) , where f(a_t) denotes the set
         | of functions defined in action a_t .
         | 
         | This is a roundabout way to say: "We pick an action based on
         | what's happened so far, do it, see the result, and update the
         | history. If it's something new, we add it to the list of
         | actions we can use."
        
           | MattPalmer1086 wrote:
           | Hah, I had a paper published this year. My co-authors are
           | academics but I am not. Honestly, I couldn't understand the
           | first version of the paper we wrote, despite inventing the
           | algorithm it described!
           | 
           | There is definitely a certain language and a precise
           | mathematical approach which is needed to pass review for
           | academic papers. It isn't nonsense, but does obfuscate
           | obvious meanings.
        
           | NitpickLawyer wrote:
           | The funny thing is that we're getting closer to being able to
           | give that paragraph to an LLM and have it spit out the
           | simpler explanation.
           | 
           | This is what chatgpt gave me for the prompt "can you explain
           | this in two sentences". It's pretty close to what you wrote.
           | 
           | > The system follows the ReAct framework to decide on a
           | thought and action at each step based on the task, available
           | actions, and interaction history, updating its context with
           | the results of the action. If the action introduces new
           | functions, the system expands its action set to include these
           | new capabilities.
        
         | lbeurerkellner wrote:
         | Typically these logs are available, but hard to read or just
         | dumped in some JSON file.
         | 
         | However, they have been efforts like
         | https://explorer.invariantlabs.ai/benchmarks/ that try to make
         | agents more transparent in that way (show interaction logs).
        
         | PunchTornado wrote:
         | I agree. And I view it as intellectual imposture. Instead of
         | saying something really simple which can give good results, you
         | obfuscate it a lot to make it sound more intelligent. Reviewers
         | shouldn't accept these kind of papers and I'm thinking that we
         | need a Sokal moment in AI research.
        
       | throwup238 wrote:
       | It looks like the key insight here is to have the LLM generate
       | its own tools (as in GPT/Claude tool calling) via Python code
       | generation and apply cosine similarity RAG to select which tools
       | are available at each step using the tool description and the
       | problem/step while using recent history to error correct.
       | 
       | The agent starts with some human created tooling like a tool to
       | read the file system or create another tool using python code,
       | then starts accumulating custom Python functions it wrote itself
       | with tool calling metadata like descriptions and input/output
       | types. Every time it reaches a step, if it doesn't find a
       | relevant tool, it creates a new one. Apparently this improves
       | performance on complex tasks (via GAIA benchmark) with
       | diminishing returns on simpler tasks.
        
         | IanCal wrote:
         | I played around making these things before, it's a fun
         | exercise. Interesting to see that's where things may be
         | heading.
         | 
         | My example was asking for a poem about the headlines (good
         | example of info they don't have, and something that's very hard
         | to do mechanically).
         | 
         | https://news.ycombinator.com/item?id=37015591
        
           | llm_trw wrote:
           | I ended up training a bert on nothing but python for the
           | embedding search. The results were crap. Then I used an llm
           | to write a new docstring for each class/function definition
           | in the training data and the results were better than state
           | of the art.
           | 
           | There's so much wide open space to explore. It's a shame that
           | everyone is wasting their time with the biggest possible
           | models they can afford.
        
             | digdugdirk wrote:
             | Do you have any more detailed info on this process? I've
             | played around with using LLMs, but nothing in the training
             | realm. I'd love to see a writeup or guide to the process
             | you used there.
        
               | llm_trw wrote:
               | No and it won't do you much good even if I did.
               | 
               | The tools have broken again since then - thanks
               | tensorflow data loaders - and my code only works against
               | a version of python that's no longer supported in LTS
               | Ubuntu/Debian10+.
               | 
               | I have been mulling about running a subscription service
               | where you get up to date code that works on topics like
               | the above. If you're interested drop me a line at my
               | profile email and I'll add you to a mailing list when/if
               | I ever get around to doing it.
        
         | mountainriver wrote:
         | This is what Voyager did awhile back, it's interesting but I
         | think only part of the answer
        
         | thom wrote:
         | Seems like you could go further than this with something like
         | DSPy and start evaluating which tools contribute to successful
         | outcomes. Funny how much things start to look like Eurisko the
         | more time goes on.
        
       | wokwokwok wrote:
       | This is super big news if it's real.
       | 
       | Basically, given an agent with an initial set of predefined
       | actions and goal, they're saying "decompose this into steps and
       | pick and action to achieve each step". Pretty standard stuff.
       | 
       |  _Then_ they say, hey, if you can't solve the problem with those
       | actions (ie. failed repeatedly when attempting to solve), write
       | some arbitrary generic python code and use that as your action
       | for the next step.
       | 
       | Then save that as a new generic action, and slowly build up a
       | library of actions to augment the initial set.
       | 
       | The thing is, there's no meaningful difference between the task
       | "write code to solve this task" and "write code to solve this
       | action"; if you can meaningfully generate code that can, without
       | error, perform arbitrary tasks, you've basically solved
       | programming.
       | 
       | So... that would be quite a big deal.
       | 
       | That would be a real "Devon" that would actually be able to write
       | arbitrary code to solve arbitrary problems.
       | 
       | ...which makes me a bit sceptical.
       | 
       | Still, this seems to have at least worked reasonably well (as
       | shown by being a leader on the GAIA leaderboard) so they seem to
       | have done _something_ that works, but I'm left wondering...
       | 
       | If you've figured out how to get an agent to write error free
       | deterministic code to perform arbitrary actions in a chain of
       | thought process, why are you pissing around with worrying about
       | accumulating a library of agent actions?
       | 
       | That's all entirely irrelevant and unnecessary.
       | 
       | Just generate code for each step.
       | 
       | So... something seems a bit strange around this.
       | 
       | I'd love to see a log of the actual problem / action / code
       | sequences.
        
         | Kiro wrote:
         | Devin is real. What do you mean?
         | 
         | Anyway, this is pretty standard stuff already. In all my agent
         | workflows the agents are able to write their own code and
         | execute it before passing the result to the next agent. It
         | doesn't need to be perfect since you always have an agent
         | validating the results, sending the task back if necessary.
         | 
         | I haven't read the paper beyond the synopsis so I might be
         | missing a crucial key takeaway and I presume it has a lot of
         | additional layers.
        
           | wokwokwok wrote:
           | As evidenced by the reaction to Devin, no, it's not real.
           | 
           | There's a limit, beyond which agent generated code is, in
           | general, not reliable.
           | 
           | All of the people who claim otherwise (like the Devin videos)
           | have shown to be fake (1) or cherry-picked.
           | 
           | Having agent generated code is arbitrary code to solve
           | arbitrary problems is. Not. A. Solved. Problem.
           | 
           | Yet.
           | 
           | ...no matter, no matter how many AI bros claim otherwise,
           | currently.
           | 
           | Being able to decompose complex problems into part small
           | enough to be able to be solved by current models would be a
           | big deal if it was real.
           | 
           | (Because, _currently_ the SoTA can't reliably do this; this
           | should not be a remotely controversial claim to people
           | familiar with this space)
           | 
           | So tldr; extraordinary claims require extraordinary evidence.
           | Which is absent here, as far as I can tell. They specifically
           | call out in the paper that generated actions are overly
           | specific and don't always work; but as I said, it's doing
           | well on the leader board, so it's clearly doing _something_ ,
           | which is working, but there's just noooooo way of seeing
           | _what_.
           | 
           | [1] - https://www.zeniteq.com/blog/devins-demo-as-the-first-
           | ai-sof...
        
         | IanCal wrote:
         | > If you've figured out how to get an agent to write error free
         | deterministic code to perform arbitrary actions in a chain of
         | thought process
         | 
         | You don't have to have it perfect, and the more you reuse
         | things that you know work the less you have to build each time
         | (reducing places for errors)
         | 
         | > Just generate code for each step.
         | 
         | We don't do this as humans, we build and reuse pieces.
        
       | mosses-sandals wrote:
       | Writers basically said in this paper let us just save certain
       | amounts of working code snippets generated by llm and hope that
       | they are also needed in the future, at the same time concluding
       | that the saved code is sparsed , so this research paper at this
       | stage is just useless.
        
         | bloomingkales wrote:
         | I don't know if it's useless, but as someone with no background
         | in ML, I've ad-hoc come up with the exact same idea playing
         | around with LLMs.
         | 
         | So, is this just a low hanging fruit idea that looks
         | authoritative because it's in an academic format?
        
       | quicheshore wrote:
       | This is a great application of dynamic tooling. But figure 5 is
       | kind of flawed. It's not a fair comparison, when the tool call
       | you provide doesn't work. Obviously the LLM with code execution
       | capabilities will do better.
        
       | killerstorm wrote:
       | Generating code to do stuff was the idea of OpenAI Codex in 2021.
       | 
       | This paper basically just adds a cache? Not really novel as we
       | already have Codex, Code Interpreter, etc.
        
       | adtac wrote:
       | The paper evaluates itself on the GAIA benchmark and it was my
       | first time hearing about it, so I tried to evaluate myself as a
       | human.
       | 
       | Here's a level 3 question from the GAIA paper (level 3 =
       | hardest):
       | 
       | >In NASA's Astronomy Picture of the Day on 2006 January 21, two
       | astronauts are visible, with one appearing much smaller than the
       | other. As of August 2023, out of the astronauts in the NASA
       | Astronaut Group that the smaller astronaut was a member of, which
       | one spent the least time in space, and how many minutes did he
       | spend in space, rounded to the nearest minute? Exclude any
       | astronauts who did not spend any time in space. Give the last
       | name of the astronaut, separated from the number of minutes by a
       | semicolon. Use commas as thousands separators in the number of
       | minutes.
       | 
       | I timed myself solving the problem. It took me 9 minutes, 5
       | Google searches, 14 web pages, multiple Ctrl+F in these pages and
       | 1 calculator use to figure out the answer.
       | 
       | DynaSaur seems to have a 10% to 20% success rate at this level.
       | 
       | Try for yourself. This is one of the few empirically grounded
       | reference levels for how far we are from AGI.
        
         | ethbr1 wrote:
         | That seems similar to a ~7th grade reading comprehension
         | question, if all the facts where at hand.
         | 
         | Out of curiosity, if anyone knows, what's SOTA for how well
         | LLMs actually parse (English) grammar? In the way they're
         | looking at the prompt.
         | 
         | A lot of correctness to the challenge questions seems to be
         | identifying key phrases and requests. I.e. reading
         | comprehension.
         | 
         | And multi-step tool use requires a higher bar than straight
         | summarization, as one must more particularly differentiate
         | between alternative information to focus on.
        
           | adtac wrote:
           | The question above was not preceded by anything; that was the
           | whole question. The facts are at hand in the sense that you
           | have the internet and you're allowed to use it. The hard part
           | is knowing what to search and recognising the answer when you
           | see it. This is much harder than any 7th grade comprehension
           | test I've done :)
        
       | 80hd wrote:
       | Putting this idea out there, haven't seen anyone implement it:
       | 
       | Use vector embeddings to represent each task as a story, an
       | abstraction of 1. the past, 2. the present, 3. the future - on a
       | kind of global "story map".
       | 
       | Each embedding would be generated by all available sense inputs
       | at a point in time. The most useful embeddings alg will be able
       | to combine sight, hearing, internal monologue, visual imagination
       | etc into one point on a high-dimensional map.
       | 
       | At each time step, find the closest successful "memory" (based on
       | embedding of 1+2+3) and do some LLM exploration to adapt the
       | memory to the new, novel situation.
       | 
       | Attempt the new "story", and do something like A* to get closer
       | to the desired "future", tweaking the story each time and
       | plotting failed attempts on the embedding map.
       | 
       | Theory being that over time, the map will become populated with
       | successful attempts and embedding will be able to abstract
       | between similar situations based on 1+2+3.
       | 
       | I'm not the guy to implement it, and I imagine new models
       | training with a "reasoning step" are doing a similar thing at
       | training-time.
        
         | johnsutor wrote:
         | Interesting idea. Similarly, recent work appears to have used
         | MCTS to explore sequential multi-agent systems (see
         | https://arxiv.org/abs/2410.10762,
         | https://arxiv.org/abs/2410.17238).
        
         | bongodongobob wrote:
         | What do you mean by a story? Like a book?
        
       ___________________________________________________________________
       (page generated 2024-12-01 23:00 UTC)