[HN Gopher] DynaSaur: Large Language Agents Beyond Predefined Ac...
___________________________________________________________________
DynaSaur: Large Language Agents Beyond Predefined Actions
Author : surprisetalk
Score : 118 points
Date : 2024-12-01 05:21 UTC (17 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| golol wrote:
| I don't like the way LLM papers are written. LLMs receive inputs
| and produce outputs that are best represented as plaintext with
| some special characters. Simply showing a few examples of the
| agent's core LLM text continuation job would explain the
| architecture much better than figures. I can't help but feel that
| the authors which do this are intentionally obfuscating things.
| exe34 wrote:
| it's a delaying tactic, isn't it, if you're working on version
| 2, you don't want version 1 to be too obvious that somebody
| might scoop your version 2. I just wish the reviewers would
| clamp down on this kind of thing.
| ukuina wrote:
| I suspect they do this to give the paper more weight than a
| mere prompt deserves. As an example:
|
| > Given a task u \in \mathcal{U} and a human-designed action
| set \mathcal{A}u with R \in \mathcal{A}u , at time step t , we
| sample a thought-action pair (h_t, a_t) \sim \pi\theta(a_t \mid
| \mathcal{A}u, u, c{t-1}) following the ReAct framework (Yao et
| al., 2023b). Here, c{t-1} = \\{(h_1, a_1, o_1), \dots,
| (h_{t-1}, a_{t-1}, o_{t-1})\\} represents the interaction
| history up to time t-1 . The action a_t is executed, and an
| observation o_t is returned from the environment, updating the
| context to c_t = c_{t-1} \cup \\{(h_t, a_t, o_t)\\} . If a_t
| contains a new function not present in \mathcal{A}_{t-1}^g , we
| update the generated action set by setting \mathcal{A}t^g =
| \mathcal{A}{t-1}^g \cup f(a_t) , where f(a_t) denotes the set
| of functions defined in action a_t .
|
| This is a roundabout way to say: "We pick an action based on
| what's happened so far, do it, see the result, and update the
| history. If it's something new, we add it to the list of
| actions we can use."
| MattPalmer1086 wrote:
| Hah, I had a paper published this year. My co-authors are
| academics but I am not. Honestly, I couldn't understand the
| first version of the paper we wrote, despite inventing the
| algorithm it described!
|
| There is definitely a certain language and a precise
| mathematical approach which is needed to pass review for
| academic papers. It isn't nonsense, but does obfuscate
| obvious meanings.
| NitpickLawyer wrote:
| The funny thing is that we're getting closer to being able to
| give that paragraph to an LLM and have it spit out the
| simpler explanation.
|
| This is what chatgpt gave me for the prompt "can you explain
| this in two sentences". It's pretty close to what you wrote.
|
| > The system follows the ReAct framework to decide on a
| thought and action at each step based on the task, available
| actions, and interaction history, updating its context with
| the results of the action. If the action introduces new
| functions, the system expands its action set to include these
| new capabilities.
| lbeurerkellner wrote:
| Typically these logs are available, but hard to read or just
| dumped in some JSON file.
|
| However, they have been efforts like
| https://explorer.invariantlabs.ai/benchmarks/ that try to make
| agents more transparent in that way (show interaction logs).
| PunchTornado wrote:
| I agree. And I view it as intellectual imposture. Instead of
| saying something really simple which can give good results, you
| obfuscate it a lot to make it sound more intelligent. Reviewers
| shouldn't accept these kind of papers and I'm thinking that we
| need a Sokal moment in AI research.
| throwup238 wrote:
| It looks like the key insight here is to have the LLM generate
| its own tools (as in GPT/Claude tool calling) via Python code
| generation and apply cosine similarity RAG to select which tools
| are available at each step using the tool description and the
| problem/step while using recent history to error correct.
|
| The agent starts with some human created tooling like a tool to
| read the file system or create another tool using python code,
| then starts accumulating custom Python functions it wrote itself
| with tool calling metadata like descriptions and input/output
| types. Every time it reaches a step, if it doesn't find a
| relevant tool, it creates a new one. Apparently this improves
| performance on complex tasks (via GAIA benchmark) with
| diminishing returns on simpler tasks.
| IanCal wrote:
| I played around making these things before, it's a fun
| exercise. Interesting to see that's where things may be
| heading.
|
| My example was asking for a poem about the headlines (good
| example of info they don't have, and something that's very hard
| to do mechanically).
|
| https://news.ycombinator.com/item?id=37015591
| llm_trw wrote:
| I ended up training a bert on nothing but python for the
| embedding search. The results were crap. Then I used an llm
| to write a new docstring for each class/function definition
| in the training data and the results were better than state
| of the art.
|
| There's so much wide open space to explore. It's a shame that
| everyone is wasting their time with the biggest possible
| models they can afford.
| digdugdirk wrote:
| Do you have any more detailed info on this process? I've
| played around with using LLMs, but nothing in the training
| realm. I'd love to see a writeup or guide to the process
| you used there.
| llm_trw wrote:
| No and it won't do you much good even if I did.
|
| The tools have broken again since then - thanks
| tensorflow data loaders - and my code only works against
| a version of python that's no longer supported in LTS
| Ubuntu/Debian10+.
|
| I have been mulling about running a subscription service
| where you get up to date code that works on topics like
| the above. If you're interested drop me a line at my
| profile email and I'll add you to a mailing list when/if
| I ever get around to doing it.
| mountainriver wrote:
| This is what Voyager did awhile back, it's interesting but I
| think only part of the answer
| thom wrote:
| Seems like you could go further than this with something like
| DSPy and start evaluating which tools contribute to successful
| outcomes. Funny how much things start to look like Eurisko the
| more time goes on.
| wokwokwok wrote:
| This is super big news if it's real.
|
| Basically, given an agent with an initial set of predefined
| actions and goal, they're saying "decompose this into steps and
| pick and action to achieve each step". Pretty standard stuff.
|
| _Then_ they say, hey, if you can't solve the problem with those
| actions (ie. failed repeatedly when attempting to solve), write
| some arbitrary generic python code and use that as your action
| for the next step.
|
| Then save that as a new generic action, and slowly build up a
| library of actions to augment the initial set.
|
| The thing is, there's no meaningful difference between the task
| "write code to solve this task" and "write code to solve this
| action"; if you can meaningfully generate code that can, without
| error, perform arbitrary tasks, you've basically solved
| programming.
|
| So... that would be quite a big deal.
|
| That would be a real "Devon" that would actually be able to write
| arbitrary code to solve arbitrary problems.
|
| ...which makes me a bit sceptical.
|
| Still, this seems to have at least worked reasonably well (as
| shown by being a leader on the GAIA leaderboard) so they seem to
| have done _something_ that works, but I'm left wondering...
|
| If you've figured out how to get an agent to write error free
| deterministic code to perform arbitrary actions in a chain of
| thought process, why are you pissing around with worrying about
| accumulating a library of agent actions?
|
| That's all entirely irrelevant and unnecessary.
|
| Just generate code for each step.
|
| So... something seems a bit strange around this.
|
| I'd love to see a log of the actual problem / action / code
| sequences.
| Kiro wrote:
| Devin is real. What do you mean?
|
| Anyway, this is pretty standard stuff already. In all my agent
| workflows the agents are able to write their own code and
| execute it before passing the result to the next agent. It
| doesn't need to be perfect since you always have an agent
| validating the results, sending the task back if necessary.
|
| I haven't read the paper beyond the synopsis so I might be
| missing a crucial key takeaway and I presume it has a lot of
| additional layers.
| wokwokwok wrote:
| As evidenced by the reaction to Devin, no, it's not real.
|
| There's a limit, beyond which agent generated code is, in
| general, not reliable.
|
| All of the people who claim otherwise (like the Devin videos)
| have shown to be fake (1) or cherry-picked.
|
| Having agent generated code is arbitrary code to solve
| arbitrary problems is. Not. A. Solved. Problem.
|
| Yet.
|
| ...no matter, no matter how many AI bros claim otherwise,
| currently.
|
| Being able to decompose complex problems into part small
| enough to be able to be solved by current models would be a
| big deal if it was real.
|
| (Because, _currently_ the SoTA can't reliably do this; this
| should not be a remotely controversial claim to people
| familiar with this space)
|
| So tldr; extraordinary claims require extraordinary evidence.
| Which is absent here, as far as I can tell. They specifically
| call out in the paper that generated actions are overly
| specific and don't always work; but as I said, it's doing
| well on the leader board, so it's clearly doing _something_ ,
| which is working, but there's just noooooo way of seeing
| _what_.
|
| [1] - https://www.zeniteq.com/blog/devins-demo-as-the-first-
| ai-sof...
| IanCal wrote:
| > If you've figured out how to get an agent to write error free
| deterministic code to perform arbitrary actions in a chain of
| thought process
|
| You don't have to have it perfect, and the more you reuse
| things that you know work the less you have to build each time
| (reducing places for errors)
|
| > Just generate code for each step.
|
| We don't do this as humans, we build and reuse pieces.
| mosses-sandals wrote:
| Writers basically said in this paper let us just save certain
| amounts of working code snippets generated by llm and hope that
| they are also needed in the future, at the same time concluding
| that the saved code is sparsed , so this research paper at this
| stage is just useless.
| bloomingkales wrote:
| I don't know if it's useless, but as someone with no background
| in ML, I've ad-hoc come up with the exact same idea playing
| around with LLMs.
|
| So, is this just a low hanging fruit idea that looks
| authoritative because it's in an academic format?
| quicheshore wrote:
| This is a great application of dynamic tooling. But figure 5 is
| kind of flawed. It's not a fair comparison, when the tool call
| you provide doesn't work. Obviously the LLM with code execution
| capabilities will do better.
| killerstorm wrote:
| Generating code to do stuff was the idea of OpenAI Codex in 2021.
|
| This paper basically just adds a cache? Not really novel as we
| already have Codex, Code Interpreter, etc.
| adtac wrote:
| The paper evaluates itself on the GAIA benchmark and it was my
| first time hearing about it, so I tried to evaluate myself as a
| human.
|
| Here's a level 3 question from the GAIA paper (level 3 =
| hardest):
|
| >In NASA's Astronomy Picture of the Day on 2006 January 21, two
| astronauts are visible, with one appearing much smaller than the
| other. As of August 2023, out of the astronauts in the NASA
| Astronaut Group that the smaller astronaut was a member of, which
| one spent the least time in space, and how many minutes did he
| spend in space, rounded to the nearest minute? Exclude any
| astronauts who did not spend any time in space. Give the last
| name of the astronaut, separated from the number of minutes by a
| semicolon. Use commas as thousands separators in the number of
| minutes.
|
| I timed myself solving the problem. It took me 9 minutes, 5
| Google searches, 14 web pages, multiple Ctrl+F in these pages and
| 1 calculator use to figure out the answer.
|
| DynaSaur seems to have a 10% to 20% success rate at this level.
|
| Try for yourself. This is one of the few empirically grounded
| reference levels for how far we are from AGI.
| ethbr1 wrote:
| That seems similar to a ~7th grade reading comprehension
| question, if all the facts where at hand.
|
| Out of curiosity, if anyone knows, what's SOTA for how well
| LLMs actually parse (English) grammar? In the way they're
| looking at the prompt.
|
| A lot of correctness to the challenge questions seems to be
| identifying key phrases and requests. I.e. reading
| comprehension.
|
| And multi-step tool use requires a higher bar than straight
| summarization, as one must more particularly differentiate
| between alternative information to focus on.
| adtac wrote:
| The question above was not preceded by anything; that was the
| whole question. The facts are at hand in the sense that you
| have the internet and you're allowed to use it. The hard part
| is knowing what to search and recognising the answer when you
| see it. This is much harder than any 7th grade comprehension
| test I've done :)
| 80hd wrote:
| Putting this idea out there, haven't seen anyone implement it:
|
| Use vector embeddings to represent each task as a story, an
| abstraction of 1. the past, 2. the present, 3. the future - on a
| kind of global "story map".
|
| Each embedding would be generated by all available sense inputs
| at a point in time. The most useful embeddings alg will be able
| to combine sight, hearing, internal monologue, visual imagination
| etc into one point on a high-dimensional map.
|
| At each time step, find the closest successful "memory" (based on
| embedding of 1+2+3) and do some LLM exploration to adapt the
| memory to the new, novel situation.
|
| Attempt the new "story", and do something like A* to get closer
| to the desired "future", tweaking the story each time and
| plotting failed attempts on the embedding map.
|
| Theory being that over time, the map will become populated with
| successful attempts and embedding will be able to abstract
| between similar situations based on 1+2+3.
|
| I'm not the guy to implement it, and I imagine new models
| training with a "reasoning step" are doing a similar thing at
| training-time.
| johnsutor wrote:
| Interesting idea. Similarly, recent work appears to have used
| MCTS to explore sequential multi-agent systems (see
| https://arxiv.org/abs/2410.10762,
| https://arxiv.org/abs/2410.17238).
| bongodongobob wrote:
| What do you mean by a story? Like a book?
___________________________________________________________________
(page generated 2024-12-01 23:00 UTC)