[HN Gopher] SOTA on swebench-verified: relearning the bitter lesson
___________________________________________________________________
SOTA on swebench-verified: relearning the bitter lesson
Author : mcflem007
Score : 11 points
Date : 2025-01-08 21:25 UTC (1 hours ago)
(HTM) web link (aide.dev)
(TXT) w3m dump (aide.dev)
| dang wrote:
| It would be helpful to explain what this is and what's
| interesting about the updates. Anyone?
|
| Edit: URL since changed - see
| https://news.ycombinator.com/item?id=42639155
|
| ---
|
| Edit: I found these past related threads, but not much discussion
| there:
|
| _Pplx and Dbrx founder giving $1M to first OSS AI that gets 90%
| on SWE-bench_ - https://news.ycombinator.com/item?id=42413392 -
| Dec 2024 (3 comments)
|
| _We might be overestimating coding agent performance on SWE-
| Bench_ - https://news.ycombinator.com/item?id=42054973 - Nov 2024
| (1 comment)
|
| _SWE-Bench Verified_ -
| https://news.ycombinator.com/item?id=41237204 - Aug 2024 (10
| comments)
|
| _Show HN: Public and Free SWE-bench-lite evaluations_ -
| https://news.ycombinator.com/item?id=40974181 - July 2024 (1
| comment)
|
| _#1 agent on swe-bench wrote 7% of its own code_ -
| https://news.ycombinator.com/item?id=40627095 - June 2024 (1
| comment)
|
| _Aider Is SOTA for Both SWE Bench and SWE Bench Lite_ -
| https://news.ycombinator.com/item?id=40562121 - June 2024 (1
| comment)
|
| _How Aider Scored SOTA 26.3% on SWE Bench Lite_ -
| https://news.ycombinator.com/item?id=40477191 - May 2024 (1
| comment)
| amrrs wrote:
| For Context:
|
| SWE-Bench (+ Verified) is the benchmark (of resolving Github
| Issues) that companies into Coding are chasing - Devin, Claude,
| OpenAI - all these!
|
| A new leader #1 - CodeStory Midwit Agent + swe-search - has been
| crowed with a score of 62% on SWE-bench verified (without even
| using any reasoning models like OpenAI o1 or o3)
|
| More details on their approach - https://aide.dev/blog/sota-
| bitter-lesson
| alach11 wrote:
| This is a very impressive result. OpenAI was able to achieve
| 72% with o3, but that's at a very high compute cost at
| inference-time.
|
| I'd be interested for Aide to release more metrics on token
| counts, total expenditure, etc. to better understand exactly
| how much test-time compute is involved here. They allude to it
| being a lot, but it would be nice to compare with OpenAI's o3.
| amrrs wrote:
| tbh there has been some issue with their previous reporting
|
| https://x.com/Alex_Cuadron/status/1876017241042587964
| skp1995 wrote:
| Hey! One of the creators of Aide here.
|
| ngl the total expenditure was around $10k, in terms of test-
| time compute we ran upto 20X agents on the same problem to
| first understand if the bitter lesson paradigm of "scale is
| the answer" really holds true.
|
| The final submission which we did ran 5X agents and the
| decider was based on mean average score of the rewards, per
| problem the cost was around $20
|
| We are going to push this scaling paradigm a bit more, my
| honest gut feeling is that swe-bench as a benchmark is prime
| for saturation real soon
|
| 1. These problem statements are in the training data for the
| LLMs
|
| 2. Brute-forcing the answer the way we are doing works and we
| just proved it, so someone is going to take a better stab at
| it real soon
| dang wrote:
| Thanks! It feels like we should switch the top link to that URL
| since it's a deeper dive into the new bit that's interesting
| here.
|
| Edit: I've done that now. Submitted URL was
| https://www.swebench.com/ and submitted title was "SWE Bench
| just got updated - new #1s".
| attentive wrote:
| This bench seems to be entirely python based. Are there similar
| benchmarks that test different languages for these tools?
| WorkerBee28474 wrote:
| > The biggest lesson that can be read from 70 years of AI
| research is that general methods that leverage computation are
| ultimately the most effective, and by a large margin.
|
| So, in the long run, we'll just throw more and more hardware at
| AI, forever.
|
| > The second general point to be learned from the bitter lesson
| is that the actual contents of minds are tremendously,
| irredeemably complex... we should build in only the meta-methods
| that can find and capture this arbitrary complexity.
|
| So AI will permanently involve throwing a ton of compute at a ton
| of data.
|
| I guess it's time to buy stock in computer hardware
| manufacturers.
___________________________________________________________________
(page generated 2025-01-08 23:00 UTC)