hngopher.com

       [HN Gopher] SOTA on swebench-verified: relearning the bitter lesson
       ___________________________________________________________________
        
       SOTA on swebench-verified: relearning the bitter lesson
        
       Author : mcflem007
       Score  : 11 points
       Date   : 2025-01-08 21:25 UTC (1 hours ago)
        
 (HTM) web link (aide.dev)
 (TXT) w3m dump (aide.dev)
        
       | dang wrote:
       | It would be helpful to explain what this is and what's
       | interesting about the updates. Anyone?
       | 
       | Edit: URL since changed - see
       | https://news.ycombinator.com/item?id=42639155
       | 
       | ---
       | 
       | Edit: I found these past related threads, but not much discussion
       | there:
       | 
       |  _Pplx and Dbrx founder giving $1M to first OSS AI that gets 90%
       | on SWE-bench_ - https://news.ycombinator.com/item?id=42413392 -
       | Dec 2024 (3 comments)
       | 
       |  _We might be overestimating coding agent performance on SWE-
       | Bench_ - https://news.ycombinator.com/item?id=42054973 - Nov 2024
       | (1 comment)
       | 
       |  _SWE-Bench Verified_ -
       | https://news.ycombinator.com/item?id=41237204 - Aug 2024 (10
       | comments)
       | 
       |  _Show HN: Public and Free SWE-bench-lite evaluations_ -
       | https://news.ycombinator.com/item?id=40974181 - July 2024 (1
       | comment)
       | 
       |  _#1 agent on swe-bench wrote 7% of its own code_ -
       | https://news.ycombinator.com/item?id=40627095 - June 2024 (1
       | comment)
       | 
       |  _Aider Is SOTA for Both SWE Bench and SWE Bench Lite_ -
       | https://news.ycombinator.com/item?id=40562121 - June 2024 (1
       | comment)
       | 
       |  _How Aider Scored SOTA 26.3% on SWE Bench Lite_ -
       | https://news.ycombinator.com/item?id=40477191 - May 2024 (1
       | comment)
        
       | amrrs wrote:
       | For Context:
       | 
       | SWE-Bench (+ Verified) is the benchmark (of resolving Github
       | Issues) that companies into Coding are chasing - Devin, Claude,
       | OpenAI - all these!
       | 
       | A new leader #1 - CodeStory Midwit Agent + swe-search - has been
       | crowed with a score of 62% on SWE-bench verified (without even
       | using any reasoning models like OpenAI o1 or o3)
       | 
       | More details on their approach - https://aide.dev/blog/sota-
       | bitter-lesson
        
         | alach11 wrote:
         | This is a very impressive result. OpenAI was able to achieve
         | 72% with o3, but that's at a very high compute cost at
         | inference-time.
         | 
         | I'd be interested for Aide to release more metrics on token
         | counts, total expenditure, etc. to better understand exactly
         | how much test-time compute is involved here. They allude to it
         | being a lot, but it would be nice to compare with OpenAI's o3.
        
           | amrrs wrote:
           | tbh there has been some issue with their previous reporting
           | 
           | https://x.com/Alex_Cuadron/status/1876017241042587964
        
           | skp1995 wrote:
           | Hey! One of the creators of Aide here.
           | 
           | ngl the total expenditure was around $10k, in terms of test-
           | time compute we ran upto 20X agents on the same problem to
           | first understand if the bitter lesson paradigm of "scale is
           | the answer" really holds true.
           | 
           | The final submission which we did ran 5X agents and the
           | decider was based on mean average score of the rewards, per
           | problem the cost was around $20
           | 
           | We are going to push this scaling paradigm a bit more, my
           | honest gut feeling is that swe-bench as a benchmark is prime
           | for saturation real soon
           | 
           | 1. These problem statements are in the training data for the
           | LLMs
           | 
           | 2. Brute-forcing the answer the way we are doing works and we
           | just proved it, so someone is going to take a better stab at
           | it real soon
        
         | dang wrote:
         | Thanks! It feels like we should switch the top link to that URL
         | since it's a deeper dive into the new bit that's interesting
         | here.
         | 
         | Edit: I've done that now. Submitted URL was
         | https://www.swebench.com/ and submitted title was "SWE Bench
         | just got updated - new #1s".
        
       | attentive wrote:
       | This bench seems to be entirely python based. Are there similar
       | benchmarks that test different languages for these tools?
        
       | WorkerBee28474 wrote:
       | > The biggest lesson that can be read from 70 years of AI
       | research is that general methods that leverage computation are
       | ultimately the most effective, and by a large margin.
       | 
       | So, in the long run, we'll just throw more and more hardware at
       | AI, forever.
       | 
       | > The second general point to be learned from the bitter lesson
       | is that the actual contents of minds are tremendously,
       | irredeemably complex... we should build in only the meta-methods
       | that can find and capture this arbitrary complexity.
       | 
       | So AI will permanently involve throwing a ton of compute at a ton
       | of data.
       | 
       | I guess it's time to buy stock in computer hardware
       | manufacturers.
        
       ___________________________________________________________________
       (page generated 2025-01-08 23:00 UTC)