[HN Gopher] SWE-Lancer: a benchmark of freelance software engine...
       ___________________________________________________________________
        
       SWE-Lancer: a benchmark of freelance software engineering tasks
       from Upwork
        
       Author : zone411
       Score  : 9 points
       Date   : 2025-02-18 05:25 UTC (17 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | colesantiago wrote:
       | Can anyone explain how this research benefits humanity for
       | OpenAI's mission?
       | 
       | OpenAI's AGI mission statement
       | 
       | > "By AGI we mean highly autonomous systems that outperform
       | humans at most economically valuable work."
       | 
       | https://openai.com/index/how-should-ai-systems-behave/
       | 
       | I would have to admit some humility as I sort of brought this on
       | myself [1]
       | 
       | > This is a fantastic idea. Perhaps then this should be the next
       | test for these SWE Agents, in the same manner as the 'Will Smith
       | Eats Spaghetti" video tests
       | 
       | https://news.ycombinator.com/item?id=43032191
       | 
       | But curiously the question is still valid.
       | 
       | Related:
       | 
       | Sam Altman: "50C/ of compute of a SWE Agent can yield "$500 or
       | $5k of work."
       | 
       | https://news.ycombinator.com/item?id=43032098
       | 
       | https://x.com/vitrupo/status/1889720371072696554
        
       | bufferoverflow wrote:
       | And how do you evaluate if the task was completed correctly?
       | There are nearly infinite ways to solve a given software dev
       | problem, if the problem isn't trivial (and I hope they are not
       | benchmarking trivial problems).
        
         | riku_iki wrote:
         | paper says they created e2e tests to check if task completed
         | successfully.
        
       | Tiberium wrote:
       | The extremely interesting part is that 3.5 Sonnet is above o1 on
       | this benchmark, which again shows that 3.5 Sonnet is a very
       | special model that's best for real world tasks and not some one-
       | shot scripts or math. And the weirdest part is that they tested
       | the 20240620 snapshot which is objectively worse on code than the
       | newer 20241022 (so-called v2).
        
         | GaggiX wrote:
         | I understand why they did not show the results on the website.
        
       | moralestapia wrote:
       | The writing is very clearly on the wall.
       | 
       | On a non-pessimist note, I don't think the SWE role will
       | disappear, but what's the best one could do to be prepared for
       | this?
        
         | bigbones wrote:
         | There will always be "real thinking" roles in software but the
         | sheer pressure on salaries from the vastly increasing free
         | labour pool will lead to an outcome a bit like embedded
         | software development, where rates don't really match the skill
         | level. I think the most obvious strategy for the time being is
         | figuring out how to become a buyer of the services you
         | understand rather than a badly crowded out seller
        
       | neilv wrote:
       | "SWE-Lancer", like, running through SWEs with a lance?
        
         | marinhero wrote:
         | I was thinking about the Gears of War Lancer--an assault rifle
         | with a chainsaw attached, designed to slice Locustin half. But
         | in this case, imagine it being used on Software Engineers
         | instead of Locust.
        
       ___________________________________________________________________
       (page generated 2025-02-18 23:00 UTC)