[HN Gopher] SWE-Lancer: a benchmark of freelance software engine...
___________________________________________________________________
SWE-Lancer: a benchmark of freelance software engineering tasks
from Upwork
Author : zone411
Score : 9 points
Date : 2025-02-18 05:25 UTC (17 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| colesantiago wrote:
| Can anyone explain how this research benefits humanity for
| OpenAI's mission?
|
| OpenAI's AGI mission statement
|
| > "By AGI we mean highly autonomous systems that outperform
| humans at most economically valuable work."
|
| https://openai.com/index/how-should-ai-systems-behave/
|
| I would have to admit some humility as I sort of brought this on
| myself [1]
|
| > This is a fantastic idea. Perhaps then this should be the next
| test for these SWE Agents, in the same manner as the 'Will Smith
| Eats Spaghetti" video tests
|
| https://news.ycombinator.com/item?id=43032191
|
| But curiously the question is still valid.
|
| Related:
|
| Sam Altman: "50C/ of compute of a SWE Agent can yield "$500 or
| $5k of work."
|
| https://news.ycombinator.com/item?id=43032098
|
| https://x.com/vitrupo/status/1889720371072696554
| bufferoverflow wrote:
| And how do you evaluate if the task was completed correctly?
| There are nearly infinite ways to solve a given software dev
| problem, if the problem isn't trivial (and I hope they are not
| benchmarking trivial problems).
| riku_iki wrote:
| paper says they created e2e tests to check if task completed
| successfully.
| Tiberium wrote:
| The extremely interesting part is that 3.5 Sonnet is above o1 on
| this benchmark, which again shows that 3.5 Sonnet is a very
| special model that's best for real world tasks and not some one-
| shot scripts or math. And the weirdest part is that they tested
| the 20240620 snapshot which is objectively worse on code than the
| newer 20241022 (so-called v2).
| GaggiX wrote:
| I understand why they did not show the results on the website.
| moralestapia wrote:
| The writing is very clearly on the wall.
|
| On a non-pessimist note, I don't think the SWE role will
| disappear, but what's the best one could do to be prepared for
| this?
| bigbones wrote:
| There will always be "real thinking" roles in software but the
| sheer pressure on salaries from the vastly increasing free
| labour pool will lead to an outcome a bit like embedded
| software development, where rates don't really match the skill
| level. I think the most obvious strategy for the time being is
| figuring out how to become a buyer of the services you
| understand rather than a badly crowded out seller
| neilv wrote:
| "SWE-Lancer", like, running through SWEs with a lance?
| marinhero wrote:
| I was thinking about the Gears of War Lancer--an assault rifle
| with a chainsaw attached, designed to slice Locustin half. But
| in this case, imagine it being used on Software Engineers
| instead of Locust.
___________________________________________________________________
(page generated 2025-02-18 23:00 UTC)