https://newsletter.getdx.com/p/benchmarking-ai-software-engineering-capabilities

 
Engineering Enablement

Engineering Enablement

SubscribeSign in

Share this post

 
[https]
Engineering Enablement
Engineering Enablement
Can LLMs earn $1M from real freelance coding work?
Copy link
Facebook
Email
Notes
More

Can LLMs earn $1M from real freelance coding work?

A new benchmark tests AI's ability to complete real-world software
engineering tasks.

 
Abi Noda's avatar
Abi Noda
Apr 16, 2025
9

Share this post

 
[https]
Engineering Enablement
Engineering Enablement
Can LLMs earn $1M from real freelance coding work?
Copy link
Facebook
Email
Notes
More
 
2
 
Share

Welcome to the latest issue of Engineering Enablement, a weekly
newsletter sharing research and perspectives on developer
productivity.

[                    ]
Subscribe
---------------------------------------------------------------------

This week I read SWE-Lancer: Can Frontier LLMs Earn $1 Million from
Real-World Freelance Software Engineering?, a new paper from OpenAI
that evaluates how well frontier AI models perform on real-world
software development tasks.

My summary of the paper

In just two years, language models have advanced from solving basic
textbook computer science problems to winning gold medals in
international programming competitions--so it can be difficult for
leaders to understand the current state of AI's capabilities. In this
paper, the researchers provide the first benchmark of this kind.

While previous coding benchmarks focused on isolated, self-contained
tasks, this new benchmark (called SWE-Lancer) shows how AI models
handle real-world tasks. In doing so, it also provides an
understanding of AI's capabilities today.

Creating the benchmark

To measure how well AI can handle real-world tasks, the researchers
collected over 1,400 freelance jobs posted on Upwork. Each task had a
real dollar value attached--ranging from $250 to $32,000--and
collectively, the tasks were worth over $1 million. By tying
performance to actual dollar values, this study provides a clearer
picture of AI's current economic impact potential.

The tasks were split into two types:

1. Individual contributor software engineering tasks -- "Can AI fix
this bug or build this feature?" These ranged from quick 15-minute
fixes to multi-week feature requests. The AI model was given the
issue description and access to the codebase, then had to write code
to solve the problem. Human engineers built tests to check whether
the AI's solution actually worked.

2. Engineering manager tasks - "Can AI pick the best solution?" In
these, the AI had to review multiple freelancer submissions for a job
and choose the best one--just like a hiring manager would. The correct
answer was based on what the original manager picked.

To be thorough, the researchers paid 100 professional software
engineers to create and verify tests for every task. Each test was
triple-verified.

They measured the success of each AI model using:

  * Pass rate: How many tasks it completed successfully

  * Earnings: The total dollar value of tasks it could complete

  * Performance variations: How results changed when models had more
    tries, more time, or access to tools

This methodology provides a realistic picture of how well current AI
models can handle the kinds of software engineering tasks that
companies actually pay humans to do.

Performance and findings

The researchers evaluated three frontier models on the selected
tasks:

  * Claude 3.5 Sonnet

  * OpenAI's GPT-4o

  * OpenAI's "o1" with high reasoning effort

Here's what they found:

1. All models underperform human engineers

As shown in the table below, all models earned well below the full $1
million USD of possible payout. Even the best-performing model--Claude
3.5 Sonnet--earned approximately $403,000 of the possible $1 million
total, solving only 33.7% of all tasks.

 
[https]

2. All models perform better at management tasks than at coding

All models were significantly better at picking the best solutions
than they were at creating the best solutions. For example, Claude
3.5 Sonnet successfully completed 47% of management tasks but only
21.1% of implementation tasks.

This suggests that AI might first aid engineering teams by helping
with code reviews and architectural decisions before it can reliably
write complex code.

 
[https]

3. Performance improves with multiple attempts

Allowing the o1 model 7 attempts instead of 1 nearly tripled its
success rate, going from 16.5% to 46.5%. This hints that current
models may have the knowledge to solve many more problems but
struggle with execution on the first try.

 
[https]

4. More computation time helps, especially on harder problems

Increasing "reasoning effort" improved o1's performance from 9.3% to
16.5%, with even bigger gains on complex tasks. This indicates that
current limitations might be computational rather than fundamental.

5. Models show significant differences in capabilities

Claude 3.5 Sonnet (31.7%) drastically outperformed GPT-4o (2.4%) on
UI/UX tasks, suggesting important differences in how these models
handle visual and interface elements.

Final thoughts

While today's best AI models can successfully complete some
real-world tasks, they still cannot reliably handle the majority of
complex projects. This study demonstrates both the incredible
progress AI has made and the significant challenges that remain
before teams can fully automate coding tasks.

Most importantly, the benchmark introduced in this paper provides a
concrete way to measure AI progress going forward, helping leaders
better understand and forecast AI's current ability and impact.

---------------------------------------------------------------------

Who's hiring right now

This week's featured DevProd & Platform job openings. See more open
roles here.

  * Adyen is hiring a Team Lead - Platform | Amsterdam

  * Scribd is hiring a Senior Manager - Developer Tooling | Remote
    (US, Canada)

  * Rippling is hiring a Director of Product Management - Platform |
    San Francisco

  * UKG is hiring a Director and Sr Director of Technical Program
    Management | Multiple locations

  * Snowflake is hiring a Director of Engineering - Test Framework |
    Bellevue and Menlo Park

  * Lyft is hiring an Engineering Manager - DevEx | Toronto

---------------------------------------------------------------------

That's it for this week. If you know someone who would enjoy this
issue, share it with them:

Share

9

Share this post

 
[https]
Engineering Enablement
Engineering Enablement
Can LLMs earn $1M from real freelance coding work?
Copy link
Facebook
Email
Notes
More
 
2
 
Share

Discussion about this post

CommentsRestacks
User's avatar
[                    ]
[                    ]
[                    ]
[                    ]
 
Sep's avatar
Sep
1h

My N=1 experiment with all publicly available solutions, including
raw frontier models and specifically built solutions for AI
development tasks like v0 or bolt, proves that while I can get some
help, especially in terms of ideation, the more specific I become
about the requirements, the less useful the results are.

Expand full comment
Reply
Share
 
Joachim Sammer's avatar
Joachim Sammer
3h

I think the 'more attempts' need clarification. It sounds like there
is an improvement with more attempts - whereas more tries might lead
to probabilistic success, if the LLM gets lucky. There is also a (k)
in the diagram that is not explained in text. Commonly this stands
for kilo as in 1,000. So, the hapless engineering manager of their
LLM team has to wade through thousands of results? Even 7 is bad
enough...

Expand full comment
Reply
Share
TopLatestDiscussions

No posts

Ready for more?

[                    ]
Subscribe
(c) 2025 Abi Noda
Privacy [?] Terms [?] Collection notice
 Start WritingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please turn on
JavaScript or unblock scripts