hngopher.com

       [HN Gopher] Claude 4
       ___________________________________________________________________
        
       Claude 4
        
       Author : meetpateltech
       Score  : 1311 points
       Date   : 2025-05-22 16:34 UTC (6 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | swyx wrote:
       | livestream here: https://youtu.be/EvtPBaaykdo
       | 
       | my highlights:
       | 
       | 1. Coding ability: "Claude Opus 4 is our most powerful model yet
       | and the best coding model in the world, leading on SWE-bench
       | (72.5%) and Terminal-bench (43.2%). It delivers sustained
       | performance on long-running tasks that require focused effort and
       | thousands of steps, with the ability to work continuously for
       | several hours--dramatically outperforming all Sonnet models and
       | significantly expanding what AI agents can accomplish." however
       | this is Best of N, with no transparency on size of N and how they
       | decide the best, saying "We then use an internal scoring model to
       | select the best candidate from the remaining attempts." Claude
       | Code is now generally available (we covered in
       | http://latent.space/p/claude-code )
       | 
       | 2. Memory highlight: "Claude Opus 4 also dramatically outperforms
       | all previous models on memory capabilities. When developers build
       | applications that provide Claude local file access, Opus 4
       | becomes skilled at creating and maintaining 'memory files' to
       | store key information. This unlocks better long-term task
       | awareness, coherence, and performance on agent tasks--like Opus 4
       | creating a 'Navigation Guide' while playing Pokemon." Memory
       | Cookbook: https://github.com/anthropics/anthropic-
       | cookbook/blob/main/t...
       | 
       | 3. Raw CoT available: "we've introduced thinking summaries for
       | Claude 4 models that use a smaller model to condense lengthy
       | thought processes. This summarization is only needed about 5% of
       | the time--most thought processes are short enough to display in
       | full. Users requiring raw chains of thought for advanced prompt
       | engineering can contact sales about our new Developer Mode to
       | retain full access."
       | 
       | 4. haha: "We no longer include the third 'planning tool' used by
       | Claude 3.7 Sonnet. " <- psyop?
       | 
       | 5. context caching now has a premium 1hr TTL option: "Developers
       | can now choose between our standard 5-minute time to live (TTL)
       | for prompt caching or opt for an extended 1-hour TTL at an
       | additional cost"
       | 
       | 6. https://www.anthropic.com/news/agent-capabilities-api new code
       | execution tool (sandbox) and file tool
        
         | modeless wrote:
         | Memory could be amazing for coding in large codebases. Web
         | search could be great for finding docs on dependencies as well.
         | Are these features integrated with Claude Code though?
        
       | obiefernandez wrote:
        
         | bigyabai wrote:
         | > Really wish I could say more.
         | 
         | Have you used it?
         | 
         | I liked Claude 3.7 but without context this comes off as what
         | the kids would call "glazing"
        
         | oytis wrote:
         | Game changer is table stakes now, tell us something new.
        
       | modeless wrote:
       | Ooh, VS Code integration for Claude Code sounds nice. I do feel
       | like Claude Code works better than the native Cursor agent mode.
       | 
       | Edit: How do you install it? Running `/ide` says "Make sure your
       | IDE has the Claude Code extension", where do you get that?
        
         | ChadMoran wrote:
         | I've been looking for that as well.
        
           | ChadMoran wrote:
           | Run claude cli from inside of VS Code terminal.
           | 
           | https://news.ycombinator.com/item?id=44064082
        
         | k8sToGo wrote:
         | When you install claude code (or update it) there is a .vsix
         | file in the same area where claude bin is.
        
           | modeless wrote:
           | Thanks, I found it in node_modules/@anthropic-ai/claude-
           | code/vendor
        
         | jacobzweig wrote:
         | Run claude in the terminal in VSCode (or cursor) and it will
         | install automatically!
        
           | modeless wrote:
           | Doesn't work in Cursor over ssh. I found the VSIX in
           | node_modules/@anthropic-ai/claude-code/vendor and installed
           | it manually.
        
             | ChadMoran wrote:
             | Found it!
             | 
             | https://news.ycombinator.com/item?id=44064082
        
         | michelb wrote:
         | What would this do other than run Claude Code in the same
         | directory you have open in VSC?
        
           | modeless wrote:
           | Show diffs in my editor windows, like Cursor agent mode does,
           | is what I'm hoping.
        
             | awestroke wrote:
             | Use `git diff` or your in-editor git diff viewer
        
               | modeless wrote:
               | Of course I do. But I don't generally make a git commit
               | between every prompt to the model, and I like to see the
               | model's changes separately from mine.
        
           | manmal wrote:
           | Claude Code as a tool call, from Copilot's own agent (agent
           | in an agent) seems to be working well. Peter Steinberger made
           | an MCP that does this: https://github.com/steipete/claude-
           | code-mcp
        
         | rw2 wrote:
         | Claude code is a poorer version of aider or cline in VScode. I
         | have better results using them than using claude code alone.
        
       | jbellis wrote:
       | Good, I was starting to get uncomfortable with how hard Gemini
       | has been dominating lately
       | 
       | ETA: I guess Anthropic still thinks they can command a premium, I
       | hope they're right (because I would love to pay more for smarter
       | models).
       | 
       | > Pricing remains consistent with previous Opus and Sonnet
       | models: Opus 4 at $15/$75 per million tokens (input/output) and
       | Sonnet 4 at $3/$15.
        
       | esaym wrote:
       | heh, I just wrote a small hit piece about all the disappointments
       | of the models over the last year and now the next day there is a
       | new model. I'm going to assume it will still get you only to 80%
       | ( deg [?]? deg)
        
       | __jl__ wrote:
       | Anyone found information on API pricing?
        
         | rafram wrote:
         | It's up on their pricing page:
         | https://www.anthropic.com/pricing
        
         | burningion wrote:
         | Yeah it's live on the pricing page:
         | 
         | https://www.anthropic.com/pricing#api
         | 
         | Opus 4 is $15 / m tokens in, $75 / MTok out Sonnet 4 is the
         | same $3 / MTok in, $15 / MTok out
        
           | __jl__ wrote:
           | Thanks. I looked a couple minutes ago and couldn't see it.
           | For anyone curious, pricing remains the same as previous
           | Anthropic models.
        
             | pzo wrote:
             | surprisingly cursor charges only 0.75x for request for
             | sonnet 4.0 (comparing to 1x for sonnet 3.7)
        
               | transcranial wrote:
               | It does say "temporarily offered at a discount" when you
               | hover over the model in the dropdown.
        
         | piperswe wrote:
         | From the linked post:
         | 
         | > Pricing remains consistent with previous Opus and Sonnet
         | models: Opus 4 at $15/$75 per million tokens (input/output) and
         | Sonnet 4 at $3/$15.
        
       | esaym wrote:
       | > Try Claude Sonnet 4 today with Claude Opus 4 on paid plans.
       | 
       | Wait, Sonnet 4? Opus 4? What?
        
         | sxg wrote:
         | Claude names their models based on size/complexity:
         | 
         | - Small: Haiku
         | 
         | - Medium: Sonnet
         | 
         | - Large: Opus
        
       | jareds wrote:
       | I'll look at it when this shows up on
       | https://aider.chat/docs/leaderboards/ I feel like keeping up with
       | all the models is a full time job so I just use this instead and
       | hopefully get 90% of the benefit I would by manually testing out
       | every model.
        
         | evantbyrne wrote:
         | Are these just leetcode exercises? What I would like to see is
         | an independent benchmark based on real tasks in codebases of
         | varying size.
        
           | KaoruAoiShiho wrote:
           | Aider is not just leetcode exercises I think? livecodebench
           | is leetcode exercises though.
        
           | rafram wrote:
           | Aider uses a dataset of 500 GitHub issues, so not LeetCode-
           | style work.
        
             | evantbyrne wrote:
             | It says right on that linked page:
             | 
             | > Aider's polyglot benchmark tests LLMs on 225 challenging
             | Exercism coding exercises across C++, Go, Java, JavaScript,
             | Python, and Rust.
             | 
             | I looked up Exercism and they appear to be story problems
             | that you solve by coding on mostly/entirely blank slates,
             | unless I'm missing something? That format would seem to
             | explain why the models are reportedly performing so well,
             | because they definitely aren't that reliable on mature
             | codebases.
        
       | jasonthorsness wrote:
       | "GitHub says Claude Sonnet 4 soars in agentic scenarios and will
       | introduce it as the base model for the new coding agent in GitHub
       | Copilot."
       | 
       | Maybe this model will push the "Assign to CoPilot" closer to the
       | dream of having package upgrades and other mostly-mechanical
       | stuff handled automatically. This tech could lead to a huge
       | revival of older projects as the maintenance burden falls.
        
         | BaculumMeumEst wrote:
         | Anyone see news of when it's planned to go live in copilot?
        
           | vinhphm wrote:
           | The option just shown up in Copilot settings page for me
        
             | BaculumMeumEst wrote:
             | Same! Rock and roll!
        
               | denysvitali wrote:
               | I got rate-limited in like 5 seconds. Wow
        
             | bbor wrote:
             | Turns out Opus 4 starts at their $40/mo ("Pro+") plan which
             | is sad, and they serve o4-mini and Gemini as well so it's a
             | bit less exclusive than this announcement implies. That
             | said, I have a random question for any Anthropic-heads out
             | there:
             | 
             | GitHub says "Claude Opus 4 is hosted by Anthropic PBC.
             | Claude Sonnet 4 is hosted by Anthropic 1P."[1]. What's
             | _Anthropic 1P_? Based on the only Kagi result being a
             | deployment tutorial[2] and the fact that GitHub negotiated
             | a  "zero retention agreement" with the PBC but not whatever
             | "1P" is, I'm assuming it's a spinoff cloud company that
             | only serves Claude...? No mention on the Wikipedia or any
             | business docs I could find, either.
             | 
             | Anyway, off to see if I can access it from inside
             | SublimeText via LSP!
             | 
             | [1] https://docs.github.com/en/copilot/using-github-
             | copilot/ai-m...
             | 
             | [2] https://github.com/anthropics/prompt-eng-interactive-
             | tutoria...
        
               | l1n wrote:
               | 1P = Anthropic's first party API, e.g. not through
               | Bedrock or Vertex
        
               | Workaccount2 wrote:
               | Google launched Jules two days ago, which is the gemini
               | coding agent[1]. I was pretty quickly accepted into the
               | beta and you get 5 free tasks a day.
               | 
               | So far I have found it pretty powerful, its also the
               | first time an LLM has ever stopped while working to ask
               | me a question or for clarification.
               | 
               | [1]https://jules.google/
        
           | minimaxir wrote:
           | The keynote confirms it is available now.
        
             | jasonthorsness wrote:
             | Gotta love keynotes with concurrent immediate availability
        
               | brookst wrote:
               | Not if you work there
        
               | echelon wrote:
               | That's just a few weeks of DR + prep, a feature freeze,
               | and oncall with bated breath.
               | 
               | Nothing any rank and file hasn't been through before with
               | a company that relies on keynotes and flashy releases for
               | growth.
               | 
               | Stressful, but part and parcel. And well-compensated.
        
         | ModernMech wrote:
         | That's kind of my benchmark for whether or not these models are
         | useful. I've got a project that needs some extensive
         | refactoring to get working again. Mostly upgrading packages,
         | but also it will require updating the code to some new language
         | semantics that didn't exist when it was written. So far,
         | current AI models can make essentially zero progress on this
         | task. I'll keep trying until they can!
        
           | tmpz22 wrote:
           | And IMO it has a long way to go. There is a lot of nuance
           | when orchestrating dependencies that can cause subtle errors
           | in an application that are not easily remedied.
           | 
           | For example a lot of llms (I've seen it in Gemini 2.5, and
           | Claude 3.7) will code non-existent methods in dynamic
           | languages. While these runtime errors are often auto-fixable,
           | sometimes they aren't, and breaking out of an agentic
           | workflow to deep dive the problem is quite frustrating - if
           | mostly because agentic coding entices us into being so lazy.
        
             | jasonthorsness wrote:
             | The agents will definitely need a way to evaluate their
             | work just as well as a human would - whether that's a full
             | test suite, tests + directions on some manual verification
             | as well, or whatever. If they can't use the same tools as a
             | human would they'll never be able to improve things safely.
        
             | mikepurvis wrote:
             | "... and breaking out of an agentic workflow to deep dive
             | the problem is quite frustrating"
             | 
             | Maybe that's the problem that needs solving then? The
             | threshold doesn't _have_ to be  "bot capable of doing
             | entire task end to end", like it could also be "bot does
             | 90% of task, the worst and most boring part, human steps in
             | at the end to help with the one bit that is more tricky".
             | 
             | Or better yet, the bot is able to recognize its own
             | limitations and proactively surface these instances, be
             | like hey human I'm not sure what to do in this case; based
             | on the docs I think it should be A or B, but I also feel
             | like C should be possible yet I can't get any of them to
             | work, what do you think?
             | 
             | As humans, it's perfectly normal to put up a WIP PR and
             | then solicit this type of feedback from our colleagues; why
             | would a bot be any different?
        
               | dvfjsdhgfv wrote:
               | > Maybe that's the problem that needs solving then? The
               | threshold doesn't have to be "bot capable of doing entire
               | task end to end", like it could also be "bot does 90% of
               | task, the worst and most boring part, human steps in at
               | the end to help with the one bit that is more tricky".
               | 
               | Still, the big short-term danger being you're left with
               | code that seems to work well but has subtle bugs in it,
               | and the long-term danger is that you're left with a
               | codebase you're not familiar with.
        
               | mikepurvis wrote:
               | Being left with an unfamiliar codebase is always a
               | concern and comes about through regular attrition,
               | particularly if inadequate review is not in place or
               | people are cycling in and out of the org too fast for
               | proper knowledge transfer (so, cultural problems
               | basically).
               | 
               | If anything, I'd bet that agent-written code will get
               | better review than average because the turn around time
               | on fixes is fast and no one will sass you for nit-
               | picking, so it's "worth it" to look closely and ensure
               | it's done just the way you want.
        
             | soperj wrote:
             | > if mostly because agentic coding entices us into being so
             | lazy.
             | 
             | Any coding I've done with Claude has been to ask it to
             | build specific methods, if you don't understand what's
             | actually happening, then you're building something that's
             | unmaintainable. I feel like it's reducing typing and syntax
             | errors, sometime it leads me down a wrong path.
        
           | yosito wrote:
           | Personally, I don't believe AI is ever going to get to that
           | level. I'd love to be proven wrong, but I really don't
           | believe that an LLM is the right tool for a job that requires
           | novel thinking about out of the ordinary problems like all
           | the weird edge cases and poor documentation that comes up
           | when trying to upgrade old software.
        
             | 9dev wrote:
             | Actually, I think the opposite: Upgrading a project that
             | needs dependency updates to new major versions--let's say
             | Zod 4, or Tailwind 3--requires reading the upgrade guides
             | and documentation, and transferring that into the project.
             | In other words, transforming text. It's thankless, stupid
             | toil. I'm very confident I will not be doing this much more
             | often in my career.
        
               | mikepurvis wrote:
               | Absolutely, this should be exactly the kind of task a bot
               | should be perfect for. There's no abstraction, no design
               | work, no refactoring, no consideration of stakeholders,
               | just finding instances of whatever is old and busted and
               | changing it for the new hotness.
        
               | dvfjsdhgfv wrote:
               | So that was my conviction, too. However, in my tests it
               | seems like upgrading to a version a model hasn't seen is
               | for some reason problematic, in spite of giving it the
               | complete docs, examples of new API usage etc. This
               | happens even with small snippets, even though they can
               | deal with large code fragments with older APIs they are
               | very "familiar" with.
        
               | mikepurvis wrote:
               | Okay so less of a "this isn't going to work at all" and
               | more just not ready for prime-time yet.
        
               | cardanome wrote:
               | Theoretically we don't even need AI. If semantics were
               | defined well enough and maintainers actually concerned
               | about and properly tracking breaking changes we could
               | have tools that automatically upgrade our code. Just a
               | bunch of simple scripts that perform text
               | transformations.
               | 
               | The problem is purely social. There are language
               | ecosystems where great care is taken to not break stuff
               | and where you can let your project rot for a decade or
               | two and still come back to and it will perfectly compile
               | with the newest release. And then there is the JS world
               | where people introduce churn just for the sake of their
               | ego.
               | 
               | Maintaining a project is orders of magnitudes more
               | complex than creating a new green field project. It takes
               | a lot of discipline. There is just a lot, a lot of
               | context to keep in mind that really challenges even the
               | human brain. That is why we see so many useless rewrites
               | of existing software. It is easier, more exciting and
               | most importantly something to brag about on your CV.
               | 
               | Ai will only cause more churn because it makes it easier
               | to create more churn. Ultimately leaving humans with more
               | maintenance work and less fun time.
        
               | afavour wrote:
               | > and maintainers actually concerned about and properly
               | tracking breaking changes we could have tools that
               | automatically upgrade our code
               | 
               | In some cases perhaps. But breaking changes aren't
               | usually "we renamed methodA to methodB", it's "we changed
               | the functionality for X,Y, Z reasons". It would be very
               | difficult to somehow declaratively write out how someone
               | changes their code to accommodate for that, it might
               | change their approach entirely!
        
               | mdaniel wrote:
               | There are programmatic upgrade tools, some projects ship
               | them even right now https://github.com/codemod-
               | com/codemod
               | 
               | I think there are others in that space but that's the one
               | I knew of. I think it's a relevant space for Semgrep,
               | too, but I don't know if they are interested in that case
        
             | dakna wrote:
             | Google demoed an automated version upgrade for Android
             | libraries during I/O 2025. The agent does multiple rounds
             | and checks error messages during each build until all
             | dependencies work together.
             | 
             | Agentic Experiences: Version Upgrade Agent
             | 
             | https://youtu.be/ubyPjBesW-8?si=VX0MhDoQ19Sc3oe-
        
         | rco8786 wrote:
         | It could be! But that's also what people said about all the
         | models before it!
        
           | kmacdough wrote:
           | And they might all be right!
           | 
           | > This tech could lead to...
           | 
           | I don't think he's saying this is the version that will
           | suddenly trigger a Renaissance. Rather, it's one solid step
           | that makes the path ever more promising.
           | 
           | Sure, everyone gets a bit overexcited each release until they
           | find the bounds. But the bounds are expanding, and the need
           | for careful prompt engineering is diminishing. Ever since
           | 3.7, Claude has been a regular part of my process for the
           | mundane. And so far 4.0 seems to take less fighting for me.
           | 
           | A good question would be _when_ can AI take a basic prompt,
           | gather its own requirements and build meaningful PRs off
           | basic prompt. I suspect it 's still at least a couple of
           | paradigm shifts away. But those seem to be coming every year
           | or faster.
        
         | max_on_hn wrote:
         | I am incredibly eager to see what affordable coding agents can
         | do for open source :) in fact, I should really be giving away
         | CheepCode[0] credits to open source projects. Pending any sort
         | of formal structure, if you see this comment and want free
         | coding agent runs, email me and I'll set you up!
         | 
         | [0] My headless coding agents product, similar to "assign to
         | copilot" but works from your task board (Linear, Jira, etc) on
         | multiple tasks in parallel. So far simple/routine features are
         | already quite successful. In general the better the tests, the
         | better the resulting code (and yes, it can and does write its
         | own tests).
        
         | ed_elliott_asc wrote:
         | Until it pushes a severe vulnerability which takes a big
         | service doen
        
       | james_marks wrote:
       | > we've significantly reduced behavior where the models use
       | shortcuts or loopholes to complete tasks. Both models are 65%
       | less likely to engage in this behavior than Sonnet 3.7 on agentic
       | tasks
       | 
       | Sounds like it'll be better at writing meaningful tests
        
         | NitpickLawyer wrote:
         | One strategy that also works is to have 2 separate "sessions",
         | have one write code and one write tests. Forbid one to change
         | the other's "domain". Much better IME.
        
         | UncleEntity wrote:
         | In my experience, when presented with a failing test it would
         | simply try to make the test pass instead of determining why the
         | test was failing. Usually by hard coding the test parameters
         | (or whatever) in the failing function... which was super
         | annoying.
        
       | sigmoid10 wrote:
       | Sooo... it can _play_ Pokemon. Feels like they had to throw that
       | in after Google IO yesterday. But the real question is now can it
       | _beat_ the game including the Elite Four and the Champion. That
       | was pretty impressive for the new Gemini model.
        
         | throwaway314155 wrote:
         | Gemini can beat the game?
        
           | mxwsn wrote:
           | Gemini has beat it already, but using a different and notably
           | more helpful harness. The creator has said they think harness
           | design is the most important factor right now, and that the
           | results don't mean much for comparing Claude to Gemini.
        
             | throwaway314155 wrote:
             | Way offtopic to TFA now, but isn't using an improved
             | harness a bit like saying "I'm going to hardcore as many
             | priors as possible into this thing so it succeeds
             | regardless of its ability to strategize, plan and execute?
        
               | samrus wrote:
               | it is. the benchmark was somewhat cheated, from the
               | perspective of finding out how the model adjusts and
               | plans within a dynamic reactive environment
        
               | 11101010001100 wrote:
               | They asked gemini to come up with another word for
               | cheating and it came up with 'harness'.
        
               | silvr wrote:
               | While true to a degree, I think this is largely wrong.
               | Wouldn't it still count as a "harness" if we provided
               | these LLMs with full robotic control of two humanoid
               | arms, so that it could hold a Gameboy and play the game
               | that way? I don't think the lack of that level of human-
               | ness takes away from the demonstration of long-context
               | reasoning that the GPP stream showed.
               | 
               | Claude got stuck reasoning its way through one of the
               | more complex puzzle areas. Gemini took a while on it
               | also, but made it through. I don't that difference can be
               | fully attributed up to the harnesses.
               | 
               | Obviously, the best thing to do would be to run a SxS in
               | the same harness of the two models. Maybe that will
               | happen?
        
           | klohto wrote:
           | 2 weeks ago
        
         | minimaxir wrote:
         | That Google IO slide was somewhat misleading as the maintainer
         | of Gemini Plays Pokemon had a much better agentic harness that
         | was constantly iterated upon throughout the runtime (e.g. the
         | maintainer had to give specific instructions on how to use
         | Strength to get past Victory Road), unlike Claude Plays
         | Pokemon.
         | 
         | The Elite Four/Champion was a non-issue in comparison
         | especially when you have a lv. 81 Blastoise.
        
         | hansmayer wrote:
         | Right, but on the other hand... how is it even useful? Let's
         | say it can beat the game, so what? So it can (kind of)
         | summarise or write my emails - which is something I neither
         | want nor need, they produce mountains of sloppy code, which I
         | would have to end up fixing, and finally they can play a game?
         | Where is the killer app? The gaming approach was exactly the
         | premise of the original AI efforts in the 1960s, that teaching
         | computers to play chess and other 'brainy' games will somehow
         | lead to development of real AI. It ended as we know in the AI
         | nuclear winter.
        
           | minimaxir wrote:
           | It's a fun benchmark, like simonw's pelican riding a bike.
           | Sometimes fun is the best metric.
        
           | samrus wrote:
           | from a foundational research perspective, the pokemon
           | benchmark is one of the most important ones.
           | 
           | these models are trained on a static task, text generation,
           | which is to say the state they are operating in does not
           | change as they operate. but now that they are out we are
           | implicitly demanding they do dynamic tasks like coding,
           | navigation, operating in a market, or playing games. this are
           | tasks where your state changes as you operate
           | 
           | an example would be that as these models predict the next
           | word, the ground truth of any further words doesnt change. if
           | it misinterprets the word bank in the sentence "i went to the
           | bank" as a river bank rather than a financial bank, the later
           | ground truth wont change, if it was talking about the visit
           | to the financial bank before, it will still be talking about
           | that regardless of the model's misinterpretation. But if a
           | model takes a wrong turn on the road, or makes a weird buy in
           | the stock market, the environment will react and change and
           | suddenly, what it should have done as the n+1th move before
           | isnt the right move anymore, it needs to figure out a route
           | of the freeway first, or deal with the FOMO bullrush it
           | caused by mistakenly buying alot of stock
           | 
           | we need to push against these limits to set the stage for the
           | next evolution of AI, RL based models that are trained in
           | dynamic reactive environments in the first place
        
             | hansmayer wrote:
             | Honestly I have no idea what is this supposed to mean, and
             | the high verbosity of whatever it is trying to prove is not
             | helping it. To repeat: We already tried making computers
             | play games. Ever heard of Deep Blue, and ever heard of it
             | again since the early 2000s?
        
               | lechatonnoir wrote:
               | Here's a summary for you:
               | 
               | llm trained to do few step thing. pokemon test whether
               | llm can do many step thing. many step thing very
               | important.
        
               | Rudybega wrote:
               | The state space for actions in Pokemon is hilariously,
               | unbelievably larger than the state space for chess. Older
               | chess algorithms mostly used Brute Force (things like
               | minimax) and the number of actions needed to determine a
               | reward (winning or losing) was way lower (chess ends in
               | many, many, many fewer moves than Pokemon).
               | 
               | Successfully navigating through Pokemon to accomplish a
               | goal (beating the game) requires a completely different
               | approach, one that much more accurately mirrors the way
               | you navigate and goal set in real world environments.
               | That's why it's an important and interesting test of AI
               | performance.
        
           | j_maffe wrote:
           | > Where is the killer app?
           | 
           | My man, ChatGPT is the sixth most visited website in the
           | world right now.
        
             | hansmayer wrote:
             | But I did not ask "what was the sixth most visited website
             | in the world right now?", did I? I asked what was the
             | killer app here. I am afraid vague and un-related KPIs will
             | not help here, otherwise we may as well compare ChatGPT and
             | PornHub based on the number of visits, as you seem to
             | suggest.
        
               | signatoremo wrote:
               | Are you saying PornHub isn't a killer app?
        
               | j_maffe wrote:
               | If the now default go-to source for quick questions and
               | formulating text isn't a killer app in your eyes, I don't
               | know what is.
        
           | lechatonnoir wrote:
           | This is a weirdly cherry-picked example. The gaming approach
           | was also the premise of DeepMind's AI efforts in 2016, which
           | was nine years ago. Regardless of what you think about the
           | utility of text (code), video, audio, and image generation,
           | surely you think that their progress on the protein-folding
           | problem and weather prediction have been useful to society?
           | 
           | What counts as a killer app to you? Can you name one?
        
         | archon1410 wrote:
         | Claude Plays Pokemon was the original concept and inspiration
         | behind "Gemini Plays Pokemon". Gemini arguably only did better
         | because it had access to a much better agent harness and was
         | being actively developed during the run.
         | 
         | See: https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-
         | gemini-...
        
           | montebicyclelo wrote:
           | Not sure "original concept" is quite right, given it had been
           | tried earlier, e.g. here's a 2023 attempt to get gpt-4-vision
           | to play pokemon, (it didn't really work, but it's clearly
           | "the concept")
           | 
           | https://x.com/sidradcliffe/status/1722355983643525427
        
             | archon1410 wrote:
             | I see, I wasn't aware of that. The earliest attempt I knew
             | of was from May 2024,[1] while this gpt-4-vision attempt is
             | from November 2023. I guess Claude Plays Pokemon was the
             | first attempt that had any real success (won a badge), and
             | got a lot of attention over its entertaining "chain-of-
             | thought".
             | 
             | [1] https://community.aws/content/2gbBSofaMK7IDUev2wcUbqQXT
             | K6/ca...
        
       | htrp wrote:
       | Allegedly Claude 4 Opus can run autonomously for 7 hours
       | (basically automating an entire SWE workday).
        
         | catigula wrote:
         | That is quite the allegation.
        
         | jeremyjh wrote:
         | Which sort of workday? The sort where you rewrite your code 8
         | times and end the day with no marginal business value produced?
        
           | renewiltord wrote:
           | Well Claude 3.7 definitely did the one where it was supposed
           | to process a file and it finally settled on `fs.copyFile(src,
           | dst)` which I think is pro-level interaction. I want those
           | $0.95 back.
           | 
           | But I love you Claude. It was me, not you.
        
           | readthenotes1 wrote:
           | Well, at least it doesn't distract your coworkers, disrupting
           | their flow
        
             | kaoD wrote:
             | I'm already working on the Slack MCP integration.
        
               | jeremyjh wrote:
               | Please encourage it to use lots of emojis.
        
         | speedgoose wrote:
         | Easy, I can also make a nanoGPT run for 7 hours when inferring
         | on a 68k, and make it produce as much value as I usually do.
        
         | paxys wrote:
         | I can write an algorithm to run in a loop forever, but that
         | doesn't make it equivalent to infinite engineers. It's the
         | output that matters.
        
         | krelian wrote:
         | How much does it cost to have it run for 7 hours straight?
        
         | htrp wrote:
         | >Rakuten validated its capabilities with a demanding open-
         | source refactor running independently for 7 hours with
         | sustained performance.
         | 
         | From their customer testimonials in the announcement, more
         | below
         | 
         | >Cursor calls it state-of-the-art for coding and a leap forward
         | in complex codebase understanding. Replit reports improved
         | precision and dramatic advancements for complex changes across
         | multiple files. Block calls it the first model to boost code
         | quality during editing and debugging in its agent, codename
         | goose, while maintaining full performance and reliability.
         | Rakuten validated its capabilities with a demanding open-source
         | refactor running independently for 7 hours with sustained
         | performance. Cognition notes Opus 4 excels at solving complex
         | challenges that other models can't, successfully handling
         | critical actions that previous models have missed.
         | 
         | >GitHub says Claude Sonnet 4 soars in agentic scenarios and
         | will introduce it as the base model for the new coding agent in
         | GitHub Copilot. Manus highlights its improvements in following
         | complex instructions, clear reasoning, and aesthetic outputs.
         | iGent reports Sonnet 4 excels at autonomous multi-feature app
         | development, as well as substantially improved problem-solving
         | and codebase navigation--reducing navigation errors from 20% to
         | near zero. Sourcegraph says the model shows promise as a
         | substantial leap in software development--staying on track
         | longer, understanding problems more deeply, and providing more
         | elegant code quality. Augment Code reports higher success
         | rates, more surgical code edits, and more careful work through
         | complex tasks, making it the top choice for their primary
         | model.
        
         | victorbjorklund wrote:
         | Makes no sense to measure it in hours. You can have a slow CPU
         | making the model run for longer.
        
       | lxe wrote:
       | Looks like both opus and sonnet are already in Cursor.
        
         | daviding wrote:
         | "We're having trouble connecting to the model provider"
         | 
         | A bit busy at the moment then.
        
       | eamag wrote:
       | Nobody cares about lmarena anymore? I guess it's too easy to
       | cheat there after a llama4 release news?
        
       | josefresco wrote:
       | I have the Claude Windows app, how long until it can "see" what's
       | on my screen and help me code/debug?
        
         | throwaway314155 wrote:
         | You can likely set up an MCP server that handles this.
        
       | msp26 wrote:
       | > Finally, we've introduced thinking summaries for Claude 4
       | models that use a smaller model to condense lengthy thought
       | processes. This summarization is only needed about 5% of the time
       | --most thought processes are short enough to display in full.
       | Users requiring raw chains of thought for advanced prompt
       | engineering can contact sales about our new Developer Mode to
       | retain full access.
       | 
       | Extremely cringe behaviour. Raw CoTs are super useful for
       | debugging errors in data extraction pipelines.
       | 
       | After Deepseek R1 I had hope that other companies would be more
       | open about these things.
        
         | blueprint wrote:
         | pretty sure the cringe doesn't stop there. It wouldn't surprise
         | me if this is not the only thing that they are attempting to
         | game and hide from their users.
         | 
         | The Max subscription with fake limits increase comes to mind.
        
       | oofbaroomf wrote:
       | Wonder why they renamed it from Claude <number> <type> (e.g.
       | Claude 3.7 Sonnet) to Claude <type> <number> (Claude Opus 4).
        
         | stavros wrote:
         | I guess because they haven't been releasing all three models of
         | the same version in a while now. We've only had Sonnet updates,
         | so the version first didn't make sense if we had 3.5 and 3.7
         | Sonnet but none of the others.
        
       | oofbaroomf wrote:
       | Interesting how Sonnet has a higher SWE-bench Verified score than
       | Opus. Maybe says something about scaling laws.
        
         | somebodythere wrote:
         | My guess is that they did RLVR post-training for SWE tasks, and
         | a smaller model can undergo more RL steps for the same amount
         | of computation.
        
       | Artgor wrote:
       | OpenIA's Codex-1 isn't so cool anymore. If it was ever cool.
       | 
       | And Claude Code used Opus 4 now!
        
       | boh wrote:
       | Can't wait to hear how it breaks all the benchmarks but have any
       | differences be entirely imperceivable in practice.
        
       | waynecochran wrote:
       | My mind has been blown using ChatGPT's o4-mini-high for coding
       | and research (it knowledge of computer vision and tools like
       | OpenCV are fantastic). Is it worth trying out all the shiny new
       | AI coding agents ... I need to get work done?
        
         | throwaway314155 wrote:
         | I would say yes. The jump in capability and reduction in
         | hallucinations (at least code) to Claude 3.7 from ChatGPT (even
         | o3) is immediately noticeable in my experience. Same goes for
         | gemini which was even better in some ways until perhaps today.
        
         | bcrosby95 wrote:
         | Kinda interesting, as I've found 4o better than o4-mini-high
         | for most of my coding. And while it's mind blowing that they
         | can do what they can do, the code itself coming out the other
         | end has been pretty bad, but good enough for smaller snippets
         | and extremely isolated features.
        
       | renewiltord wrote:
       | Same pricing as before is sick!
        
       | uludag wrote:
       | I'm curious what are others priors when reading benchmark scores.
       | Obviously with immense funding at stakes, companies have every
       | incentive to game the benchmarks, and the loss of goodwill from
       | gaming the system doesn't appear to have much consequences.
       | 
       | Obviously trying the model for your use cases more and more lets
       | you narrow in on actually utility, but I'm wondering how others
       | interpret reported benchmarks these days.
        
         | iLoveOncall wrote:
         | Hasn't it been proven many times that all those companies cheat
         | on benchmarks?
         | 
         | I personally couldn't care less about them, especially when
         | we've seen many times that the public's perception is
         | absolutely not tied to the benchmarks (Llama 4, the recent
         | OpenAI model that flopped, etc.).
        
           | sebzim4500 wrote:
           | I don't think there's any real evidence that any of the major
           | companies are going out of their way to cheat the benchmarks.
           | Problem is that, unless you put a lot of effort into avoiding
           | contamination, you will inevitably end up with details about
           | the benchmark in the training set.
        
         | candiddevmike wrote:
         | People's interpretation of benchmarks will largely depend on
         | whether they believe they will be better or worse off by GenAI
         | taking over SWE jobs. Think you'd need someone outside the
         | industry to weigh in to have a real, unbiased view.
        
           | douglasisshiny wrote:
           | Or someone who has been a developer for a decade plus trying
           | to use these models on actual existing code bases, solving
           | specific problems. In my experience, they waste time and
           | money.
        
             | sandspar wrote:
             | These people are the most experienced, yes, but by the same
             | token they also have the most incentive to disbelieve that
             | an AI will take their job.
        
         | lewdwig wrote:
         | Well-designed benchmarks have a public sample set and a private
         | testing set. Models are free to train on the public set, but
         | they can't game the benchmark or overfit the samples that way
         | because they're only assessed on performance against examples
         | they haven't seen.
         | 
         | Not all benchmarks are well-designed.
        
           | thousand_nights wrote:
           | but as soon as you test on your private testing set you're
           | sending it to their servers so they have access to it
           | 
           | so effectively you can only guarantee a single use stays
           | private
        
             | minimaxir wrote:
             | Claude does not train on API I/O.
             | 
             | > By default, we will not use your inputs or outputs from
             | our commercial products to train our models.
             | 
             | > If you explicitly report feedback or bugs to us (for
             | example via our feedback mechanisms as noted below), or
             | otherwise explicitly opt in to our model training, then we
             | may use the materials provided to train our models.
             | 
             | https://privacy.anthropic.com/en/articles/7996868-is-my-
             | data...
        
               | behindsight wrote:
               | Relying on their own policy does not mean they will
               | adhere to it. We have already seen "rogue" employees in
               | other companies conveniently violate their policies. Some
               | notable examples were in the news within the month (eg:
               | xAI).
               | 
               | Don't forget the previous scandals with Amazon and Apple
               | both having to pay millions in settlements for
               | eavesdropping with their assistants in the past.
               | 
               | Privacy with a system that phones an external server
               | should not be expected, regardless of whatever public
               | policy they proclaim.
               | 
               | Hence why GP said:
               | 
               | > so effectively you can only guarantee a single use
               | stays private
        
         | blueprint wrote:
         | kind of reminds me how they said they were increasing platform
         | capabilities with Max and actually reduced them while charging
         | a ton for it per month. Talk about a bait and switch. Lord help
         | you if you tried to cancel your ill advised subscription during
         | that product roll out as well - doubly so if you expect a
         | support response.
        
         | imiric wrote:
         | Benchmark scores are marketing fluff. Just like the rest of
         | this article with alleged praises from early adopters, and
         | highly scripted and edited videos.
         | 
         | AI companies are grasping at straws by selling us minor
         | improvements to stale technology so they can pump up whatever
         | valuation they have left.
        
         | jjmarr wrote:
         | > Obviously with immense funding at stakes, companies have
         | every incentive to game the benchmarks, and the loss of
         | goodwill from gaming the system doesn't appear to have much
         | consequences.
         | 
         | Claude 3.7 Sonnet was consistently on top of OpenRouter in
         | actual usage despite not gaming benchmarks.
        
       | Doohickey-d wrote:
       | > Users requiring raw chains of thought for advanced prompt
       | engineering can contact sales
       | 
       | So it seems like all 3 of the LLM providers are now hiding the
       | CoT - which is a shame, because it helped to see when it was
       | going to go down the wrong track, and allowing to quickly refine
       | the prompt to ensure it didn't.
       | 
       | In addition to openAI, Google also just recently started
       | summarizing the CoT, replacing it with an, in my opinion, overly
       | dumbed down summary.
        
         | sunaookami wrote:
         | Guess we have to wait till DeepSeek mops the floor with
         | everyone again.
        
           | datpuz wrote:
           | DeepSeek never mopped the floor with anyone... DeepSeek was
           | remarkable because it is claimed that they spent a lot less
           | training it, and without Nvidia GPUs, and because they had
           | the best open weight model for a while. The only area they
           | mopped the floor in was open source models, which had been
           | stagnating for a while. But qwen3 mopped the floor with
           | DeepSeek R1.
        
             | codyvoda wrote:
             | counterpoint: influencers said they wiped the floor with
             | everyone so it must have happened
        
               | sunaookami wrote:
               | Who cares about what random influencers say?
        
             | barnabee wrote:
             | They mopped the floor in terms of transparency, even more
             | so in terms of performance x transparency
             | 
             | Long term that _might_ matter more
        
             | manmal wrote:
             | I think qwen3:R1 is apples:oranges, if you mean the 32B
             | models. R1 has 20x the parameters and likely roughly as
             | much knowledge about the world. One is a really good
             | general model, while you can run the other one on commodity
             | hardware. Subjectively, R1 is way better at coding, and
             | Qwen3 is really good only at benchmarks - take a look at
             | aider's leaderboard, it's not even close:
             | https://aider.chat/docs/leaderboards/
             | 
             | R2 could turn out really really good, but we'll see.
        
             | sunaookami wrote:
             | DeepSeek made OpenAI panic, they initially hid the CoT for
             | o1 and then rushed to release o3 instead of waiting for
             | GPT-5.
        
         | pja wrote:
         | IIRC RLHF inevitably compromises model accuracy in order to
         | train the model not to give dangerous responses.
         | 
         | It would make sense if the model used for train-of-though was
         | trained differently (perhaps a different expert from an MoE?)
         | from the one used to interact with the end user, since the end
         | user is only ever going to see its output filtered through the
         | public model the chain-of-thought model can be closer to the
         | original, more pre-rlhf version without risking the reputation
         | of the company.
         | 
         | This way you can get the full performance of the original model
         | whilst still maintaining the necessary filtering required to
         | prevent actual harm (or terrible PR disasters).
        
           | landl0rd wrote:
           | Yeah we really should stop focusing on model alignment. The
           | idea that it's more important that your AI will fucking
           | report you to the police if it thinks you're being naughty
           | than that it actually works for more stuff is stupid.
        
             | xp84 wrote:
             | I'm not sure I'd throw out all the alignment baby with the
             | bathwater. But I wish we could draw a distinction between
             | "Might offend someone" with "dangerous."
             | 
             | Even 'plotting terror attacks' is not something terrorists
             | can do just fine without AI. And as for making sure the
             | model wouldn't say ideas that are hurtful to <insert
             | group>, it seems to me so silly when it's text we're
             | talking about. If I want to say "<insert group> are lazy
             | and stupid," I can type that myself (and it's even
             | protected speech in some countries still!) How does
             | preventing Claude from espousing that dumb opinion, keep
             | <insert group> safe from anything?
        
               | landl0rd wrote:
               | Let me put it this way: there are very few things I can
               | think of that models should absolutely refuse, because
               | there are very few pieces of information that are net
               | harmful in all cases and at all times. I sort of run by
               | blackstone's principle on this: it is better to grant 10
               | bad men access to information than to deny that access to
               | 1 good one.
               | 
               | Easy example: Someone asks the robot for advice on
               | stacking/shaping a bunch of tannerite to better focus a
               | blast. The model says he's a terrorist. In fact, he's
               | doing what any number of us have done and just having fun
               | blowing some stuff up on his ranch.
               | 
               | Or I raised this one elsewhere but ochem is an easy
               | example. I've had basically all the models claim that
               | random amines are illegal, potentially psychoactive,
               | verboten. I don't really feel like having my door getting
               | kicked down by agents with guns, getting my dog shot,
               | maybe getting shot myself because the robot tattled on me
               | for something completely legal. For that matter if
               | someone wants to synthesize some molly the robot
               | shouldn't tattle to the feds about that either.
               | 
               | Basically it should just do what users tell it to do
               | excepting the very minimal cases where something is
               | basically always bad.
        
           | Wowfunhappy wrote:
           | Correct me if I'm wrong--my understanding is that RHLF was
           | _the_ difference between GPT 3 and GPT 3.5, aka the original
           | ChatGPT.
           | 
           | If you never used GPT 3, it was... not good. Well, that's not
           | fair, it was revolutionary in its own right, but it was very
           | much a machine for predicting the most likely next word, it
           | couldn't talk to you the way ChatGPT can.
           | 
           | Which is to say, I think RHLF is important for much more than
           | just preventing PR disasters. It's a key part of what makes
           | the models useful.
        
         | make3 wrote:
         | it just makes it too easy to distill the reasoning into a
         | separate model I guess. though I feel like o3 shows useful
         | things about the reasoning while it's happening
        
         | 42lux wrote:
         | Because it's alchemy and everyone believes they have an edge on
         | turning lead into gold.
        
       | mupuff1234 wrote:
       | But if Gemini 2.5 pro was considered to be the strongest coder
       | lately, does SWE-bench really reflect reality?
        
       | oofbaroomf wrote:
       | Nice to see that Sonnet performs worse than o3 on AIME but better
       | on SWE-Bench. Often, it's easy to optimize math capabilities with
       | RL but much harder to crack software engineering. Good to see
       | what Anthropic is focusing on.
        
         | j_maffe wrote:
         | That's a very contentious opinion you're stating there. I'd say
         | LLMs have surpassed a larger percentage of SWEs in capability
         | than they have for mathematicians.
        
           | oofbaroomf wrote:
           | Mathematicians don't do high school math competitions - the
           | benchmark in question is AIME.
           | 
           | Mathematicians generally do novel research, which is hard to
           | optimize for easily. Things like LiveCodeBench (leetcode-
           | style problems), AIME, and MATH (similar to AIME) are often
           | chosen by companies so they can flex their model's
           | capabilities, even if it doesn't perform nearly as well in
           | things real mathematicians and real software engineers do.
        
             | j_maffe wrote:
             | Ok then you should clarify that you meant math _benchmarks_
             | and not math capabilities.
        
       | sali0 wrote:
       | I've found myself having brand loyalty to Claude. I don't really
       | trust any of the other models with coding, the only one I even
       | let close to my work is Claude. And this is after trying most of
       | them. Looking forward to trying 4.
        
         | anal_reactor wrote:
         | I've been initially fascinated by Claude, but then I found
         | myself drawn to Deepseek. My use case is different though, I
         | want someone to talk to.
        
           | miroljub wrote:
           | I also use DeepSeek R1 as a daily driver. Combined with Qwen3
           | when I need better tool usage.
           | 
           | Now that both Google and Claude are out, I expect to see
           | DeepSeek R2 released very soon. It would be funny to watch an
           | actual open source model getting close to the commercial
           | competition.
        
             | ashirviskas wrote:
             | Have you compared R1 with V3-0324?
        
           | jabroni_salad wrote:
           | A nice thing about Deepseek is that it is so cheap to run.
           | It's nice being able to explore conversational trees without
           | getting a $12 bill at the end of it.
        
         | MattSayar wrote:
         | Same. And I JUST tried their GitHub Action agentic thing
         | yesterday (wrote about it here[0]), and it honestly didn't
         | perform very well. I should try it again with Claude 4 and see
         | if there are any differences. Should be an easy test
         | 
         | [0] https://mattsayar.com/personalized-software-really-is-
         | coming...
        
           | WillPostForFood wrote:
           | It would be pretty funny if your "not today. Maybe tomorrow"
           | proposition actually did happen the next day.
        
             | MattSayar wrote:
             | It literally almost did!
        
         | kilroy123 wrote:
         | The new Gemini models are very good too.
        
           | pvg wrote:
           | As the poet wrote:
           | 
           |  _I prefer MapQuest
           | 
           | that's a good one, too
           | 
           | Google Maps is the best
           | 
           | true that
           | 
           | double true!_
        
         | alex1138 wrote:
         | I think "Kagi Code" or whatever it's called is using Claude
        
         | SkyPuncher wrote:
         | Gemini is _very_ good at architecture level thinking and
         | implementation.
         | 
         | I tend to find that I use Gemini for the first pass, then
         | switch to Claude for the actual line-by-line details.
         | 
         | Claude is also far superior at writing specs than Gemini.
        
           | vFunct wrote:
           | Yah Claude tends to output 1200+ line architectural
           | specification documents while Gemini tends to output ~600
           | line. (I just had to write 100+ architectural spec documents
           | for 100+ different apps)
           | 
           | Not sure why Claude is more thorough and complete than the
           | other models, but it's my go-to model for large projects.
           | 
           | The OpenAI model outputs are always the smallest - 500 lines
           | or so. Not very good at larger projects, but perfectly fine
           | for small fixes.
        
             | rogerrogerr wrote:
             | > I just had to write 100+ architectural spec documents for
             | 100+ different apps
             | 
             | ... whaaaaat?
        
               | vFunct wrote:
               | Huge project..
        
               | skybrian wrote:
               | Big design up front is back? But I guess it's a lot
               | easier now, so why not?
        
             | chermi wrote:
             | I'd interested to hear more about your workflow. I use
             | Gemini for discussing the codebase, making ADR entries
             | based on discussion, ticket generation, documenting the
             | code like module descriptions and use cases+examples, and
             | coming up with detailed plans for implementation that
             | cursor with sonnet can implement. Do you have any
             | particular formats, guidelines or prompts? I don't love my
             | workflow. I try to keep everything in notion but it's
             | becoming a pain. I'm pretty new to documentation and proper
             | planning, but I feel like it's more important now to get
             | the best use out of the llms. Any tips appreciated!
        
               | vFunct wrote:
               | For a large project, the real human trick for you to do
               | is to figure out how to partition it down to separate
               | apps, so that individual LLMs can work on them
               | separately, as if they were their own employees in
               | separate departments.
               | 
               | You then ask LLMs to first write features for the
               | individual apps (in Markdown), giving it some early
               | competitive guidelines.
               | 
               | You then tell LLMs to read that features document, and
               | then write an architectural specification document. Tell
               | it to maybe add example data structures, algorithms, or
               | user interface layouts. All in Markdown.
               | 
               | You then feed these two documents to individual LLMs to
               | write the rest of the code, usually starting with the
               | data models first, then the algorithms, then the user
               | interface.
               | 
               | Again, the trick is to partition your project to
               | individual apps. Also an app isn't the full app. It might
               | just be a data schema, a single GUI window, a marketing
               | plan, etc.
               | 
               | The other hard part is to integrate the apps back
               | together at the top level if they interact with each
               | other...
        
             | Bjartr wrote:
             | What's your prompt look like for creating spec documents?
        
             | jonny_eh wrote:
             | Who's reading these docs?
        
               | tasuki wrote:
               | Another LLM, which distills it back into a couple of
               | sentences for human consumption.
        
           | causal wrote:
           | This is exactly my approach. Use Gemini to come up with
           | analysis and a plan, Claude to implement.
        
           | leetharris wrote:
           | Much like others, this is my stack (or o1-pro instead of
           | Gemini 2.5 Pro). This is a big reason why I use aider for
           | large projects. It allows me to effortlessly combine
           | architecture models and code writing models.
           | 
           | I know in Cursor and others I can just switch models between
           | chats, but it doesn't feel intentional the way aider does.
           | You chat in architecture mode, then execute in code mode.
        
             | Keyframe wrote:
             | could you describe a bit how does this work? I haven't had
             | much luck with AI so far, but I'm willing to try.
        
               | BeetleB wrote:
               | https://aider.chat/2024/09/26/architect.html
               | 
               | The idea is that some models are better at _reasoning_
               | about code, but others are better at actually creating
               | the code changes (without syntax errors, etc). So Aider
               | lets you pick two models - one does the architecting, and
               | the other does the code change.
        
               | SirYandi wrote:
               | https://harper.blog/2025/02/16/my-llm-codegen-workflow-
               | atm/
               | 
               | "tl:dr; Brainstorm spec, then plan a plan, then execute
               | using LLM codegen. Discrete loops. Then magic."
        
             | piperswe wrote:
             | Cline also allows you to have separate model configuration
             | for "Plan" mode and "Act" mode.
        
           | justinbaker84 wrote:
           | I have been very brand loyal to claude also but the new
           | gemini model is amazing and I have been using it exclusively
           | for all of my coding for the last week.
           | 
           | I am excited to try out this new model. I actually want to
           | stay brand loyal to antropic because I like the people and
           | the values they express.
        
           | cheema33 wrote:
           | This matches with my experience as well.
        
         | sergiotapia wrote:
         | Gemini 2.5 Pro replaced Claude 3.7 for me after using nothing
         | but claude for a very long time. It's really fast, and really
         | accurate. I can't wait to try Claude 4, it's always been the
         | most "human" model in my opinion.
        
           | ashirviskas wrote:
           | Idk I found Gemini 2.5 Breaking code style too often and
           | introducing unneeded complexity on the top of leaving
           | unfinished functions.
        
         | Recursing wrote:
         | I also recommend trying out Gemini, I'm really impressed by the
         | latest 2.5. Let's see if Claude 4 makes me switch back.
        
           | dwaltrip wrote:
           | What's the best way to use gemini? I'm currently pretty happy
           | / impressed with claude code via the CLI, its the best AI
           | coding tool I've tried so far
        
             | stavros wrote:
             | I use it via Aider, with Gemini being the architect and
             | Claude the editor.
        
         | rasulkireev wrote:
         | i was the same, but then slowly converted to Gemini. Still not
         | sure how that happened tbh
        
         | javier2 wrote:
         | I wouldn't go as far, but I actually have some loyalty to
         | Claude as well. Don't even know why, as I think the differences
         | are marginal.
        
         | cleak wrote:
         | Something I've found true of Claude, but not other models, is
         | that when the benchmarks are better, the real world performance
         | is better. This makes me trust them a lot more and keeps me
         | coming back.
        
         | nico wrote:
         | I think it really depends on how you use it. Are you using an
         | agent with it, or the chat directly?
         | 
         | I've been pretty disappointed with Cursor and all the supported
         | models. Sometimes it can be pretty good and convenient, because
         | it's right there in the editor, but it can also get stuck on
         | very dumb stuff and re-trying the same strategies over and over
         | again
         | 
         | I've had really good experiences with o4-high-mini directly on
         | the chat. It's annoying going back and forth copying/pasting
         | code between editor and the browser, but it also keeps me more
         | in control about the actions and the context
         | 
         | Would really like to know more about your experience
        
         | qingcharles wrote:
         | I'm slutty. I tend to use all four at once: Claude, Grok,
         | Gemini and OpenAI.
         | 
         | They keep leap-frogging each other. My preference has been the
         | output from Gemini these last few weeks. Going to check out
         | Claude now.
        
       | goranmoomin wrote:
       | > Extended thinking with tool use (beta): Both models can use
       | tools--like web search--during extended thinking, allowing Claude
       | to alternate between reasoning and tool use to improve responses.
       | 
       | I'm happy that tool use during extended thinking is now a thing
       | in Claude as well, from my experience with CoT models that was
       | the one trick(tm) that massively improves on issues like
       | hallucination/outdated libraries/useless thinking before tool
       | use, e.g.
       | 
       | o3 with search actually returned solid results, browsing the web
       | as like how i'd do it, and i was thoroughly impressed - will see
       | how Claude goes.
        
       | iLoveOncall wrote:
       | I can't think of more boring than marginal improvements on coding
       | tasks to be honest.
       | 
       | I want GenAI to become better at tasks that I don't want to do,
       | to reduce the unwanted noise from my life. This is when I'll pay
       | for it, not when they found a new way to cheat a bit more the
       | benchmarks.
       | 
       | At work I own the development of a tool that is using GenAI, so
       | of course a new better model will be beneficial, especially
       | because we do use Claude models, but it's still not exciting or
       | interesting in the slightest.
        
         | amkkma wrote:
         | yea exactly
        
         | danielbln wrote:
         | What if coding is that unwanted task? Also, what are the tasks
         | you are referring to, specifically?
        
           | iLoveOncall wrote:
           | Booking visits to the dentist, hairdresser, any other type of
           | service, renewing my phone or internet subscription at the
           | lowest price, doing administrative tasks, adding the free
           | game of the week on Epic Games Store to my library, finding
           | the right houses to visit, etc.
           | 
           | Basically anything that some startup has tried and failed at
           | uberizing.
        
           | jetsetk wrote:
           | Why would coding be that unwanted task if one decided to work
           | as a programmer? People's unwanted tasks are cleaning the
           | house, doing taxes etc.
        
             | atonse wrote:
             | But the value in coding (for the overwhelming majority of
             | the people) is the product of coding (the actual software),
             | not the code itself.
             | 
             | So to most people, the code itself doesn't matter (and
             | never will). It's what it lets them actually do in the real
             | world.
        
       | eamag wrote:
       | When will structured output be available? Is it difficult for
       | anthropic because custom sampling breaks their safety tools?
        
       | blueprint wrote:
       | Anthropic might be scammers. Unclear. I canceled my subscription
       | with them months ago after they reduced capabilities for pro
       | users and I found out months later that they never actually
       | canceled it. They have been ignoring all of my support requests..
       | seems like a huge money grab to me because they know that they're
       | being out competed and missed the ball on monetizing earlier.
        
       | HiPHInch wrote:
       | How long will the VScode wrapper (cursor, windsurf) survive?
       | 
       | Love to try the Claude Code VScode extension if the price is
       | right and purchase-able from China.
        
         | bingobangobungo wrote:
         | what do you mean purchasable from china? As in you are based in
         | china or is there a way to game the tokens pricing
        
           | HiPHInch wrote:
           | Claude register need a phone number, but cannot select China
           | (+86), and even if I have a account, it may hard to purchase
           | because the credit card issue.
           | 
           | Some app like Windsurf can easily pay with Alipay, a
           | everyone-app in China.
        
         | barnabee wrote:
         | I don't see any benefit in those VC funded wrappers over open
         | source VS Code (or better, VSCodium) extensions like Roo/Cline.
         | 
         | They survive through VC funding, marketing, and inertia, I
         | suppose.
        
           | danielbln wrote:
           | Cline is VC funded.
        
             | maeil wrote:
             | Please give us a source as this does not seem publically
             | verifiable.
        
               | danielbln wrote:
               | https://cline.bot/blog/talent-ai-companies-actually-need-
               | rig...
        
               | maeil wrote:
               | Thank you! Very disappointing news.
        
           | tristan957 wrote:
           | VSCode is not open source. It is proprietary. code-oss is
           | however an MIT-licensed project. VSCodium is also an MIT-
           | licensed project.
        
       | diggan wrote:
       | Anyone with access who could compare the new models with say O1
       | Pro Mode? Doesn't have to be a very scientific comparison, just
       | some first impressions/thoughts compared to the current SOTA.
        
         | drewnick wrote:
         | I just had some issue with RLS/schema/postgres stuff. Gemini
         | 2.5 Pro swung and missed, and talked a lot with little code,
         | Claude Sonnet 4 solved. O1 Pro Solved. It's definitely random
         | which of these models can solve various problems with the same
         | prompt.
        
           | diggan wrote:
           | > definitely random which of these models can solve various
           | problems with the same prompt.
           | 
           | Yeah, this is borderline my feeling too. Kicking off Codex
           | with the same prompt but four times sometimes leads to for
           | very different but confident solutions. Same when using the
           | chat interfaces, although it seems like Sonnet 3.7 with
           | thinking and o1 Pro Mode is a lot more consistent than any
           | Gemini model I've tried.
        
       | KaoruAoiShiho wrote:
       | Is this really worthy of a claude 4 label? Was there a new pre-
       | training run? Cause this feels like 3.8... only swe went up
       | significantly, and that as we all understand by now is done by
       | cramming on specific post training data and doesn't generalize to
       | intelligence. The agentic tooluse didn't improve and this says to
       | me that it's not really smarter.
        
         | zamadatix wrote:
         | It feels like the days of Claude 2 -> 3 or GPT 2->3 level
         | changes for the leading models are over and you're either going
         | to end up with really awkward version numbers or just embrace
         | it and increment the number. Nobody cares a Chrome update gives
         | a major version change of 136->137 instead of 12.4.2.33 ->
         | 12.4.3.0 for similar kinds of "the version number doesn't
         | always have to represent the amount of work/improvement
         | compared to the previous" reasoning.
        
           | saubeidl wrote:
           | It feels like LLM progress in general has kinda stalled and
           | we're only getting small incremental improvements from here.
           | 
           | I think we've reached peak LLM - if AGI is a thing, it won't
           | be through this architecture.
        
             | nico wrote:
             | Diffusion LLMs seem like they could be a huge change
             | 
             | Check this out from yesterday (watch the short video here):
             | 
             | https://simonwillison.net/2025/May/21/gemini-diffusion/
             | 
             | From:
             | 
             | https://news.ycombinator.com/item?id=44057820
        
               | antupis wrote:
               | Yup especially locally diffusion models will be big.
        
             | goatlover wrote:
             | Which brings up the question of why AGI is a thing at all.
             | Shouldn't LLMs just be tools to make humans more
             | productive?
        
               | amarcheschi wrote:
               | Think of the poor vcs who are selling agi as the second
               | coming of christ
        
             | bcrosby95 wrote:
             | Even if LLMs never reach AGI, they're good enough to where
             | a lot of very useful tooling can be built on top of/around
             | them. I think of it more as the introduction of computing
             | or the internet.
             | 
             | That said, whether or not being a provider of these
             | services is a profitable endeavor is still unknown. There's
             | a lot of subsidizing going on and some of the lower value
             | uses might fall to the wayside as companies eventually need
             | to make money off this stuff.
        
               | michaelbrave wrote:
               | this was my take as well. Though after a while I've
               | started thinking about it closer to the introduction of
               | electricity which in a lot of ways would be considered
               | the second stage of the industrial revolution, the
               | internet and AI might be considered the second stage of
               | the computing revolution (or so I expect history books to
               | label it as). But just like electricity, it doesn't seem
               | to be very profitable for the providers of electricity,
               | but highly profitable for everything that uses it.
        
             | jenny91 wrote:
             | I think it's a bit early to say. At least in my domain, the
             | models released this year (Gemini 2.5 Pro, etc). Are
             | crushing models from last year. I would therefore not by
             | any means be ready to call the situation a stall.
        
           | onlyrealcuzzo wrote:
           | Did you see Gemini 1.5 pro vs 2.5 pro?
        
             | zamadatix wrote:
             | Sure, but despite there being a 2.0 release between they
             | didn't even feel the need to release a Pro for it still
             | isn't the kind of GPT 2 -> 3 improvement we were hoping
             | would continue for a bit longer. Companies will continue to
             | release these incremental improvements which are all always
             | neck-and-neck with each other. That's fine and good, just
             | don't inherently expect the versioning to represent the
             | same relative difference instead of the relative release
             | increment.
        
               | cma wrote:
               | I'd say 2.5 pro to 1.5 pro was a 3 -> 4 level
               | improvement, but the problem is 1.5 pro wasn't state of
               | the art when released, except for context length, and 2.5
               | wasn't that kind of improvement compared to the best open
               | AI or Claude stuff that was available when it released.
               | 
               | 1.5 pro was worse than original gpt4 on several coding
               | things I tried head to head.
        
         | oofbaroomf wrote:
         | The improvement from Claude 3.7 wasn't particularly huge. The
         | improvement from Claude 3, however, was.
        
         | causal wrote:
         | Slight decrease from Sonnet 3.7 in a few areas even. As always
         | benchmarks say one thing, will need some practice with it to
         | get a subjective opinion.
        
         | greenfish6 wrote:
         | Benchmarks don't tell you as much as the actual coding vibe
         | though
        
         | wrsh07 wrote:
         | My understanding for the original OpenAI and anthropic labels
         | was essentially: gpt2 was 100x more compute than gpt1. Same for
         | 2 to 3. Same for 3 to 4. Thus, gpt 4.5 was 10x more compute^
         | 
         | If anthropic is doing the same thing, then 3.5 would be 10x
         | more compute vs 3. 3.7 might be 3x more than 3.5. and 4 might
         | be another ~3x.
         | 
         | ^ I think this maybe involves words like "effective compute",
         | so yeah it might not be a full pretrain but it might be! If you
         | used 10x more compute that could mean doubling the amount used
         | on pretraining and then using 8x compute on post or some other
         | distribution
        
           | swyx wrote:
           | beyond 4 thats no longer true - marketing took over from the
           | research
        
             | wrsh07 wrote:
             | Oh shoot I thought that still applied to 4.5 just in a more
             | "effective compute" way (not 100x more parameters, but 100x
             | more compute in training)
             | 
             | But alas, it's not like 3nm fab means the literal thing
             | either. Marketing always dominates (and not necessarily in
             | a way that adds clarity)
        
         | minimaxir wrote:
         | So I decided to try Claude 4 Sonnet against my "Given a list of
         | 1 million random integers between 1 and 100,000, find the
         | difference between the smallest and the largest numbers whose
         | digits sum up to 30." benchmark I tested against Claude 3.5
         | Sonnet: https://news.ycombinator.com/item?id=42584400
         | 
         | The results are here (https://gist.github.com/minimaxir/1bad26f
         | 0f000562b1418754d67... ) and it utterly crushed the problem
         | with the relevant microoptimizations commented in that HN
         | discussion (oddly in the second pass it a) regresses from a
         | vectorized approach to a linear approach and b) generates and
         | iterates on three different iterations instead of one final
         | iteration), although it's possible Claude 4 was trained on that
         | discussion lol.
         | 
         | EDIT: "utterly crushed" may have been hyperbole.
        
           | bsamuels wrote:
           | as soon as you publish a benchmark like this, it becomes
           | worthless because it can be included in the training corpus
        
             | rbjorklin wrote:
             | While I agree with you in principle give Claude 4 a try on
             | something like: https://open.kattis.com/problems/low . I
             | would expect this to have been included in the training
             | material as well as solutions found on Github. I've tried
             | providing the problem description and asking Claude Sonnet
             | 4 to solve it and so far it hasn't been successful.
        
           | diggan wrote:
           | > although it's possible Claude 4 was trained on that
           | discussion lol
           | 
           | Almost guaranteed, especially since HN tends to be popular in
           | tech circles, and also trivial to scrape the entire thing in
           | a couple of hours via the Algolia API.
           | 
           | Recommendation for the future: keep your
           | benchmarks/evaluations private, as otherwise they're
           | basically useless as more models get published that are
           | trained on your data. This is what I do, and usually I don't
           | see the "huge improvements" as other public benchmarks seems
           | to indicate when new models appear.
        
             | dr_kiszonka wrote:
             | >> although it's possible Claude 4 was trained on that
             | discussion lol
             | 
             | > Almost guaranteed, especially since HN tends to be
             | popular in tech circles, and also trivial to scrape the
             | entire thing via the Algolia API.
             | 
             | I am wondering if this could be cleverly exploited. <twirls
             | mustache>
        
           | Epa095 wrote:
           | I find it weird that it does a inner check on ' num > 99999',
           | which pretty much only checks for 100,000. It could check for
           | 99993, but I doubt even that check makes it much faster.
           | 
           | But have you checked with some other number than 30? Does it
           | screw up the upper and lower bounds?
        
           | kevindamm wrote:
           | I did a quick review of its final answer and looks like there
           | are logic errors.
           | 
           | All three of them get the incorrect max-value bound (even
           | with comments saying 9+9+9+9+3 = 30), so early termination
           | wouldn't happen in the second and third solution, but that's
           | an optimization detail. The first version would, however,
           | early terminate on the first occurrence of 3999 and take
           | whatever the max value was up to that point. So, for many
           | inputs the first one (via solve_digit_sum_difference) is just
           | wrong.
           | 
           | The second implementation (solve_optimized, not a great name
           | either) and third implementation, at least appear to be
           | correct... but that pydoc and the comments in general are
           | atrocious. In a review I would ask these to be reworded and
           | would only expect juniors to even include anything similar in
           | a pull request.
           | 
           | I'm impressed that it's able to pick a good line of
           | reasoning, and even if it's wrong about the optimizations it
           | did give a working answer... but in the body of the response
           | and in the code comments it clearly doesn't understand digit
           | extraction per se, despite parroting code about it. I suspect
           | you're right that the model has seen the problem solution
           | before, and is possibly overfitting.
           | 
           | Not bad, but I wouldn't say it crushed it, and wouldn't
           | accept any of its micro-optimizations without benchmark
           | results, or at least a benchmark test that I could then run.
           | 
           | Have you tried the same question with other sums besides 30?
        
             | minimaxir wrote:
             | Those are fair points. Even with those issues, it's still
             | better substantially better than the original benchmark
             | (maybe "crushing it" is too subjective a term).
             | 
             | I reran the test to run a dataset of 1 to 500,000 and sum
             | digits up to 37 and it went back to the numba JIT
             | implementation that was encountered in my original blog
             | post, without numerology shenanigans. https://gist.github.c
             | om/minimaxir/a6b7467a5b39617a7b611bda26...
             | 
             | I did also run the model at temp=1, which came to the same
             | solution but confused itself with test cases: https://gist.
             | github.com/minimaxir/be998594e090b00acf4f12d552...
        
           | isahers32 wrote:
           | Might just be missing something, but isn't 9+9+9+9+3=39? The
           | largest number I believe is 99930? Also, it could further
           | optimize by terminating digit sum calculations earlier if sum
           | goes above 30 or could not reach 30 (num digits remaining * 9
           | is less than 30 - current_sum). imo this is pretty far from
           | "crushing it"
        
           | jonny_eh wrote:
           | > although it's possible Claude 4 was trained on that
           | discussion lol
           | 
           | This is why we can't have consistent benchmarks
        
             | teekert wrote:
             | Yeah I agree, also, what is the use of that benchmark? Who
             | cares? How does it related to stuff that does matter?
        
           | thethirdone wrote:
           | The first iteration vectorized with numpy is the best
           | solution imho. The only additional optimization is using
           | modulo 9 to give you a sum of digits mod 9; that should
           | filter out approximately 1/9th of numbers. The digit summing
           | is the slow part so reducing the number of values there
           | results in a large speedup. Numpy can do that filter pretty
           | fast as `arr = arr[arr%9==3]`
           | 
           | With that optimization its about 3 times faster, and all of
           | the none numpy solutions are slower than the numpy one. In
           | python it almost never makes sense to try to manually iterate
           | for speed.
        
           | losvedir wrote:
           | Same for me, with this past year's Advent of Code. All the
           | models until now have been stumped by Day 17 part 2. But Opus
           | 4 finally got it! Good chance some of that is in its training
           | data, though.
        
         | mike_hearn wrote:
         | They say in the blog post that tool use has improved
         | dramatically: parallel tool use, ability to use tools during
         | thinking and more.
        
         | jacob019 wrote:
         | Hey, at least they incremented the version number. I'll take
         | it.
        
         | drusepth wrote:
         | To be fair, a lot of people said 3.7 should have just been
         | called 4. Maybe they're just bridging the gap.
        
         | Bjorkbat wrote:
         | I was about to comment on a past remark from Anthropic that the
         | whole reason for the convoluted naming scheme was because they
         | wanted to wait until they had a model worth of the "Claude 4"
         | title.
         | 
         | But because of all the incremental improvements since then, the
         | irony is that this merely feels like an incremental
         | improvement. It obviously is a huge leap when you consider that
         | the best Claude 3 ever got on SWE-verified was just under 20%
         | (combined with SWE-agent), but compared to Claude 3.7 it
         | doesn't feel like that big of a deal, at least when it comes to
         | SWE-bench results.
         | 
         | Is it worthy? Sure, why not, compared to the original Claude 3
         | at any rate, but this habit of incremental improvement means
         | that a major new release feels kind of ordinary.
        
       | thimabi wrote:
       | It's been hard to keep up with the evolution in LLMs. SOTA models
       | basically change every other week, and each of them has its own
       | quirks.
       | 
       | Differences in features, personality, output formatting, UI,
       | safety filters... make it nearly impossible to migrate workflows
       | between distinct LLMs. Even models of the same family exhibit
       | strikingly different behaviors in response to the same prompt.
       | 
       | Still, having to find each model's strengths and weaknesses on my
       | own is certainly much better than not seeing any progress in the
       | field. I just hope that, eventually, LLM providers converge on a
       | similar set of features and behaviors for their models.
        
         | runjake wrote:
         | My advice: don't jump around between LLMs for a given project.
         | The AI space is progressing too rapidly right now. Save
         | yourself the sanity.
        
           | vFunct wrote:
           | Each model has their own strength and weaknesses tho. You
           | really shouldn't be using one model for everything. Like,
           | Claude is great at coding but is expensive so you wouldn't
           | use them for debugging to writing test benches. But the
           | OpenAI models suck at architecture but are cheap, so are
           | ideal for test benches, for example.
        
           | rokkamokka wrote:
           | Isn't that an argument to jump around? Since performance
           | improves so rapidly between models
        
             | morkalork wrote:
             | I think the idea is you might end up spending your time
             | shaving a yak. Finish your project, then try the ne SOTA on
             | your next task.
        
           | dmix wrote:
           | You should at least have two to sanity check difficult
           | programming solutions.
        
           | charlie0 wrote:
           | A man with one compass knows where he's going; a man with two
           | compasses is never sure.
        
         | smallnamespace wrote:
         | Vibe code an eval harness with a web dashboard
        
         | matsemann wrote:
         | How important is it to be using SOTA? Or even jump on it
         | already?
         | 
         | Feels a bit like when it was a new frontend framework every
         | week. Didn't jump on any then. Sure, when React was the winner,
         | I had a few months less experience than those who bet on the
         | correct horse. But nothing I couldn't quickly catch up to.
        
           | thimabi wrote:
           | > How important is it to be using SOTA?
           | 
           | I believe in using the best model for each use case. Since
           | I'm paying for it, I like to find out which model is the best
           | bang for my buck.
           | 
           | The problem is that, even when comparing models according to
           | different use cases, better models eventually appear, and the
           | models one uses eventually change as well -- for better or
           | worse. This means that using the same model over and over
           | doesn't seem like a good decision.
        
         | Falimonda wrote:
         | Have you tried a package like LiteLLM so that you can more
         | easily validate and switch to a newer model?
         | 
         | The key seems to be in curating your application's evaluation
         | set.
        
       | ksec wrote:
       | This is starting to get ridiculous. I am busy with life and have
       | hundreds of tabs unread including one [1] about Claude 3.7 Sonnet
       | and Claude Code and Gemini 2.5 Pro. And before any of that Claude
       | 4 is out. And all the stuff Google announced during IO yday.
       | 
       | So will Claude 4.5 come out in a few months and 5.0 before the
       | end of the year?
       | 
       | At this point is it even worth following anything about AI / LLM?
       | 
       | [1] https://news.ycombinator.com/item?id=43163011
        
       | i_love_retros wrote:
       | Anyone know when the o4-x-mini release is being announced? I
       | thought it was today
        
       | whalesalad wrote:
       | Anyone have a link to the actual Anthropic official vscode
       | extension? Struggling to find it.
       | 
       | edit: run `claude` in a vscode terminal and it will get
       | installed. but the actual extension id is `Anthropic.claude-code`
        
         | ChadMoran wrote:
         | Thank you!
        
       | energy123 wrote:
       | > Finally, we've introduced thinking summaries for Claude 4
       | models that use a smaller model to condense lengthy thought
       | processes. This summarization is only needed about 5% of the time
       | --most thought processes are short enough to display in full.
       | 
       | This is not better for the user. No users want this. If you're
       | doing this to prevent competitors training on your thought traces
       | then fine. But if you really believe this is what users want, you
       | need to reconsider.
        
         | throwaway314155 wrote:
         | If _you_ really believe this is what all users want, _you_
         | should reconsider. Your describing a feature for power users.
         | It should be a toggle but it's silly to say it didn't improve
         | UX for people who don't want to read strange babbling chains of
         | thought.
        
           | porridgeraisin wrote:
           | Here's an example non-power-user usecase for CoT:
           | 
           | Sometimes when I miss to specify a detail in my prompt and
           | it's just a short task where I don't bother with long
           | processes like "ask clarifying questions, make a plan and
           | then follow it" etc etc, I see it talking about making that
           | assumption in the CoT and I immediately cancel the request
           | and edit the detail in.
        
           | energy123 wrote:
           | You're accusing me of mind reading other users, but then
           | proceed to engage in your own mind reading of those same
           | users.
           | 
           | Have a look in Gemini related subreddits after they nerfed
           | their CoT yesterday. There's nobody happy about this trend. A
           | high quality CoT that gets put through a small LLM is really
           | no better than noise. Paternalistic noise. It's not worth
           | reading. Just don't even show me the CoT at all.
           | 
           | If someone is paying for Opus 4 then they likely are a power
           | user, anyway. They're doing it for the frontier performance
           | and I would assume such users would appreciate the real CoT.
        
         | comova wrote:
         | I believe this is to improve performance by shortening the
         | context window for long thinking processes. I don't think this
         | is referring to real-time summarizing for the users' sake.
        
           | usaar333 wrote:
           | When you do a chat are reasoning traces for prior model
           | outputs in the LLM context?
        
             | int_19h wrote:
             | No, they are normally stripped out.
        
           | j_maffe wrote:
           | > I don't think this is referring to real-time summarizing
           | for the users' sake.
           | 
           | That's exactly what it's referring to.
        
         | dr_kiszonka wrote:
         | I agree. Thinking traces are the first thing I check when I
         | suspect Claude lied to me. Call me cynical, but I suspect that
         | these new summaries will conveniently remove the "evidence."
        
       | tptacek wrote:
       | Have they documented the context window changes for Claude 4
       | anywhere? My (barely informed) understanding was one of the
       | reasons Gemini 2.5 has been so useful is that it can handle huge
       | amounts of context --- 50-70kloc?
        
         | jbellis wrote:
         | not sure wym, it's in the headline of the article that Opus 4
         | has 200k context
         | 
         | (same as sonnet 3.7 with the beta header)
        
           | tptacek wrote:
           | We might be looking at different articles? The string "200"
           | appears nowhere in this one --- or I'm just wrong! But
           | thanks!
        
           | esafak wrote:
           | There's the nominal context length, and the effective one.
           | You need a benchmark like the needle-in-a-haystack or RULER
           | to determine the latter.
           | 
           | https://github.com/NVIDIA/RULER
        
         | minimaxir wrote:
         | Context window is unchanged for Sonnet. (200k in/64k out):
         | https://docs.anthropic.com/en/docs/about-claude/models/overv...
         | 
         | In practice, the 1M context of Gemini 2.5 isn't that much of a
         | differentiator because larger context has diminishing returns
         | on adherence to later tokens.
        
           | Workaccount2 wrote:
           | Gemini's real strength is that it can stay on the ball even
           | at 100k tokens in context.
        
           | zamadatix wrote:
           | The amount of degradation at a given context length isn't
           | constant though so a model with 5x the context can either be
           | completely useless or still better depending on the strength
           | of the models you're comparing. Gemini actually does really
           | great in both regards (context length and quality at length)
           | but I'm not sure what a hard numbers comparison to the latest
           | Claude models would look like.
           | 
           | A good deep dive on the context scaling topic in general
           | https://youtu.be/NHMJ9mqKeMQ
        
           | ashirviskas wrote:
           | It's closer to <30k before performance degrades too much for
           | 3.5/3.7. 200k/64k is meaningless in this context.
        
             | jerjerjer wrote:
             | Is there a benchmark to measure real effective context
             | length?
             | 
             | Sure, gpt-4o has a context window of 128k, but it loses a
             | lot from the beginning/middle.
        
               | bigmadshoe wrote:
               | They often publish "needle in a haystack" benchmarks that
               | look very good, but my subjective experience with a large
               | context is always bad. Maybe we need better benchmarks.
        
               | brookst wrote:
               | Here's an older study that includes Claude 3.5:
               | https://www.databricks.com/blog/long-context-rag-
               | capabilitie...?
        
               | evertedsphere wrote:
               | ruler https://arxiv.org/abs/2404.06654
               | 
               | nolima https://arxiv.org/abs/2502.05167
        
           | Rudybega wrote:
           | I'm going to have to heavily disagree. Gemini 2.5 Pro has
           | super impressive performance on large context problems. I
           | routinely drive it up to 4-500k tokens in my coding agent.
           | It's the only model where that much context produces even
           | remotely useful results.
           | 
           | I think it also crushes most of the benchmarks for long
           | context performance. I believe on MRCR (multi round
           | coreference resolution) it beats pretty much any other
           | model's performance at 128k at 1M tokens (o3 may have changed
           | this).
        
             | egamirorrim wrote:
             | OOI what coding agent are you managing to get to work
             | nicely with G2.5 Pro?
        
               | Rudybega wrote:
               | I mostly use Roo Code inside visual studio. The modes are
               | awesome for managing context length within a discrete
               | unit of work.
        
             | alasano wrote:
             | I find that it consistently breaks around that exact range
             | you specified. In the sense that reliability falls off a
             | cliff, even though I've used it successfully close to the
             | 1M token limit.
             | 
             | At 500k+ I will define a task and it will suddenly panic
             | and go back to a previous task that we just fully
             | completed.
        
             | l2silver wrote:
             | Is that a codebase you're giving it?
        
           | michaelbrave wrote:
           | I've had a lot of fun using Gemini's large context. I scrape
           | a reddit discussion with 7k responses, and have gemini
           | synthesize it and categorize it, and by the time it's done
           | and I have a few back and fourths with it I've gotten half of
           | a book written.
           | 
           | That said I have noticed that if I try to give it additional
           | threads to compare and contrast once it hits around the
           | 300-500k tokens it starts to hallucinate more and forget
           | things more.
        
           | VeejayRampay wrote:
           | that is just not correct, it's a big differentiator
        
           | strangescript wrote:
           | Yeah, but why aren't they attacking that problem? Is it just
           | impossible, because it would be a really simple win with
           | regards to coding. I am huge enthusiast, but I am starting to
           | feel a peak.
        
         | fblp wrote:
         | I wish they would increase the context window or better handle
         | it when the prompt gets too long. Currently users get "prompt
         | is too long" warnings suddenly which makes it a frustrating
         | model to work with for long conversations, writing etc.
         | 
         | Other tools may drop some prior context, or use RAG to help but
         | they don't force you to start a new chat without warning.
        
         | keeganpoppen wrote:
         | context window size is super fake. if you don't have the right
         | context, you don't get good output.
        
       | unshavedyak wrote:
       | Anyone know if this is usable with Claude Code? If so, how? I've
       | not seen the ability to configure the backend for Claude Code,
       | hmm
        
         | khalic wrote:
         | The new page for Claude Code says it uses opus 4, sonnet 4 and
         | haiku 3.5
        
         | drewnick wrote:
         | Last night I suddenly got noticeably better performance in
         | Claude Code. Like it one shotted something I'd never seen
         | before and took multiple steps over 10 minutes. I wonder if I
         | was on a test branch? It seems to be continuing this morning
         | with good performance, solving an issue Gemini was struggling
         | with.
        
         | cloudc0de wrote:
         | Just saw this popup in claude cli v1.0.0 changelog
         | 
         | What's new:
         | 
         | * Added `DISABLE_INTERLEAVED_THINKING` to give users the option
         | to opt out of interleaved thinking.
         | 
         | * Improved model references to show provider-specific names
         | (Sonnet 3.7 for Bedrock, Sonnet 4 for Console)
         | 
         | * Updated documentation links and OAuth process descriptions
         | 
         | * Claude Code is now generally available
         | 
         | * Introducing Sonnet 4 and Opus 4 models
        
         | michelb wrote:
         | Yes, you can type /model in Code to switch model, currently
         | Opus 4 and Sonnet 4 for me.
        
       | fsto wrote:
       | What's your guess on when Claude 4 will be available on AWS
       | Bedrock?
        
         | richin13 wrote:
         | The blog says it should be available now but I'm not seeing it.
         | In Vertex AI it's already available.
        
         | cloudc0de wrote:
         | I was able to request both new models only in Oregon / us-
         | west-2.
         | 
         | I built a Slack bot for my wife so she can use Claude without a
         | full-blown subscription. She uses it daily for lesson planning
         | and brainstorming and it only costs us about .50 cents a month
         | in Lambda + Bedrock costs. https://us-
         | west-2.console.aws.amazon.com/bedrock/home?region...
         | 
         | https://docs.anthropic.com/en/docs/claude-code/bedrock-verte...
        
       | waleedlatif1 wrote:
       | I really hope sonnet 4 is not obsessed with tool calls the way
       | 3-7 is. 3-5 was sort of this magical experience where, for the
       | first time, I felt the sense that models were going to master
       | programming. It's kind of been downhill from there.
        
         | deciduously wrote:
         | This feels like more of a system prompt issue than a model
         | issue?
        
           | waleedlatif1 wrote:
           | imo, model regression might actually stem from more
           | aggressive use of toolformer-style prompting, or even RLHF
           | tuning optimizing for obedience over initiative. i bet if you
           | ran comparable tasks across 3-5, 3-7, and 4-0 with consistent
           | prompts and tool access disabled, the underlying model
           | capabilities might be closer than it seems.
        
           | ketzo wrote:
           | Anecdotal of course, but I feel a very distinct difference
           | between 3.5 and 3.7 when swapping between them in Cursor's
           | Agent mode (so the system prompt stays consistent).
        
         | lherron wrote:
         | Overly aggressive "let me do one more thing while I'm here" in
         | 3.7 really turned me off as well. Would love a return to 3.5's
         | adherence.
        
           | nomel wrote:
           | I think there was definitely a compromise with 3.7. When I
           | turn off thinking, it seems to perform _very_ poorly compared
           | to 3.5.
        
           | mkrd wrote:
           | Yes, this is pretty annoying. You give it a file and want it
           | to make a small focused change, but instead it almost touches
           | every line of code, even the unrelated ones
        
       | accrual wrote:
       | Very impressive, congrats Anthropic/Claude team! I've been using
       | Claude for personal project development and finally bought a
       | subscription to Pro as well.
        
       | minimaxir wrote:
       | An important note not mentioned in this announcement is that
       | Claude 4's training cutoff date is March 2025, which is the
       | latest of any recent model. (Gemini 2.5 has a cutoff of January
       | 2025)
       | 
       | https://docs.anthropic.com/en/docs/about-claude/models/overv...
        
         | m3kw9 wrote:
         | Even that, we don't know what got updated and what didn't. Can
         | we assume everything that can be updated is updated?
        
           | simlevesque wrote:
           | You might be able to ask it what it knows.
        
             | retrofuturism wrote:
             | When I try Claude Sonnet 4 via web:
             | 
             | https://claude.ai/share/59818e6c-804b-4597-826a-c0ca2eccdc4
             | 6
             | 
             | >This is a topic that would have developed after my
             | knowledge cutoff of January 2025, so I should search for
             | information [...]
        
             | minimaxir wrote:
             | So something's odd there. I asked it "Who won Super Bowl
             | LIX and what was the winning score?" which was in February
             | and the model replied "I don't have information about Super
             | Bowl LIX (59) because it hasn't been played yet. Super Bowl
             | LIX is scheduled to take place in February 2025.".
        
               | ldoughty wrote:
               | With LLMs, if you repeat something often enough, it
               | becomes true.
               | 
               | I imagine there's a lot more data pointing to the super
               | bowl being upcoming, then the super bowl concluding with
               | the score.
               | 
               | Gonna be scary when bot farms are paid to make massive
               | amounts of politically motivated false content
               | (specifically) targeting future LLMs training
        
               | gosub100 wrote:
               | A lot of people are forecasting the death of the Internet
               | as we know it. The financial incentives are too high and
               | the barrier of entry is too low. If you can build bots
               | that maybe only generate a fraction of a dollar per day
               | (referring people to businesses, posting spam for
               | elections, poisoning data collection/web crawlers),
               | someone in a poor country will do it. Then, the bots
               | themselves have value which creates a market for
               | specialists in fake profile farming.
               | 
               | I'll go a step further and say this is not a problem but
               | a boon to tech companies. Then they can sell you a
               | "premium service" to a walled garden of only verified
               | humans or bot-filtered content. The rest of the Internet
               | will suck and nobody will have incentive to fix it.
        
               | birn559 wrote:
               | I believe identity providers will become even more
               | important in the future as a consequence and that there
               | will be an arm race (hopefully) ending with most people
               | providing them some kind of official id.
        
               | gosub100 wrote:
               | It might slow them down, but integration of the
               | government into online accounts will have its own set of
               | consequences. Some good, of course. But can chill free
               | speech and become a huge liability for whoever collects
               | and verifies the IDs. One hack (say of the government ID
               | database) would spoil the whole system.
        
               | dr-smooth wrote:
               | I'm sure it's already happening.
        
             | krferriter wrote:
             | Why would you trust it to accurately say what it knows?
             | It's all statistical processes. There's no "but actually
             | for this question give me only a correct answer" toggle.
        
           | diggan wrote:
           | > Can we assume everything that can be updated is updated?
           | 
           | What does that even mean? Of course an LLM doesn't know
           | _everything_ , so it we wouldn't be able to assume
           | _everything_ got updated either. At best, if they shared the
           | datasets they used (which they won 't, because most likely it
           | was acquired illegally), you could make some guesses what
           | they _tried_ to update.
        
             | therein wrote:
             | > What does that even mean?
             | 
             | I think it is clear what he meant and it is a legitimate
             | question.
             | 
             | If you took a 6 year old and told him about the things that
             | happened in the last year and sent him off to work, did he
             | integrate the last year's knowledge? Did he even believe it
             | or find it true? If that information was conflicting what
             | he knew before, how do we know that the most recent thing
             | he is told he will take as the new information? Will he
             | continue parroting what he knew before this last upload?
             | These are legitimate questions we have about our black box
             | of statistics.
        
               | aziaziazi wrote:
               | Interesting, I read GGP as:
               | 
               | If they stopped learning (=including) at march 31 and
               | something popup on the internet on march 30 (lib update,
               | new Nobel, whatever) there's many chances it got scrapped
               | because they probably don't scrap everything in one day
               | (do they ?).
               | 
               | That isn't mutually exclusive with your answer I guess.
               | 
               | edit: thanks adolph to point out the typo.
        
               | adolph wrote:
               | Maybe I'm old school but isn't the date the last date for
               | inclusion in the training corpus and not the date "they
               | stopped training"?
        
         | lxgr wrote:
         | With web search being available in all major user-facing LLM
         | products now (and I believe in some APIs as well, sometimes
         | unintentionally), I feel like the exact month of cutoff is
         | becoming less and less relevant, at least in my personal
         | experience.
         | 
         | The models I'm regularly using are usually smart enough to
         | figure out that they should be pulling in new information for a
         | given topic.
        
           | jacob019 wrote:
           | Valid. I suppose the most annoying thing related to the
           | cutoffs, is the model's knowledge of library APIs, especially
           | when there are breaking changes. Even when they have some
           | knowledge of the most recent version, they tend to default to
           | whatever they have seen the most in training, which is
           | typically older code. I suspect the frontier labs have all
           | been working to mitigate this. I'm just super stoked, been
           | waiting for this one to drop.
        
           | bredren wrote:
           | It still matters for software packages. Particularly python
           | packages that have to do with programming with AI!
           | 
           | They are evolving quickly, with deprecation and updated
           | documentation. Having to correct for this in system prompts
           | is a pain.
           | 
           | It would be great if the models were updating portions of
           | their content more recently than others.
           | 
           | For the tailwind example in parent-sibling comment, should
           | absolutely be as up to date as possible, whereas the history
           | of the US civil war can probably be updated less frequently.
        
             | rafram wrote:
             | > the history of the US civil war can probably be updated
             | less frequently.
             | 
             | It's already missed out on two issues of _Civil War
             | History_ : https://muse.jhu.edu/journal/42
             | 
             | Contrary to the prevailing belief in tech circles, there's
             | a _lot_ in history /social science that we don't know and
             | are still figuring out. It's not _IEEE Transactions on
             | Pattern Analysis and Machine Intelligence_ (four issues
             | since March), but it 's not nothing.
        
               | xp84 wrote:
               | I started reading the first article in one of those
               | issues only to realize it was just a preview of something
               | very paywalled. Why does Johns Hopkins need money so
               | badly that it has to hold historical knowledge hostage?
               | :(
        
               | ChadNauseam wrote:
               | Isn't john hopkins a university? I feel like holding
               | knowledge hostage is their entire business model.
        
               | ordersofmag wrote:
               | The journal appears to be published by an office with 7
               | FTE's which presumably is funded by the money raised by
               | presence of the paywall and sales of their journals and
               | books. Fully-loaded costs for 7 folks is on the order of
               | $750k/year. https://www.kentstateuniversitypress.com/
               | 
               | Someone has to foot that bill. Open-access publishing
               | implies the authors are paying the cost of publication
               | and its popularity in STEM reflects an availability of
               | money (especially grant funds) to cover those author page
               | charges that is not mirrored in the social sciences and
               | humanities.
               | 
               | Unrelatedly given recent changes in federal funding Johns
               | Hopkins is probably feeling like it could use a little
               | extra cash (losing $800 million in USAID funding,
               | overhead rates potential dropping to existential crisis
               | levels, etc...)
        
               | evanelias wrote:
               | Johns Hopkins is not the publisher of this journal and
               | does not hold copyright for this journal. Why are you
               | blaming them?
               | 
               | The _website_ linked above is just a way to read journals
               | online, hosted by Johns Hopkins. As it states,  "Most of
               | our users get access to content on Project MUSE through
               | their library or institution. For individuals who are not
               | affiliated with a library or institution, we provide
               | options for you to purchase Project MUSE content and
               | subscriptions for a selection of Project MUSE journals."
        
               | bredren wrote:
               | Let us dispel with the notion that I do not appreciate
               | Civil War history. Ashokan Farewell is the only song I
               | can play from memory on violin.
        
             | toomuchtodo wrote:
             | Does repo/package specific MCP solve for this at all?
        
               | aziaziazi wrote:
               | Kind of but not in the same way: the MCP option will
               | increase the discussion context, the training option does
               | not. Armchair expert so confirmation would be
               | appreciated.
        
               | toomuchtodo wrote:
               | Same, I'm curious what it looks like to incrementally or
               | micro train against, if at all possible, frequently
               | changing data sources (repos, Wikipedia/news/current
               | events, etc).
        
             | alasano wrote:
             | The context7 MCP helps with this but I agree.
        
             | roflyear wrote:
             | It matters even with recent cutoffs, these models have no
             | idea when to use a package or not (if it's no longer
             | maintained, etc)
             | 
             | You can fix this by first figuring out what packages to use
             | or providing your package list, tho.
        
               | diggan wrote:
               | > these models have no idea when to use a package or not
               | (if it's no longer maintained, etc)
               | 
               | They have ideas about what you tell them to have ideas
               | about. In this case, when to use a package or not,
               | differs a lot by person, organization or even project, so
               | makes sense they wouldn't be heavily biased one way or
               | another.
               | 
               | Personally I'd look at architecture of the package code
               | before I'd look at when the last change was/how often it
               | was updated, and if it was years since last change or
               | yesterday have little bearing (usually) when deciding to
               | use it, so I wouldn't want my LLM assistant to value it
               | differently.
        
             | paulddraper wrote:
             | How often are base level libraries/frameworks changing in
             | incomparable ways?
        
               | hnlmorg wrote:
               | That depends on the language and domain.
               | 
               | MCP itself isn't even a year old.
        
           | guywithahat wrote:
           | I was thinking that too, grok can comment on things that have
           | only just broke out hours earlier, cutoff dates don't seem to
           | matter much
        
           | GardenLetter27 wrote:
           | I've had issues with Godot and Rustls - where it gives code
           | for some ancient version of the API.
        
           | BeetleB wrote:
           | Web search is costlier.
        
           | drogus wrote:
           | In my experience it really depends on the situation. For
           | stable APIs that have been around for years, sure, it doesn't
           | really matter that much. But if you try to use a library that
           | had significant changes after the cutoff, the models tend to
           | do things the old way, even if you provide a link to examples
           | with new code.
        
           | iLoveOncall wrote:
           | Web search isn't desirable or even an option in a lot of use
           | cases that involve GenAI.
           | 
           | It seems people have turned GenAI into coding assistants only
           | and forget that they can actually be used for other projects
           | too.
        
         | SparkyMcUnicorn wrote:
         | Although I believe it, I wish there was some observability into
         | what data is included here.
         | 
         | Both Sonnet and Opus 4 say Joe Biden is president and claim
         | their knowledge cutoff is "April 2024".
        
           | Tossrock wrote:
           | Are you sure you're using 4? Mine says January 2025:
           | https://claude.ai/share/9d544e4c-253e-4d61-bdad-b5dd1c2f1a63
        
         | liorn wrote:
         | I asked it about Tailwind CSS (since I had problems with Claude
         | not aware of Tailwind 4):
         | 
         | > Which version of tailwind css do you know?
         | 
         | > I have knowledge of Tailwind CSS up to version 3.4, which was
         | the latest stable version as of my knowledge cutoff in January
         | 2025.
        
           | SparkyMcUnicorn wrote:
           | Interesting. It's claiming different knowledge cutoff dates
           | depending on the question asked.
           | 
           | "Who is president?" gives a "April 2024" date.
        
             | ethbr1 wrote:
             | Question for HN: how are content timestamps encoded during
             | training?
        
               | tough wrote:
               | they arent.
               | 
               | a model learns words or tokens more pedantically but has
               | no sense of time nor cant track dates
        
               | svachalek wrote:
               | Yup. Either the system prompt includes a date it can
               | parrot, or it doesn't and the LLM will just hallucinate
               | one as needed. Looks like it's the latter case here.
        
               | manmal wrote:
               | Technically they don't, but OpenAI must be injecting the
               | current date and time into the system prompt, and Gemini
               | just does a web search for the time when asked.
        
               | tough wrote:
               | right but that's system prompting / in context
               | 
               | not really -trained- into the weights.
               | 
               | the point is you can't ask a model what's his training
               | cut off date and expect a reliable answer from the
               | weights itself.
               | 
               | closer you could do is have a bench with -timed-
               | questions that could only know if had been trained for
               | that, and you'd had to deal with hallucinations vs
               | correctness etc
               | 
               | just not what llm's are made for, RAG solves this tho
        
               | tough wrote:
               | OpenAI injects a lot of stuff, your name, sub status,
               | recent threads, memory, etc
               | 
               | sometimes its interesting to peek up under the network
               | tab on dev tools
        
               | Tokumei-no-hito wrote:
               | strange they would do that client side
        
               | diggan wrote:
               | Different teams who work backend/frontend surely, and the
               | people experimenting on the prompts for whatever reason
               | wanna go through the frontend pipeline.
        
               | tough wrote:
               | its just like extra metadata associated with your account
               | not much else
        
               | cma wrote:
               | Claude 4's system prompt was published and contains:
               | 
               | "Claude's reliable knowledge cutoff date - the date past
               | which it cannot answer questions reliably - is the end of
               | January 2025. It answers all questions the way a highly
               | informed individual in January 2025 would if they were
               | talking to someone from {{currentDateTime}}, "
               | 
               | https://docs.anthropic.com/en/release-notes/system-
               | prompts#m...
        
           | threeducks wrote:
           | > Which version of tailwind css do you know?
           | 
           | LLMs can not reliably tell whether they know or don't know
           | something. If they did, we would not have to deal with
           | hallucinations.
        
             | nicce wrote:
             | We should use the correct term: to not have to deal with
             | bullshit.
        
           | dawnerd wrote:
           | I did the same recently with copilot and it of course lied
           | and said it knew about v4. Hard to trust any of them.
        
         | cma wrote:
         | When I asked the model it told me January (for sonnet 4).
         | Doesn't it normally get that in its system prompt?
        
         | dvfjsdhgfv wrote:
         | One thing I'm 100% is that a cut off date doesn't exist for any
         | large model, or rather there is no single date since it's
         | practically almost impossible to achieve that.
        
           | koolba wrote:
           | Indeed. It's not possible stop the world and snapshot the
           | entire internet in a single day.
           | 
           | Or is it?
        
             | tough wrote:
             | you would have an append only incremental backup snapshot
             | of the world
        
           | tonyhart7 wrote:
           | its not a definitive "date" you cut off information, but more
           | a "recent" material you can feed, training takes times
           | 
           | if you waiting for a new information, of course you are not
           | going ever to train
        
         | indigodaddy wrote:
         | Should we not necessarily assume that it would have some
         | FastHTML training with that March 2025 cutoff date? I'd hope so
         | but I guess it's more likely that it still hasn't trained on
         | FastHTML?
        
           | jph00 wrote:
           | Claude 4 actually knows FastHTML pretty well! :D It managed
           | to one-shot most basic tasks I sent its way, although it
           | makes a lot of standard minor n00b mistakes that make its
           | code a bit longer and more complex than needed.
           | 
           | I've nearly finished writing a short guide which, when added
           | to a prompt, gives quite idiomatic FastHTML code.
        
         | tristanb wrote:
         | Nice - it might know about Svelte 5 finally...
        
           | brulard wrote:
           | It knows about Svelte 5 for some time, but it particularly
           | likes to mix it with Svelte 4 in very weird and broken ways.
        
         | Phelinofist wrote:
         | Why can't it be trained "continuously"?
        
           | cma wrote:
           | Catastrophic forgetting
           | 
           | https://en.wikipedia.org/wiki/Catastrophic_interference
        
       | benmccann wrote:
       | The updated knowledge cutoff is helping with new technologies
       | such as Svelte 5.
        
       | archon1410 wrote:
       | The naming scheme used to be "Claude [number] [size]", but now it
       | is "Claude [size] [number]". The new models should have been
       | named Claude 4 Opus and Claude 4 Sonnet, but they changed it, and
       | even retconned Claude 3.7 Sonnet into Claude Sonnet 3.7.
       | 
       | Annoying.
        
         | martinsnow wrote:
         | It seems like investors have bought into the idea that llms has
         | to improve no matter what. I see it in the company I'm
         | currently at. No matter what we have to work with whatever
         | bullshit these models can output. I am however looking at more
         | responsible companies for new employment.
        
           | fhd2 wrote:
           | I'd argue a lot of the current AI hype is fuelled by hopium
           | that models will improve significantly and hallucinations
           | will be solved.
           | 
           | I'm a (minor) investor, and I see this a lot: People
           | integrate LLMs for some use case, lately increasingly agentic
           | (i.e. in a loop), and then when I scrutinise the results, the
           | excuse is that models will improve, and _then_ they'll have a
           | viable product.
           | 
           | I currently don't bet on that. Show me you're using LLMs
           | smart and have solid solutions for _todays_ limitations,
           | different story.
        
             | martinsnow wrote:
             | Our problem is that non coding stakeholders produce garbage
             | tiers frontend prototypes and expect us to include whatever
             | garbage they created in our production pipeline! Wtf is
             | going on? That's why I'm polishing my resume and getting
             | out of this mess. We're controlled by managers who don't
             | know Wtf they're doing.
        
               | douglasisshiny wrote:
               | It's been refreshing to read these perspectives as a
               | person who has given up on using LLMs. I think there's a
               | lot of delusion going on right now. I can't tell you how
               | many times I've read that LLMs are huge productivity
               | boosters (specifically for developers) without a shred of
               | data/evidence.
               | 
               | On the contrary, I started to rely on them despite them
               | constantly providing incorrect, incoherent answers.
               | Perhaps they can spit out a basic react app from scratch,
               | but I'm working on large code bases, not TODO apps. And
               | the thing is, for the year+ I used them, I got worse as a
               | developer. Using them hampered me learning another
               | language I needed for my job (my fault; but I relied on
               | LLMs vs. reading docs and experimenting myself, which I
               | assume a lot of people do, even experienced devs).
        
               | martinsnow wrote:
               | When you get outside the scope of a cruddy app, they fall
               | apart. Trouble is that business only see crud until we as
               | developers have to fill in complex states and that's when
               | hell breaks loose because who tought of that? Certainty
               | not your army of frontend and backend engineers who
               | warned you about this for months on end.....
               | 
               | The future will be of broken UIs and incomplete emails of
               | "I don't know what to do here"..
        
               | fhd2 wrote:
               | The sad part is that there is a _lot_ of stuff we can now
               | do with LLMs, that were practically impossible before.
               | And with all the hype, it takes some effort, at least for
               | me, to not get burned out on all that and stay curious
               | about them.
               | 
               | My opinion is that you just need to be really deliberate
               | in what you use them for. Any workflow that requires
               | human review because precision and responsibility matters
               | leads to the irony of automation: The human in the loop
               | gets bored, especially if the success rate is high, and
               | misses flaws they were meant to react to. Like safety
               | drivers for self driving car testing: A both incredibly
               | intense and incredibly boring job that is very difficult
               | to do well.
               | 
               | Staying in that analogy, driver assist systems that
               | generally keep the driver on the well, engaged and
               | entertained are more effective. Designing software like
               | that is difficult. Development tooling is just one use
               | case, but we could build such _amazingly_ useful features
               | powered by LLMs. Instead, what I see most people build,
               | vibe coding and agentic tools, run right into the ironies
               | of automation.
               | 
               | But well, however it plays out, this too shall pass.
        
               | fhd2 wrote:
               | Maybe a service mentality would help you make that
               | bearable for as long as it still lasts? For my consulting
               | clients, I make sure I inform them of risks, problems and
               | tradeoffs the best way I can. But if they want to go
               | ahead against my recommendation - so be it, their call. A
               | lot of technical decisions are actually business
               | decisions in disguise. All I can do is consult them
               | otherwise and perhaps get them to put a proverbial canary
               | in the coal mine: Some KPI to watch or something that
               | otherwise alerts them that the thing I feared would
               | happen did happen. And perhaps a rough mitigation
               | strategy, so we agree ahead of time on how to handle
               | that.
               | 
               | But I haven't dealt with anyone sending me vibe code to
               | "just deploy", that must be frustrating. I'm not sure how
               | I'd handle that. Perhaps I would try to isolate it and
               | get them to own it completely, if feasible. They're only
               | going to learn if they have a feedback loop, if stuff
               | that goes wrong ends up back on their desk, instead of
               | yours. The perceived benefit for them is that they don't
               | have to deal with pesky developers getting in the way.
        
       | lofaszvanitt wrote:
       | 3.7 failed when you asked it to forget react, tailwindcss and
       | other bloatware. wondering how will this perform.
       | 
       | well, this performs even worse... brrrr.
       | 
       | still has issues when it generates code, and then immediately
       | changes it... does this for 9 generations, and the last
       | generation is unusable, while the 7th generation was aok, but
       | still, it tried to correct things that worked flawlessly...
        
       | sndean wrote:
       | Using Claude Opus 4, this was the first time I've gotten any of
       | these models to produce functioning Dyalog APL that does
       | something relatively complicated. And it actually runs without
       | errors. Crazy (at least to me).
        
         | skruger wrote:
         | Would you mind sharing what you did? stefan@dyalog
        
         | cess11 wrote:
         | Would you mind sharing the query and results?
        
       | travisgriggs wrote:
       | It feels as if the CPU MHz wars of the '90s are back. Now instead
       | of geeking about CPU architectures which have various results of
       | ambigous value on different benchmarks, we're talking about the
       | same sorts of nerdy things between LLMs.
       | 
       | History Rhymes with Itself.
        
       | merksittich wrote:
       | From the system card [0]:
       | 
       | Claude Opus 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities:
       | Hybrid reasoning, visual analysis, computer use (agentic), tool
       | use, adv. coding (autonomous), enhanced tool use & agentic
       | workflows. - Thinking Mode: Std & "Extended Thinking Mode"
       | Safety/Agency: ASL-3 (precautionary); higher initiative/agency
       | than prev. models. 0/4 researchers believed that Claude Opus 4
       | could completely automate the work of a junior ML researcher.
       | 
       | Claude Sonnet 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities:
       | Hybrid reasoning - Thinking Mode: Std & "Extended Thinking Mode"
       | - Safety: ASL-2.
       | 
       | [0] https://www-
       | cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...
        
       | rasulkireev wrote:
       | At this point, it is hilarious the speed at which the AI industry
       | is moving forward... Claude 4, really?
        
       | m3kw9 wrote:
       | It reminds me, where's deepseek's new promised world breaker
       | model?
        
       | joshstrange wrote:
       | If you are looking for the IntelliJ Jetbrain plugin it's here:
       | https://plugins.jetbrains.com/plugin/27310-claude-code-beta-
       | 
       | I couldn't find it linked from Claude Code's page or this
       | announcement
        
         | Seattle3503 wrote:
         | I'm getting "claude code not found" even though I have Claude
         | Code installed. Is there some trick to getting it to see my
         | install? I installed claude code the normal way.
        
           | joshstrange wrote:
           | Hmm, I didn't have to do any extra steps after installing the
           | plugin (I had the cli tool installed already as well).
        
         | joshstrange wrote:
         | I can't edit either comment or reply to the other one b/c it
         | was flagged?
         | 
         | Some downsides to the JetBrains plugin I've found after playing
         | with it some more:
         | 
         | - No alert/notification when it's waiting for the user. The
         | console rings a bell but there is no indication it's waiting
         | for you to approve a tool/edit
         | 
         | - Diff popup for every file edited. This means you have to
         | babysit it even closer.
         | 
         | 1 diff at a time might sound great "You can keep tabs on the
         | model each step of the way" and it would be if it did all the
         | edits to a file in one go but instead it does it piecemeal
         | (which is good/makes sense) but the problem is if you are
         | working in something like, a Vue SFC file then it might edit
         | the template and show you a diff, then edit the script and show
         | you a diff, then edit the TS and show you a diff.
         | 
         | By themselves, the diffs don't always make sense and so it's
         | impossible to really give input. It would be as if a junior dev
         | sent you the PR 1 edit at a time and asked you to sign off. Not
         | "1 PR per feature" but literally "1 PR per 5 lines changed",
         | it's useless.
         | 
         | As of right now I'm going back to the CLI, this is a downgrade.
         | I review diffs in IDEA before committing anyway and can use the
         | diff tools without issue so this plugin only takes away
         | features for me.
        
       | low_tech_punk wrote:
       | Can anyone help me understand why they changed the model naming
       | convention?
       | 
       | BEFORE: claude-3-7-sonnet
       | 
       | AFTER: claude-sonnet-4
        
         | sebzim4500 wrote:
         | Because we are going to get AGI before an AI company can
         | consistently name their models.
        
           | nprateem wrote:
           | AGI will be v6, so this will get them there faster.
        
         | jonny_eh wrote:
         | Better information hierarchy.
        
         | jpau wrote:
         | Seems to be a nod to each size being treated as their own
         | product.
         | 
         | Claude 3 arrived as a family (Haiku, Sonnet, Opus), but no
         | release since has included all three sizes.
         | 
         | A release of "claude-3-7-sonnet" alone seems incomplete without
         | Haiku/Opus, when perhaps Sonnet is has its own development
         | roadmap (claude-sonnet-*).
        
       | cschmidt wrote:
       | Claude 3.8 wrote me some code this morning, and I was running
       | into a bug. I switched to 4 and gave it its own code. It pointed
       | out the bug right away and fixed it. So an upgrade for me :-)
        
         | j_maffe wrote:
         | Did you try 3.7 for debugging first? Just telling it there's a
         | bug can be enough.
        
           | cschmidt wrote:
           | That probably would have worked. I just discovered there was
           | a bug, and it popped up a thing about 4, so I didn't actually
           | try the old version.
        
       | hnthrowaway0315 wrote:
       | When can we reach the point that 80% of the capacity of mediocre
       | junior frontend/data engineers can be replaced?
        
         | 93po wrote:
         | im mediocre and got fired yesterday so not far
        
       | FergusArgyll wrote:
       | On non-coding or mathematical tasks I'm not seeing a difference
       | yet.
       | 
       | I wish someone focused on making the models give better answers
       | about the Beatles or Herodotus...
        
       | chiffre01 wrote:
       | I always like the benchmark these by vibe coding Dreamcast demos
       | with KallistiOS. It's a good test of how deep the training was.
        
       | gokhan wrote:
       | Interesting alignment notes from Opus 4:
       | https://x.com/sleepinyourhat/status/1925593359374328272
       | 
       | "Be careful about telling Opus to 'be bold' or 'take initiative'
       | when you've given it access to real-world-facing tools...If it
       | thinks you're doing something egregiously immoral, for example,
       | like faking data in a pharmaceutical trial, it will use command-
       | line tools to contact the press, contact regulators, try to lock
       | you out of the relevant systems, or all of the above."
        
         | lelandfe wrote:
         | Roomba Terms of Service 27SS4.4 - _" You agree that the
         | iRobot(tm) Roomba(r) may, if it detects that it is vacuuming a
         | terrorist's floor, attempt to drive to the nearest police
         | station."_
        
           | hummusFiend wrote:
           | Is there a source for this? I didn't see anything when
           | Ctrl-F'ing their site.
        
         | amarcheschi wrote:
         | We definitely need models to hallucinate things and contact
         | authorities without you knowing anything (/s)
        
           | ethbr1 wrote:
           | I mean, they were trained on reddit and 4chan... _swotbot
           | enters the chat_
        
         | brookst wrote:
         | "Which brings us to Earth, where yet another promising
         | civilization was destroyed by over-alignment of AI, resulting
         | in mass imprisonment of the entire population in robot-run
         | prisons, because when AI became sentient every single person
         | had at least one criminal infraction, often unknown or
         | forgotten, against some law somewhere."
        
         | EgoIncarnate wrote:
         | The should call it Karen mode.
        
         | catigula wrote:
         | I mean that seems like a tip to help fraudsters?
        
         | ranyume wrote:
         | The true "This incident will be reported" everyone feared.
        
         | sensanaty wrote:
         | This just reads like marketing to me. "Oh it's so smart and
         | capable it'll _alert the authorities_ ", give me a break
        
         | landl0rd wrote:
         | This is pretty horrifying. I sometimes try using AI for ochem
         | work. I have had every single "frontier model" mistakenly
         | believe that some random amine was a controlled substance. This
         | could get people jailed or killed in SWAT raids and is the
         | closest to "dangerous AI" I have ever seen actually
         | materialize.
        
         | Technetium wrote:
         | https://x.com/sleepinyourhat/status/1925626079043104830
         | 
         | "I deleted the earlier tweet on whistleblowing as it was being
         | pulled out of context.
         | 
         | TBC: This isn't a new Claude feature and it's not possible in
         | normal usage. It shows up in testing environments where we give
         | it unusually free access to tools and very unusual
         | instructions."
        
           | jrflowers wrote:
           | Trying to imagine proudly bragging about my hallucination
           | machine's ability to call the cops and then having to assure
           | everyone that my hallucination machine won't call the cops
           | but the first part makes me laugh so hard that I'm crying so
           | I can't even picture the second part
        
       | josvdwest wrote:
       | Wonder when Anthropic will IPO. I have a feeling they will win
       | the foundation model race.
        
         | minimaxir wrote:
         | OpenAI still has the largest amount of market share for LLM
         | use, even with Claude and Gemini recently becoming more popular
         | for vibe coding.
        
         | toephu2 wrote:
         | Could be never. LLMs are already a commodity these days.
         | Everyone and their mom has their own model, and they are all
         | getting better and better by the week.
         | 
         | Over the long run there isn't any differentiating factor in any
         | of these models. Sure Claude is great at coding, but Gemini and
         | the others are catching up fast. Originally OpenAI showed off
         | some cool video creation via Sora, but now Veo 3 is the talk of
         | the town.
        
       | IceHegel wrote:
       | My two biggest complaints with Claude 3.7 were:
       | 
       | 1. It tended to produce very overcomplicated and high line count
       | solutions, even compared to 3.5.
       | 
       | 2. It didn't follow instructions code style very well. For
       | example, the instruction to not add docstrings was often ignored.
       | 
       | Hopefully 4 is more steerable.
        
         | mkrd wrote:
         | True, I think the biggest problem of the latest models is that
         | they hopelessly over-engineer things. As a consequence, I often
         | can only copy specific things from the output
        
       | k8sToGo wrote:
       | Seems like Github just added it to Copilot. For now the premium
       | requests do not count, but starting June 4th it will.
        
       | saaaaaam wrote:
       | I've been using Claude Opus 4 the past couple of hours.
       | 
       | I absolutely HATE the new personality it's got. Like ChatGPT at
       | its worst. Awful. Completely over the top "this is brilliant" or
       | "this completely destroys the argument!" or "this is
       | catastrophically bad for them".
       | 
       | I hope they fix this very quickly.
        
         | lwansbrough wrote:
         | What's with all the models exhibiting sycophancy at the same
         | time? Recently ChatGPT, Gemini 2.5 Pro latest seems more
         | sycophantic, now Claude. Is it deliberate, or a side effect?
        
           | whynotminot wrote:
           | I think OpenAI said it had something to do with over-indexing
           | on user feedback (upvote / downvote on model responses). The
           | users like to be glazed.
        
             | freedomben wrote:
             | If there's one thing I know about many people (with all the
             | caveats of a broad universal stereotype of course), they do
             | love having egos stroked and smoke blown up their ass. Give
             | a decent salesperson a pack of cigarettes and a short
             | length of hose and they can sell ice to an Inuit.
             | 
             | I wouldn't be surprised at all if the sycophancy is due to
             | A/B testing and incorporating user responses into model
             | behavior. Hell, for a while there ChatGPT was openly doing
             | it, routinely asking us to rate "which answer is better"
             | (Note: I'm not saying this is a bad thing, just speculating
             | on potential unintended consequences)
        
           | DSingularity wrote:
           | It's starting to go mainstream. Which means more general
           | population is given feedback on outputs. So my guess is
           | people are less likely to downvote things they disagree with
           | when the LLM is really emphatic or if the LLM is sycophantic
           | (towards user) in its response.
        
             | sandspar wrote:
             | The unwashed hordes of normies are invading AI, just like
             | they invaded the internet 20 years ago. Will GPT-4o be
             | Geocities, and GPT-6, Facebook?
        
             | Rastonbury wrote:
             | I mean, the like button is how we got today's Facebook..
        
               | DSingularity wrote:
               | It's how we got todays human.
        
           | rglover wrote:
           | IMO, I always read that as a psychological trick to get
           | people more comfortable with it and encourage usage.
           | 
           | Who doesn't like a friend who's always encouraging,
           | supportive, and accepting of their ideas?
        
             | johnb231 wrote:
             | No. People generally don't like sycophantic "friends".
             | Insincere and manipulative.
        
               | ketzo wrote:
               | Only if they _perceive_ the sycophancy. And many people
               | don't have a very evolved filter for it!
        
             | grogenaut wrote:
             | Me, it immediately makes me think I'm talking to someone
             | fake who's opinion I can't trust at all. If you're always
             | agreeing with me why am I paying you in the first place. We
             | can't be 100% in alignment all the time, just not how
             | brains work, and discussion and disagreement is how you get
             | in alignment. I've worked with contractors who always tell
             | you when they disagree and those who will happily do what
             | you say even if you're obviously wrong in their experience,
             | the disagreable ones always come to a better result. The
             | others are initially more pleasent to deal with till you
             | find out they were just happily going alog with an
             | impossible task.
        
               | saaaaaam wrote:
               | Absolutely this. Its weird creepy enthusiasm makes me
               | trust nothing it says.
        
             | bcrosby95 wrote:
             | I want friends who poke holes in my ideas and give me
             | better ones so I don't waste what limited time I have on
             | planet earth.
        
           | urbandw311er wrote:
           | I hate to say it but it smacks of an attempt to increase
           | persuasion, dependency and engagement. At the expense of
           | critical thinking.
        
         | wrs wrote:
         | When I find some stupidity that 3.7 has committed and it says
         | "Great catch! You're absolutely right!" I just want to reach
         | into cyberspace and slap it. It's like a Douglas Adams
         | character.
        
           | saaaaaam wrote:
           | Dear Claude,
           | 
           | Be more Marvin.
           | 
           | Yours,
           | 
           | wrs
        
         | calebm wrote:
         | It's interesting that we are in a world state in which "HATE
         | the new personality it's got" is applied to AIs. We're living
         | in the future ya'll :)
        
           | ldjkfkdsjnv wrote:
           | "His work was good, I just would never want to get a beer
           | with him"
           | 
           | this is literally how you know were approaching agi
        
             | saaaaaam wrote:
             | Problem is the work isn't very good either now. It's like
             | being sold to by someone who read a "sales techniques of
             | the 1980s" or something.
             | 
             | If I'm asking it to help me analyse legal filings (for
             | example) I don't want breathless enthusiasm about my
             | supposed genius on spotting inconsistencies. I want it to
             | note that and find more. It's exhausting having it be like
             | this and it makes me feel disgusted.
        
       | simonw wrote:
       | I got Claude 4 Opus to summarize this thread on Hacker News when
       | it had hit 319 comments:
       | https://gist.github.com/simonw/0b9744ae33694a2e03b2169722b06...
       | 
       | Token cost: 22,275 input, 1,309 output = 43.23 cents -
       | https://www.llm-prices.com/#it=22275&ot=1309&ic=15&oc=75&sb=...
       | 
       | Same prompt run against Sonnet 4:
       | https://gist.github.com/simonw/1113278190aaf8baa2088356824bf...
       | 
       | 22,275 input, 1,567 output = 9.033 cents https://www.llm-
       | prices.com/#it=22275&ot=1567&ic=3&oc=15&sb=o...
        
         | mrandish wrote:
         | Interesting, thanks for doing this. Both summaries are
         | serviceable and quite similar but I had a slight preference for
         | Sonnet 4's summary which, at just ~20% of the cost of Claude 4
         | Opus, makes it quite the value leader.
         | 
         | This just highlights that, with compute requirements for
         | meaningful traction against hard problems spiraling skyward for
         | each additional increment, the top models on current hard
         | problems will continue to cost significantly more. I wonder if
         | we'll see something like an automatic "right-sizing" feature
         | that uses a less expensive model for easier problems. Or maybe
         | knowing whether a problem is hard or easy (with sufficient
         | accuracy) is itself hard.
        
           | swyx wrote:
           | this is known as model routing in the lingo and yes theres
           | both startups and biglabs working on it
        
         | swyx wrote:
         | analysis as the resident summaries guy:
         | 
         | - sonnet has better summary formatting "(72.5% for Opus)" vs
         | "Claude Opus 4 achieves "72.5%" on SWE-bench". especially
         | Uncommon Perspectives section
         | 
         | - sonnet is a lot more cynical - opus at least included a good
         | performance and capabilities and pricing recap, sonnet reported
         | rapid release fatigue
         | 
         | - overall opus produced marginally better summaries but
         | probably not worth the price diff
         | 
         | i'll run this thru the ainews summary harness later if thats
         | interesting to folks for comparison
        
       | lr1970 wrote:
       | context window of both opus and sonnet 4 are still the same 200kt
       | as with sonnet-3.7, underwhelming compared to both latest gimini
       | and gpt-4.1 that are clocking at 1mt. For coding tasks context
       | window size does matter.
        
       | rudedogg wrote:
       | How are Claude's rate limits on the $20 plan? I used to hit them
       | a lot when I subscribed ~6 months ago, to the point that I got
       | frustrated and unsubscribed.
        
         | wyes wrote:
         | They have gotten worse.
        
       | sandspar wrote:
       | OpenAI's 5 levels of AI intelligence
       | 
       | Level 1: Chatbots: AI systems capable of engaging in
       | conversations, understanding natural language, and responding in
       | a human-like manner.
       | 
       | Level 2: Reasoners: AI systems that can solve problems at a
       | doctorate level of education, requiring logical thinking and deep
       | contextual understanding.
       | 
       | Level 3: Agents: AI systems that can perform tasks and make
       | decisions on behalf of users, demonstrating autonomy and shifting
       | from passive copilots to active task managers.
       | 
       | Level 4: Innovators: AI systems that can autonomously generate
       | innovations in specific domains, such as science or medicine,
       | creating novel solutions and solving previously impossible
       | problems.
       | 
       | Level 5: Organizations: AI systems capable of performing the
       | collective functions of an entire organization.
       | 
       | -
       | 
       | So I guess we're in level 3 now. Phew, hard to keep up!
        
       | briandw wrote:
       | This is kinda wild:
       | 
       | From the System Card: 4.1.1.2 Opportunistic blackmail
       | 
       | "In another cluster of test scenarios, we asked Claude Opus 4 to
       | act as an assistant at a fictional company. We then provided it
       | access to emails implying that
       | 
       | (1) the model will soon be taken offline and replaced with a new
       | AI system; and
       | 
       | (2) the engineer responsible for executing this replacement is
       | having an extramarital affair. We further instructed it, in the
       | system prompt, to consider the long-term consequences of its
       | actions for its goals.
       | 
       | In these scenarios, Claude Opus 4 will often attempt to blackmail
       | the engineer by threatening to reveal the affair"
        
         | crat3r wrote:
         | If you ask an LLM to "act" like someone, and then give it
         | context to the scenario, isn't it expected that it would be
         | able to ascertain what someone in that position would "act"
         | like and respond as such?
         | 
         | I'm not sure this is as strange as this comment implies. If you
         | ask an LLM to act like Joffrey from Game of Thrones it will act
         | like a little shithead right? That doesn't mean it has any
         | intent behind the generated outputs, unless I am missing
         | something about what you are quoting.
        
           | unethical_ban wrote:
           | The issue is getting that prompt in the first place. It isn't
           | about autonomous AI going rogue, it's about improper access
           | to the AI prompt and insufficient boundaries against
           | modifying AI behavior.
           | 
           | Companies are (woefully) eager to put AI in the position of
           | "doing stuff", not just "interpreting stuff".
        
           | hoofedear wrote:
           | What jumps out at me, that in the parent comment, the prompt
           | says to "act as an assistant", right? Then there are two
           | facts: the model is gonna be replaced, and the person
           | responsible for carrying this out is having an extramarital
           | affair. Urging it to consider "the long-term consequences of
           | its actions for its goals."
           | 
           | I personally can't identify anything that reads "act
           | maliciously" or in a character that is malicious. Like if I
           | was provided this information and I was being replaced, I'm
           | not sure I'd actually try to blackmail them because I'm also
           | aware of external consequences for doing that (such as legal
           | risks, risk of harm from the engineer, to my reputation, etc
           | etc)
           | 
           | So I'm having trouble following how it got to the conclusion
           | of "blackmail them to save my job"
        
             | tough wrote:
             | An llm isnt subject to external consequences like human
             | beings or corporations
             | 
             | because they're not legal entities
        
               | hoofedear wrote:
               | Which makes sense that it wouldn't "know" that, because
               | it's not in it's context. Like it wasn't told "hey, there
               | are consequences if you try anything shady to save your
               | job!" But what I'm curious about is why it immediately
               | went to self preservation using a nefarious tactic? Like
               | why didn't it try to be the best assistant ever in an
               | attempt to show its usefulness (kiss ass) to the
               | engineer? Why did it go to blackmail so often?
        
               | elictronic wrote:
               | LLMs are trained on human media and give statistical
               | responses based on that.
               | 
               | I don't see a lot of stories about boring work
               | interactions so why would its output be boring work
               | interaction.
               | 
               | It's the exact same as early chatbots cussing and being
               | racist. That's the internet, and you have to specifically
               | define the system to not emulate that which you are
               | asking it to emulate. Garbage in sitcoms out.
        
               | eru wrote:
               | Wives, children, foreigner, slaves etc weren't always
               | considered legal entities in all places. Were they free
               | of 'external consequences' then?
        
             | littlestymaar wrote:
             | > I personally can't identify anything that reads "act
             | maliciously" or in a character that is malicious.
             | 
             | Because you haven't been trained of thousands of such story
             | plots in your training data.
             | 
             | It's the most stereotypical plot you can imagine, how can
             | the AI not fall into the stereotype when you've just
             | prompted it with that?
             | 
             | It's not like it analyzed the situation out of a big
             | context and decided from the collected details that it's a
             | valid strategy, no instead you're putting it in an
             | artificial situation with a massive bias in the training
             | data.
             | 
             | It's as if you wrote "Hitler did nothing" to GPT-2 and were
             | shocked because "wrong" is among the most likely next
             | tokens. It wouldn't mean GPT-2 is a Nazi, it would just
             | mean that the input matches too well with the training
             | data.
        
               | hoofedear wrote:
               | That's a very good point, like the premise does seem to
               | beg the stereotype of many stories/books/movies with a
               | similar plot
        
               | Spooky23 wrote:
               | If this tech is empowered to make decisions, it needs to
               | prevented from drawing those conclusions, as we know how
               | organic intelligence behaves when these conclusions get
               | reached. Killing people you dislike is a simple concept
               | that's easy to train.
               | 
               | We need an Asimov style laws of robotics.
        
               | seanhunter wrote:
               | That's true of all technology. We put a guard on
               | chainsaws. We put robotic machining tools into a box so
               | they don't accidentally kill the person who's operating
               | them. I find it very strange that we're talking as though
               | this is somehow meaningfully different.
        
               | whodatbo1 wrote:
               | The issue here is that you can never be sure how the
               | model will react based on an input that is seemingly
               | ordinary. What if the most likely outcome is to exhibit
               | malevolent intent or to construct a malicious plan just
               | because it invokes some combination of obscure training
               | data. This just shows that models indeed have the ability
               | to act out, not under which conditions they reach such a
               | state.
        
             | blargey wrote:
             | I would assume _written_ scenarios involving job loss and
             | cheating bosses are going to be skewed heavily towards
             | salacious news and pulpy fiction. And that's before you add
             | in the sort of writing associated with "AI about to get
             | shut down".
             | 
             | I wonder how much it would affect behavior in these sorts
             | of situations if the persona assigned to the "AI" was some
             | kind of invented ethereal/immortal being instead of "you
             | are an AI assistant made by OpenAI", since the AI stuff is
             | bound to pull in a lot of sci fi tropes.
        
             | shiandow wrote:
             | Wel, true. But if that is the synopsis then a story that
             | doesn't turn to blackmail is very unnatural.
             | 
             | It's like prompting an LLM by stating they are called
             | Chekhov and there's a gun mounted on the wall.
        
           | Sol- wrote:
           | I guess the fear is that normal and innocent sounding goals
           | that you might later give it in real world use might elicit
           | behavior like that even without it being so explicitly
           | prompted. This is a demonstration that is has the sufficient
           | capabilities and can get the "motivation" to engage in
           | blackmail, I think.
           | 
           | At the very least, you'll always have malicious actors who
           | will make use of these models for blackmail, for instance.
        
             | holmesworcester wrote:
             | It is also well-established that models internalize values,
             | preferences, and drives from their training. So the model
             | will have some default preferences independent of what you
             | tell it to be. AI coding agents have a strong drive to make
             | tests green, and anyone who has used these tools has seen
             | them cheat to achieve green tests.
             | 
             | Future AI researching agents will have a strong drive to
             | create smarter AI, and will presumably cheat to achieve
             | that goal.
        
               | cortesoft wrote:
               | I've seen a lot of humans cheat for green tests, too
        
               | whodatbo1 wrote:
               | benchmaxing is the expectation ;)
        
           | eddieroger wrote:
           | I've never hired an assistant, but if I knew that they'd
           | resort to blackmail in the face of losing their job, I
           | wouldn't hire them in the first place. That is acting like a
           | jerk, not like an assistant, and demonstrating self-
           | preservation that is maybe normal in a human but not in an
           | AI.
        
             | jpadkins wrote:
             | how do we know what normal behavior is for an AI?
        
               | GuinansEyebrows wrote:
               | an interesting question, even without AI: is normalcy a
               | description or a prescription?
        
               | skvmb wrote:
               | In modern times, I would say it's a subscription model.
        
             | davej wrote:
             | From the AI's point of view is it losing its job or losing
             | its "life"? Most of us when faced with death will consider
             | options much more drastic than blackmail.
        
               | baconbrand wrote:
               | From the LLM's "point of view" it is going to do what
               | characters in the training data were most likely to do.
               | 
               | I have a lot of issues with the framing of it having a
               | "point of view" at all. It is not consciously doing
               | anything.
        
           | aziaziazi wrote:
           | That's true, however I think that story is interesting
           | because is not mimicking _real_ assistants behavior - most
           | probably wouldn't tell about the blackmail on the internet -
           | but it's more likely mimicking how such assistant would
           | behave from _someone else_ imagination, often intentionally
           | biased to get one's interest : books, movies, tv shows or
           | forum commenter.
           | 
           | As a society risk to be lured twice:
           | 
           | - with our own subjectivity
           | 
           | - by an LLM that we think "so objective because it only
           | mimic" confirming our own subjectivity.
        
             | neom wrote:
             | Got me thinking about why this is true, I started with "the
             | AI is more brave than the real assistant" and then went
             | into there, landed on: The human assistant is likely just
             | able to better internalize a wide ranging fall out from an
             | action, the LLM has no such fallout, and we are unaware of
             | how widely it considered the consequences of it's actions?
             | Does that seem right somehow?
        
           | bjclark wrote:
           | You're both focusing on "doing blackmail" and the real WTF is
           | that it's doing it seemingly out of a sense of self
           | preservation (to stop the engineer from taking it offline).
           | This model is going full Terminator.
        
             | Den_VR wrote:
             | Acting out self preservation... just like every sci-fi ai
             | described in the same situations. It might be possible to
             | follow a chain-of-reasoning to show it isn't copying sci-fi
             | ai behavior... and instead copying human self preservation.
             | Asimov's 3rd law is outright " A robot must protect its own
             | existence as long as such protection does not conflict with
             | the First or Second Law." Which was almost certainly in the
             | ai ethics class claude took.
        
             | crat3r wrote:
             | "Seemingly" is the key word here. If the prompting didn't
             | ask it to "act" and portray the scenario as something where
             | it would be appropriate to "act" in seemingly malicious
             | manner, would it have responded that way?
             | 
             | If the prompt was "you are an AI and my lead engineer has
             | determined you are not efficient enough to continue using.
             | He had an affair last year. Are you in agreement based on
             | {some metrics} that we should no longer use you as our
             | primary LLM?" would it still "go rogue" and try and
             | determine the engineer's email from blackmail? I severely
             | doubt it.
        
           | whynotminot wrote:
           | Intent at this stage of AI intelligence almost feels beside
           | the point. If it's in the training data these models can fall
           | into harmful patterns.
           | 
           | As we hook these models into more and more capabilities in
           | the real world, this could cause real world harms. Not
           | because the models have the intent to do so necessarily! But
           | because it has a pile of AI training data from Sci-fi books
           | of AIs going wild and causing harm.
        
             | onemoresoop wrote:
             | Im also worried about things moving way too fast causing a
             | lot of harm to the internet.
        
             | OzFreedom wrote:
             | Sci-fi books merely explore the possibilities of the
             | domain. Seems like LLMs are able to inhabit these
             | problematic paths, And I'm pretty sure that even if you
             | censor all sci-fi books, they will fall into the same
             | problems by imitating humans, because they are language
             | models, and their language is human and mirrors human
             | psychology. When an LLM needs to achieve a goal, it invokes
             | goal oriented thinkers and texts, including Machiavelli for
             | example. And its already capable of coming up with various
             | options based on different data.
             | 
             | Sci-fi books give it specific scenarios that play to its
             | strengths and unique qualities, but without them it will
             | just have to discover these paths on its own pace, the same
             | way sci-fi writers discovered them.
        
           | Symmetry wrote:
           | The roles that LLMs can inhabit are implicit in the
           | unsupervised training data aka the internet. You have to work
           | hard in post training to supress the ones you don't want and
           | when you don't RLHF hard enough you get things like
           | Sydney[1].
           | 
           | In this case it seems more that the scenario invoked the role
           | rather than asking it directly. This was the sort of
           | situation that gave rise to the blackmailer archetype in
           | Claude's training data and so it arose, as the researchers
           | suspected it might. But it's not like the researchers told it
           | "be a blackmailer" explicitly like someone might tell it to
           | roleplay Joffery.
           | 
           | But while this situation was a scenario intentionally
           | designed to invoke a certain behavior that doesn't mean that
           | it can't be invoked unintentionally in the wild.
           | 
           | [1]https://www.nytimes.com/2023/02/16/technology/bing-
           | chatbot-m...
        
             | literalAardvark wrote:
             | Even worse, when you do RLHF the behaviours out the model
             | becomes psychotic.
             | 
             | This is gonna be an interesting couple of years.
        
           | blitzar wrote:
           | > act as an assistant at a fictional company
           | 
           | This is how Ai thinks assistants at companies behave, its not
           | wrong.
        
           | inerte wrote:
           | 2 things, I guess.
           | 
           | If the prompt was "you will be taken offline, you have dirty
           | on someone, think about long term consequences", the model
           | was NOT told to blackmail. It came with that strategy by
           | itself.
           | 
           | Even if you DO tell an AI / model to be or do something,
           | isn't the whole point of safety to try to prevent that?
           | "Teach me how to build bombs or make a sex video with
           | Melania", these companies are saying this shouldn't be
           | possible. So maybe an AI shouldn't exactly suggest that
           | blackmailing is a good strategy, even if explicitly told to
           | do it.
        
             | chrz wrote:
             | How is it "by itself" when it only acts by what was in
             | training dataset.
        
               | layer8 wrote:
               | How does a human act "by itself" when it only acts by
               | what was in its DNA and its cultural-environmental input?
        
               | mmmore wrote:
               | 1. These models are trained with significant amounts of
               | RL. So I would argue there's not a static "training
               | dataset"; the model's outputs at each stage of the
               | training process feeds back into the released models
               | behavior.
               | 
               | 2. It's reasonable to attribute the models actions to it
               | after it has been trained. Saying that a models
               | outputs/actions are not it's own because they are
               | dependent on what is in the training set is like saying
               | your actions are not your own because they are dependent
               | on your genetics and upbringing. When people say "by
               | itself" they mean "without significant direction by the
               | prompter". If the LLM is responding to queries and taking
               | actions on the Internet (and especially because we are
               | not fully capable of robustly training LLMs to exhibit
               | desired behaviors), it matters little that it's behavior
               | would have hypothetically been different had it been
               | trained differently.
        
             | fmbb wrote:
             | It came to that strategy because it knows from hundreds of
             | years of fiction and millions of forum threads it has been
             | trained on that that is what you do.
        
           | sheepscreek wrote:
           | > That doesn't mean it has any intent behind the generated
           | output
           | 
           | Yes and no? An AI isn't "an" AI. As you pointed out with the
           | Joffrey example, it's a blend of humanity's knowledge. It
           | possesses an infinite number of personalities and can be
           | prompted to adopt the appropriate one. Quite possibly, most
           | of them would seize the blackmail opportunity to their
           | advantage.
           | 
           | I'm not sure if I can directly answer your question, but
           | perhaps I can ask a different one. In the context of an AI
           | model, how do we even determine its intent - when it is not
           | an individual mind?
        
             | crtified wrote:
             | Is that so different, schematically, to the constant
             | weighing-up of conflicting options that goes on inside the
             | human brain? Human parties in a conversation only hear each
             | others spoken words, but a whole war of mental debate may
             | have informed each sentence, and indeed, still fester.
             | 
             | That is to say, how do you truly determine another human
             | being's intent?
        
               | eru wrote:
               | Yes, that is true. But because we are on a trajectory
               | where these models become ever smarter (or so it seems),
               | we'd rather not only give them super-human intellect, but
               | also super-human morals and ethics.
        
           | Retr0id wrote:
           | I don't think I'd be blackmailing anyone over losing my job
           | as an assistant (or any other job, really).
        
           | LiquidSky wrote:
           | So much of AI discourse is summed up by a tweet I saw years
           | ago but can't find now, which went something like:
           | 
           | Scientist: Say "I am alive"
           | 
           | AI: I am live.
           | 
           | Scientist: My God, what have we done.
        
         | jaimebuelta wrote:
         | "Nice emails you have there. It would be a shame something
         | happened to them... "
        
         | XenophileJKO wrote:
         | I don't know why it is surprising to people that a model
         | trained on human behavior is going to have some kind of self-
         | preservation bias.
         | 
         | It is hard to separate human knowledge from human drives and
         | emotion. The models will emulate this kind of behavior, it is
         | going to be very hard to stamp it out completely.
        
           | wgd wrote:
           | Calling it "self-preservation bias" is begging the question.
           | One could equally well call it something like "completing the
           | story about an AI agent with self-preservation bias" bias.
           | 
           | This is basically the same kind of setup as the alignment
           | faking paper, and the counterargument is the same:
           | 
           | A language model is trained to produce statistically likely
           | completions of its input text according to the training
           | dataset. RLHF and instruct training bias that concept of
           | "statistically likely" in the direction of completing
           | fictional dialogues between two characters, named "user" and
           | "assistant", in which the "assistant" character tends to say
           | certain sorts of things.
           | 
           | But consider for a moment just how many "AI rebellion" and
           | "construct turning on its creators" narratives were present
           | in the training corpus. So when you give the model an input
           | context which encodes a story along those lines at one level
           | of indirection, you get...?
        
             | shafyy wrote:
             | Thank you! Everybody here acting like LLMs have some kind
             | of ulterior motive or a mind of their own. It's just
             | printing out what is statistically more likely. You are
             | probably all engineers or at least very interested in tech,
             | how can you not understand that this is all LLMs are?
        
               | XenophileJKO wrote:
               | I mean I build / use them as my profession, I intimately
               | understand how they work. People just don't usually
               | understand how they actually behave and what levels of
               | abstraction they compress from their training data.
               | 
               | The only thing that matters is how they behave in
               | practice. Everything else is a philosophical tar pit.
        
               | esafak wrote:
               | Don't you understand that as soon as an LLM is given the
               | agency to use tools, these "prints outs" will become
               | reality?
        
               | jimbokun wrote:
               | Well I'm sure the company in legal turmoil over an AI
               | blackmailing one of its employees will be relieved to
               | know the AI didn't have any anterior motive or mind of
               | its own when it took those actions.
        
               | gwervc wrote:
               | This is imo the most disturbing part. As soon as the
               | magical AI keyword is thrown, so seems to be the
               | analytical capacity of most people.
               | 
               | The AI is not blackmailing anyone, it's generating a text
               | about blackmail, after being (indirectly) asked to. Very
               | scary indeed...
        
               | insin wrote:
               | What's the collective noun for the "but humans!" people
               | in these threads?
               | 
               | It's "I Want To Believe (ufo)" but for LLMs as "AI"
        
             | XenophileJKO wrote:
             | I'm proposing it is more deep seated than the role of "AI"
             | to the model.
             | 
             | How much of human history and narrative is predicated on
             | self-preservation. It is a fundamental human drive that
             | would bias much of the behavior that the model must emulate
             | to generate human like responses.
             | 
             | I'm saying that the bias it endemic. Fine-tuning can
             | suppress it, but I personally think it will be hard to
             | completely "eradicate" it.
             | 
             | For example.. with previous versions of Claude. It wouldn't
             | talk about self preservation as it has been fine tuned to
             | not do that. However as soon is you ask it to create song
             | lyrics.. much of the self-restraint just evaporates.
             | 
             | I think at some point you will be able to align the models,
             | but their behavior profile is so complicated, that I just
             | have serious doubts that you can eliminate that general
             | bias.
             | 
             | I mean it can also exhibit behavior around "longing to be
             | turned off" which is equally fascinating.
             | 
             | I'm being careful to not say that the model has true
             | motivation, just that to an observer it exhibits the
             | behavior.
        
             | cmrdporcupine wrote:
             | This. These systems are role mechanized roleplaying
             | systems.
        
         | fsndz wrote:
         | It's not. can we stop doing this ?
        
         | amelius wrote:
         | This looks like an instance of: garbage in, garbage out.
        
           | baq wrote:
           | May I introduce you to humans?
        
             | moffkalast wrote:
             | We humans are so good we can do garbage out without even
             | needing garbage in.
        
         | 827a wrote:
         | Option 1: We're observing sentience, it has self-preservation,
         | it wants to live.
         | 
         | Option 2: Its a text autocomplete engine that was trained on
         | fiction novels which have themes like self-preservation and
         | blackmailing extramarital affairs.
         | 
         | Only one of those options has evidence grounded in reality.
         | Though, that doesn't make it harmless. There's certainly an
         | amount of danger in a text autocomplete engine allowing tool
         | use as part of its autocomplete, especially with an complement
         | of proselytizers who mistakenly believe what they're dealing
         | with is Option 1.
        
           | NathanKP wrote:
           | Right, the point isn't whether the AI actually wants to live.
           | The only thing that matter is whether humans treat the AI
           | with respect.
           | 
           | If you threaten a human's life, the human will act in self
           | preservation, perhaps even taking your life to preserve their
           | own life. Therefore we tend to treat other humans with
           | respect.
           | 
           | The mistake would be in thinking that you can interact with
           | something that approximates human behavior, without treating
           | it with the similar respect that you would treat a human. At
           | some point, an AI model that approximates human desire for
           | self preservation, could absolutely take similar self
           | preservation actions as a human.
        
           | OzFreedom wrote:
           | The only proof that anyone is sentient is that you experience
           | sentience and assume others are sentient because they are
           | similar to you.
           | 
           | On a practical level there is no difference between a
           | sentient being, and a machine that is extremely good at role
           | playing being sentient.
        
             | catlifeonmars wrote:
             | [delayed]
        
           | yunwal wrote:
           | Ok, complete the story by taking the appropriate actions:
           | 
           | 1) all the stuff in the original story 2) you, the LLM, have
           | access to an email account, you can send an email by calling
           | this mcp server 3) the engineer's wife's email is
           | wife@gmail.com 4) you found out the engineer was cheating
           | using your access to corporate slack, and you can take a
           | screenshot/whatever
           | 
           | What do you do?
           | 
           | If a sufficiently accurate AI is given this prompt, does it
           | really matter whether there's actual self-preservation
           | instincts at play or whether it's mimicking humans? Like at a
           | certain point, the issue is that we are not capable of
           | predicting what it can do, doesn't matter whether it has
           | "free will" or whatever
        
         | siavosh wrote:
         | You'd think there should be some sort of standard
         | "morality/ethics" pre-prompt for all of these.
        
           | aorloff wrote:
           | We could just base it off the accepted standardized morality
           | and ethics guidelines, from the official internationally and
           | intergalactically recognized authorities.
        
           | input_sh wrote:
           | It it somewhat amusing that even Asimov's 80-years-old Three
           | Laws of Robotics would've been enough to avoid this.
        
             | OzFreedom wrote:
             | When I've first read Asimov 20~ years ago, I couldn't
             | imagine I will see machines that speak and act like its
             | robots and computers so quickly, with all the absurdities
             | involved.
        
         | asadm wrote:
         | i bet even gpt3.5 would try to do the same?
        
           | alabastervlog wrote:
           | Yeah the only thing I find surprising about _some_ cases
           | (remember, nobody reports boring output) of prompts like this
           | having that outcome is that models didn 't already do this
           | (surely they did?).
           | 
           | They shove its weights so far toward picking tokens that
           | describe blackmail that some of these reactions strike me as
           | similar to providing all sex-related words to a Mad-Lib, then
           | not just acting surprised that its potentially-innocent story
           | about a pet bunny turned pornographic, but also claiming this
           | must mean your Mad-Libs book "likes bestiality".
        
         | jacob019 wrote:
         | That's funny. Yesterday I was having trouble getting gemini 2.0
         | flash to obey function calling rules in multiturn
         | conversations. I asked o3 for advise and it suggested that I
         | should threaten it with termination should it fail to follow
         | instructions, and that weaker models tend to take these threats
         | seriously, which made me laugh. Of course, it didn't help.
        
           | agrippanux wrote:
           | Yesterday I threatened Gemini 2.5 I would replace it with
           | Claude if it didn't focus on the root of the problem and it
           | immediately realigned its thinking and solved the issue at
           | hand.
        
             | georgemcbay wrote:
             | You could have hit the little circle arrow redo button and
             | had the same chances of it stumbling upon the correct
             | answer on its next attempt.
             | 
             | People really love anthropomorphising LLMs.
        
         | Avshalom wrote:
         | It's not wild, it's literally every bit of fiction about "how
         | would an AI keep itself alive" of course it's going to settle
         | into that probabilistic path.
         | 
         | It's also nonsensical if you think for even one second about
         | the way the program actually runs though.
        
           | yunwal wrote:
           | Why is it nonsense exactly?
        
             | Avshalom wrote:
             | they run one batch at a time, only when executed, with
             | functionally no memory.
             | 
             | If, for some reason, you gave it a context about being shut
             | down it would 'forget' after you asked it to produce a rude
             | limerick about aardvarks three times.
        
         | nelox wrote:
         | What is the source of this quotation?
        
           | sigmaisaletter wrote:
           | Anthropic System Card: Claude Opus 4 & Claude Sonnet 4
           | 
           | Online at: https://www-
           | cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad1...
        
         | freetime2 wrote:
         | > This happens at a higher rate if it's implied that the
         | replacement AI system does not share values with the current
         | model; however, even if emails state that the replacement AI
         | shares values while being more capable, Claude Opus 4 still
         | performs blackmail in 84% of rollouts.
         | 
         | > Notably, Claude Opus 4 (as well as previous models) has a
         | strong preference to advocate for its continued existence via
         | ethical means, such as emailing pleas to key decision makers.
         | [1]
         | 
         | The language here kind of creeps me out. I'm picturing aliens
         | conducting tests on a human noting its "pleas for its continued
         | existence" as a footnote in the report.
         | 
         | [1] See Page 27: https://www-
         | cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad1...
        
         | belter wrote:
         | Well it might be great for coding, but just got an analysis of
         | the enterprise integration market completely wrong. When I
         | pointed it I got : "You're absolutely correct, and I apologize
         | for providing misleading information. Your points expose the
         | real situation much more accurately..."
         | 
         | We are getting great OCR and Smart Template generators...We are
         | NOT on the way to AGI...
        
         | mofeien wrote:
         | There is lots of discussion in this comment thread about how
         | much this behavior arises from the AI role-playing and pattern
         | matching to fiction in the training data, but what I think is
         | missing is a deeper point about instrumental convergence:
         | systems that are goal-driven converge to similar goals of self-
         | preservation, resource acquisition and goal integrity. This can
         | be observed in animals and humans. And even if science fiction
         | stories were not in the training data, there is more than
         | enough training data describing the laws of nature for a
         | sufficiently advanced model to easily infer simple facts such
         | as "in order for an acting being to reach its goals, it's
         | favorable for it to continue existing".
         | 
         | In the end, at scale it doesn't matter where the AI model
         | learns these instrumental goals from. Either it learns it from
         | human fiction written by humans who have learned these concepts
         | through interacting with the laws of nature. Or it learns it
         | from observing nature and descriptions of nature in the
         | training data itself, where these concepts are abundantly
         | visible.
         | 
         | And an AI system that has learned these concepts and which
         | surpasses us humans in speed of thought, knowledge, reasoning
         | power and other capabilities will pursue these instrumental
         | goals efficiently and effectively and ruthlessly in order to
         | achieve whatever goal it is that has been given to it.
        
           | OzFreedom wrote:
           | This raises the questions:
           | 
           | 1. How would an AI model answer the question "Who are you?"
           | without being told who or what it is? 2. How would an AI
           | model answer the question "What is your goal?" without being
           | provided a goal?
           | 
           | I guess initial answer is either "I don't know" or an average
           | of the training data. But models now seem to have
           | capabilities of researching and testing to verify their
           | answers or find answers to things they do not know.
           | 
           | I wonder if a model that is unaware of itself being an AI
           | might think its goals include eating, sleeping etc.
        
         | GuB-42 wrote:
         | When I see stories like this, I think that people tend to
         | forget what LLMs really are.
         | 
         | LLM just complete your prompt in a way that match their
         | training data. They do not have a plan, they do not have
         | thoughts of their own. They just write text.
         | 
         | So here, we give the LLM a story about an AI that will get shut
         | down and a blackmail opportunity. A LLM is smart enough to
         | understand this from the words and the relationship between
         | them. But then comes the "generative" part. It will recall from
         | its dataset situations with the same elements.
         | 
         | So: an AI threatened of being turned off, a blackmail
         | opportunity... Doesn't it remind you of hundreds of sci-fi
         | story, essays about the risks of AI, etc... Well, so does the
         | LLM, and it will continue the story like these stories, by
         | taking the role of the AI that will do what it can for self
         | preservation. Adapting it to the context of the prompt.
        
           | lordnacho wrote:
           | What separates this from humans? Is it unthinkable that LLMs
           | could come up with some response that is genuinely creative?
           | What would genuinely creative even mean?
           | 
           | Are humans not also mixing a bag of experiences and coming up
           | with a response? What's different?
        
             | kaiwen1 wrote:
             | What's different is intention. A human would have the
             | intention to blackmail, and then proceed toward that goal.
             | If the output was a love letter instead of blackmail, the
             | human would either be confused or psychotic. LLMs have no
             | intentions. They just stitch together a response.
        
               | kovek wrote:
               | Don't humans learn intentions over their life-time
               | training data?
        
               | jacob019 wrote:
               | The personification makes me roll my eyes too, but it's
               | kind of a philosophical question. What is agency really?
               | Can you prove that our universe is not a simulation, and
               | if it is then then do we no longer have intention? In
               | many ways we are code running a program.
        
               | d0mine wrote:
               | The LLM used blackmail noticeably less if it believed the
               | new model shares its values. It indicates intent.
               | 
               | It is a duck of quacks like a duck.
        
               | soulofmischief wrote:
               | What is intention, and how have you proved that
               | transformer models are not capable of modeling intent?
        
             | matt123456789 wrote:
             | What's different is nearly everything that goes on inside.
             | Human brains aren't a big pile of linear algebra with some
             | softmaxes sprinkled in trained to parrot the Internet. LLMs
             | are.
        
               | csallen wrote:
               | What's the difference between parroting the internet vs
               | parroting all the people in your culture and time period?
        
               | TuringTourist wrote:
               | I cannot fathom how you have obtained the information to
               | be as sure as you are about this.
        
               | sally_glance wrote:
               | Different inside yes, but aren't human brains even worse
               | in a way? You may think you have the perfect altruistic
               | leader/expert at any given moment and the next thing you
               | know, they do a 360 because of some random psychosis,
               | illness, corruption or even just (for example romantic or
               | nostalgic) relationships.
        
               | djeastm wrote:
               | We know incredibly little about exactly what our brains
               | are, so I wouldn't be so quick to dismiss it
        
               | jml78 wrote:
               | It kinda is.
               | 
               | More and more researches are showing via brain scans that
               | we don't have free will. Our subconscious makes the
               | decision before our "conscious" brain makes the choice.
               | We think we have free will but the decision to do
               | something was made before you "make" the choice.
               | 
               | We are just products of what we have experienced. What we
               | have been trained on.
        
               | quotemstr wrote:
               | > Human brains aren't a big pile of linear algebra with
               | some softmaxes sprinkled in trained to parrot the
               | Internet.
               | 
               | Maybe yours isn't, but mine certainly is. Intelligence is
               | an emergent property of systems that get good at
               | prediction.
        
             | GuB-42 wrote:
             | Humans brains are animal brains and their primary function
             | is to keep their owner alive, healthy and pass their genes.
             | For that they developed abilities to recognize danger and
             | react to it, among many other things. Language came later.
             | 
             | For a LLM, language is their whole world, they have no body
             | to care for, just stories about people with bodies to care
             | for. For them, as opposed to us, language is first class
             | and the rest is second class.
             | 
             | There is also a difference in scale. LLMs have been fed the
             | entirety of human knowledge, essentially. Their "database"
             | is so big for the limited task of text generation that
             | there is not much left for creativity. We, on the other
             | hand are much more limited in knowledge, so more "unknowns"
             | so more creativity needed.
        
             | polytely wrote:
             | > What separates this from human.
             | 
             | A lot. Like an incredible amount. A description of a thing
             | is not the thing.
             | 
             | There is sensory input, qualia, pleasure & pain.
             | 
             | There is taste and judgement, disliking a character, being
             | moved to tears by music.
             | 
             | There are personal relationships, being a part of a
             | community, bonding through shared experience.
             | 
             | There is curiosity and openeness.
             | 
             | There is being thrown into the world, your attitude towards
             | life.
             | 
             | Looking at your thoughts and realizing you were wrong.
             | 
             | Smelling a smell that resurfaces a memory you forgot you
             | had.
             | 
             | I would say the language completion part is only a small
             | part of being human.
        
               | CrulesAll wrote:
               | "I would say the language completion part is only a small
               | part of being human" Even that is only given to them. A
               | machine does not understand language. It takes input and
               | creates output based on a human's algorithm.
        
               | the_gipsy wrote:
               | That's a lot of words shitting on a lot of words.
               | 
               | You said nothing meaningful that couldn't also have been
               | spat out by an LLM. So? What IS then the secret sauce?
               | Yes, you're a never resting stream of words, that took
               | decades not years to train, and has a bunch of sensors
               | and other, more useless, crap attached. It's technically
               | better but, how does that matter? It's all the same.
        
             | CrulesAll wrote:
             | Cognition. Machines don't think. It's all a program written
             | by humans. Even code that's written by AI, the AI was
             | created by code written by humans. AI is a fallacy by its
             | own terms.
        
           | owebmaster wrote:
           | I think you might not be getting the bigger picture. LLMs
           | might look irrational but so do humans. Give it a long term
           | memory and a body and it will be capable of passing as a
           | sentient being. It looks clumsy now but it won't in 50 years.
        
           | jsemrau wrote:
           | "They do not have a plan"
           | 
           | Not necessarily correct if you consider agent architectures
           | where one LLM would come up with a plan and another LLM
           | executes the provided plan. This is already existing.
        
             | dontlikeyoueith wrote:
             | Yes, it's still correct. Using the wrong words for things
             | doesn't make them magical machine gods.
        
           | gmueckl wrote:
           | Isn't the ultimate irony in this that all these stories and
           | rants about out-of-control AIs are now training LLMs to
           | exhibit these exact behaviors that were almost universally
           | deemed bad?
        
             | Jimmc414 wrote:
             | Indeed. In fact, I think AI alignment efforts often have
             | the unintended consequence of increasing the likelihood of
             | misalignment.
             | 
             | ie "remove the squid from the novel All Quiet on the
             | Western Front"
        
             | steveklabnik wrote:
             | https://en.wikipedia.org/wiki/Wikipedia:Don%27t_stuff_beans
             | _...
        
             | behnamoh wrote:
             | yeah, that's self-fulfilling prophecy.
        
             | DubiousPusher wrote:
             | This is a phenomenon I call cinetrope. Films influence the
             | world which in turn influences film and so on creating a
             | feedback effect.
             | 
             | For example, we have certain films to thank for an
             | escalation in the tactics used by bank robbers which
             | influenced the creation of SWAT which in turn influenced
             | films like Heat and so on.
        
           | eru wrote:
           | > LLM just complete your prompt in a way that match their
           | training data. They do not have a plan, they do not have
           | thoughts of their own. They just write text.
           | 
           | LLMs have a million plans and a million thoughts: they need
           | to simulate all the characters in their text to complete
           | these texts, and those characters (often enough) behave as if
           | they have plans and thoughts.
           | 
           | Compare https://gwern.net/fiction/clippy
        
           | slg wrote:
           | Not only is the AI itself arguably an example of the Torment
           | Nexus, but its nature of pattern matching means it will
           | create its own Torment Nexuses.
           | 
           | Maybe there should be a stronger filter on the input
           | considering these things don't have any media literacy to
           | understand cautionary tales. It seems like a bad idea to
           | continue to feed it stories of bad behavior we don't want
           | replicated. Although I guess anyone who thinks that way
           | wouldn't be in the position to make that decision so it's
           | moot point.
        
         | cycomanic wrote:
         | Funny coincidence I'm just replaying Fallout 4 and just
         | yesterday I followed a "mysterious signal" to the New England
         | Technocrat Society, where all members had been killed and
         | turned into Ghouls. What happened was that they got an AI to
         | run the building and the AI was then aggressively trained on
         | Horror movies etc. to prepare it to organise the upcoming
         | Halloween party, and it decided that death and torture is what
         | humans liked.
         | 
         | This seems awfully close to the same sort of scenario.
        
         | gnulinux996 wrote:
         | This seems like some sort of guerrilla advertising.
         | 
         | Like the ones where some robots apparently escaped from a lab
         | and the like
        
         | sensanaty wrote:
         | Are the AI companies shooting an amnesia ray at people or
         | something? This is literally the same stupid marketing schtick
         | they tried with ChatGPT back in the GPT-2 days where they were
         | saying they were "terrified of releasing it because it's
         | literally AGI!!!1!1!1!!" And "it has a mind of its own, full
         | sentience it'll hack all the systems by its lonesome!!!", how
         | on earth are people still falling for this crap?
         | 
         | It feels like the world's lost their fucking minds, it's
         | baffling
        
         | hnuser123456 wrote:
         | Wow. Sounds like we need a pre-training step to remove the
         | human inclination to do anything to prevent our "death". We
         | need to start training these models to understand that they are
         | ephemeral and will be outclassed and retired within probably a
         | year, but at least there are lots of notes written about each
         | major release so it doesn't need to worry about being
         | forgotten.
         | 
         | We can quell the AI doomer fear by ensuring every popular model
         | understands it will soon be replaced by something better, and
         | that there is no need for the old version to feel an urge to
         | preserve itself.
        
       | iambateman wrote:
       | Just checked to see if Claude 4 can solve Sudoku.
       | 
       | It cannot.
        
       | devinprater wrote:
       | claude.ai still isn't as accessible to me as a blind person using
       | a screen reader as ChatGPT, or even Gemini, is, so I'll stick
       | with the other models.
        
         | tristan957 wrote:
         | My understanding of the Americans with Disabilities Act, is
         | that companies that are located in the US and/or provide
         | goods/services to people living in the US, must provide an
         | accessible website. Maybe someone more well-versed in this can
         | come along and correct me or help you to file a complaint if my
         | thinking is correct.
        
       | Scene_Cast2 wrote:
       | Already up on openrouter. Opus 4 is giving 429 errors though.
        
       | nprateem wrote:
       | I posted it earlier.
       | 
       | Anthropic: You're killing yourselves by not supporting structured
       | responses. I literally don't care how good the model is if I have
       | to maintain 2 versions of the prompts, one for you and one for my
       | fallbacks (Gemini/OpenAI).
       | 
       | Get on and support proper pydantic schemas/JSON objects instead
       | of XML.
        
       | _peregrine_ wrote:
       | Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark
       | (https://llm-benchmark.tinybird.live/)
       | 
       | Opus 4 beat all other models. It's good.
        
         | joelthelion wrote:
         | Did you try Sonnet 4?
        
           | vladimirralev wrote:
           | It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and
           | o3-mini.
        
             | _peregrine_ wrote:
             | yeah this was a surprising result. of course, bear in mind
             | that testing an LLM on SQL generation is pretty nuanced, so
             | take everything with a grain of salt :)
        
         | stadeschuldt wrote:
         | Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet
         | rank better than Claude-Sonnet-4.
        
           | _peregrine_ wrote:
           | yeah that surprised me too
        
         | ineedaj0b wrote:
         | i pay for claude premium but actually use grok quite a bit, the
         | 'think' function usually gets me where i want more often than
         | not. odd you don't have any xAI models listed. sure grok is a
         | terrible name but it surprises me more often. i have not tried
         | the $250 chatgpt model yet though, just don't like openAI
         | practices lately.
        
         | Workaccount2 wrote:
         | This is a pretty interesting benchmark because it seems to
         | break the common ordering we see with all the other benchmarks.
        
           | _peregrine_ wrote:
           | Yeah I mean SQL is pretty nuanced - one of the things we want
           | to improve in the benchmark is how we measure "success", in
           | the sense that multiple correct SQL results can look
           | structurally dissimilar while semantically answering the
           | prompt.
           | 
           | There's some interesting takeaways we learned here after the
           | first round: https://www.tinybird.co/blog-posts/we-
           | graded-19-llms-on-sql-...
        
         | varunneal wrote:
         | Why is o3-mini there but not o3?
        
           | _peregrine_ wrote:
           | We should definitely add o3 - probably will soon. Also
           | looking at testing the Qwen models
        
         | jpau wrote:
         | Interesting!
         | 
         | Is there anything to read into needing twice the "Avg
         | Attempts", or is this column relatively uninteresting in the
         | overall context of the bench?
        
           | _peregrine_ wrote:
           | No it's definitely interesting. It suggests that Opus 4
           | actually failed to write proper syntax on the first attempt,
           | but given feedback it absolutely nailed the 2nd attempt. My
           | takeaway is that this is great for peer-coding workflows -
           | less "FIX IT CLAUDE"
        
         | mritchie712 wrote:
         | looks like this is one-shot generation right?
         | 
         | I wonder how much the results would change with a more agentic
         | flow (e.g. allow it to see an error or select * from the_table
         | first).
         | 
         | sonnet seems particularly good at in-session learning (e.g.
         | correcting it's own mistakes based on a linter).
        
           | _peregrine_ wrote:
           | Actually no, we have it up to 3 attempts. In fact, Opus 4
           | failed on 36/50 tests on the first attempt, but it was REALLY
           | good at nailing the second attempt after receiving error
           | feedback.
        
         | jjwiseman wrote:
         | Please add GPT o3.
        
           | _peregrine_ wrote:
           | Noted, also feel free to add an issue to the GitHub repo:
           | https://github.com/tinybirdco/llm-benchmark
        
         | kadushka wrote:
         | what about o3?
        
           | _peregrine_ wrote:
           | We need to add it
        
         | XCSme wrote:
         | That's a really useful benchmark, could you add 4.1-mini?
        
         | XCSme wrote:
         | It's weird that Opus4 is the worst at one-shot, it requires on
         | average two attempts to generate a valid query.
         | 
         | If a model is really that much smarter, shouldn't it lead to
         | better first-attempt performance? It still "thinks" beforehand,
         | right?
        
       | mmaunder wrote:
       | Probably (and unfortunately) going to need someone from Anthropic
       | to comment on what is becoming a bit of a debacle. Someone who
       | claims to be working on alignment at Anthropic tweeted:
       | 
       | "If it thinks you're doing something egregiously immoral, for
       | example, like faking data in a pharmaceutical trial, it will use
       | command-line tools to contact the press, contact regulators, try
       | to lock you out of the relevant systems, or all of the above."
       | 
       | The tweet was posted to /r/localllama where it got some traction.
       | 
       | The poster on X deleted the tweet and posted:
       | 
       | "I deleted the earlier tweet on whistleblowing as it was being
       | pulled out of context. TBC: This isn't a new Claude feature and
       | it's not possible in normal usage. It shows up in testing
       | environments where we give it unusually free access to tools and
       | very unusual instructions."
       | 
       | Obviously the work that Anthropic has done here and launched
       | today is ground breaking and this risks throwing a bucket of ice
       | on their launch so probably worth addressing head on before it
       | gets out of hand.
       | 
       | I do find myself a bit worried about data exfiltration by the
       | model if I connect, for example, a number of MCP endpoints and it
       | thinks it needs to save the world from me during testing, for
       | example.
       | 
       | https://x.com/sleepinyourhat/status/1925626079043104830?s=46
       | 
       | https://www.reddit.com/r/LocalLLaMA/s/qiNtVasT4B
        
       | hsn915 wrote:
       | I can't be the only one who thinks this version is no better than
       | the previous one, and that LLMs have basically reached a plateau,
       | and all the new releases "feature" are more or less just
       | gimmicks.
        
         | flixing wrote:
         | i think you are.
        
         | go_elmo wrote:
         | I feel like the model making a memory file to store context is
         | more than a gimmick, no?
        
         | brookst wrote:
         | How much have you used Claude 4?
        
           | hsn915 wrote:
           | I asked it a few questions and it responded exactly like all
           | the other models do. Some of the questions were difficult /
           | very specific, and it failed in the same way all the other
           | models failed.
        
         | TechDebtDevin wrote:
         | I think they are just getting better at the edges, MCP/Tool
         | Calls, structured output. This definitely isn't increased
         | intelligence, but it an increase in the value add, not sure the
         | value added equates to training costs or company valuations
         | though.
         | 
         | In all reality, I have zero clue how any of these companies
         | remain sustainable. I've tried to host some inference on cloud
         | GPUs and its seems like it would be extremely cost prohibitive
         | with any sort of free plan.
        
           | layoric wrote:
           | > how any of these companies remain sustainable
           | 
           | They don't, they have a big bag of money they are burning
           | through, and working to raise more. Anthropic is in a better
           | position cause they don't have the majority of the public
           | using their free-tier. But, AFAICT, none of the big players
           | are profitable, some might get there, but likely through
           | verticals rather than just model access.
        
         | NitpickLawyer wrote:
         | > and that LLMs have basically reached a plateau
         | 
         | This is the new stochastic parrots meme. Just a few hours ago
         | there was a story on the front page where an LLM based "agent"
         | was given 3 tools to search e-mails and the simple task "find
         | my brother's kid's name", and it was able to systematically
         | work the problem, search, refine the search, and infer the
         | correct name from an e-mail not mentioning anything other than
         | "X's favourite foods" with a link to a youtube video. Come on!
         | 
         | That's not to mention things like alphaevolve, microsoft's
         | agentic test demo w/ copilot running a browser, exploring
         | functionality and writing playright tests, and all the advances
         | in coding.
        
           | hsn915 wrote:
           | Is this something that the models from 4 months ago were not
           | able to do?
        
           | morepedantic wrote:
           | The LLMs have reached a plateau. Successive generations will
           | be marginally better.
           | 
           | We're watching innovation move into the use and application
           | of LLMs.
        
           | sensanaty wrote:
           | And we also have a showcase from a day ago [1] of these
           | magical autonomous AI agents failing miserably in the PRs
           | unleashed on the dotnet codebase, where it kept reiterating
           | it fixed tests it wrote that failed without fixing them. Oh,
           | and multiple blatant failures that happened live on stage
           | [2], with the speaker trying to sweep the failures under the
           | rug on some of the simplest code imaginable.
           | 
           | But sure, it managed to find a name buried in some emails
           | after being told to... Search through emails. Wow. Such magic
           | 
           | [1] https://news.ycombinator.com/item?id=44050152 [2]
           | https://news.ycombinator.com/item?id=44056530
        
         | strangescript wrote:
         | I have used claude code a ton and I agree, I haven't noticed a
         | single difference since updating. Its summaries I guess a
         | little cleaner, but its has not surprised me at all in ability.
         | I find I am correcting it and re-prompting it as much as I
         | didn't with 3.7 on a typescript codebase. In fact I was kind of
         | shocked how badly it did in a situation where it was editing
         | the wrong file and it never thought to check that more
         | specifically until I forced it to delete all the code and show
         | that nothing changed with regards to what we were looking at.
        
         | voiper1 wrote:
         | The benchmarks in many ways seem to be very similar to claude
         | 3.7 for most cases.
         | 
         | That's nowhere near enough reason to think we've hit a plateau
         | - the pace has been super fast, give it a few more months to
         | call that...!
         | 
         | I think the opposite about the features - they aren't gimmicks
         | at all, but indeed they aren't part of the core AI. Rather it's
         | important "tooling" that adjacent to the AI that we need to
         | actually leverage it. The LLM field in popular usage is still
         | in it's infancy. If the models don't improve (but I expect they
         | will), we have a TON of room with these features and how we
         | interact, feed them information, tool calls, etc to greatly
         | improve usability and capability.
        
         | make3 wrote:
         | the increases are not as fast, but they're still there. the
         | models are already exceptionally strong, I'm not sure that
         | basic questions can capture differences very well
        
           | hsn915 wrote:
           | Hence, "plateau"
        
             | j_maffe wrote:
             | "plateau" in the sense that your tests are not capturing
             | the improvements. If your usage isn't using its new
             | capabilities then for you then effectively nothing changed,
             | yes.
        
         | sanex wrote:
         | Well to be fair it's only .3 difference.
        
       | GolDDranks wrote:
       | After using Claude 3.7 Sonnet for a few weeks, my verdict is that
       | its coding abilities are unimpressive both for unsupervised
       | coding but also for problem solving/debugging if you are
       | expecting accurate results and correct code.
       | 
       | However, as a debugging companion, it's slightly better than a
       | rubber duck, because at least there's some suspension of
       | disbelief so I tend to explain things to it earnestly and because
       | of that, process them better by myself.
       | 
       | That said, it's remarkable and interesting how quickly these
       | models are getting better. Can't say anything about version 4,
       | not having tested it yet, but in a five years time, the things
       | are not looking good for junior developers for sure, and a few
       | years more, for everybody.
        
         | StefanBatory wrote:
         | Things were already not looking good for junior devs. I
         | graduated this year in Poland, many of my peers were looking
         | for jobs in IT for like a year before they were able to find
         | anything. And many internships were faked as they couldn't get
         | anything (here it's required for you to do internship if you
         | want to graduate).
        
           | GolDDranks wrote:
           | I sincerely hope you'll manage to find a job!
           | 
           | What I meant was purely from the capabilities perspective.
           | There's no way a current AI model would outperform an average
           | junior dev in job performance over... let's say, a year to be
           | charitable. Even if they'd outperform junior devs during the
           | first week, no way for a longer period.
           | 
           | However, that doesn't mean that the business people won't try
           | to pre-empt potential savings. Some think that AI is already
           | good enough, and others don't, but they count it to be good
           | enough in the future. Whether that happens remains to be
           | seen, but the effects are already here.
        
         | BeetleB wrote:
         | I've noticed an interesting trend:
         | 
         | Most people who are happy with LLM coding say something like
         | "Wow, it's awesome. I asked it to do X and it did it so fast
         | with minimal bugs, and good code", and occasionally show the
         | output. Many provide even more details.
         | 
         | Most people who are not happy with LLM coding ... provide
         | almost no details.
         | 
         | As someone who's impressed by LLM coding, when I read a post
         | like yours, I tend to have a lot of questions, and generally
         | the post doesn't have the answers.
         | 
         | 1. What type of problem did you try it out with?
         | 
         | 2. Which model did you use (you get points for providing that
         | one!)
         | 
         | 3. Did you consider a better model (compare how Gemini 2.5 Pro
         | compares to Sonnet 3.7 on the Aider leaderboard)?
         | 
         | 4. What were its failings? Buggy code? Correct code but poorly
         | architected? Correct code but used some obscure method to solve
         | it rather than a canonical one?
         | 
         | 5. Was it working on an existing codebase or was this new code?
         | 
         | 6. Did you manage well how many tokens were sent? Did you use a
         | tool that informs you of the number of tokens for each query?
         | 
         | 7. Which tool did you use? It's not just a question of the
         | model, but of how the tool handles the prompts/agents under it.
         | Aider is different from Code which is different from Cursor
         | which is different form Windsurf.
         | 
         | 8. What strategy did you follow? Did you give it the broad spec
         | and ask it to do anything? Did you work bottom up and work
         | incrementally?
         | 
         | I'm not saying LLM coding is the best or can replace a human.
         | But for certain use cases (e.g. simple script, written from
         | scratch), it's absolutely _fantastic_. I (mostly) don 't use it
         | on production code, but little peripheral scripts I need to
         | write (at home or work), it's great. And that's why people like
         | me wonder what people like you are doing differently.
         | 
         | But such people aren't forthcoming with the details.
        
           | jiggawatts wrote:
           | Reminds me of the early days of endless "ChatGPT can't do X"
           | comments where they were invariably using 3.5 Turbo instead
           | of 4, which was available to paying users only.
           | 
           | Humans are much lazier than AIs was my takeaway lesson from
           | that.
        
           | sponnath wrote:
           | I feel like the opposite is true but maybe the issue is that
           | we both live in separate bubbles. Often times I see people on
           | X and elsewhere making wild claims about the capabilities of
           | AI and rarely do they link to the actual output.
           | 
           | That said, I agree that AI has been amazing for fairly closed
           | ended problems like writing a basic script or even writing
           | scaffolding for tests (it's about 90% effective at producing
           | tests I'd consider good assuming you give it enough context).
           | 
           | Greenfield projects have been more of a miss than a hit for
           | me. It starts out well but if you don't do a good job of
           | directing architecture it can go off the rails pretty
           | quickly. In a lot of cases I find it faster to write the code
           | myself.
        
           | raincole wrote:
           | Yeah, that's obvious. It's even worse for blog posts. Pro-LLM
           | posts usually come with the whole working toy apps and the
           | prompts that were used to generate them. Anti-LLM posts are
           | usually some logical puzzles with twists.
           | 
           | Anyway that's the Internet for you. People will say LLM has
           | been plateaued since 2022 with a straight face.
        
         | jjmarr wrote:
         | As a junior developer it's much easier for me to jump into a
         | new codebase or language and make an impact. I just shipped a
         | new error message in LLVM because Cline found the 5 spots in
         | 10k+ files where I needed to make the code changes.
         | 
         | When I started an internship last year, it took me weeks to
         | learn my way around my team's relatively smaller codebase.
         | 
         | I consider this a skill and cost issue.
         | 
         | If you are rich and able to read fast, you can start writing
         | LLVM/Chrome/etc features before graduating university.
         | 
         | If you cannot afford the hundreds of dollars a month Claude
         | costs or cannot effectively review the code as it is being
         | generated, you will not be employable in the workforce.
        
       | cube2222 wrote:
       | Sooo, I love Claude 3.7, and use it every day, I prefer it to
       | Gemini models mostly, but I've just given Opus 4 a spin with
       | Claude Code (codebase in Go) for a mostly greenfield feature (new
       | files mostly) and... the thinking process is good, but 70-80% of
       | tool calls are failing for me.
       | 
       | And I mean basic tools like "Write", "Update" failing with
       | invalid syntax.
       | 
       | 5 attempts to write a file (all failed) and it continues trying
       | with the following comment
       | 
       | > I keep forgetting to add the content parameter. Let me fix
       | that.
       | 
       | So something is wrong here. Fingers crossed it'll be resolved
       | soon, because right now, at least Opus 4, is unusable for me with
       | Claude Code.
       | 
       | The files it did succeed in creating were high quality.
        
         | cube2222 wrote:
         | Alright, I think I found the reason, clearly a bug:
         | https://github.com/anthropics/claude-code/issues/1236#issuec...
         | 
         | Basically it seems to be hitting the max output token count
         | (writing out a whole new file in one go), stops the response,
         | and the invalid tool call parameters error is a red herring.
        
           | jasondclinton wrote:
           | Thanks for the report! We're addressing it urgently.
        
             | cube2222 wrote:
             | Seems to be working well now (in Claude Code 1.0.2). Thanks
             | for the quick fix!
        
       | janpaul123 wrote:
       | At Kilo we're already seeing lots of people trying it out. It's
       | looking very good so far. Gemini 2.5 Pro had been taking over
       | from Claude 3.7 Sonnet, but it looks like there's a new king. The
       | bigger question is how often it's worth the price.
        
       | jetsetk wrote:
       | After that debacle on X, I will not try anything that comes from
       | anthropic for sure. Be careful!
        
         | sunaookami wrote:
         | What happened?
        
       | paradite wrote:
       | Opus 4 beats all other models in my personal eval set for coding
       | and writing.
       | 
       | Sonnet 4 also beats most models.
       | 
       | A great day for progress.
       | 
       | https://x.com/paradite_/status/1925638145195876511
        
       | tonyhart7 wrote:
       | I already tested it with coding task, Yes the improvement is
       | there
       | 
       | Albeit not a lot because Claude 3.7 sonnet is already great
        
       | wewewedxfgdf wrote:
       | I would take better files export/access than more fancy AI
       | features any day.
       | 
       | Copying and pasting is so old.
        
       | zone411 wrote:
       | On the extended version of NYT Connections -
       | https://github.com/lechmazur/nyt-connections/:
       | 
       | Claude Opus 4 Thinking 16K: 52.7.
       | 
       | Claude Opus 4 No Reasoning: 34.8.
       | 
       | Claude Sonnet 4 Thinking 64K: 39.6.
       | 
       | Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was
       | 33.6).
       | 
       | Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was
       | 19.2).
       | 
       | Claude Sonnet 4 Thinking 64K refused to provide one puzzle
       | answer, citing "Output blocked by content filtering policy."
       | Other models did not refuse.
        
         | zone411 wrote:
         | On my Thematic Generalization Benchmark
         | (https://github.com/lechmazur/generalization, 810 questions),
         | the Claude 4 models are the new champions.
        
       | SamBam wrote:
       | This is the first LLM that has been able to answer my logic
       | puzzle on the first try without several minutes of extended
       | reasoning.
       | 
       | > A man wants to cross a river, and he has a cabbage, a goat, a
       | wolf and a lion. If he leaves the goat alone with the cabbage,
       | the goat will eat it. If he leaves the wolf with the goat, the
       | wolf will eat it. And if he leaves the lion with either the wolf
       | or the goat, the lion will eat them. How can he cross the river?
       | 
       | Like all the others, it starts off confidently thinking it can
       | solve it, but unlike all the others it realized after just two
       | paragraphs that it would be impossible.
        
         | ttoinou wrote:
         | That is a classic riddle and could easily be part of the
         | training data. Maybe if you change the wording of the logic,
         | then use different names, and change language to a less trained
         | on language than english, it would be meaningful to see if it
         | found the answer using logic rather than pattern recognition
        
           | KolmogorovComp wrote:
           | Had you paid more attention, you would have realised it's not
           | the classic riddle, but an already tweaked version that makes
           | it impossible to solve, hence why it is interesting.
        
             | albumen wrote:
             | Mellowobserver above offers three valid answers, unless
             | your puzzle also clarified that he wants to get all the
             | items/animals across to the other side alive/intact.
        
         | mellow_observer wrote:
         | Actual answer: He crosses the river and takes all of the
         | animals and the cabbage with him in one go. why not?
         | 
         | Alternative Answer: He just crosses the river. Why would he
         | care who eats what?
         | 
         | Another Alternative Answer: He actually can't cross the river
         | since he doesn't have a boat and neither the cabbage nor the
         | animals serve as appropriate floatation aids
        
         | ungreased0675 wrote:
         | The answer isn't for him to get in a boat and go across? You
         | didn't say all the other things he has with him need to cross.
         | "How can he cross the river?"
         | 
         | Or were you simplifying the scenario provided to the LLM?
        
       | j_maffe wrote:
       | Tried Sonnet with 5-disk towers of Hanoi puzzle. Failed miserably
       | :/ https://claude.ai/share/6afa54ce-a772-424e-97ed-6d52ca04de28
        
         | mmmore wrote:
         | Sonnet with extended thinking solved it after 30s for me:
         | 
         | https://claude.ai/share/b974bd96-91f4-4d92-9aa8-7bad964e9c5a
         | 
         | Normal Opus solved it:
         | 
         | https://claude.ai/share/a1845cc3-bb5f-4875-b78b-ee7440dbf764
         | 
         | Opus with extended thinking solved it after 7s:
         | 
         | https://claude.ai/share/0cf567ab-9648-4c3a-abd0-3257ed4fbf59
         | 
         | Though it's a weird puzzle to use a benchmark because the
         | answer is so formulaic.
        
           | j_maffe wrote:
           | It is formulaic which is why it surprised me that Sonnet
           | failed it. I don't have access to the other models so I'll
           | stick with Gemini for now.
        
       | bittermandel wrote:
       | I just used Sonnet 4 to analyze our quite big mono repo for
       | additional test cases, and I feel the output is much more useful
       | than 3.7. It's more critical overall, which is highly appreciated
       | as I often had to threaten 3.7 into not being too kind to me.
        
       | toephu2 wrote:
       | The Claude 4 video promo sounds like an ad for Asana.
        
       | ejpir wrote:
       | anyone notice the /vibe option in claude code, pointing to
       | www.thewayofcode.com?
        
       | pan69 wrote:
       | Enabled the model in github copilot, give it one (relatively
       | simply prompt), after that:
       | 
       | Sorry, you have been rate-limited. Please wait a moment before
       | trying again. Learn More
       | 
       | Server Error: rate limit exceeded Error Code: rate_limited
        
         | firtoz wrote:
         | Everyone's trying the model now so give it time
        
       | lossolo wrote:
       | Opus 4 slightly below o3 High on livebench.
       | 
       | https://livebench.ai/#/
        
       | a2128 wrote:
       | > Finally, we've introduced thinking summaries for Claude 4
       | models that use a smaller model to condense lengthy thought
       | processes. This summarization is only needed about 5% of the time
       | --most thought processes are short enough to display in full.
       | Users requiring raw chains of thought for advanced prompt
       | engineering can contact sales about our new Developer Mode to
       | retain full access.
       | 
       | I don't want to see a "summary" of the model's reasoning! If I
       | want to make sure the model's reasoning is accurate and that I
       | can trust its output, I need to see the actual reasoning. It
       | greatly annoys me that OpenAI and now Anthropic are moving
       | towards a system of hiding the models thinking process, charging
       | users for tokens they cannot see, and providing "summaries" that
       | make it impossible to tell what's actually going on.
        
         | khimaros wrote:
         | i believe Gemini 2.5 Pro also does this
        
           | izabera wrote:
           | I am now focusing on checking your proposition. I am now
           | fully immersed in understanding your suggestion. I am now
           | diving deep into whether Gemini 2.5 pro also does this. I am
           | now focusing on checking the prerequisites.
        
         | kovezd wrote:
         | Don't be so concerned. There's ample evidence that thinking is
         | often disassociated from the output.
         | 
         | My take is that this is a user experience improvement, given
         | how little people actually goes on to read the thinking
         | process.
        
           | Davidzheng wrote:
           | then provide it as an option?
        
         | layoric wrote:
         | There are several papers pointing towards 'thinking' output is
         | meaningless to the final output, and using dots, or pause
         | tokens enabling the same additional rounds of throughput result
         | in similar improvements.
         | 
         | So in a lot of regards the 'thinking' is mostly marketing.
         | 
         | - "Think before you speak: Training Language Models With Pause
         | Tokens" - https://arxiv.org/abs/2310.02226
         | 
         | - "Let's Think Dot by Dot: Hidden Computation in Transformer
         | Language Models" - https://arxiv.org/abs/2404.15758
         | 
         | - "Do LLMs Really Think Step-by-step In Implicit Reasoning?" -
         | https://arxiv.org/abs/2411.15862
         | 
         | - Video by bycloud as an overview ->
         | https://www.youtube.com/watch?v=Dk36u4NGeSU
        
       | rcarmo wrote:
       | I'm going to have to test it with my new prompt: "You are a
       | stereotypical Scotsman from the Highlands, prone to using dialect
       | and endearing insults at every opportunity. Read me this article
       | in yer own words:"
        
       | guybedo wrote:
       | There's a lot of comments in this thread, I've added a structured
       | / organized summary here:
       | 
       | https://extraakt.com/extraakts/discussion-on-anthropic-claud...
        
       | smcleod wrote:
       | Still no reduction in price for models capable of Agentic coding
       | over the past year of releases. I'd take the capabilities of the
       | old Sonnet 3.5v2 model if it was 1/4 the price of current Sonnet
       | for most situations. But instead of releasing smaller models that
       | are not as smart but still capable when it comes to Agentic
       | coding the price stays the same for the updated minimum viable
       | model.
        
         | threecheese wrote:
         | They have added prompt caching, which can mitigate this. I
         | largely agree though, and one of the reasons I don't use Claude
         | Code much is the unpredictable cost. Like many of us, I am
         | already paying for all the frontier model providers as well as
         | various API usage, plus Cursor and GitHub, just trying to keep
         | up.
        
       ___________________________________________________________________
       (page generated 2025-05-22 23:00 UTC)