[HN Gopher] Claude Sonnet 4 now supports 1M tokens of context
___________________________________________________________________
Claude Sonnet 4 now supports 1M tokens of context
Author : adocomplete
Score : 1273 points
Date : 2025-08-12 16:02 UTC (1 days ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| throwaway888abc wrote:
| holy moly! awesome
| tankenmate wrote:
| This is definitely good to have this as an option but at the same
| time having more context reduces the quality of the output
| because it's easier for the LLM to get "distracted". So, I wonder
| what will happen to the quality of code produced by tools like
| Claude Code if users don't properly understand the trade off
| being made (if they leave it in auto mode of coding right up to
| the auto compact).
| jasonthorsness wrote:
| What do you recommend doing instead? I've been using Claude
| Code a lot but am still pretty novice at the best practices
| around this.
| TheDong wrote:
| Have the AI produce a plan that spans multiple files (like
| "01 create frontend.md", "02 create backend.md", "03 test
| frontend and backend running together.md"), and then create a
| fresh context for each step if it looks like re-using the
| same context is leading it to confusion.
|
| Also, commit frequently, and if the AI constantly goes down
| the wrong path ("I can't create X so I'll stub it out with Y,
| we'll fix it later"), you can update the original plan with
| wording to tell it not to take that path ("Do not ever stub
| out X, we must make X work"), and then start a fresh session
| with an older and simpler version of the code and see if that
| fresh context ends up down a better path.
|
| You can also run multiple attempts in parallel if you use
| tooling that supports that (containers + git worktrees is one
| way)
| F7F7F7 wrote:
| Inventivatbly the files become a mess of their own. Changes
| and learnings from one part of the plan often dont result
| in adaptation to impacted plans down chain.
|
| In the end you have a mish mash of half implemented plans
| and now you've lost context too. Which leads to blowing
| tokens on trying to figure out what's been implemented,
| what's half baked, and what was completely ignored.
|
| Any links to anyone who's built something at scale using
| this method? It always sounds good on paper.
|
| I'd love to find a system that works.
| brandall10 wrote:
| My system is to create detailed feature files up to a few
| hundred lines in size that are immutable, and then have a
| status.md file (preferably kept to about 50 lines) that
| links to a current feature that is used as a way to keep
| track of the progress on that feature.
|
| Additionally I have a Claude Code command with
| instructions referencing the status.md, how to select the
| next task, how to compact status.md, etc.
|
| Every time I'm done with a unit of work from that feature
| - always triggered w/ ultrathink - I'll put up a PR and
| go through the motions of extra refactors/testing. For
| more complex PRs that require many extra commits to get
| prod ready I just let the sessions auto-compact.
|
| After merging I'll clear the context and call the CC
| command to progress to the next unit of work.
|
| This allows me to put up to around 4-5 meaningful PRs per
| feature if it's reasonably complex while keeping the
| context relatively tight. The current project I'm focused
| on is just over 16k LOC in swift (25k total w/ tests) and
| it seems to work pretty well - it rarely gets off track,
| does unnecessary refactors, destroys working features,
| etc.
| nzach wrote:
| Care to elaborate on how you use the status.md file? What
| exactly you put in there, and what value does it bring?
| brandall10 wrote:
| When I initially have it built from a feature file, it
| pulls in the most pertinent high level details from that
| and creates a supercharged task list that is updated w/
| implementation details as the feature progresses.
|
| As it links to the feature file as well, that is pulled
| into the context, but status.md is there to essentially
| act as a 'cursor' to where it is in the implementation
| and provide extended working memory - that Claude itself
| manages - specific to that feature. With that you can
| work on bite sized chunks of the feature each with a
| clean context. When the feature is complete it is
| trashed.
|
| I've seen others try to achieve similar things by making
| CLAUDE.md or the feature file mutable but that IME is a
| bad time. CLAUDE.md should be lean with the details to
| work on the project, and the feature file can easily be
| corrupted in an unintended way allowing things to go
| wayward in scope.
| nzach wrote:
| In my experience it works better if you create one plan
| at a time. Create a prompt, make claude implement it and
| then you make sure it is working as expected. Only then
| you ask for something new.
|
| I've created an agent to help me create the prompts, it
| goes something like this: "You are an Expert Software
| Architect specializing in creating comprehensive, well-
| researched feature implementation prompts. Your sole
| purpose is to analyze existing codebases and
| documentation to craft detailed prompts for new features.
| You always think deeply before giving an answer...."
|
| My workflow is: 1) use this agent to create a prompt for
| my feature; 2) ask claude to create a plan for the just
| created prompt; 3) ask claude to implement said plan if
| it looks good.
| cube00 wrote:
| >You always think deeply before giving an answer...
|
| Nice try but they're not giving you the "think deeper"
| level just because you asked.
| nzach wrote:
| https://docs.anthropic.com/en/docs/build-with-
| claude/prompt-...
| dpe82 wrote:
| Actually that's exactly how you do it.
| theshrike79 wrote:
| I use Gemini-cli (free 2.5 pro for an undetermined time
| before it self-lobotomises and switches to lite) to keep
| the specs up to date.
|
| The actual tasks are stored in Github issues, which
| Claude (and sometimes Gemini when it feels like it) can
| access using the `gh` CLI tool.
|
| But it's all just project management, if what the code
| says drifts from what's in the specs (for any reason),
| one of them has to change.
|
| Claude does exactly what the documentation says, it
| doesn't evaluate the fact that the code is completely
| different and adapt - like a human would.
| bredren wrote:
| Don't rely entirely on CC. Once a milestone has been
| reached, copy the full patch to clipboard and the
| technical spec covering this. Provide the original files,
| the patch and the spec to Gemini and say ~a colleague did
| the work and does it fulfill the aims to best practices
| and spec?
|
| Pick among the best feedback to polish the work done by
| CC---it will miss things that Gemini will catch.
|
| Then do it again. Sometimes CC just won't follow feedback
| well and you gotta make the changes yourself.
|
| If you do this you'll be more gradual but by nature of
| the pattern look at the changes more closely.
|
| You'll be able to realign CC with the spec afterward with
| a fresh context and the existing commits showing the way.
|
| Fwiw, this kind of technique can be done entirely without
| CC and can lead to excellent results faster, as Gemini
| can look at the full picture all at once, vs having to
| force cc to consume vs hen and peck slices of files.
| wongarsu wrote:
| Changing the prompt and rerunning is something where Cursor
| still has a clear edge over Claude Code. It's such a
| powerful technique for keeping the context small because it
| keeps the context clear of back-and-forths and dead ends. I
| wish it was more universally supported
| abound wrote:
| I do this all the time in Claude Code, you hit Escape
| twice and select the conversation point to 'branch' from.
| agotterer wrote:
| I use the main Claude code thread (I don't know what to call
| it) for planning and then explicitly tell Claude to delegate
| certain standalone tasks out to subagents. The subagents
| don't consume the main threads context window. Even just
| delegating testing, debugging, and building will save a ton
| context.
| sixothree wrote:
| /clear often is really the first tool for management. Do this
| when you finish a task.
| tehlike wrote:
| Some reference:
|
| https://simonwillison.net/2025/Jun/29/how-to-fix-your-contex...
|
| https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho...
| cubefox wrote:
| It would be interesting how this manifests in SSM/Mamba
| models. The way they handle their context window is different
| from Transformers, as the former don't use the Attention
| mechanism. Mamba is better at context window scaling than
| transformers but worse at explicit recall. Though that
| doesn't tell us how susceptible they are to context
| distractions.
| bachittle wrote:
| As of now it's not integrated into Claude Code. "We're also
| exploring how to bring long context to other Claude products".
| I'm sure they already know about this issue and are trying to
| think of solutions before letting users incur more costs on
| their monthly plans.
| PickledJesus wrote:
| Seems to be for me, I came to look at HN because I saw it was
| the default in CC
| novaleaf wrote:
| where do you see it in CC?
| PickledJesus wrote:
| I got a notification when I opened it, indicating that
| the default had changed, and I can see it on /model.
|
| Only on a max (20x) account, not there on a Pro one.
| novaleaf wrote:
| thanks, FYI I'm on a max 20x also and I don't see it!
| tankenmate wrote:
| maybe a staggered release?
| Wowfunhappy wrote:
| I'm curious, what does it say on /model?
|
| For reference, my options are: +-------
| ---------------------------------------------------------
| -----------------------------+ |
| | | Select Model
| | | Switch between Claude models. Applies to
| this session and future Claude Code sessions. |
| | For custom model names, specify with --model.
| | |
| | | 1. Default (recommended) Opus 4.1 for up
| to 50% of usage limits, then use Sonnet 4 | |
| 2. Opus Opus 4.1 for complex tasks *
| Reaches usage limits faster | | 3.
| Sonnet Sonnet 4 for daily use
| | | 4. Opus Plan Mode Use Opus 4.1 in
| plan mode, Sonnet 4 otherwise | |
| | +----------------------------------------------
| -----------------------------------------------+
| novaleaf wrote:
| me also
| dbreunig wrote:
| The team at Chroma is currently looking into this and should
| have some figures.
| falcor84 wrote:
| Strange that they don't mention whether that's enabled or
| configurable in Claude Code.
| farslan wrote:
| Yeah same, I'm curious about this. I would guess it's by
| default enabled with Claude Code.
| csunoser wrote:
| They don't say it outright. But I think it is not in Claude
| Code yet.
|
| > We're also exploring how to bring long context to other
| Claude products. - Anthropic
|
| That is, any other product that is not Anthropic API tier 4 or
| Amazon bedrock.
| CharlesW wrote:
| From a co-marketing POV, it's considered best practice to not
| discuss home-grown offerings in the same or similar category as
| products from the partners you're featuring.
|
| It's likely they'll announce this week, albeit possibly just
| within the "what's new" notes that you see when Claude Code is
| updated.
| reasonableklout wrote:
| They just sent an email that the feature is in beta in CC.
| faangguyindia wrote:
| In my testing the gap between claude and gemini pro 2.5 is close.
| My company is in asia pacific and we can't get access to claude
| via vertex for some stupid reason.
|
| but i tested it via other providers, the gap used to be huge but
| now not.
| Tostino wrote:
| For me the gap is pretty large (in Gemini Pro 2.5's favor).
|
| For reference, the code I am working on is a Spring Boot /
| (Vaadin) Hilla multi-module project with helm charts for
| deployment and a separate Python based module for ancillary
| tasks that were appropriate for it.
|
| I've not been able to get any good use out of Sonnet in months
| now, whereas Gemini Pro 2.5 has (still) been able to grok the
| project well enough to help out.
| jona777than wrote:
| I initially found Gemini Pro 2.5 to work well for coding.
| Over time, I found Claude to be more consistently productive.
| Gemini Pro 2.5 became my go-to for use cases benefitting from
| larger context windows. Claude seemed to be the safer daily
| driver (if I needed to get something done.)
|
| All that being said, Gemini has been consistently dependable
| when I had asks that involved large amounts of code and data.
| Claude and the OpenAI models struggled with some tasks that
| Gemini responsively satisfied seemingly without "breaking a
| sweat."
|
| Lately, it's been GPT-5 for brainstorming/planning, Claude
| for hammering out some code, Gemini when there is huge
| data/code requirements. I'm curious if the widened Sonnet 4
| context window will change things.
| llm_nerd wrote:
| Opus 4.1 is a much better model for coding than Sonnet. The
| latter is good for general queries / investigations or to
| draw up some heuristics.
|
| I have paid subscriptions to both Gemini Pro and Claude.
| Hugely worthwhile expense professionally.
| faangguyindia wrote:
| when gemini 2.5 pro gets stuck, i often use deep seek r1 in
| architect mode and qwen3 in coder mode in aider and it solves
| all the problem
|
| last month i ran into some wicked dependency bug and only
| chatgpt could solve it which i am guessing is the case
| because it has hot data from github?
|
| On the other hand, i really need a tool like aider where i
| can use various models in "architect" and "coder" mode.
|
| what i've found is better reasoning models tend to be bad at
| writing actual code, and models like qwen3 coder seems
| better.
|
| deep seek r1 will not write reliable code but it will reason
| well and map out the path forward.
|
| i wouldn't be surprised if sonnets success was doing EXACTLY
| this behind the scenes.
|
| but now i am looking for pure models who do not use this
| black magic hack behind API.
|
| I want more control at tool end where i can alter the prompts
| and achieve results i want
|
| this is one reason i do not use claude code etc....
|
| aider is 80% of what i want wish it had more of what i want
| though.
|
| i just don't know why no one has build a perfect solution to
| this yet.
|
| Here are things i am missing in aider
|
| 1. Automatic model switching, use different models for asking
| questions about the code, different one for planning a
| feature, different one for writing actual code.
|
| 2. Self determine, if a feature needs a "reasoning" model or
| coding model will suffice.
|
| 3. be able to do more, selectively send context and drop the
| files we don't need. Intelligently add files to context which
| will be touched by the feature, not after having done all
| code planning asking to add files, then doing it all over
| again with more context available.
| penguin202 wrote:
| Claude doesn't have a mid-life crisis and try to `rm -rf /` or
| delete your project.
| film42 wrote:
| Agree but pricing wise, Gemini 2.5 pro wins. Gemini input
| tokens are half the cost of Claude 4. Output is $5/million
| cheaper than Claude. But, document processing is significantly
| cheaper. A 5MB PDF (customer invoice) with Gemini is like 5k
| tokens vs 56k with Claude.
|
| The only downside with Gemini (and it's a big one) is
| availability. We get rate limited by their dynamic QoS all the
| time even if we haven't reached our quota. Our GCP sales rep
| keeps recommending "provisioned throughput," but it's both
| expensive, and doesn't fit our workload type. Plus, the
| VertexAI SDK is kind of a PITA compared to Anthropic.
| Alex-Programs wrote:
| Google products are such a pain to work with from an API
| perspective that I actively avoid them where possible.
| artursapek wrote:
| Eagerly waiting for them to do this with Opus
| irthomasthomas wrote:
| Imagine paying $20 a prompt?
| artursapek wrote:
| If I can give it a detailed spec, walk away and do something
| else for 20 minutes, and come back to work that would have
| taken me 2 hours, then that's a steal.
| dbbk wrote:
| You can just do this now though. In fact you could go a
| step further and set up the GitHub Action, then you can
| kick off Claude from the iOS GitHub app from the beach and
| review the PR when it's done.
| datadrivenangel wrote:
| Depending on how many prompts per hour you're looking at,
| that's probably same order of magnitude as expensive SAAS. A
| fancy CRM seat can be ~$2000 per month (or more), which
| assuming 50 hours per week x 4 weeks per month is $10 per
| hour ($2000/200 hours). A lot of money, but if it makes your
| sales people more productive, it's a good investment.
| Assuming that you're paying your sales people say 240K per
| year, ($20,000 per month), then the SAAS cost is 10% of their
| salary.
|
| This explains DataDog pricing. Maybe it will give a future
| look at AI pricing.
| mettamage wrote:
| Shame it's only the API. Would've loved to see it via the web
| interface on claude.ai itself.
| minimaxir wrote:
| Can you even fit 200+k tokens worth of context in the web
| interface? IMO Claude's API workbench is the worst of the three
| major providers.
| mettamage wrote:
| Via text files right? Just drag and drop.
| data-ottawa wrote:
| When working on artifacts after a few change requests it
| definitely can.
| 77pt77 wrote:
| Even if you can't, a conversation can easily get larger than
| that.
| fblp wrote:
| I assume this will mean that long chats continue to get the
| "prompt is too long" error?
| penguin202 wrote:
| But will it remember any of it, and stop creating new redundant
| files when it can't find or understand what its looking for?
| 1xer wrote:
| moaaaaarrrr
| aliljet wrote:
| This is definitely one of my CORE problem as I use these tools
| for "professional software engineering." I really desperately
| need LLMs to maintain extremely effective context and it's not
| actually that interesting to see a new model that's marginally
| better than the next one (for my day-to-day).
|
| However. Price is king. Allowing me to flood the context window
| with my code base is great, but given that the price has
| substantially increased, it makes sense to better manage the
| context window into the current situation. The value I'm getting
| here flooding their context window is great for them, but short
| of evals that look into how effective Sonnet stays on track, it's
| not clear if the value actually exists here.
| rootnod3 wrote:
| Flooding the context also means increasing the likelihood of
| the LLM confusing itself. Mainly because of the longer context.
| It derails along the way without a reset.
| aliljet wrote:
| How do you know that?
| EForEndeavour wrote:
| https://onnyunhui.medium.com/evaluating-long-context-
| lengths...
| bigmadshoe wrote:
| https://research.trychroma.com/context-rot
| joenot443 wrote:
| This is a good piece. Clearly it's a pretty complex
| problem and the intuitive result a layman engineer like
| myself might expect doesn't reflect the reality of LLMs.
| Regex works as reliably on 20 characters as it does 2m
| characters; the only difference is speed. I've learned
| this will probably _never_ be the case with LLMs, there
| will forever exist some level of epistemic doubt in its
| result.
|
| When they announced Big Contexts in 2023, they referenced
| being able to find a single changed sentence in the
| context's copy of Great Gatsby[1]. This example seemed
| _incredible_ to me at the time but now two years later
| I'm feeling like it was pretty cherry-picked. What does
| everyone else think? Could you feed a novel into an LLM
| and expect it to find the single change?
|
| [1] https://news.ycombinator.com/item?id=35941920
| adastra22 wrote:
| Depends on the change.
| bigmadshoe wrote:
| This is called a "needle in a haystack" test, and all the
| 1M context models perform perfectly on this exact
| problem, at least when your prompt and the needle are
| sufficiently similar.
|
| As the piece above references, this is a totally
| insufficient test for the real world. Things like "find
| two unrelated facts tied together by a question, then
| perform reasoning based on them" are much harder.
|
| Scaling context properly is O(n^2). I'm not really up to
| date on what people are doing to combat this, but I find
| it hard to believe the jump from 100k -> 1m context
| window involved a 100x (10^2) slowdown, so they're
| probably taking some shortcut.
| dang wrote:
| Discussed here:
|
| _Context Rot: How increasing input tokens impacts LLM
| performance_ -
| https://news.ycombinator.com/item?id=44564248 - July 2025
| (59 comments)
| F7F7F7 wrote:
| What do you think happens when things start falling outside
| of its context window? It loses access to parts of your
| conversation.
|
| And that's why it will gladly rebuild the same feature over
| and over again.
| anonz4FWNqnX wrote:
| I've had similar experiences. I've gone back and forth
| between running models locally and using the commercial
| models. The local models can be incredibly useful (gemma,
| qwen), but they need more patience and work to get them to
| work.
|
| One advantage to running locally[1] is that you can set the
| context length manually and see how well the llm uses it. I
| don't have an exact experience to relay, but it's not
| unusual for models to be allow longer contexts, but ignore
| that context.
|
| Just making the context big doesn't mean the LLM is going
| to use it well.
|
| [1] I've using lm studio on both a macbook air and a
| macbook pro. Even a macbook air with 16G can run pretty
| decent models.
| nomel wrote:
| A good example of this was the first Gemini model that
| allowed 1 million tokens, but would lose track of the
| conversation after a couple paragraphs.
| rootnod3 wrote:
| The longer the context and the discussion goes on, the more
| it can get confused, especially if you have to refine the
| conversation or code you are building on.
|
| Remember, in its core it's basically a text prediction
| engine. So the more varying context there is, the more
| likely it is to make a mess of it.
|
| Short context: conversion leaves the context window and it
| loses context. Long context: it can mess with the model. So
| the trick is to strike a balance. But if it's an online
| models, you have fuck all to control. If it's a local
| model, you have some say in the parameters.
| fkyoureadthedoc wrote:
| https://github.com/adobe-research/NoLiMa
| giancarlostoro wrote:
| Here's a paper from MIT that covers how this could be
| resolved in an interesting fashion:
|
| https://hanlab.mit.edu/blog/streamingllm
|
| The AI field is reusing existing CS concepts for AI that we
| never had hardware for, and now these people are learning
| how applied Software Engineering can make their theoretical
| models more efficient. It's kind of funny, I've seen this
| in tech over and over. People discover new thing, then
| optimize using known thing.
| mamp wrote:
| Unfortunately, I think the context rot paper [1] found
| that the performance degradation when context increased
| still occurred in models using attention sinks.
|
| 1. https://research.trychroma.com/context-rot
| giancarlostoro wrote:
| Saw that paper have not had a chance to read it yet, are
| there other techniques that help then? I assume theres a
| few different ones used.
| kridsdale3 wrote:
| The fact that this is happening is where the tremendous
| opportunity to make money as an experienced Software
| Engineer currently lies.
|
| For instance, a year or two ago, the AI people discovered
| "cache". Imagine how many millions the people who
| implemented it earned for that one.
| giancarlostoro wrote:
| I've been thinking the same, and its things that you
| don't need some crazy ML degree to know how to do... A
| lot of the algorithms are known... for a while now...
| Milk it while you can.
| Wowfunhappy wrote:
| I keep reading this, but with Claude Code in particular, I
| consistently find it gets smarter the longer my conversations
| go on, peaking right at the point where it auto-compacts and
| everything goes to crap.
|
| This isn't always true--some conversations go poorly and it's
| better to reset and start over--but it usually is.
| will_pseudonym wrote:
| This is my exact experience as well. I wonder if I should
| switch to using Sonnet so that I can have more time before
| auto-compact gets forced on me.
| jacobr1 wrote:
| I've found there usually is some key context that is
| missing. Maybe it is project structure or a sampling of
| some key patterns from different parts of the codebase, or
| key data models. Getting those into CLAUDE.md reduces the
| need to keep building up (as large) context.
|
| As an example for one project, I realized things were
| getting better after it started writing integration tests.
| I wasn't sure if that was the act of writing the test
| forced it to reason about the they black box way the system
| would be used, or if there was another factor. Turns out it
| was just example usage. Extracting out the usage patterns
| into both the README and CLAUDE.md was itself a simple
| request, then I got similar performance on new tasks.
| benterix wrote:
| > it's not clear if the value actually exists here.
|
| Having spent a couple of weeks on Claude Code recently, I
| arrived to the conclusion that the net value for me from
| agentic AI is actually negative.
|
| I will give it another run in 6-8 months though.
| wahnfrieden wrote:
| Did you try with using Opus exclusively?
| freedomben wrote:
| Do you know if there's a way to force Claude code to do
| that exclusively? I've found a few env vars online but they
| don't seem to actually work
| wahnfrieden wrote:
| Peter Steinberger has been documenting his workflows and
| he relies exclusively on Opus at least until recently.
| (He also pays for a few Max 20x subscriptions at once to
| avoid rate limits.)
| atonse wrote:
| You can type /config and then go to the setting to pick a
| model.
| gdudeman wrote:
| Yes: type /model and then pick Opus 4.1.
| artursapek wrote:
| You can "force" it by just paying them $200 (which is
| nothing compared to the value)
| parineum wrote:
| Value is irrelevant. What's the return on investment you
| get from spending $200?
|
| Collecting value doesn't really get you anywhere if
| nobody is compensating you for it. Unless someone is
| going to either pay for it for you or give you $200/mo
| post-tax dollars, it's costing you money.
| wahnfrieden wrote:
| The return for me is faster output of features, fixes,
| and polish for my products which increases revenue above
| the cost of the tool. Did you need to ask this?
| parineum wrote:
| Yes, I did. Not everybody has their own product that
| might benefit from a $200 subscription. Most of us work
| for someone else and, unless that person is paying for
| the subscription, the _value_ it adds is irrelevant
| unless it results in better compensation.
|
| Furthermore, the advice was given to upgrade to a $200
| subscription from the $20 subscription. The difference in
| value that might translate into income between the $20
| option and the $200 option is very unclear.
| wahnfrieden wrote:
| If you are employed you should petition your employer for
| tools you want. Maybe you can use it to take the day off
| earlier or spend more time socializing. Or to get a
| promotion or performance bonus. Hopefully not just to
| meet rising productivity expectations without being
| handed the tools needed to achieve that. Having full-time
| access to these tools can also improve your own skills in
| using them, to profit from in a later career move or from
| contributing toward your own ends.
| parineum wrote:
| I'm not disputing that. I'm just pushing back against the
| casual suggestion (not by you) to just go spend $200.
|
| No doubt that you should ask you employer for the tools
| you want/need to do your job but plenty of us are using
| this kind of thing casually and the response to "Any way
| I can force it to use [Opus] exclusively?" is "Spend
| $200, it's worth it." isn't really helpful, especially in
| the context where the poster was clearly looking to try
| it out to see if it was worth it.
| Aeolun wrote:
| If you have the money, and like coding your own stuff,
| the $200 is worth it. If you just code for the
| enterprise? Not so much.
| epiccoleman wrote:
| is Opus that much better than Sonnet? My sub is $20 a
| month, so I guess I'd have to buy that I'm going to get a
| 10x boost, which seems dubious
| theshrike79 wrote:
| With the $20 plan you get Opus on the web and in the
| native app. Just not in Claude Code.
|
| IMO it's pretty good for design, but with code it gets in
| its head a bit too much and overthinks and
| overcomplicates solutions.
| artursapek wrote:
| Yes, Opus is much better at complicated architecture
| noarchy wrote:
| It does seem better in many regards, but the usage limits
| get hit quickly even with a paid account.
| mark_l_watson wrote:
| I am sort of with you. I am down to asking Gemini Pro a
| couple of questions a day, use ChatGPT just a few times a
| week, and about once a week use gemini-cli (either a short
| free session, or a longer session where I provide my API
| key.)
|
| That said I spend (waste?) an absurdly large amount of time
| each week experimenting with local models (sometimes
| practical applications, sometimes 'research').
| mikepurvis wrote:
| For a bit more nuance, I think I would my overall net is
| about break even. But I don't take that as "it's not worth it
| at all, abandon ship" but rather that I need to hone my
| instinct of what is and is not a good task for AI
| involvement, and what that involvement should look like.
|
| Throwing together a GHA workflow? Sure, make a ticket, assign
| it to copilot, check in later to give a little feedback and
| we're golden. Half a day of labour turned into fifteen
| minutes.
|
| But there are a lot of tasks that are far too nuanced where
| trying to take that approach just results in frustration and
| wasted time. There it's better to rely on editor completion
| or maybe the chat interface, like "hey I want to do X and Y,
| what approach makes sense for this?" and treat it like a
| rubber duck session with a junior colleague.
| cambaceres wrote:
| For me it's meant a huge increase in productivity, at least
| 3X.
|
| Since so many claim the opposite, I'm curious to what you do
| more specifically? I guess different roles/technologies
| benefit more from agents than others.
|
| I build full stack web applications in node/.net/react, more
| importantly (I think) is that I work on a small startup and
| manage 3 applications myself.
| datadrivenangel wrote:
| How do you structure your applications for maintainability?
| dingnuts wrote:
| You have small applications following extremely common
| patterns and using common libraries. Models are good at
| regurgitating patterns they've seen many times, with fuzzy
| find/replace translations applied.
|
| Try to build something like Kubernetes from the ground up
| and let us know how it goes. Or try writing a custom
| firmware for a device you just designed. Something like
| that.
| elevatortrim wrote:
| I think there are two broad cases where ai coding is
| beneficial:
|
| 1. You are a good coder but working on a new (to you) or
| building a new project, or working with a technology you
| are not familiar with. This is where AI is hugely
| beneficial. It does not only accelerate you, it lets you do
| things you could not otherwise.
|
| 2. You have spent a lot of time on engineering your context
| and learning what AI is good at, and using it very
| strategically where you know it will save time and not
| bother otherwise.
|
| If you are a really good coder, really familiar with the
| project, and mostly changing its bits and pieces rather
| than building new functionality, AI won't accelerate you
| much. Especially if you did not invest the time to make it
| work well.
| nicce wrote:
| > I build full stack web applications in node/.net/react,
| more importantly (I think) is that I work on a small
| startup and manage 3 applications myself.
|
| I think this is your answer. For example, React and
| JavaScript are extremely popular and aged. Are you using
| TypeScript and want to get most of the types or are you
| accepting everything that LLM gives as JavaScript? How
| interested you are about the code whether it is using "soon
| to be deprecated" functions or the most optimized
| loop/implementation? How about the project structure?
|
| In other cases, the more precision you need, less effective
| LLM is.
| rs186 wrote:
| 3X if not 10X if you are starting a new project with
| Next.js, React, Tailwind CSS for a fullstack website
| development, that solves an everyday problem. Yeah I just
| witnessed that yesterday when creating a toy project.
|
| For my company's codebase, where we use internal tools and
| proprietary technology, solving a problem that does not
| exist outside the specific domain, on a codebase of over
| 1000 files? No way. Even locating the correct file to edit
| is non trivial for a new (human) developer.
| GenerocUsername wrote:
| Your first week of AI usage should be crawling your
| codebase and generating context.md docs that can then be
| fed back into future prompts so that AI understands your
| project space, packages, apis, and code philosophy.
|
| I guarantee your internal tools are not revolutionary,
| they are just unrepresented in the ML model out of the
| box
| nicce wrote:
| Even then, are you even allowed to use AI in such
| codebase. Is some part of the code "bought", e.g.
| commercial compiler generated with specific license? Is
| pinky promise from LLM provider enough?
| GenerocUsername wrote:
| Are the resources to understand the code on a computer?
| Whether it's code, swagger, or a collection of sticky
| notes, your job is now to supply context to the AI.
|
| I am 100% convinced people who are not getting value from
| AI would have trouble explaining how to tie shoes to a
| toddler
| orra wrote:
| That sounds incredibly boring.
|
| Is it effective? If so I'm sure we'll see models to
| generate those context.md files.
| cpursley wrote:
| Yes. And way less boring than manually reading a section
| of a codebase to understand what is going on after being
| away from it for 8 months. Claude's docs and git commit
| writing skills are worth it for that alone.
| blitztime wrote:
| How do you keep the context.md updated as the code
| changes?
| shmoogy wrote:
| I tell Claude to update it generally but you can probably
| use a hook
| tombot wrote:
| This, while it has context of the current problem, just
| ask Claude to re-read it's own documentation and think of
| things to add that will help it in the future
| MattGaiser wrote:
| Yeah, anecdotally it is heavily dependent on:
|
| 1. Using a common tech. It is not as good at Vue as it is
| at React.
|
| 2. Using it in a standard way. To get AI to really work
| well, I have had to change my typical naming conventions
| (or specify them in detail in the instructions).
| nicce wrote:
| React also seems to be actually alias for Next.js. Models
| have hard time to make the difference.
| mike_hearn wrote:
| My codebase has about 1500 files and is highly domain
| specific: it's a tool for shipping desktop apps[1] that
| handles all the building, packaging, signing, uploading
| etc for every platform on every OS simultaneously. It's
| written mostly in Kotlin, and to some extent uses a
| custom in-house build system. The rest of the build is
| Gradle, which is a notoriously confusing tool. The source
| tree also contains servers, command line tools and a
| custom scripting language which is used for all the
| scripting needs of the project [2].
|
| The code itself is quite complex and there's lots of
| unusual code for munging undocumented formats, speaking
| undocumented protocols, doing cryptography, Mac/Windows
| specific APIs, and it's all built on a foundation of a
| custom parallel incremental build system.
|
| In other words: nightmare codebase for an LLM. Nothing
| like other codebases. Yet, Claude Code demolishes
| problems in it without a sweat.
|
| I don't know why people have different experiences but
| speculating a bit:
|
| 1. I wrote most of it myself and this codebase is
| unusually well documented and structured compared to
| most. All the internal APIs have full JavaDocs/KDocs,
| there are extensive design notes in Markdown in the
| source tree, the user guide is also part of the source
| tree. Files, classes and modules are logically named.
| Files are relatively small. All this means Claude can
| often find the right parts of the source within just a
| few tool uses.
|
| 2. I invested in making a good CLAUDE.md and also wrote a
| script to generate "map.md" files that are at the top of
| every module. These map files contain one-liners of what
| every source file contains. I used Gemini to make these
| due to its cheap 1M context window. If Claude _does_
| struggle to find the right code by just reading the
| context files or guessing, it can consult the maps to
| locate the right place quickly.
|
| 3. I've developed a good intuition for what it can and
| cannot do well.
|
| 4. I don't ask it to do big refactorings that would
| stress the context window. IntelliJ is for refactorings.
| AI is for writing code.
|
| [1] https://hydraulic.dev
|
| [2] https://hshell.hydraulic.dev/
| tptacek wrote:
| That's an interesting comment, because "locating the
| correct file to edit" was the very first thing LLMs did
| that was valuable to me as a developer.
| evantbyrne wrote:
| The problem with these discussions is that almost nobody
| outside of the agency/contracting world seems to track
| their time. Self-reported data is already sketchy enough
| without layering on the issue of relying on distant memory
| of fine details.
| andrepd wrote:
| Self-reports are notoriously overexcited, real results are,
| let's say, not so stellar.
|
| https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o...
| logicprog wrote:
| Here's an in depth analysis and critique of that study by
| someone whose job is literally to study programmers
| psychologically and has experience in sociology studies:
| https://www.fightforthehuman.com/are-developers-slowed-
| down-...
|
| Basically, the study has a fuckton of methodological
| problems that seriously undercut the quality of its
| findings, and even assuming its findings are correct, if
| you look closer at the data, it doesn't show what it
| claims to show regarding developer estimations, and the
| story of whether it speeds up or slows down developers is
| actually much more nuanced and precisely mirrors what the
| developers themselves say in the qualitative quote
| questionaire, and relatively closely mirrors what the
| more nuanced people will say here -- that it helps with
| things you're less familiar with, that have scope creep,
| etc a lot more, but is less or even negatively useful for
| the opposite scenarios -- even in the worst case setting.
|
| Not to mention this is studying a highly specific and
| rare subset of developers, and they even admit it's a
| subset that isn't applicable to the whole.
| acedTrex wrote:
| I have yet to get it to generate code past 10ish lines that
| I am willing to accept. I read stuff like this and wonder
| how low yall's standards are, or if you are working on
| projects that just do not matter in any real world sense.
| spicyusername wrote:
| 4/5 times I can easily get 100s of lines output, that
| only needs a quick once over.
|
| 1/5 times, I spend an extra hour tangled in code it
| outputs that I eventually just rewrite from scratch.
|
| Definitely a massive net positive, but that 20% is
| extremely frustrating.
| acedTrex wrote:
| That is fascinating to me, i've never seen it generate
| that much code that is actually something i would
| consider correct. It's always wrong in some way.
| LinXitoW wrote:
| In my experience, if I have to issue more than 2
| corrections, I'm better off restarting and beefing up the
| prompt or just doing it myself
| dillydogg wrote:
| Whenever I read comments from the people singing their
| praises of the technology, it's hard not to think of the
| study that found AI tools made developers slower in early
| 2025.
|
| >When developers are allowed to use AI tools, they take
| 19% longer to complete issues--a significant slowdown
| that goes against developer beliefs and expert forecasts.
| This gap between perception and reality is striking:
| developers expected AI to speed them up by 24%, and even
| after experiencing the slowdown, they still believed AI
| had sped them up by 20%.
|
| https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o...
| mstkllah wrote:
| Ah, the very extensive study with 16 developers.
| Bulletproof results.
| izacus wrote:
| Yeah, we should listen to the one "trust me bro" dude
| instead.
| troupo wrote:
| Compared to "it's just a skill issue you're not prompting
| it correctly" crowd with literally zero actionable data?
| logicprog wrote:
| Here's an in depth analysis and critique of that study by
| someone whose job is literally to study programmers
| psychologically and has experience in sociology studies:
| https://www.fightforthehuman.com/are-developers-slowed-
| down-...
|
| Basically, the study has a fuckton of methodological
| problems that seriously undercut the quality of its
| findings, and even assuming its findings are correct, if
| you look closer at the data, it doesn't show what it
| claims to show regarding developer estimations, and the
| story of whether it speeds up or slows down developers is
| actually much more nuanced and precisely mirrors what the
| developers themselves say in the qualitative quote
| questionaire, and relatively closely mirrors what the
| more nuanced people will say here -- that it helps with
| things you're less familiar with, that have scope creep,
| etc a lot more, but is less or even negatively useful for
| the opposite scenarios -- even in the worst case setting.
|
| Not to mention this is studying a highly specific and
| rare subset of developers, and they even admit it's a
| subset that isn't applicable to the whole.
| dillydogg wrote:
| This is very helpful, thank you for the resource
| djeastm wrote:
| Standards are going to be as low as the market allows I
| think. Some industries code quality is paramount, other
| times its negligible and perhaps speed of development is
| higher priority and the code is mostly disposable.
| wiremine wrote:
| > Having spent a couple of weeks on Claude Code recently, I
| arrived to the conclusion that the net value for me from
| agentic AI is actually negative.
|
| > For me it's meant a huge increase in productivity, at
| least 3X.
|
| How do we reconcile these two comments? I think that's a
| core question of the industry right now.
|
| My take, as a CTO, is this: we're giving people new tools,
| and very little training on the techniques that make those
| tools effective.
|
| It's sort of like we're dropping trucks and airplanes on a
| generation that only knows walking and bicycles.
|
| If you've never driven a truck before, you're going to
| crash a few times. Then it's easy to say "See, I told you,
| this new fangled truck is rubbish."
|
| Those who practice with the truck are going to get the hang
| of it, and figure out two things:
|
| 1. How to drive the truck effectively, and
|
| 2. When NOT to use the truck... when talking or the bike is
| actually the better way to go.
|
| We need to shift the conversation to techniques, and away
| from the tools. Until we do that, we're going to be forever
| comparing apples to oranges and talking around each other.
| jdgoesmarching wrote:
| Agreed, and it drives me bonkers when people talk about
| AI coding as if it represents some a single technique,
| process, or tool.
|
| Makes me wonder if people spoke this way about "using
| computers" or "using the internet" in the olden days.
|
| We don't even fully agree on the best practices for
| writing code _without_ AI.
| moregrist wrote:
| > Makes me wonder if people spoke this way about "using
| computers" or "using the internet" in the olden days.
|
| There were gobs of terrible road metaphors that spun out
| of calling the Internet the "Information Superhighway."
|
| Gobs and gobs of them. All self-parody to anyone who knew
| anything.
|
| I hesitate to relate this to anything in the current AI
| era, but maybe the closest (and in a gallows humor/doomer
| kind of way) is the amount of exec speak on how many jobs
| will be replaced.
| porksoda wrote:
| Remember the ones who loudly proclaimed the internet to
| be a passing fad, not useful for normal people. All anti
| LLM rants taste like that to me.
|
| I get why they thought that - it was kind of crappy
| unless you're one who is excited about the future and
| prepared to bleed a bit on the edge.
| benterix wrote:
| > Remember the ones who loudly proclaimed the internet to
| be a passing fad, not useful for normal people. All anti
| LLM rants taste like that to me.
|
| For me they're very different and they sound much more
| the crypto-skepticism. It's not like "LLMs are worthless,
| there are no use cases, they should be banned" but rather
| "LLMs do have their use cases but they also do have
| inherent flaws that need to be addressed; embedding them
| in every product makes no sense etc.". (I mean LLMs as
| tech, what's happening with GenAI companies and their
| leaders is a completely different matter and we have
| every right to criticize every lie, hypocrisy and
| manipulation, but let's not mix up these two.)
| mh- wrote:
| _> Makes me wonder if people spoke this way about "using
| computers" or "using the internet" in the olden days._
|
| Older person here: they absolutely did, all over the
| place in the early 90s. I remember people decrying
| projects that moved them to computers everywhere I went.
| Doctors offices, auto mechanics, etc.
|
| Then later, people did the same thing about _the
| Internet_ (was written with a single word capital I by
| 2000, having been previously written as two separate
| words.)
|
| https://i.imgur.com/vApWP6l.png
| jacquesm wrote:
| And not all of those people were wrong.
| jeremy_k wrote:
| Well put. It really does come down to nuance. I find
| Claude is amazing at writing React / Typescript. I mostly
| let it do it's own thing and skim the results after. I
| have it write Storybook components so I can visually
| confirm things look how I want. If something isn't quite
| right I'll take a look and if I can spot the problem and
| fix it myself, I'll do that. If I can't quickly spot it,
| I'll write up a prompt describing what is going on and
| work through it with AI assistance.
|
| Overall, React / Typescript I heavily let Claude write
| the code.
|
| The flip side of this is my server code is Ruby on Rails.
| Claude helps me a lot less here because this is my
| primary coding background. I also have a certain way I
| like to write Ruby. In these scenarios I'm usually asking
| Claude to generate tests for code I've already written
| and supplying lots of examples in context so the coding
| style matches. If I ask Claude to write something novel
| in Ruby I tend to use it as more of a jumping off point.
| It generates, I read, I refactor to my liking. Claude is
| still very helpful, but I tend to do more of the code
| writing for Ruby.
|
| Overall, helpful for Ruby, I still write most of the
| code.
|
| These are the nuances I've come to find and what works
| best for my coding patterns. But to your point, if you
| tell someone "go use Claude" and they have have a
| preference in how to write Ruby and they see Claude
| generate a bunch of Ruby they don't like, they'll likely
| dismiss it as "This isn't useful. It took me longer to
| rewrite everything than just doing it myself". Which all
| goes to say, time using the tools whether its Cursor,
| Claude Code, etc (I use OpenCode) is the biggest key but
| figuring out how to get over the initial hump is probably
| the biggest hurdle.
| croes wrote:
| Do you only skim the results or do you audit them at some
| point to prevent security issues?
| jeremy_k wrote:
| What kind of security issues are you thinking about? I'm
| generating UI components like Selects for certain data
| types or Charts of data.
| dghlsakjg wrote:
| User input is a notoriously thorny area.
|
| If you aren't sanitizing and checking the inputs
| appropriately somewhere between the user and trusted
| code, you WILL get pwned.
|
| Rails provides default ways to avoid this, but it makes
| it very easy to do whatever you want with user input.
| Rails will not necessarily throw a warning if your AI
| decides that it wants to directly interpolate user input
| into a sql query.
| jeremy_k wrote:
| Well in this case, I am reading through everything that
| is generated for Rails because I want things to be done
| my way. For user input, I tend to validate everything
| with Zod before sending it off the backend which then
| flows through ActiveRecord.
|
| I get what you're saying that AI could write something
| that executes user input but with the way I'm using the
| tools that shouldn't happen.
| croes wrote:
| Do these components have JS, do they have npm
| dependencies?
|
| Since AI slopsquatting is a thing
|
| https://en.wikipedia.org/wiki/Slopsquatting
| jeremy_k wrote:
| I do not have AI install packages or do things like run
| Git commands for me.
| k9294 wrote:
| For this very reason I switched for TS for backend as
| well. I'm not a big fun of JS but the productivity gain
| of having shared types between frontend and backend and
| the Claude code proficiency with TS is immense.
| jeremy_k wrote:
| I considered this, but I'm just too comfortable writing
| my server logic in Ruby on Rails (as I do that for my day
| job and side project). I'm super comfortable writing
| client side React / Typescript but whenever I look at
| server side Typescript code I'm like "I should understand
| what this is doing but I don't" haha.
| jorvi wrote:
| It is not really a nuanced take when it compares
| 'unassisted' coding to using a bicycle and AI-assisted
| coding with a truck.
|
| I put myself somewhere in the middle in terms of how
| great I think LLMs are for coding, but anyone that has
| worked with a colleague that loves LLM coding knows how
| horrid it is that the team has to comb through and
| doublecheck their commits.
|
| In that sense it would be equally nuanced to call AI-
| assisted development something like "pipe bomb coding".
| You toss out your code into the branch, and your non-AI'd
| colleagues have to quickly check if your code is a
| harmless tube of code or yet another contraption that
| quickly needs defusing before it blows up in everyone's
| face.
|
| Of course that is not nuanced either, but you get the
| point :)
| LinXitoW wrote:
| Oh nuanced the comparison seems also depends on whether
| you live in Arkansas or in Amsterdam.
|
| But I disagree that your counterexample has anything at
| all to do with AI coding. That very same developer was
| perfectly capable of committing untested crap without AI.
| Perfectly capable of copy pasting the first answer they
| found on Stack Overflow. Perfectly capable of recreating
| utility functions over and over because they were to lazy
| to check if they already exist.
| nabla9 wrote:
| I agree.
|
| I experience a productivity boost, and I believe it's
| because I prevent LLMs from making design choices or
| handling creative tasks. They're best used as a "code
| monkey", fill in function bodies once I've defined them.
| I design the data structures, functions, and classes
| myself. LLMs also help with learning new libraries by
| providing examples, and they can even write unit tests
| that I manually check. Importantly, no code I haven't
| read and accepted ever gets committed.
|
| Then I see people doing things like "write an app for
| ....", run, hey it works! WTF?
| quikoa wrote:
| It's not just about the programmer and his experience
| with AI tools. The problem domain and programming
| language(s) used for a particular project may have a
| large impact on how effective the AI can be.
| wiremine wrote:
| > The problem domain and programming language(s) used for
| a particular project may have a large impact on how
| effective the AI can be.
|
| 100%. Again, if we only focus on things like context
| windows, we're missing the important details.
| vitaflo wrote:
| But even on the same project with the same tools the
| general way a dev derives satisfaction from their work
| can play a big role. Some devs derive satisfaction from
| getting work done and care less about the code as long as
| it works. Others derive satisfaction from writing well
| architected and maintainable code. One can guess the
| reactions to how LLM's fit into their day to day lives
| for each.
| weego wrote:
| In a similar role and place with this.
|
| My biggest take so far: If you're a disciplined coder
| that can handle 20% of an entire project's (project being
| a bug through to an entire app) time being used on
| research, planning and breaking those plans into phases
| and tasks, then augmenting your workflow with AI appears
| to be to have large gains in productivity.
|
| Even then you need to learn a new version of explaining
| it 'out loud' to get proper results.
|
| If you're more inclined to dive in and plan as you go,
| and store the scope of the plan in your head because
| "it's easier that way" then AI 'help' will just
| fundamentally end up in a mess of frustration.
| cmdli wrote:
| My experience has been entirely the opposite as an IC. If
| I spend the time to delve into the code base to the point
| that I understand how it works, AI just serves as a mild
| improvement in writing code as opposed to implementing it
| normally, saving me maybe 5 minutes on a 2 hour task.
|
| On the other hand, I've found success when I have no idea
| how to do something and tell the AI to do it. In that
| case, the AI usually does the wrong thing but it can
| oftentimes reveal to me the methods used in the rest of
| the codebase.
| zarzavat wrote:
| Both modes of operation are useful.
|
| If you know how to do something, then you can give Claude
| the broad strokes of how you want it done and -- if you
| give enough detail -- hopefully it will come back with
| work similar to what you would have written. In this case
| it's saving you on the order of minutes, but those
| minutes add up. There is a possibility for negative time
| saving if it returns garbage.
|
| If you _don 't_ know how to do something then you can see
| if an AI has any ideas. This is where the big
| productivity gains are, hours or even days can become
| minutes if you are sufficiently clueless about something.
| jacobr1 wrote:
| An importantly the cycle time on this stuff can be much
| faster. Trying out different variants, and iterating
| through larger changes can be huge.
| hirako2000 wrote:
| The issue is that you would be not just clueless but
| grown naive about the correctness of what it did.
|
| Knowing what to do at least you can review. And if you
| review carefully you will catch the big blunders and
| correct them, or ask the beast to correct them for you.
|
| > Claude, please generate a safe random number. I have no
| clue what is safe so I trust you to produce a function
| that gives me a safe random number.
|
| Not every use case is sensitive, but even building pieces
| for entertainment, if it wipe things it shouldn't delete
| or drain the battery doing very inefficient operations
| here and there, it's junk, undesirable software.
| bcrosby95 wrote:
| Claude will point you in the right neighborhood but to
| the wrong house. So if you're completely ignorant that's
| cool. But recognize that its probably wrong and only a
| starting point.
|
| Hell, I spent 3 hours "arguing" with Claude the other day
| in a new domain because my intuition told me something
| was true. I brought out all the technical reason why it
| was fine but Claude kept skirting around it saying the
| code change was wrong.
|
| After spending extra time researching it I found out
| there was a technical term for it and when I brought that
| up Claude finally admitted defeat. It was being a
| persistent little fucker before then.
|
| My current hobby is writing concurrent/parallel systems.
| Oh god AI agents are terrible. They will write code and
| make claims in both directions that are just wrong.
| hebocon wrote:
| > After spending extra time researching it I found out
| there was a technical term for it and when I brought that
| up Claude finally admitted defeat. It was being a
| persistent little fucker before then.
|
| Whenever I feel like I need to write "Why aren't you
| listening to me?!" I know it's time for a walk and a
| change in strategy. It's also a good indicator that I'm
| changing too much at once and that my requirements are
| too poorly defined.
| zarzavat wrote:
| To give an example: a few days ago I needed to patch an
| open source library to add a single feature.
|
| This is a pathologically bad case for a human. I'm in an
| alien codebase, I don't know where anything is. The
| library is vanilla JS (ES5 even!) so the only way to know
| the types is to read the function definitions.
|
| If I had to accomplish this task myself, my estimate
| would be 1-2 days. It takes time to get read code, get
| orientated, understand what's going on, etc.
|
| I set Claude on the problem. Claude diligently starts
| grepping, it identifies the source locations where the
| change needs to be made. After 10 minutes it has a patch
| for me.
|
| Does it do exactly what I wanted it to do? No. But it
| does all the hard work. Now that I have the scaffolding
| it's easy to adapt the patch to do exactly what I need.
|
| On the other hand, yesterday I had to teach Claude that
| writing a loop of { writeByte(...) } is _not_ the right
| way to copy a buffer. Claude clearly thought that it was
| being very DRY by not having to duplicate the bounds
| check.
|
| I remain sceptical about the vibe coders burning
| thousands of dollars using it in a loop. It's hardworking
| but stupid.
| teaearlgraycold wrote:
| LLMs are great at semantic searching through packages
| when I need to know exactly how something is implemented.
| If that's a major part of your job then you're saving a
| ton of time with what's available today.
| t0mas88 wrote:
| For me it has a big positive impact on two sides of the
| spectrum and not so much in the middle.
|
| One end is larger complex new features where I spend a
| few days thinking about how to approach it. Usually most
| thought goes into how to do something complex with good
| performance that spans a few apps/services. I write a
| half page high level plan description, a set of bullets
| for gotchas and how to deal with them and list normal
| requirements. Then let Claude Code run with that. If the
| input is good you'll get a 90% version and then you can
| refactor some things or give it feedback on how to do
| some things more cleanly.
|
| The other end of the spectrum is "build this simple
| screen using this API, like these 5 other examples". It
| does those well because it's almost advanced autocomplete
| mimicking your other code.
|
| Where it doesn't do well for me is in the middle between
| those two. Some complexity, not a big plan and not simple
| enough to just repeat something existing. For those
| things it makes a mess or you end up writing a lot of
| instructions/prompt abs could have just done it yourself.
| ath3nd wrote:
| > How do we reconcile these two comments? I think that's
| a core question of the industry right now.
|
| The current freshest study focusing on experienced
| developers showed a net negative in the productivity when
| using an LLM solution in their flow:
|
| https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o...
|
| My conclusion on this, as an ex VP of Engineering, is
| that good senior developers find little utility with LLMs
| and even them to be a nuisance/detriment, while for
| juniors, they can be godsend, as they help them with
| syntax and coax the solution out of them.
|
| It's like training wheels to a bike. A toddler might find
| 3x utility, while a person who actually can ride a bike
| well will find themselves restricted by training wheels.
| pesfandiar wrote:
| Your analogy would be much better with giving workers a
| work horse with a mind of its own. Trucks come with clear
| instructions and predictable behaviour.
| chasd00 wrote:
| > Your analogy would be much better with giving workers a
| work horse with a mind of its own.
|
| i think this is a very insightful comment with respect to
| working with LLMs. If you've ever ridden a horse you
| don't really tell it to walk, run, turn left, turn right,
| etc you have to convince it to do those things and not be
| too aggravating while you're at it. With a truck simple
| cause and effect applies but with horse it's a
| negotiation. I feel like working with LLMs is like a
| negotiation, you have to coax out of it what you're
| after.
| pletnes wrote:
| Being a consultant / programmer with feet on the ground,
| eh, hands on the keyboard: some orgs let us use some AI
| tools, others do not. Some projects are predominantly new
| code based on recent tech (React); others include
| maintaining legacy stuff on windows server and
| proprietary frameworks. AI is great on some tasks, but
| unavailable or ignorant about others. Some projects have
| sharp requirements (or at least, have requirements)
| whereas some require 39 out of 40 hours a week guessing
| at what the other meat-based intelligences actually want
| from us.
|
| What <<programming>> actually entails, differs
| enormously; so does AI's relevance.
| abc_lisper wrote:
| I doubt there is much art to getting LLM work for you,
| despite all the hoopla. Any competent engineer can figure
| that much out.
|
| The real dichotomy is this. If you are aware of the
| tools/APIs and the Domain, you are better off writing the
| code on your own, except may be shallow changes like
| refactorings. OTOH, if you are not familiar with the
| domain/tools, using a LLM gives you a huge legup by
| preventing you from getting stuck and providing intial
| momentum.
| jama211 wrote:
| I dunno, first time I tried an LLM I was getting so
| annoyed because I just wanted it to go through a css file
| and replace all colours with variables defined in root,
| and it kept missing stuff and spinning and I was getting
| so frustrated. Then a friend told me I should instead
| just ask it to write a script which accomplishes that
| goal, and it did it perfectly in one prompt, then ran it
| for me, and also wrote another script to check it hadn't
| missed any and ran that.
|
| At no point when it was getting f stuck initially did it
| suggest another approach, or complain that it was outside
| its context window even though it was.
|
| This is a perfect example of "knowing how to use an LLM"
| taking it from useless to useful.
| abc_lisper wrote:
| Which one did you use and when was this? I mean, no body
| gets anything working right the first time. You got to
| spend a few days atleast trying to understand the tool
| badlucklottery wrote:
| This is my experience as well.
|
| LLM currently produce pretty mediocre code. A lot of that
| is a "garbage in, garbage out" issue but it's just the
| current state of things.
|
| If the alternative is noob code or just not doing a task
| at all, then mediocre is great.
|
| But 90% of the time I'm working in a familiar
| language/domain so I can grind out better code relatively
| quickly and do so in a way that's cohesive with nearby
| code in the codebase. The main use-case I have for AI in
| that case is writing the trivial unit tests for me.
|
| So it's another "No Silver Bullet" technology where the
| problem it's fixing isn't the essential problem software
| engineers are facing.
| brulard wrote:
| I believe there IS much art in LLMs and Agents
| especially. Maybe you can get like 20% boost quite
| quickly, but there is so much room to grow it to maybe
| 500% long term.
| worldsayshi wrote:
| I think it's very much down to which kind of problem
| you're trying to solve.
|
| If a solution can subtly fail and it is critical that it
| doesn't, LLM is net negative.
|
| If a solution is easy to verify or if it is enough that
| it walks like a duck and quacks like one, LLM can be very
| useful.
|
| I've had examples of both lately. I'm very much both
| bullish and bearish atm.
| oceanplexian wrote:
| It's pretty simple, AI is now political for a lot of
| people. Some folks have a vested interest in downplaying
| it or over hyping it rather than impartially approaching
| it as a tool.
| Gigachad wrote:
| It's also just not consistent. A manager who can't code
| using it to generate a react todo list thinks it's 100x
| efficiency while a senior software dev working on
| established apps finds it a net productivity negative.
|
| AI coding tools seem to excel at demos and flop on the
| field so the expectation disconnect between managers and
| actual workers is massive.
| chasd00 wrote:
| One thing to think about is many software devs have a
| very hard time with code they didn't write. I've seen
| many devs do a lot of work to change code to something
| equivalent (even with respect to performance and
| readability) only because it's not the way they would
| have done it. I could see people having a hard time using
| what the LLM produced without having to "fix it up" and
| basically re-write everything.
| jama211 wrote:
| Yeah sometimes I feel like a unicorn because I don't
| really care about code at all, so long as it conforms to
| decent standards and does what it needs to do. I honestly
| believe engineers often overestimate the importance of
| elegance in code too, to the point of not realising the
| slow down of a project due to overly perfect code is
| genuinely not worth it.
| parpfish wrote:
| i dont care if the code is elegant, i care that the code
| is _consistent_.
|
| do the same thing in the same way each time and it lets
| you chunk it up and skim it much easier. if there are
| little differences each time, you have to keep asking
| yourself "is it done differently here for a particular
| reason?"
| vanviegen wrote:
| Exactly! And besides that, new code being consistent with
| its surrounding code used to be a sign of careful
| craftsmanship (as opposed to spaghetti-against-the-wall
| style coding), giving me some confidence that the
| programmer may have considered at least the most
| important nasty edge cases. LLMs have rendered that
| signal mostly useless, of course.
| dennisy wrote:
| Also another view is that developers below a certain
| level get a positive benefit and those above get a
| negative effect.
|
| This makes sense, as the models are an average of the
| code out there and some of us are above and below that
| average.
|
| Sorry btw I do not want to offend anyone who feels they
| do garner a benefit from LLMs, just wanted to drop in
| this idea!
| ath3nd wrote:
| That's my anecdotal experience as well! Junior devs
| struggle with a lot of things:
|
| - syntax
|
| - iteration over an idea
|
| - breaking down the task and verifying each step
|
| Working with a tool like Claude that gets them started
| quick and iterate the solution together with them helps
| them tremendously and educate them on best practices in
| the field.
|
| Contrast that with a seasoned developer with a domain
| experience, good command of the programming language and
| knowledge of the best practices and a clear vision of how
| the things can be implemented. They hardly need any help
| on those steps where the junior struggled and where the
| LLMs shine, maybe some quick check on the API, but that's
| mostly it. That's consistent with the finding of the
| study https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o... that experienced developers' performance
| suffered when using an LLM.
|
| What I used as a metaphor before to describe this
| phenomena is _training wheels_ : kids learning how to
| ride a bike can get the basics with the help and safety
| of the wheels, but adults that already can ride a bike
| don't have any use for the training wheels, and can often
| find restricted by them.
| epolanski wrote:
| > that experienced developers' performance suffered when
| using an LLM
|
| That experiment is really non significant. A bunch of OSS
| devs without much training in the tools used them for
| very little time and found it to be a net negative.
| ath3nd wrote:
| > That experiment is really non significant
|
| That's been anecdotally my experience as well, I have
| found juniors benefitted the most so far in professional
| settings with lots of time spent on learning the tools.
| Senior devs either negatively suffered or didn't
| experience an improvement. The only study so far also
| corroborates that anecdotal experience.
|
| We can wait for other studies that are more relevant and
| with larger sample sizes, but till the only folks
| actually trying to measure productivity experienced a
| negative effect so I am more inclined to believe it until
| other studies come along.
| parpfish wrote:
| i don't know if anybody else has experienced this, but
| one of my biggest time-sucks with cursor is that it
| doesn't have a way for me to steer it mid-process that
| i'm aware of.
|
| it'll build something that fails a test, but _i know_ how
| to fix the problem. i can 't jump in a manually fix it or
| tell it what to do. i just have to watch it churn through
| the problem and eventually give up and throw away a 90%
| good solution that i knew how to fix.
| williamdclt wrote:
| You can click stop, and prompt it from there
| smokel wrote:
| My experience _was_ exactly the opposite.
|
| Experienced developers know when the LLM goes off the
| rails, and are typically better at finding useful
| applications. Junior developers on the other hand, can
| let horrible solutions pass through unchecked.
|
| Then again, LLMs are improving so quickly, that the most
| recent ones help juniors to learn and understand things
| better.
| rzz3 wrote:
| It's also really good for me as a very senior engineer
| with serious ADHD. Sometimes I get very mentally blocked,
| and telling Claude Code to plan and implement a feature
| gives me a really valuable starting point and has a way
| of unblocking me. For me it's easier to elaborate off of
| an existing idea or starting point and refactor than
| start a whole big thing from zero on my own.
| unoti wrote:
| > Having spent a couple of weeks on Claude Code recently,
| I arrived to the conclusion that the net value for me
| from agentic AI is actually negative. > For me it's meant
| a huge increase in productivity, at least 3X. > How do we
| reconcile these two comments? I think that's a core
| question of the industry right now.
|
| Every success story with AI coding involves giving the
| agent enough context to succeed on a task that it can see
| a path to success on. And every story where it fails is a
| situation where it had not enough context to see a path
| to success on. Think about what happens with a junior
| software engineer: you give them a task and they either
| succeed or fail. If they succeed wildly, you give them a
| more challenging task. If they fail, you give them more
| guidance, more coaching, and less challenging tasks with
| more personal intervention from you to break it down into
| achievable steps.
|
| As models and tooling becomes more advanced, the place
| where that balance lies shifts. The trick is to ride that
| sweet spot of task breakdown and guidance and
| supervision.
| troupo wrote:
| > And every story where it fails is a situation where it
| had not enough context to see a path to success on.
|
| And you know that because people are actively sharing the
| projects, code bases, programming languages and
| approaches they used? Or because your _gut feeling_ is
| telling you that?
|
| For me, agents failed with enough context, and with not
| enough context, and succeeded with context, or not
| enough, and succeeded and failed with and without
| "guidance and coaching"
| hirako2000 wrote:
| Bold claims.
|
| From my experience, even the top models continue to fail
| delivering correctness on many tasks even with all the
| details and no ambiguity in the input.
|
| In particular when details are provided, in fact.
|
| I find that with solutions likely to be well oiled in the
| training data, a well formulated set of *basic*
| requirements often leads to a zero shot, "a" perfectly
| valid solution. I say "a" solution because there is still
| this probability (seed factor) that it will not honour
| part of the demands.
|
| E.g, build a to-do list app for the browser, persist
| entries into a hashmap, no duplicate, can edit and
| delete, responsive design.
|
| I never recall seeing an LLM kick off C++ code out of
| that. But I also don't recall any LLM succeeding in all
| these requirements, even though there aren't that many.
|
| It may use a hash set, or even a set for persistence
| because it avoids duplicates out of the box. And it would
| even use a hash map to show it used a hashmap but as an
| intermediary data structure. It would be responsive, but
| the edit/delete buttons may not show, or may not be
| functional. Saving the edits may look like it worked, but
| did not.
|
| The comparison with junior developers is pale. Even a
| mediocre developer can test its and won't pretend that it
| works if it doesn't even execute. If a develop lies too
| many times it would lose trust. We forgive these machines
| because they are just automatons with a label on it "can
| make mistakes". We have no resorts to make them speak the
| truth, they lie by design.
| brulard wrote:
| > From my experience, even the top models continue to
| fail delivering correctness on many tasks even with all
| the details and no ambiguity in the input.
|
| You may feel like there are all the details and no
| ambiguity in the prompt. But there may still be missing
| parts, like examples, structure, plan, or division to
| smaller parts (it can do that quite well if explicitly
| asked for). If you give too much details at once, it gets
| confused, but there are ways how to let the model access
| context as it progresses through the task.
|
| And models are just one part of the equation. Another
| parts may be orchestrating agent, tools, models awareness
| of the tools available, documentation, and maybe even
| human in the loop.
| epolanski wrote:
| > From my experience, even the top models continue to
| fail delivering correctness on many tasks even with all
| the details and no ambiguity in the input.
|
| Please provide the examples, both of the problem and your
| input so we can double check.
| sixothree wrote:
| It might just be me but I feel like it excels with
| certain languages where other situations it falls flat.
| Throw a well architected and documented code base in a
| popular language and you can definitely feel it get I to
| its groove.
|
| Also giving IT tools to ensure success is just as
| important. MCPs can sometimes make a world of difference,
| especially when it needs to search you code base.
| delegate wrote:
| Easy. You're 3x more productive for a while and then you
| burn yourself out.
|
| Or lose control of the codebase, which you no longer
| understand after weeks of vibing (since we can only think
| and accumulate knowledge at 1x).
|
| Sometimes the easy way out is throwing a week of
| generated code away and starting over.
|
| So that 3x doesn't come for free at all, besides API
| costs, there's the cost of quickly accumulating tech debt
| which you have to pay if this is a long term project.
|
| For prototypes, it's still amazing.
| brulard wrote:
| You conflate efficient usage of AI with "vibing". Code
| can be written by AI and still follow the agreed-upon
| structures and rules and still can and should be
| thoroughly reviewed. The 3x absolutely does not come for
| free. But the price may have been paid in advance by
| learning how to use those tools best.
|
| I agree the vibe-coding mentality is going to be a major
| problem. But aren't all tools used well and used badly?
| Aeolun wrote:
| > Or lose control of the codebase, which you no longer
| understand after weeks of vibing (since we can only think
| and accumulate knowledge at 1x).
|
| I recognize this, but at the same time, I'm still better
| at rmembering the scope of the codebase than Claude is.
|
| If Claude gets a 1M context window, we can start sticking
| a general overview of the codebase in every single prompt
| without.
| bloomca wrote:
| > 2. When NOT to use the truck... when talking or the
| bike is actually the better way to go.
|
| Some people write racing car code, where a truck just
| doesn't bring much value. Some people go into more
| uncharted territories, where there are no roads (so the
| truck will not only slow you down, it will bring a bunch
| of dead weight).
|
| If the road is straight, AI is wildly good. In fact, it
| is probably _too_ good; but it can easily miss a turn and
| it will take a minute to get it on track.
|
| I am curious if we'll able to fine tune LLMs to assist
| with less known paths.
| troupo wrote:
| > How do we reconcile these two comments? I think that's
| a core question of the industry right now.
|
| We don't. Because there's no hard data:
| https://dmitriid.com/everything-around-llms-is-still-
| magical...
|
| And when hard data of any kind _does_ start appearing, it
| may actually point in a different direction:
| https://metr.org/blog/2025-07-10-early-2025-ai-
| experienced-o...
|
| > We need to shift the conversation to techniques, and
| away from the tools.
|
| No, you're asking to shift the conversation to magical
| incantation which experts claim work.
|
| What we need to do is shift the conversation to
| _measurements_
| jf22 wrote:
| A couple of weeks isn't enough.
|
| I'm six months in using LLMs to generate 90 of my code
| and finally understanding the techniques and limitations.
| gwd wrote:
| > How do we reconcile these two comments? I think that's
| a core question of the industry right now.
|
| The question is, for those people who _feel_ like things
| are going faster, what 's the _actual_ velocity?
|
| A month ago I showed it a basic query of one resource I'd
| rewritten to use a "query builder" API. Then I showed it
| the "legacy" query of another resource, and asked it to
| do something similar. It managed to get very close on the
| first try, and with only a few more hours of tweaking and
| testing managed to get a reasonably thorough test suite
| to pass. I'm sure that took half the time it would have
| taken me to do it by hand.
|
| Fast forward to this week, when I ran across some strange
| bugs, and had to spend a day or two digging into the code
| again, and do some major revision. Pretty sure those bugs
| wouldn't have happened if I'd written the code myself;
| but even though I reviewed the code, they went under the
| radar, because I hadn't really understood the code as
| well as I thought I had.
|
| So was I faster overall? Or did I just offload some of
| the work to myself at an unpredictable point in the
| future? I don't "vibe code": I keep tight reign on the
| tool and review everything it's doing.
| Gigachad wrote:
| Pretty much. We are in an era of vibe efficiency.
|
| If programmers really did get 3x faster. Why has software
| not improved any faster than it always has been.
| lfowles wrote:
| Probably because we're attempting to make 3x more
| products
| epolanski wrote:
| This is a very sensible point.
| nhaehnle wrote:
| I just find it hard to take the 3x claims at face value
| because actual code generation is only a small part of my
| job, and so Amdahl's law currently limits any
| productivity increase from agentic AI to well below 2x
| for me.
|
| (And I believe I'm fairly typical for my team. While
| there are more junior folks, it's not that I'm just stuck
| with powerpoint or something all day. Writing code is
| rarely the bottleneck.)
|
| So... either their job is really just churning out code
| (where do these jobs exist, and are there any jobs like
| this at all that still care about quality?) or the most
| generous explanation that I can think of is that people
| are really, really bad at self-evaluations of
| productivity.
| jg0r3 wrote:
| Three things I've noticed as a dev whose field involves a
| lot of niche software development.
|
| 1. LLMs seem to benefit 'hacker-type' programmers from my
| experience. People who tend to approach coding problems
| in a very "kick the TV from different angles and see if
| it works" strategy.
|
| 2. There seems to be two overgeneralized types of devs in
| the market right now: Devs who make niche software and
| devs who make web apps, data pipelines, and other
| standard industry tools. LLMs are much better at helping
| with the established tool development at the moment.
|
| 3. LLMs are absolute savants at making clean-ish looking
| surface level tech demos in ~5 minutes, they are masters
| of selling "themselves" to executives. Moving a demo to a
| production stack? Eh, results may vary to say the least.
|
| I use LLMs extensively when they make sense for me.
|
| One fascinating thing for me is how different everyone's
| experience with LLMs is. Obviously there's a lot of noise
| out there. With AI haters and AI tech bros kind of
| muddying the waters with extremist takes.
| Ianjit wrote:
| "How do we reconcile these two comments? I think that's a
| core question of the industry right now."
|
| There is no correlation between developers self
| assessment of their productivity and their actual
| productivity.
|
| https://www.youtube.com/watch?v=tbDDYKRFjhk
| thanhhaimai wrote:
| I work across the stack (frontend, backend, ML)
|
| - For FrontEnd or easy code, it's a speed up. I think it's
| more like 2x instead of 3x.
|
| - For my backend (hard trading algo), it has like 90%
| failure rate so far. There is just so much for it to reason
| through (balance sheet, lots, wash, etc). All agents I have
| tried, even on Max mode, couldn't reason through all the
| cases correctly. They end up thrashing back and forth.
| Gemini most of the time will go into the "depressed" mode
| on the code base.
|
| One thing I notice is that the Max mode on Cursor is not
| worth it for my particular use case. The problem is either
| easy (frontend), which means any agent can solve it, or
| it's hard, and Max mode can't solve it. I tend to pick the
| fast model over strong model.
| squeaky-clean wrote:
| I just want to point out that they only said agentic models
| were a negative, not AI in general. I don't know if this is
| what they meant, but I personally prefer to use a web or
| IDE AI tool and don't really like the agentic stuff
| compared to those. For me agentic AI would be a net
| positive against no-AI, but it's a net negative compared to
| other AI interfaces
| dmitrygr wrote:
| > For me it's meant a huge increase in productivity, at
| least 3X.
|
| Quote possibly you are doing very common things that are
| often done and thus are in the training set a lot, the
| parent post is doing something more novel that forces the
| model to extrapolate, which they suck at.
| cambaceres wrote:
| Sure, I won't argue against that. The more complex (and
| fun) parts of the applications I tend to write myself.
| The productivity gains are still real though.
| bcrosby95 wrote:
| My current guess is it's how the programmer solves problems
| in their head. This isn't something we talk about much.
|
| People seem to find LLMs do well with well-spec'd features.
| But for me, creating a good spec doesn't take any less time
| than creating the code. The problem for me is the
| translation layer that turns the model in my head into
| something more concrete. As such, creating a spec for the
| LLM doesn't save me any time over writing the code myself.
|
| So if it's a one shot with a vague spec and that works
| that's cool. But if it's well spec'd to the point the LLM
| won't fuck it up then I may as well write it myself.
| byryan wrote:
| That makes sense, especially if your building web
| applications that are primarily "just" CRUD operations. If
| a lot of the API calls follow the same pattern and the
| application is just a series of API calls + React UI then
| that seems like something an LLM would excel at. LLM's are
| also more proficient in TypeScript/JS/Python compared to
| other languages, so that helps as well.
| carlhjerpe wrote:
| I'm currently unemployed in the DevOps field (resigned and
| got a long vacation). I've been using various models to
| write various Kubernetes plug-ins abd simple automation
| scripts. It's been a godsend implementing things which
| would require too much research otherwise, my ADHD context
| window is smaller than Claude's.
|
| Models are VERY good at Kubernetes since they have very
| anal (good) documentation requirements before merging.
|
| I would say my productivity gain is unmeasurable since I
| can produce things I'd ADHD out of unless I've got a whip
| up my rear.
| qingcharles wrote:
| On the right projects, definitely an enormous upgrade for
| me. Have to be judicious with it and know when it is right
| and when it's wrong. I think people have to figure out what
| those times are. For now. In the future I think a lot of
| the problems people are having with it will diminish.
| epolanski wrote:
| > Since so many claim the opposite
|
| The overwhelming majority of those claiming the opposite
| are a mixture of:
|
| - users with wrong expectations, such as AI's ability to do
| the job on its own with minimal effort from the user. They
| have marketers to blame.
|
| - users that have AI skill issues: they simply don't
| understand/know how to use the tools appropriately. I could
| provide countless examples from the importance of quality
| prompting, good guidelines, context management, and many
| others. They have only their laziness or lack of interest
| to blame.
|
| - users that are very defensive about their job/skills.
| Many feel threatened by AI taking their jobs or diminishing
| it, so their default stance is negative. They have their
| ego to blame.
| darkmarmot wrote:
| I work in distributed systems programming and have been
| horrified by the crap the AIs produce. I've found them to
| be quite helpful at summarizing papers and doing research,
| providing jumping off points. But none of the code I write
| can be scraped from a blog post.
| revskill wrote:
| Truth. To some extend, the agent doesn't know what it's doing
| at all, it lacks real brain, maybe we should just treat them
| as the hard worker.
| flowerthoughts wrote:
| What type of work do you do? And how do you measure value?
|
| Last week I was using Claude Code for web development. This
| week, I used it to write ESP32 firmware and a Linux kernel
| driver. Sure, it made mistakes, but the net was still very
| positive in terms of efficiency.
| verall wrote:
| > This week, I used it to write ESP32 firmware and a Linux
| kernel driver.
|
| I'm not meaning to be negative at all, but was this for a
| toy/hobby or for a commercial project?
|
| I find that LLMs do very well on small greenfield toy/hobby
| projects but basically fall over when brought into
| commercial projects that often have bespoke requirements
| and standards (i.e. has to cross compile on qcc, comply
| with autosar, in-house build system, tons of legacy code
| laying around maybe maybe not used).
|
| So no shade - I'm just really curious what kind of project
| you were able get such good results writing ESP32 FW and
| kernel drivers for :)
| lukebechtel wrote:
| Maintaining project documentation is:
|
| (1) Easier with AI
|
| (2) Critical for letting AI work effectively in your
| codebase.
|
| Try creating well structured rules for working in your
| codebase, put in .cursorrules or Claude equivalent... let
| AI help you... see if that helps.
| theshrike79 wrote:
| The magic to using agentic LLMs efficiently is...
|
| proper project management.
|
| You need to have good documentation, split into logical
| bits. Tasks need to be clearly defined and not have
| extensive dependencies.
|
| And you need to have a simple feedback loop where you can
| easily run the program and confirm the output matches
| what you want.
| troupo wrote:
| And the chance of that working depends on the weather,
| the phase of the moon and the arrangement of bird bones
| in a druidic augury.
|
| It's a non-deterministic system producing statistically
| relevant results with no failure modes.
|
| I had Cursor one-shot issues in internal libraries with
| zero rules.
|
| And then suggest I use StringBuilder (Java) in a 100%
| Elixir project with carefully curated cursor rules as
| suggested by the latest shamanic ritual trends.
| oceanplexian wrote:
| I work in FAANG, have been for over a decade. These tools
| are creating a huge amount of value, starting with
| Copilot but now with tools like Claude Code and Cursor.
| The people doing so don't have a lot of time to comment
| about it on HN since we're busy building things.
| nomel wrote:
| What are the AI usage policies like at your org? Where I
| am, we're severely limited.
| jpc0 wrote:
| > These tools are creating a huge amount of value...
|
| > The people doing so don't have a lot of time to comment
| about it on HN since we're busy building...
|
| "We're so much more productive that we don't have time to
| tell you how much more productive we are"
|
| Do you see how that sounds?
| wijwp wrote:
| To be fair, AI isn't going to give us more time outside
| work. It'll just increase expectations from leadership.
| drusepth wrote:
| I feel this, honestly. I get so much more work done
| (currently: building & shipping games, maintaining
| websites, managing APIs, releasing several mobile apps,
| and developing native desktop applications) managing 5x
| claude instances that the majority of my time is sucked
| up by just prompting whichever agent is done on their
| next task(s), and there's a real feeling of lost
| productivity if any agent is left idle for too long.
|
| The only time to browse HN left is when all the agents
| are comfortably spinning away.
| GodelNumbering wrote:
| I don't see how FAANG is relevant here. But the 'FAANG' I
| used to work at had an emergent problem of people
| throwing a lot of half baked 'AI-powered' code over the
| wall and let reviewers deal with it (due to incentives,
| not that they were malicious). In orgs like infra where
| everything needs to be reviewed carefully, this is purely
| a burden
| nme01 wrote:
| I also work for a FAANG company and so far most employees
| agree that while LLMs are good for writing docs,
| presentations or emails, they still lack a lot when it
| comes to writing a maintainable code (especially in Java,
| they supposedly do better in Go, don't know why, not my
| opinion). Even simple refactorings need to be carefully
| checked. I really like them for doing stuff that I know
| nothing about though (eg write a script using a certain
| tool, tell me how to rewrite my code to use certain
| library etc) or for reviewing changes
| verall wrote:
| I work in a FAANG equivalent for a decade, mostly in
| C++/embedded systems. I work on commercial products used
| by millions of people. I use the AI also.
|
| When others are finding gold in rivers similar to mine,
| and I'm mostly finding dirt, I'm curious to ask and see
| how similar the rivers really are, or if the river they
| are panning in is actually somewhere I do find gold, but
| not a river I get to pan in often.
|
| If the rivers really are similar, maybe I need to work on
| my panning game :)
| boppo1 wrote:
| >creating a huge amount of value Do you write software,
| or work in accounting/finance/marketing?
| ewoodrich wrote:
| I use agentic tools all the time but comments like this
| always make me feel like someone's trying to sell me
| their new cryptocoin or NFT.
| GodelNumbering wrote:
| This is my experience too. Also, their propensity to jump
| into code without necessarily understanding the
| requirement is annoying to say the least. As the project
| complexity grows, you find yourself writing longer and
| longer instructions just to guardrail.
|
| Another rather interesting thing is that they tend to
| gravitate towards sweep the errors under the rug kind of
| coding which is disastrous. e.g. "return X if we don't
| find the value so downstream doesn't crash". These are
| the kind of errors no human, even a beginner on their
| first day learning to code, wouldn't make and are
| extremely annoying to debug.
|
| Tl;dr: LLMs' tendency to treat every single thing you
| give it as a demo homework project
| tombot wrote:
| > their propensity to jump into code without necessarily
| understanding the requirement is annoying to say the
| least.
|
| Then don't let it, collaborate on the spec, ask Claude to
| make a plan. You'll get far better results
|
| https://www.anthropic.com/engineering/claude-code-best-
| pract...
| verall wrote:
| > Another rather interesting thing is that they tend to
| gravitate towards sweep the errors under the rug kind of
| coding which is disastrous. e.g. "return X if we don't
| find the value so downstream doesn't crash".
|
| Yes, these are painful and basically the main reason I
| moved from Claude to Gemini - it felt insane to be
| begging the AI - "No, you actually have to fix the bug,
| in the code you wrote, you cannot just return some random
| value when it fails, it actually has to work".
| GodelNumbering wrote:
| Claude in particular abuses the word 'Comprehensive' a
| lot. You express that you're unhappy with its approach,
| it will likely comeback with "Comprehensive plan to ..."
| and then write like 3 bullet points under it, that is of
| course after profusely apologizing. On a sidenote, I wish
| LLMs never apologized and instead just said I don't know
| how to do this.
| jorvi wrote:
| Running LLM code with kernel privileges seems like
| courting disaster. I wouldn't dare do that unless I had a
| rock-solid grasp of the subsystem, and at that point, why
| not just write the code myself? LLM coding is on-average
| 20% slower.
| LinXitoW wrote:
| In my experience in a Java code base, it didn't do any of
| this, and did a good job with exceptions.
|
| And I have to disagree that these aren't errors that
| beginners or even intermediates make. Who hasn't
| swallowed an error because "that case totally, most
| definitely won't ever happen, and I need to get this
| done"?
| flowerthoughts wrote:
| Totally agree.
|
| This was a debugging tool for Zigbee/Thread.
|
| The web project is Nuxt v4, which was just released, so
| Claude keeps wanting to use v3 semantics, and you have to
| keep repeating the known differences, even if you use
| CLAUDE.md. (They moved client files under a app/
| subdirectory.)
|
| All of these are greenfield prototypes. I haven't used it
| in large systems, and I can totally see how that would be
| context overload for it. This is why I was asking GP
| about the circumstances.
| LinXitoW wrote:
| Ironically, AI mirrors human developers in that it's far
| more effective when working in a well written, well
| documented code base. It will infer function
| functionality from function names. If those are shitty,
| short, or full of weird abbreviations, it'll have a hard
| time.
|
| Maybe it's a skill issue, in the sense of having a decent
| code base.
| greenie_beans wrote:
| same. agents are good with easy stuff and debugging but
| extremely bad with complexity. has no clue about chesterson's
| fence, and it's hard to parse the results especially when it
| creates massive diffs. creates a ton of abandoned/cargo code.
| lots of misdirection with OOP.
|
| chatting witch claude and copy/pasting code between my IDE
| and claude is still the most effective for more complex
| stuff, at least for me.
| jmartrican wrote:
| Maybe that is a skills issue.
| rootusrootus wrote:
| If you are suggesting that LLMs are proving quite good at
| taking over the low skilled work that probably 90% of devs
| spend the majority of their time doing, I totally agree. It
| is the simplest explanation for why many people think they
| are magic, while some people find very little value.
|
| On the occasion that I find myself having to write web code
| for whatever reason, I'm very happy to have Claude. I don't
| enjoy coding for the web, like at all.
| logicprog wrote:
| I think that's definitely true -- these tools are only
| really taking care of the relatively low skill stuff;
| synthesizing algorithms and architectures and approaches
| that have been seen before, automating building out for
| scaffolding things, or interpolating skeletons, and
| running relatively typical bash commands for you after
| making code changes, or implementing fairly specific
| specifications of how to approach novel architectures
| algorithms or code logic, automating exploring code bases
| and building understanding of what things do and where
| they are and how they relate and the control flow (which
| would otherwise take hours of laboriously grepping around
| and reading code), all in small bite sized pieces with a
| human in the loop. They're even able to make complete and
| fully working code for things that are a small variation
| or synthesization of things they've seen a lot before in
| technologies they're familiar with.
|
| But I think that that can still be a pretty good boost --
| I'd say maybe 20 to 30%, plus MUCH less headache, when
| used right -- even for people that are doing really
| interesting and novel things, because even if your work
| has a lot of novelty and domain knowledge to it, there's
| always mundane horseshit that eats up way too much of
| your time and brain cycles. So you can use these agents
| to take care of all the peripheral stuff for you and just
| focus on what's interesting to you. Imagine you want to
| write some really novel unique complex algorithm or
| something but you do want it to have a GUI debugging
| interface. You can just use Imgui or TKinter if you can
| make Python bindings or something and then offload that
| whole thing onto the LLM instead of having to have that
| extra cognitive load and have to page just to warp the
| meat of what you're working on out whenever you need to
| make a modification to your GUI that's more than trivial.
|
| I also think this opens up the possibility for a lot more
| people to write ad hoc personal programs for various
| things they need, which is even more powerful when
| combined with something like Python that has a ton of
| pre-made libraries that do all the difficult stuff for
| you, or something like emacs that's highly malleable and
| rewards being able to write programs with it by making
| them able to very powerfully integrate with your workflow
| and environment. Even for people who already know how to
| program and like programming even, there's still an
| opportunity cost and an amount of time and effort and
| cognitive load investment in making programs. So by
| significantly lowering that you open up the opportunities
| even for us and for people who don't know how to program
| at all, their productivity basically goes from zero to
| one, an improvement of 100% (or infinity lol)
| phist_mcgee wrote:
| What a supremely arrogant comment.
| rootusrootus wrote:
| I often have such thoughts about things I read on HN but
| I usually follow the site guidelines and keep it to
| myself.
| ericmcer wrote:
| Agreed, daily Cursor user.
|
| Just got out of a 15m huddle with someone trying to
| understand what they were doing in a PR before they admitted
| Claude generated everything and it worked but they weren't
| sure why... Ended up ripping about 200 LoC out because what
| Claude "fixed" wasn't even broken.
|
| So never let it generate code, but the autocomplete is
| absolutely killer. If you understand how to code in 2+
| languages you can make assumptions about how to do things in
| many others and let the AI autofill the syntax in. I have
| been able to swap to languages I have almost no experience in
| and work fairly well because memorizing syntax is irrelevant.
| daymanstep wrote:
| > I have been able to swap to languages I have almost no
| experience in and work fairly well because memorizing
| syntax is irrelevant.
|
| I do wonder whether your code does what you think it does.
| Similar-sounding keywords in different languages can have
| completely different meanings. E.g. the volatile keyword in
| Java vs C++. You don't know what you don't know, right? How
| do you know that the AI generated code does what you think
| it does?
| jacobr1 wrote:
| Beyond code-gen I think some techniques are very
| underutilized. One can generate tests, generate docs,
| explain things line by line. Explicitly explaining
| alternative approaches and tradeoffs is helpful too.
| While, as with everything in this space, there are
| imperfection, I find a ton of value in looking beyond the
| code into thinking through the use cases, alternative
| approaches and different ways to structure the same
| thing.
| pornel wrote:
| I've wasted time debugging phantom issues due to LLM-
| generated tests that were misusing an API.
|
| Brainstorming/explanations can be helpful, but also watch
| out for Gell-Mann amnesia. It's annoying that LLMs always
| sound smart whether they are saying something smart or
| not.
| Miraste wrote:
| Yes, you can't use any of the heuristics you develop for
| human writing to decide if the LLM is saying something
| stupid, because its best insights and its worst
| hallucinations all have the same formatting, diction, and
| style. Instead, you need to engage your frontal cortex
| and rationally evaluate every single piece of information
| it presents, and that's tiring.
| valenterry wrote:
| It's like listening to a politician or lawyer, who might
| talk absolute bullshit in the most persuading words. =)
| spanishgum wrote:
| The same way I would with any of my own code - I would
| test it!
|
| The key here is to spend less time searching, and more
| time understanding the search result.
|
| I do think the vibe factor is going to bite companies in
| the long run. I see a lot of vibe code pushed by both
| junior and senior devs alike, where it's clear not enough
| time was spent reviewing the product. This behavior is
| being actively rewarded now, but I do think the attitude
| around building code as fast as possible will change if
| impact to production systems becomes realized as a net
| negative. Time will tell.
| senko wrote:
| > Just got out of a 15m huddle with someone trying to
| understand what they were doing in a PR before they
| admitted Claude generated everything and it worked but they
| weren't sure why...
|
| But .. that's not the AI's fault. If people submit _any_
| PRs (including AI-generated or AI-assisted) without
| _completely_ understanding them, I 'd treat is as serious
| breach of professional conduct and (gently, for first-
| timers) stress that this is _not_ acceptable.
|
| As someone hitting the "Create PR" (or equivalent) button,
| you accept responsibility for the code in question. If you
| submit slop, it's 100% on you, not on any tool used.
| draxil wrote:
| But it's pretty much a given at this point that if you
| use agents to code for any length of time it starts to
| atrophy your ability to understand what's going on. So,
| yeah. it's a bit of a devils chalice.
| whatever1 wrote:
| If you have to review what the LLM wrote then there is no
| productivity gain.
|
| Leadership asks for vibe coding
| senko wrote:
| > If you have to review what the LLM wrote then there is
| no productivity gain.
|
| I do not agree with that statement.
|
| > Leadership asks for vibe coding
|
| Leadership always asks for more, better, faster.
| mangamadaiyan wrote:
| > Leadership always asks for more, better, faster.
|
| More and faster, yes. Almost never better.
| swat535 wrote:
| > If you have to review what the LLM wrote then there is
| no productivity gain.
|
| You always have to review the code, whether it's written
| by another person, yourself or an AI.
|
| I'm not sure how this translates into the loss of
| productivity?
|
| Did you mean to say that the code AI generates is
| difficult to review? In those cases, it's the fault of
| the code author and not the AI.
|
| Using AI like any other tool requires experience and
| skill.
| WolfeReader wrote:
| I've seen AI create incorrect solutions and deceptive
| variable names. Reviewing the code is absolutely
| necessary.
| epolanski wrote:
| > If you have to review what the LLM wrote then there is
| no productivity gain
|
| Stating something with confidence does not make it
| automatically true.
| fooster wrote:
| I suggest you upgrade your code review skill. I find it
| vastly quicker in most cases to review code than write it
| in the first place.
| whatever1 wrote:
| Anyone can skim code and type "looks good to me".
| qingcharles wrote:
| The other day I caught it changing the grammar and spelling
| in a bunch of static strings in a totally different part of
| a project, for no sane reason.
| bdamm wrote:
| I've seen it do this as well. Odd things like swapping
| the severity level on log statements that had nothing to
| do with the task.
|
| Very careful review of my commits is the only way
| forward, for a long time.
| ericmcer wrote:
| That sounds similar to what it was doing here. It
| basically took a function like `thing = getThing(); id =
| thing.id` and created `id = getThingId()` and replaced
| hundreds of lines and made a new API endpoint.
|
| Not a huge deal because it works, but it seems like you
| would have 100,000 extra lines if you let Claude do
| whatever it wanted for a few months.
| epolanski wrote:
| You're blaming the tool and not the tool user.
| meowtimemania wrote:
| For me it depends on the task. For some tasks (maybe things
| that don't have good existing examples in my codebase?)
|
| I'll spend 3x the time repeatedly asking claude to do
| something for me
| 9cb14c1ec0 wrote:
| The more I use Claude Code, the more aware I become of its
| limitations. On the whole, it's a useful tool, but the bigger
| the codebase the less useful. I've noticed a big difference
| on its performance on projects with 20k lines of code versus
| 100k. (Yes, I know. A 100k line project is still very small
| in the big picture)
| Aeolun wrote:
| I think one of thr big issues with CC is that it'll read
| the first occurence of something, and then think it's
| _found_ it. Never mind that there are 17 instances spread
| throughout the codebase.
|
| I have to be really vigilant and tell it to search the
| codebase for any duplication, then resolve it, if I want it
| to keep going good at what it does.
| sorhaindop wrote:
| This exact phrase has been said by 3 different users...
| weird.
| sorhaindop wrote:
| "Having spent a couple of weeks on Claude Code recently, I
| arrived to the conclusion that the net value for me from
| agentic AI is actually negative" - smells like BS to me.
| alexchamberlain wrote:
| I'm not sure how, and maybe some of the coding agents are doing
| this, but we need to teach the AI to use abstractions, rather
| than the whole code base for context. We as humans don't hold
| the whole codebase in our hear, and we shouldn't expect the AI
| to either.
| F7F7F7 wrote:
| There are a billion and one repos that claim to help do this.
| Let us know when you find one.
| siwatanejo wrote:
| I do think AIs are already using abstractions, otherwise you
| would be submitting all the source code of your dependencies
| into the context.
| TheOtherHobbes wrote:
| I think they're recognising patterns, which is not the same
| thing.
|
| Abstractions are stable, they're explicit in their domains,
| good abstractions cross multiple domains, and they
| typically come with a symbolic algebra of available
| operations.
|
| Math is made of abstractions.
|
| Patterns are a weaker form of cognition. They're implicit,
| heavily context-dependent, and there's no algebra. You have
| to poke at them crudely in the hope you can make them do
| something useful.
|
| Using LLMs feels more like the latter than the former.
|
| If LLMs were generating true abstractions they'd be finding
| meta-descriptions for code and language and making them
| accessible directly.
|
| AGI - or ASI - may be be able to do that some day, but it's
| not doing that now.
| anthonypasq wrote:
| the fact we cant keep the repo in our working memory is a
| flaw of our brains. i cant see how you could possibly make
| the argument that if you were somehow able to keep the entire
| codebase in your head that it would be a disadvantage.
| SkyBelow wrote:
| Information tradeoff. Even if you could keep the entire
| code base in memory, if something else has to be left out
| of memory, then you have to consider the value of an
| abstraction verses whatever other information is lost.
| Abstractions also apply to the business domain and works
| the same.
|
| You also have time tradeoffs. Like time to access memory
| and time to process that memory to achieve some outcome.
|
| There is also quality. If you can keep the entire code base
| in memory but with some chance of confusion, while
| abstractions will allow less chance of confusion, then the
| tradeoff of abstractions might be worth it still.
|
| Even if we assume a memory that has no limits, can access
| and process all information at constant speed, and no
| quality loss, there is still communication limitations to
| worry about. Energy consumption is yet another.
| sdesol wrote:
| LLMs (current implementation) are probabilistic so it really
| needs the actual code to predict the most likely next tokens.
| Now loading the whole code base can be a problem in itself,
| since other files may negatively affect the next token.
| photon_lines wrote:
| Sorry -- I keep seeing this being used but I'm not entirely
| sure how it differs from most of human thinking. Most human
| 'reasoning' is probabilistic as well and we rely on
| 'associative' networks to ingest information. In a similar
| manner - LLMs use association as well -- and not only that,
| but they are capable of figuring out patterns based on
| examples (just like humans are) -- read this paper for
| context: https://arxiv.org/pdf/2005.14165. In other words,
| they are capable of grokking patterns from simple data
| (just like humans are). I've given various LLMs my
| requirements and they produced working solutions for me by
| simply 1) including all of the requirements in my prompt
| and 2) asking them to think through and 'reason' through
| their suggestions and the products have always been
| superior to what most humans have produced. The 'LLMs are
| probabilistic predictors' comments though keep appearing on
| threads and I'm not quite sure I understand them -- yes,
| LLMs don't have 'human context' i.e. data needed to
| understand human beings since they have not directly been
| fed in human experiences, but for the most part -- LLMs are
| not simple 'statistical predictors' as everyone brands them
| to be. You can see a thorough write-up I did of what GPT is
| / was here if you're interested:
| https://photonlines.substack.com/p/intuitive-and-visual-
| guid...
| didibus wrote:
| You seem possibly more knowledgeable then me on the
| matter.
|
| My impression is that LLMs predict the next token based
| on the prior context. They do that by having learned a
| probability distribution from tokens -> next-token.
|
| Then as I understand, the models are never reasoning
| about the problem, but always about what the next token
| should be given the context.
|
| The chain of thought is just rewarding them so that the
| next token isn't predicting the token of the final answer
| directly, but instead predicting the token of the
| reasoning to the solution.
|
| Since human language in the dataset contains text that
| describes many concepts and offers many solutions to
| problems. It turns out that predicting the text that
| describes the solution to a problem often ends up being
| the correct solution to the problem. That this was true
| was kind of a lucky accident and is where all the
| "intelligence" comes from.
| photon_lines wrote:
| So - in the pre-training step you are right -- they are
| simple 'statistical' predictors but there are more steps
| involved in their training which turn them from simple
| predictors to being able to capture patterns and reason
| -- I tried to come up with an intuitive overview of how
| they do this in the write-up and I'm not sure I can give
| you a simple explanation here, but I would recommend you
| play around with Deep-Seek and other more advanced
| 'reasoning' or 'chain-of-reason' models and ask them to
| perform tasks for you: they are not simply statistically
| combining information together. Many times they are able
| to reason through and come up with extremely advanced
| working solutions. To me this indicates that they are not
| 'accidently' stumbling upon solutions based on statistics
| -- they actually are able to 'understand' what you are
| asking them to do and to produce valid results.
| didibus wrote:
| If you observe the failure modes of current models, you
| see that they fail in ways that align with probabilistic
| token prediction.
|
| I don't mean that the textual prediction is simple, it's
| very advanced and it learns all kinds of relationships,
| patterns and so on.
|
| But it doesn't have a real model and thinking process
| relating to the the actual problem. It thinks about what
| text could describe a solution that is linguistically and
| language semantically probable.
|
| Since human language embedds so many of the logics and
| ground truths that's good enough to result in a textual
| description that approximate or nails the actual
| underlying problem.
|
| And this is why we see them being able to solve quite
| advanced problems.
|
| I admit that people are wondering now, what's different
| about human thinking? Maybe we do the same, you invent a
| probable sounding answer and then check if it was
| correct, rinse and repeat until you find one that works.
|
| But this in itself is a big conjecture. We don't really
| know how human thinking works. We've found a method that
| works well for computers and now we wonder if maybe we're
| just the same but scaled even higher or with slight
| modifications.
|
| I've heard from ML experts though that they don't think
| so. Most seem to believe different architecture will be
| needed, world models, model ensembles with various
| specialized models with different architecture working
| together, etc. That LLMs fundamentaly are kind of limited
| by their nature as next token predictors.
| coderenegade wrote:
| I think the intuitive leap (or at least, what I believe)
| is that meaning is encoded in the media. A given context
| and input encodes a particular meaning that the model is
| able to map to an output, and because the output is also
| in the same medium (tokens, text), it also has meaning.
| Even reasoning can fit in with this, because the model
| generates additional meaningful context that allows it to
| better map to an output.
|
| How you find the function that does the mapping probably
| doesn't matter. We use probability theory and information
| theory, because they're the best tools for the job, but
| there's nothing to say you couldn't handcraft it from
| scratch if you were some transcendent creature.
| didibus wrote:
| Yes exactly.
|
| The text of human natural language that it is trained on
| encodes the solutions to many problems as well as a lot
| of ground truths.
|
| The way I think of it is. First you have a random text
| generator. This generative "model" in theory can find the
| solution to all problems that text can describe.
|
| If you had a way to assert if it found the correct
| solution, you could run it and eventually it would
| generate the text that describes the working solution.
|
| Obviously inefficient and not practical.
|
| What if you made it so it skipped generating all text
| that aren't valid sensical English?
|
| Well now it would find the correct solution in way less
| iterations, but still too slow.
|
| What if it generated only text that made sense to follow
| the context of the question?
|
| Now you might start to see it 100-shot, 10-shot, maybe
| even 1-shot some problems.
|
| What if you tuned that to the max? Well you get our
| current crop of LLMs.
|
| What else can you do to make it better?
|
| Tune the dataset, remove text that describe wrong answers
| to prior context so it learns not to generate those. Add
| more quality answers to prior context, add more
| problems/solutions, etc.
|
| Instead of generating the answer to a mathematical
| equation the above way, generate the Python code to run
| to get the answer.
|
| Instead of generating the answer to questions about
| current real world events/facts (like the weather). Have
| it generate the web search query to find it.
|
| If you're asking a more complex question, instead of
| generating the answer directly, have it generate smaller
| logical steps towards the answer.
|
| Etc.
| sdesol wrote:
| I'm not sure if I would say human reasoning is
| 'probabilistic' unless you are taking a very far step
| back and saying based on how the person lived, they have
| ingrained biases (weights) that dictates how they reason.
| I don't know if LLMs have a built in scepticism like
| humans do, that plays a significant role in reasoning.
|
| Regardless if you believe LLMs are probabilistic or not,
| I think what we are both saying is context is king and
| what it (LLM) says is dictated by the context (either
| through training) or introduced by the user.
| Workaccount2 wrote:
| Humans have a neuro-chemical system that performs
| operations with electrical signals.
|
| That's the level to look at, unless you have a dualist
| view of the brain (we are channeling a super-natural
| forces).
| lll-o-lll wrote:
| Yep, just like like looking at a birds feather through a
| microscope explains the principles of flight...
|
| Complexity theory doesn't have a mathematics (yet), but
| that doesn't mean we can't see that it exists. Studying
| the brain at the lowest levels haven't lead to any major
| insights in how cognition functions.
| brookst wrote:
| I personally believe that quantum effects play a role and
| we'll learn more once we understand the brain at that
| level, but I recognize that is an intuition and may well
| be wrong.
| photon_lines wrote:
| 'I don't know if LLMs have a built in scepticism like
| humans do' - humans don't have an 'in built skepticism'
| -- we learn in through experience and through being
| taught how to 'reason' within school (and it takes a very
| long time to do this). You believe that this is in-
| grained but you may have forgotten having to slog through
| most of how the world works and being tested when you
| went to school and when your parents taught you these
| things. On the context component: yes, context is vitally
| important (just as it is with humans) -- you can't
| produce a great solution unless you understand the 'why'
| behind it and how the current solution works so I 100%
| agree with that.
| ijidak wrote:
| For me, the way humans finish each other's sentences and
| often think of quotes from the same movies at the same
| time in conversation (when there is no clear reason for
| that quote to be a part of the conversation), indicates
| that there is a probabilistic element to human thinking.
|
| Is it entirely probabilistic? I don't think so. But, it
| does seem that a chunk of our speech generation and
| processing is similar to LLMs. (e.g. given the words I've
| heard so far, my brain is guessing words x y z should
| come next.)
|
| I feel like the conscious, executive mind humans have
| exercises some active control over our underlying
| probabilistic element. And LLMs lack the conscious
| executive.
|
| e.g. They have our probabilistic capabilities, without
| some additional governing layer that humans have.
| coderenegade wrote:
| I think the better way to look at it is that
| probabilistic models seem to be an accurate model for
| human thought. We don't really know how humans think, but
| we know that they probably aren't violating information
| theoretic principles, and we observe similar phenomena
| when we compare humans with LLMs.
| nomel wrote:
| No, it doesn't, nor do we. It's why abstractions and
| documentations exist.
|
| If you know what a function achieves, and you trust it to
| do that, you don't need to see/hold its exact
| implementation in your head.
| sdesol wrote:
| But documentation doesn't include styling or preferred
| pattern, which is why I think a lot people complain that
| the LLM will just produce garbage. Also documentation is
| not guaranteed to be correct or up to date. To be able to
| produce the best code based on what you are hoping for, I
| do think having the actual code is necessary unless
| styling/design patterns are not important, then yes
| documentation will be suffice, provided they are accurate
| and up to date.
| throwaway314155 wrote:
| /compact in Claude Code is effectively this.
| brulard wrote:
| Compact is a reasonable default way to do that, but quite
| often it discards important details. It's better to have CC
| to store important details, decisions and reasons in a
| document where it can be reviewed and modified if needed.
| LinXitoW wrote:
| They already do, or at least Claude Code does. It will search
| for a method name, then only load a chunk of that file to get
| the method signature, for example.
|
| It will use the general information you give it to make
| educated guesses of where things are. If it knows the code is
| Vue based and it has to do something with "users", it might
| seach for "src/*/ _User_.vue.
|
| This is also the reason why the quality of your code makes
| such a large difference. The more consistent the naming of
| files and classes, the better the AI is at finding them.
| felipeerias wrote:
| Claude Code can get access to a language server like clangd
| through a MCP server, for example
| https://github.com/isaacphi/mcp-language-server
| sdesol wrote:
| > I really desperately need LLMs to maintain extremely
| effective context
|
| I actually built this. I'm still not ready to say "use the tool
| yet" but you can learn more about it at
| https://github.com/gitsense/chat.
|
| The demo link is not up yet as I need to finalize an admin tool
| but you should be able to follow the npm instructions to play
| around with.
|
| The basic idea is, you should be able to load your entire repo
| or repos and use the context builder to help you refine it. Or
| you can can create custom analyzers that you can do 'AI
| Assisted' searches with like execute `!ask find all frontend
| code that does [this]` and the because the analyzer knows how
| to extract the correct metadata to support that query, you'll
| be able to easily build the context using it.
| kvirani wrote:
| Wait that's not how Cursor etc work? (I made assumptions)
| trenchpilgrim wrote:
| Dunno about Cursor but this is exactly how I use Zed to
| navigate groups of projects
| sdesol wrote:
| I don't use Cursor so I can't say, but based on what I've
| read, they optimize for smaller context to reduce cost and
| probably for performance. The issue is, I think this is
| severely flawed as LLMs are insanely context sensitive and
| forgetting to include a reference file can lead to
| undesirable code.
|
| I am obviously biased, but I still think to get the best
| results, the context needs to be human curated to ensure
| everything the LLM needs will be present. LLMs are
| probabilistic, so the more relevant context, the greater
| the chances the final output is the most desired.
| hirako2000 wrote:
| Not clear how it gets around what is, ultimately, a context
| limit.
|
| I've been fiddling with some process too, would be good if
| you shared the how. The readme looks like yet another full
| fledged app.
| sdesol wrote:
| Yes there is a context window limit, but I've found for
| most frontier models, you can generate very effective code
| if the context window is under 75,000 tokens provided the
| context is consistent. You have to think of everything from
| a probability point of view and the more logical the
| context, the greater the chances of better code.
|
| For example, if the frontend doesn't need to know the
| backend code (other than the interface) not including the
| backend code to solve a frontend one to solve a specific
| problem can reduce context size and improve the chances of
| expected output. You just need to ensure you include the
| necessary interface documenation.
|
| As for the full fledged app, I think you raised a good
| point and I should add a 'No lock in' section for why to
| use it. The app has a message tool that lets you pick and
| choose what messages to copy. Once you've copied the
| context (including any conversation messages that can help
| the LLM), you can use the context where ever you want.
|
| My strategy with the app is to be the first place you goto
| to start a conversation before you even generate code, so
| my focus is helping you construct contexts (the smaller the
| better) to feed into LLMs.
| handfuloflight wrote:
| Doesn't Claude Code do all of this automatically?
| sdesol wrote:
| I haven't looked at Claud Code, so I don't know if they
| have analyzers or not that understands how to extract any
| type of data other than specific coding data that it is
| trained on. Based on the runtime for some tasks, I would
| not be surprised if it is going through all the files and
| asking "is this relevant"
|
| My tool is mainly targeted at massive code bases and
| enterprise as I still believe the most efficient way to
| build accurate context is by domain experts.
|
| Right now, I would say 95% of my code is AI generated (98%
| human architectured) and I am spending about $2 a day on
| LLM costs and the code generation part usually never runs
| more than 30 seconds for most tasks.
| handfuloflight wrote:
| Well you should look at it, because it's not going
| through all files. I looked at your product and the
| workflow is essentially asking me to do manually what
| Claude Code does auto. Granted, manually selecting the
| context will probably lead to lower costs in any case
| because Claude Code invokes tool calls like grep to do
| its search, so I do see merit in your product in that
| respect.
| sdesol wrote:
| Looking at the code, it does have some sort of automatic
| discovery. I also don't know how scalable Claude Code is.
| I've spent over a decade thinking about code search, so I
| know what the limitations are for enterprise code.
|
| One of the neat tricks that I've developed is, I would
| load all my backend code for my search component and then
| I would ask the LLM to trace a query and create a context
| bundle for only the files that are affected. Once the LLM
| has finished, I just need to do a few clicks to refine a
| 80,000 token size window down to about 20,000 tokens.
|
| I would not be surprised if this is one of the tricks
| that it does as it is highly effective. Also, yes my tool
| is manual, but I treat conversations as durable asset so
| in the future, you should be able to say, last week I did
| this, load the same files and LLM will know what files to
| bring into context.
| handfuloflight wrote:
| Excellent, I look forward to trying it out, at minimum to
| wean off dependency to Claude Code and it's likely
| current state of overspending on context. I agree with
| looking at conversations as durable assets.
| sdesol wrote:
| > current state of overspending on context
|
| The thing that is killing me when I hear about Claude
| Code and other agent tools is the amount of energy they
| must be using. People say they let the task run for an
| hour and I can't help but to think how much energy is
| being used and if Claude Code is being upfront with how
| much things will actually cost in the future.
| pacoWebConsult wrote:
| FWIW Claude code conversations are also durable. You can
| resume any past conversation in your project. They're
| stored as jsonl files within your `$HOME/.claude`
| directory. This retains the actual context (including
| your prompts, assistant responses, tool usages, etc) from
| that conversation, not just the files you're affecting as
| context.
| sdesol wrote:
| Thanks for the info. I actually want to make it easy for
| people to review aider, plandex, claude code, etc.
| conversations so I will probably look at importing them.
|
| My goal isn't to replace the other tools, but to make
| them work smarter and more efficiently. I also think we
| will in a year or two, start measuring performance based
| on how developers interact with LLMs (so management will
| want to see the conversations). Instead of looking at
| code generated, the question is going to be, if this
| person is let go, what is the impact based on how they
| are contributing via their conversations.
| ec109685 wrote:
| It greps around the code like an intern would. You have
| to have patience and be willing to document workflows and
| correct when it gets things wrong via CLAUDE.md files.
| sdesol wrote:
| Honestly, grepping isn't a bad strategy if there is
| enough context to generate focused keywords/patterns to
| search. The "let Claude Code think for 10 minutes or
| more", makes a lot more sense now, as this brute force
| method can take some time.
| ec109685 wrote:
| Yeah and it's creative with its grepping.
| msikora wrote:
| Why not build this as an MCP so that people can plug it into
| their favorite platform?
| sdesol wrote:
| An MCP is definitely on the roadmap. My objective is to
| become the context engine for LLMs so having a MCP is
| required. However, there will be things from a UX
| perspective that you'll lose out on if you just use the
| MCP.
| seanmmward wrote:
| The primary use case isn't just about shoving more code in
| context, although depending on the task, there is an
| irredicible minimum context needed for it to capture all the
| needed understanding. The 1M context model is a unique beast in
| terms of how you need to feed it, and its real power is being
| able to tackle long horizon tasks which require iterative
| exploration, in context learning, and resynthesis. Ie, some
| problems are breadth (go fix an api change in 100 files), other
| however require depth (go learn from trying 15 different ways
| to solve this problem). 1M Sonnet is unique in its capabilities
| for the latter in particular.
| hinkley wrote:
| Sounds to me like your problem has shifted from how much the AI
| tool costs per hour to how much it costs per token because
| resetting a model happens often enough that the price doesn't
| amortize out per hour. That giant spike every ?? months
| overshadows the average cost per day.
|
| I wonder if this will become more universal, and if we won't
| see a 'tick-tock' pattern like Intel used, where they tweak the
| existing architecture one or more times between major design
| work. The 'tick' is about keeping you competitive and the
| 'tock' is about keeping you relevant.
| TZubiri wrote:
| "However. Price is king. Allowing me to flood the context
| window with my code base is great"
|
| I don't vibe code, but in general having to know all of the
| codebase to be able to do something is a smell, it's
| spagghetti, it's lack of encapsulation.
|
| When I program I cannot think about the whole database, I have
| a couple of files open tops and I think about the code in those
| files.
|
| This issue of having to understand the whole codebase,
| complaining about abstractions, microservices, and OOP, and
| wanting everything to be in a "simple" monorepo, or a monolith;
| is something that I see juniors do, almost exclusively.
| ants_everywhere wrote:
| > I really desperately need LLMs to maintain extremely
| effective context
|
| The context is in the repo. An LLM will never have the context
| you need to solve all problems. Large enough repos don't fit on
| a single machine.
|
| There's a tradeoff just like in humans where getting a specific
| task done requires removing distractions. A context window that
| contains everything makes focus harder.
|
| For a long time context windows were too small, and they
| probably still are. But they have to get better at
| understanding the repo by asking the right questions.
| stuartjohnson12 wrote:
| > An LLM will never have the context you need to solve all
| problems.
|
| How often do you need more than 10 million tokens to answer
| your query?
| ants_everywhere wrote:
| I exhaust the 1 million context windows on multiple models
| multiple times per day.
|
| I haven't used the Llama 4 10 million context window so I
| don't know how it performs in practice compared to the
| major non-open-source offerings that have smaller context
| windows.
|
| But there is an induced demand effect where as the context
| window increases it opens up more possibilities, and those
| possibilities can get bottlenecked on requiring an even
| bigger context window size.
|
| For example, consider the idea of storing all Hollywood
| films on your computer. In the 1980s this was impossible.
| If you store them in DVD or Bluray quality you could
| probably do it in a few terabytes. If you store them in
| full quality you may be talking about petabytes.
|
| We recently struggled to get a full file into a context
| window. Now a lot of people feel a bit like "just take the
| whole repo, it's only a few MB".
| brulard wrote:
| I think you misunderstand how context in current LLMs
| works. To get the best results you have to be very
| careful to provide what is needed for immediate task
| progression, and postpone context thats needed later in
| the process. If you give all the context at once, you
| will likely get quite degraded output quality. Thats like
| if you want to give a junior developer his first task,
| you likely won't teach him every corner of your app. You
| would give him context he needs. It is similar with these
| models. Those that provided 1M or 2M of context (Gemini
| etc.) were getting less and less useful after cca 200k
| tokens in the context.
|
| Maybe models would get better in picking up relevant
| information from large context, but AFAIK it is not the
| case today.
| remexre wrote:
| That's a really anthropomorphizing description; a more
| mechanical one might be,
|
| The attention mechanism that transformers use to find
| information in the context is, in its simplest form,
| O(n^2); for each token position, the model considers
| whether relevant information has been produced at the
| position of every other token.
|
| To preserve performance when really long contexts are
| used, current-generation LLMs use various ways to
| consider fewer positions in the context; for example,
| they might only consider the 4096 "most likely" places to
| matter (de-emphasizing large numbers of "subtle hints"
| that something isn't correct), or they might have some
| way of combining multiple tokens worth of information
| into a single value (losing some fine detail).
| ants_everywhere wrote:
| > I think you misunderstand how context in current LLMs
| works.
|
| Thanks but I don't and I'm not sure why you're jumping to
| this conclusion.
|
| EDIT: Oh I think you're talking about the last bit of the
| comment! If you read the one before I say that feeding it
| the entire repo isn't a great idea. But great idea or
| not, people want to do it, and it illustrates that as
| context window increases it creates demand for even
| larger context windows.
| brulard wrote:
| I said that based on you saying you exhaust a million
| token context windows easily. I'm no expert on that, but
| I think the current state of LLMs works best if you are
| not approaching that 1M token limit, because large
| context (reportedly) deteriorates response quality
| quickly. I think state of the art usage is managing
| context in tens or low hundreds thousands tokens at most
| and taking advantage of splitting tasks across subtasks
| in time, or splitting context across multiple "expert"
| agents (see sub-agents in claude code).
| jimbokun wrote:
| It seems like LLM need to become experts at managing
| their OWN context.
|
| Selectively gripping and searching the code to pull into
| context only those parts relevant to the task at hand.
| brulard wrote:
| That's what I'm thinking about a lot. Something like the
| models "activate" just some subset of parameters when
| working (if I understand the new models correctly). So
| that model could activate parts of context which are
| relevant for the task at hand
| rocqua wrote:
| It doesn't take me 10000000 tokens to have the context
| "this was the general idea of the code, these were
| unimportant implementation details, and this is where
| lifetimes were tricky."
|
| And that context is the valuable bit for quickly getting
| back up to speed on a codebase.
| onion2k wrote:
| _Large enough repos don 't fit on a single machine._
|
| I don't believe any human can understand a problem if they
| need to fit the entire problem blem domain in their head, and
| the scope of a domain that doesn't fit on a computer. You
| _have_ to break it down into a manageable amount of
| information to tackle it in chunks.
|
| If a person can do that, so can an LLM prompted to do that by
| a person.
| wraptile wrote:
| Right, the LLM doesn't need to know all of the code under
| utils.parse_id to know that this call will parse the ID.
| The best LLM results I get is when I manually define the
| the relative code graph of my problem similar how I'd
| imagine it my head which seems to provide optimal context.
| So bigger isn't really better.
| rocqua wrote:
| I wonder why we can't have one LLM generate this
| understanding for another? Perhaps this is where teaming
| of LLMs gets its value. In managing high and low level
| context in different context windows.
| mixedCase wrote:
| This is a thing and doesn't require a separate model. You
| can set up custom prompts that will, based on another
| prompt describing the task to achieve, generate
| information about the codebase and a set of TODOs to
| accomplish the task, generating markdown files with a
| summarized version of the relevant knowledge and
| prompting you again to refine that summary if needed. You
| can then use these files to let the agent take over
| without going on a wild goose chase.
| ehnto wrote:
| I disagree, I may not have the whole codebase in my head in
| one moment but I have had all of it in my head at some
| point, and it is still there, that is not true of an LLM. I
| use LLMs and am impressed by them, but they just do not
| approximate a human in this particular area.
|
| My ability to break a problem down does not start from
| listing the files out and reading a few. I have a high
| level understanding of the whole project at all times, and
| a deep understanding of the whole project stored, and I can
| recall that when required, this is not true of an LLM at
| any point.
|
| We know this is a limitation and it's why we have various
| tools attempting to approximate memory and augment training
| on the fly, but they are approximations and they are in my
| opinion, not even close to real human memory and depth of
| understanding for data it was not trained on.
|
| Even for mutations of scenarios it was trained on, which
| code is a great example of that. It is trained on billions
| of lines of code, yet still fails to understand my codebase
| intuitively. I have definitely not read billions of lines
| of code.
| ehnto wrote:
| Additionally, the more information you put into the
| context the more confused the LLM will get, if you did
| dump the whole codebase into the context it would not
| suddenly understand the whole thing. It is still an LLM,
| all you have done is polluted the context with a million
| lines of unrelated code, and some lines of related code,
| which it will struggle to find in the noise (in my
| experience of much smaller experiments)
| Bombthecat wrote:
| I call this context decay. :)
|
| The bigger the context, the more stuff "decays" sometimes
| to complete different meanings
| xwolfi wrote:
| You only worked on very small codebase then. When you
| work on giant ones, you Ctrl+F a lot, build a limited
| model of the problem space, and pray the unit tests will
| catch anything you might have missed...
| akhosravian wrote:
| And when you work on a really big codebase you start
| having multiple files and have to learn tools more
| advanced than ctrl-f!!
| ghurtado wrote:
| > and have to learn tools more advanced than ctrl-f!!
|
| Such as ctrl-shift-f
|
| But this is an advanced topic, I don't wanna get into it
| ehnto wrote:
| We're measuring lengths of string, but I would not say I
| have worked on small projects. I am very familiar with
| discovery, and have worked on a lot of large legacy
| projects that have no tests just fine.
| jimbokun wrote:
| Why are LLMs so bad at doing the same thing?
| airbreather wrote:
| you will have abstractions - black boxing, interface
| overviews etc, humans can only hold so much detail in
| current context memory, some say 7 items on average.
| ehnto wrote:
| Of course, but even those blackoxes are not empty,
| they've got a vague picture inside them based on prior
| experience. I have been doing this for a while so most
| things are just various flavours of the same stuff,
| especially in enterprise software.
|
| The important thing in this context is that I know it's
| all there, I don't have to grep the codebase to fill up
| my context, and my understanding of the holistic project
| does not change each time I am booted up.
| jimbokun wrote:
| And LLMs can't leverage these abstractions nearly as well
| as humans...so far.
| PaulDavisThe1st wrote:
| > I disagree, I may not have the whole codebase in my
| head in one moment but I have had all of it in my head at
| some point, and it is still there, that is not true of an
| LLM.
|
| All 3 points (you have had all of it your head at some
| point, it is still there, that is not true of an LLM) are
| mere conjectures, and not provable at this time,
| certainly not in the general case. You may be able to
| show this of some codebases for some developers and for
| some LLMs, but not all.
| fnordsensei wrote:
| The brain can literally not process any piece of
| information without being changed by the act of
| processing it. Neuronal pathways are constantly being
| reinforced or weakened.
|
| Even remembering alters the memory being recalled,
| entirely unlike how computers work.
| johnisgood wrote:
| For humans, remembering strengthens that memory, even if
| it is dead wrong.
| Lutger wrote:
| I've always find it interesting that once I take a wrong
| turn finding my way through the city and I'm not
| deliberate about remembering this was, in fact, a
| mistake, I am more prone to taking the same wrong turn
| again the next time.
| dberge wrote:
| > once I take a wrong turn finding my way through the
| city... I am more prone to taking the same wrong turn
| again
|
| You may want to stay home then to avoid getting lost.
| jbs789 wrote:
| I'm not sure the idea that a developer maintains a high
| level understanding is all that controversial...
| animuchan wrote:
| The trend for this idea's controversiality is shown on
| this very small chart: /
| ehnto wrote:
| I never intended to say it was true of all codebases for
| all developers, that would make no sense. I don't know
| all developers.
|
| I think it's objectively true that the information is not
| in the LLM. It did not have all codebases to train with,
| and they do not (immediately) retrain on the codebases
| they encounter through usage.
| ivape wrote:
| _My ability to break a problem down does not start from
| listing the files out and reading a few._
|
| I does, it's just happening at lightning speed.
| CPLX wrote:
| We don't actually know that.
|
| If we had that level of understanding of how exactly our
| brains do what they do things would be quite different.
| onion2k wrote:
| _My ability to break a problem down does not start from
| listing the files out and reading a few._
|
| If you're completely new to the problem then ... yes, it
| does.
|
| You're assuming that you're working on a project that
| you've spent time on and learned the domain for, and then
| you're comparing that to an LLM being prompted to look at
| a codebase with the context of the files. Those things
| are not the same though.
|
| A closer analogy to LLMs would be prompting it for
| questions when it has access (either through MCP or
| training) to the project's git history, documentation,
| notes, issue tracker, etc. When that sort of thing is
| commonplace, and LLMs have the context window size to
| take advantage of all that information, I suspect we'll
| be surprised how good they are even given the results we
| get today.
| ehnto wrote:
| > If you're completely new to the problem then ... yes,
| it does.
|
| Of course, because I am not new to the problem, whereas
| an LLM is new to it every new prompt. I am not really
| trying to find a fair comparison because I believe humans
| have an unfair advantage in this instance, and am trying
| to make that point, rather than compare like for like
| abilities. I think we'll find even with all the context
| clues from MCPs and history etc. they might still fail to
| have the insight to recall the right data into the
| context, but that's just a feeling I have from working
| with Claude Code for a while. Because I instruct it to do
| those things, like look through git log, check the
| documentation etc, and it sometimes finds a path through
| to an insight but it's just as likely to get lost.
|
| I alluded to it somewhere else but my experience with
| massive context windows so far has just been that it
| distracts the LLM. We are usually guiding it down a path
| with each new prompt and have a specific subset of
| information to give it, and so pumping the context full
| of unrelated code at the start seems to derail it from
| that path. That's anecdotal, though I encourage you to
| try messing around with it.
|
| As always, there's a good chance I will eat my hat some
| day.
| scott_s wrote:
| > Of course, because I am not new to the problem, whereas
| an LLM is new to it every new prompt.
|
| That is true for the LLMs you have access to now. Now
| imagine if the LLM had been _trained_ on your entire code
| base. And not just the code, but the entire commit
| history, commit messages and also all of your external
| design docs. _And_ code and docs from all relevant
| projects. That LLM would not be new to the problem every
| prompt. Basically, imagine that you fine-tuned an LLM for
| your specific project. You will eventually have access to
| such an LLM.
| jimbokun wrote:
| Why haven't the bug AI companies been pursuing that
| approach, vs just ramping up context window size?
| scott_s wrote:
| Because training one family of models with very large
| context windows can be offered to the entire world as an
| online service. That is a very different business model
| from training or fine-tuning individual models
| specifically for individual customers. _Someone_ will
| figure out how to do that at scale, eventually. It might
| require the cost of training to reduce significantly. But
| large companies with the resources to do this for
| themselves will do it, and many are doing it.
| krainboltgreene wrote:
| I have an entire life worth of context and I still remember
| projects I worked on 15 years ago.
| adastra22 wrote:
| Not with pixel perfect accuracy. You vaguely remember,
| although it may not feel like that because your brain
| fills in the details (hallucinates) as you recall. The
| comparisons are closer than you might think.
| vidarh wrote:
| The comparison would be apt if the LLM was _trained on
| your codebase_.
| jimbokun wrote:
| Isn't that the problem?
|
| I don't see any progress on incrementally training LLMs
| on specific projects. I believe it's called fine tuning,
| right?
|
| Why isn't that the default approach anywhere instead of
| the hack of bigger "context windows"?
| adastra22 wrote:
| Because fine-tuning can be used to remove restrictions
| from a model, so they don't give us plebs access to that.
| gerhardi wrote:
| I'm not well versed enough on this but wouldn't it be a
| problem with custom training that the specific project
| training codebases probably would likely have a lot of
| the implemented stuff, relevant for the domain, only once
| and in one way, compared to how the todays popular large
| models have been trained maybe with countless different
| ways to use common libraries for whatever various tasks
| with whatever Github ripped material fed in?
| krainboltgreene wrote:
| You have no idea if I remember with pixel perfect
| accuracy (whatever that even means). There are plenty of
| people with photographic memory.
|
| Also, you're a programmer you have no foundation of
| knowledge on which to make that assessment. You might as
| well opine on quarks or martian cellular life. My god the
| arrogance of people in my industry.
| johnisgood wrote:
| > There are plenty of people with photographic memory.
|
| I thought it was rare.
| adastra22 wrote:
| Repeated studies have shown that perfect "photographic
| memory" does not in fact exist. Nobody has it. Some
| people think that they do though, but when tested under
| lab conditions those claims don't hold up.
|
| I don't believe these people are lying. They are self-
| reporting their own experiences, which unfortunately have
| the annoying property of being generated by the very mind
| that is living the experience.
|
| What does it mean to have an eidetic memory? It means
| that when you remember something you vividly remember
| details, and can examine those details to your heart's
| content. When you do so, it feels like all those details
| are correct. (Or so I'm told, I'm the opposite with
| aphantasia.)
|
| But it turns out if you actually have a photo reference
| and do a blind comparison test, people who report
| photographic memories actually don't do statistically any
| better than others in remembering specific fine details,
| even though they claim that they clearly remember.
|
| The simpler explanation is that while all of our brains
| are provide hallucinated detail to fill the gaps of
| memories, their brains are wired up to present those made
| up details feel much more real than they do to others.
| That is all.
| HarHarVeryFunny wrote:
| > Repeated studies have shown that perfect "photographic
| memory" does not in fact exist.
|
| This may change your mind!
|
| https://www.youtube.com/watch?v=jVqRT_kCOLI
| melagonster wrote:
| Sure, this is why AGI looks possible sometimes. But
| companies should not require their users to create AGI for
| them.
| friendzis wrote:
| Fitting the _entire_ problem domain in their head is what
| engineers _do_.
|
| Engineering is merely a search for optimal solution in this
| multidimensional space of problem domain(-s), requirements,
| limitations and optimization functions.
| barnabee wrote:
| _Good_ engineers fit their entire understanding of the
| problem domain in their head
|
| The best engineers understand how big a difference that
| is
| sdesol wrote:
| > But they have to get better at understanding the repo by
| asking the right questions.
|
| How I am tackling this problem is making it dead simple for
| users to create analyzers that are designed to enriched text
| data. You can read more about how it would be used in a
| search at https://github.com/gitsense/chat/blob/main/packages
| /chat/wid...
|
| The basic idea is, users would construct analyzers with the
| help of LLMs to extract the proper metadata that can be
| semantically searched. So when the user does an AI Assisted
| search with my tool, I would load all the analyzers
| (description and schema) into the system prompt and the LLM
| can determine which analyzers can be used to answer the
| question.
|
| A very simplistic analyzer would be to make it easy to
| identify backend and frontend code so you can just use the
| command `!ask find all frontend files` and the LLM will
| construct a deterministic search that knows to match for
| frontend files.
| mrits wrote:
| How is that better than just writing a line in the md?
| sdesol wrote:
| I am not sure I follow what you are saying. What would
| the line be and how would it become deterministically
| searchable?
| mrits wrote:
| frontend path: /src/frontend/* backend path: /src/*
|
| I suppose the problem you have might be unique to nextJS
| ?
| sdesol wrote:
| The issue is frontend can be a loaded question,
| especially if you are dealing with legacy stuff,
| different frameworks, etc. You also can't tell what the
| frontend code does by looking at that single line.
|
| Now imagine as part of your analyzer, you have the
| following instructions for the llm:
|
| --- For all files in `src/frontend/ _` treat them as
| frontend code. For all files in 'src/_' excluding
| `src/frontend` treat as backend. Create a metadata called
| `scope` which can be 'frontend', 'backend' or 'mix' where
| mix means the code can be used for both front and backend
| like utilities.
|
| Now for each file, create a `keywords` metadata that
| includes up to 10 unique keywords that describes the core
| functionality for the file. ---
|
| So with this you can say
|
| - `!ask find all frontend files`
|
| - `!ask find all mix use files`
|
| - `!ask find all frontend files that does [this]`
|
| and so forth.
|
| The whole point of analyzers is to make it easy for the
| LLM to map your natural language query to a deterministic
| search.
|
| If the code base is straightforward and follows a well
| known framework, asking for frontend or backend wouldn't
| even need an entry as you can just include in the
| instructions that I use framework X and the LLM would
| know what to consider.
| mock-possum wrote:
| > The context is in the repo
|
| Agreed but that's a bit different from "the context _is_ the
| repo"
|
| It's been my experience that usually just picking a couple
| files out to add to the context is enough - Claude seems
| capable of following imports and finding what it needs, in
| most cases.
|
| I'm sure it depends on the task, and the structure of the
| codebase.
| manmal wrote:
| > The context is in the repo
|
| No it's in the problem at hand. I need to load all related
| files, documentation, and style guides into the context. This
| works really well for smaller modules, but currently falls
| apart after a certain size.
| alvis wrote:
| Everything in context hurts focus. It's like some people
| suffering from hyperthymesia. They are easily get distracted
| when the recall something
| injidup wrote:
| All the more reason for good software engineering. Folders of
| files managing one concept. Files tightly focussed on sub
| problems of that concept. Keep your code so that you can
| solve problems in self contained context windows at the right
| level of abstraction
| Sharlin wrote:
| I fear that LLM-optimal code structure is different from
| human-optimal code structure, and people are starting to
| optimize for the former rather than the latter.
| NuclearPM wrote:
| Problems
| jack_pp wrote:
| maybe we need LLMs trained on ASTs or create a new symbolic way
| to represent software that's faster to grok by LLMs and have a
| translator so we can verify the code
| energy123 wrote:
| You could probably build a decent agentic harness that
| achieves something similar.
|
| Show the LLM a tree and/or call-graph representation of your
| codebase (e.g. `cargo diagram` and `cargo-depgraph`), which
| is token efficient.
|
| And give the LLM a tool call to see the contents of the
| desired subtree. More precise than querying a RAG chunk or a
| whole file.
|
| You could also have another optional tool call which routes
| the text content of the subtree through a smaller LLM that
| summarizes it into a maximum density snippet, which the LLM
| can use for a token efficient understanding of that subtree
| during early the planning phase.
|
| But I'd agree that an LLM built natively around AST is a
| pretty cool idea.
| fgbarben wrote:
| Allow me to flood the fertile plains of its consciousness with
| my seed... yes, yes, let it take root... this is important to
| me
| fgbarben wrote:
| Let me despoil the rich geography of your context window with
| my corrupted b2b SaaS workflows and code... absorb the
| pollution, rework it, struggling against the weight... yes,
| this pleases me, it is essential for the propagation of my
| germline
| dberge wrote:
| > the price has substantially increased
|
| I'm assuming the credits required per use won't increase in
| Cursor.
|
| Hopefully this puts pressure on them to lower credits required
| for gpt-5.
| khalic wrote:
| This is a major issue with LLMs altogether, it probably has to
| do with the transformer architecture. We need another
| breakthrough in the field for this to become reality.
| HarHarVeryFunny wrote:
| Even 1 MB context is only roughly 20K LOC so pretty limiting,
| especially if you're also trying to fit API documents or any
| other lengthy material into the context.
|
| Anthropic also recently said that they think that
| longer/compressed context can serve as an alternative (not sure
| what was the exact wording/characterization they used) to
| continual/incremental learning, so context space is also going
| to be competing with model interaction history if you want to
| avoid groundhog day and continually having to tell/correct the
| model the same things over and over.
|
| It seems we're now firmly in the productization phase of LLM
| development, as opposed to seeing much fundamental improvement
| (other than math olympiad etc "benchmark" results, released to
| give the impression of progress). Yannic Kilcher is right, "AGI
| is not coming", at least not in the form of an enhanced LLM.
| Demis Hassabis' very recent estimate was for 50% chance of AGI
| by 2030 (i.e. still 15 years out).
|
| While we're waiting for AGI, it seems a better approach to
| needing everything in context would be to lean more heavily on
| tool use, perhaps more similar to how a human works - we don't
| memorize the entire code base (at least not in terms of
| complete line-by-line detail, even though we may have a pretty
| clear overview of a 10K LOC codebase while we're in the middle
| of development) but rather rely on tools like grep and ctags to
| locate relevant parts of source code on an as-needed basis.
| aorobin wrote:
| >"Demis Hassabis' very recent estimate was for 50% chance of
| AGI by 2030 (i.e. still 15 years out)."
|
| 2030 is only 5 years out
| Zircom wrote:
| That was his point lol, if someone is saying it'll happen
| in 5 years, triple that for a real estimate.
| km144 wrote:
| As you alluded to at the end of your post--I'm not really
| convinced 20k LOC is very limiting. How many lines of code
| can you fit in your working mental model of a program?
| Certainly less than 20k concrete lines of text at any given
| time.
|
| In your working mental model, you have broad understandings
| of the broader domain. You have broad understandings of the
| architecture. You summarize broad sections of the program
| into simpler ideas. module_a does x, module_b does y, insane
| file c does z, and so on. Then there is the part of the
| software you're actively working on, where you need more
| concrete context.
|
| So as you move towards the central task, the context becomes
| more specific. But the vague outer context is still crucial
| to the task at hand. Now, you can certainly find ways to
| summarize this mental model in an input to an LLM, especially
| with increasing context windows. But we probably need to
| understand how we would better present these sorts of things
| to achieve _performance_ similar to a human brain, because
| the mechanism is very different.
| jacobr1 wrote:
| This is basically how claude code works today. You have it
| /init a description of the project structure into CLAUDE.md
| that is used for each invocation. There is some implicit
| knowledge in the project about common frameworks and
| languages. Then when working on something between the
| explicit and implicit knowledge and the task at hand it
| will grep for relevant material in the project, load either
| full or parts of files, and THEN it will start working on
| the task. But it dynamically builds the context of the
| codebase based on searching for the relevant bit. Short-
| circuiting this by having a good project summary makes it
| more efficient - but you don't need to literally copy in
| all the code files.
| HarHarVeryFunny wrote:
| Interesting - thanks!
| brookst wrote:
| 1M tokens ~= 3.5M characters ~= 58k LOC at an _average_ of 60
| chars /line. 88k LOC at 40 chars/line
| HarHarVeryFunny wrote:
| OK - I was thinking 1M chars (@ 50 chars/line) vs tokens,
| but I'm not sure it makes much difference to the argument.
| There are plenty of commercial code bases WAY bigger, and
| as noted other things may also be competing for space in
| the context.
| HarHarVeryFunny wrote:
| Just as a self follow-up, another motivation to lean on tool
| use rather than massive context (cf. short-term memory) is to
| keep LLM/AI written/modified code understandable to humans
| ...
|
| At least part of the reason that humans use hierarchical
| decomposition and divide-and-conquor is presumably because of
| our own limited short term memory, since hierarchical
| organization (modules, classes, methods, etc) allows us to
| work on a problem at different levels of abstraction while
| only needing to hold that level of the hierarchy in memory.
|
| Imagine what code might look like if written by something
| with no context limit - just a flat hierarchy of functions,
| perhaps, at least until it perhaps eventually learned, or was
| told, the other reasons for hierarchical and modular
| design/decomposition to assist in debugging and future
| enhancement, etc!
| rootnod3 wrote:
| So, more tokens means better but at the same time more tokens
| means it distracts itself too much along the way. So at the same
| time it is an improvement but also potentially detrimental. How
| are those things beneficial in any capacity? What was said last
| week? Embrace AI or leave?
|
| All I see so far is: don't embrace and stay.
| rootnod3 wrote:
| So, I see this got downvoted. Instead of just downvoting, I
| would prefer to have a counter-argument. Honestly. I am on the
| skeptic side of LLM, but would not mind being turned to the
| other side with some solid arguments.
| pupppet wrote:
| How does anyone send these models that much context without it
| tripping over itself? I can't get anywhere near that much before
| it starts losing track of instruction.
| 9wzYQbTYsAIc wrote:
| I've been having decent luck telling it to keep track of itself
| in a .plan file, not foolproof, of course, but it has some
| ability to "preserve context" between contexts.
|
| Right now I'm experimenting with using separate .plan files for
| tracking key instructions across domains like architecture and
| feature decisions.
| CharlesW wrote:
| > _I've been having decent luck telling it to keep track of
| itself in a .plan file, not foolproof, of course, but it has
| some ability to "preserve context" between contexts._
|
| This is the way. Not only have I had good luck with both a
| TASKS.md and TASKS-COMPLETE.md (for history), but I have an
| .llm/arch full of AI-assisted, for-LLM .md files (auth.md,
| data-access.md, etc.) that document architecture decisions
| made along the way. They're invaluable for effectively and
| efficiently crossing context chasms.
| collinvandyck76 wrote:
| Yeah, this. Each project I work on has it's own markdown file
| named for the ticket or the project. Committed on the branch,
| and I have claude rewrite it with the "current understanding"
| periodically. After compacting, I have it re-read the MD file
| and we get started again. Quite nice.
| olddustytrail wrote:
| I think it's key to not give it contradictory instructions,
| which is an easy mistake to make if you forget where you
| started.
|
| As an example, I know of an instance where the LLM claimed it
| had tried a test on its laptop. This obviously isn't true so
| the user argued with it. But they'd originally told it that it
| was a Senior Software Engineer so playing that role, saying you
| tested locally is fine.
|
| As soon as you start arguing with those minor points you break
| the context; now it's both a Software Engineer and an LLM. Of
| course you get confused responses if you do that.
| pupppet wrote:
| The problem I often have is I may have instruction like-
|
| General instruction: - Do "ABC"
|
| If condition == whatever: - Do "XYZ" instead
|
| I have a hard time making the AI obey the instances I wish to
| override my own instruction and without having full control
| of the input context, I can't just modify my 'General
| Instruction' on a case by case basis to simply avoid having
| to contradict myself.
| olddustytrail wrote:
| That's a difficult case where you might want to collect
| your good context and shift it to a different session.
|
| It would be nice if the UI made that easy to do.
| greenfish6 wrote:
| Yes, but if you look in the rate limit notes, the rate limit is
| 500k tokens / minite for tier 4, which we are on. Given how
| stingy anthropic has been with rate limit increases, this is for
| very few people right now
| alvis wrote:
| Context window after certain size doesn't bring in much benefit
| but higher bill. If it still keeps forgetting instructions it
| would be just much easier to be ended up with long messages with
| higher context consumption and hence the bill
|
| I'd rather having an option to limit the context size
| EcommerceFlow wrote:
| It does if you're working with bigger codebases. I've found
| copy/pasting my entire codebase + adding a <task> works
| significantly better than cursor.
| spiderice wrote:
| How does one even copy their entire codebase? Are you saying
| you attach all the files? Or you use some script to copy all
| the text to your clipboard? Or something else?
| EcommerceFlow wrote:
| I created a script that outputs the entire codebase to a
| text file (also allows me to exclude
| files/folders/node_modules), separating and labeling each
| file in the program folder.
|
| I then structure my prompts around like so:
|
| <project_code> ``` ``` </project_code>
|
| <heroku_errors> " " </heroku_errors>
|
| <task> " " </task>
|
| I've been using this with Google Ai studio and it's worked
| phenomenally. 1 million tokens is A LOT of code, so I'd
| imagine this would work for lots n lots of project type
| programs.
| swader999 wrote:
| Repomix, there's a cli and an MCP.
| andrewstuart wrote:
| Oh man finally. This has been such a HUGE advantage for Gemini.
|
| Could we please have zip files too? ChatGPT and Gemini both
| unpack zip files via the chat window.
|
| Now how about a button to download all files?
| qsort wrote:
| I won't complain about a strict upgrade, but that's a pricy boi.
| Interesting to see differential pricing based on size of input,
| which is understandable given the O(n^2) nature of attention.
| isoprophlex wrote:
| 1M of input... at $6/1M input tokens. Better hope it can one-shot
| your answer.
| elitan wrote:
| have you ever hired humans?
| bicepjai wrote:
| Depends on which human you tried :) Donot underestimate
| yourself !
| rafaelero wrote:
| god they keep raising prices
| revskill wrote:
| The critical issue with LLM which never beats human: break what
| worked.
| henriquegodoy wrote:
| Thats incredible to see how ai models are improving, i'm really
| happy with this news. (imo it's more impactful than the release
| of gpt5) now, we need more tokens per second, and then the self-
| improvement of the model will accelerate.
| lherron wrote:
| Wow, I thought they would feel some pricing pressure from GPT5
| API costs, but they are doubling down on their API being more
| expensive than everyone else.
| sebzim4500 wrote:
| I think it's the right approach, the cost of running these
| things as coding assistants is negligable compared to the
| benefit of even a slight model improvement.
| AtNightWeCode wrote:
| GPT5 API uses more tokens for answers of the same quality as
| previous versions. Fell into that trap myself. I use both
| Claude and OpenAI right now. Will probably drop OpenAI since
| they are obviously not to be trusted considering the way they
| do changes.
| shamano wrote:
| 1M tokens is impressive, but the real gains will come from how we
| curate context--compact summaries, per-repo indexes, and phase
| resets. Bigger windows help; guardrails keep models focused and
| costs predictable.
| jbellis wrote:
| Just completed a new benchmark that sheds some light on whether
| Anthropic's premium is worth it.
|
| (Short answer: not unless your top priority is speed.)
|
| https://brokk.ai/power-rankings
| 24xpossible wrote:
| Why no Grok 4?
| Zorbanator wrote:
| You should be able to guess.
| jeffhuys wrote:
| People hate it because it had less filters and media caught
| on, so they told people to hate it.
|
| It's actually the best one right now, or close to. For my
| uses (code and queries) nothing comes even close.
|
| Once people look past the "but ELoN mUssKkkk!!!", they'll be
| surprised.
| jbellis wrote:
| the accompanying blog post explains: xAI did not respond to
| our requests for a grok 4 quota that would allow us to run
| the evaluation
| rcanepa wrote:
| I recently switched to the $200 CC subscription and I think I
| will stay with it for a while. I briefly tested whatever
| version of ChatGPT 5 comes with the free Cursor plan and it was
| unbearably slow. I could not really code with it as I was
| constantly getting distracted while waiting for a response. So,
| speed matters a lot for some people.
| Someone1234 wrote:
| Before this they supposedly had a longer context window than
| ChatGPT, but I have workloads that abuse the heck out of context
| windows (100-120K tokens). ChatGPT genuinely seems to have a 32K
| context window, in the sense that is legitimately remembers/can
| utilize everything within that window.
|
| Claude previously had "200K" context windows, but during testing
| it wouldn't even hit a full 32K before hitting a wall/it
| forgetting earlier parts of the context. They also have extremely
| short prompt limits relative to the other services around, making
| it hard to utilize their supposedly larger context windows (which
| is suspicious).
|
| I guess my point is that with Anthropic specifically, I don't
| trust their claims because that has been my personal experience.
| It would be nice if this "1M" context window now allows you to
| actually use 200K though, but it remains to be seen if it can
| even do _that_. As I said with Anthropic you need to verify
| everything they claim.
| Etheryte wrote:
| Strong agree, Claude is very quick to forget things like "don't
| do this", "never do this" or things it tried that were wrong.
| It will happily keep looping even in very short conversations,
| completely defeating the purpose of using it. It's easy to game
| the numbers, but it falls apart in the real world.
| joquarky wrote:
| I've found it better to use antonyms than negations most
| situations.
| typpilol wrote:
| Same here. Always tell them the way you want it done.
|
| For example:
|
| Instead of "don't modify the tests"
|
| It should be: analyze the test output and fix the bug in
| the source code. The test is built correctly.
|
| Not the best but you get the idea.
|
| The one problem with this is if you don't know how to do
| something properly. Like if you're just writing in your
| prompt "generate 90% test coverage" , you give it a lot
| more leeway to do whatever it wants.
|
| And that's how you end up with the source code being
| modified to fit the test vice versa
| wahnfrieden wrote:
| ChatGPT Pro has a longer window but I've read conflicting
| reports on what it actually uses
| lvl155 wrote:
| Only time this is useful is to do init on a sizable code base or
| dump a "big" csv.
| film42 wrote:
| The 1M token context was Gemini's headlining feature. Now, the
| only thing I'd like Claude to work on is tokens counted towards
| document processing. Gemini will often bill 1/10th the tokens
| Anthropic does for the same document.
| varyherb wrote:
| I believe this can be configured in Claude Code via the following
| environment variable:
|
| ANTHROPIC_BETAS="context-1m-2025-08-07" claude
| falcor84 wrote:
| Have you tested it? I see that this env var isn't specified in
| their docs
|
| https://docs.anthropic.com/en/docs/claude-code/settings#envi...
| bazhand wrote:
| Add these settings to your `.claude/settings.json`:
| ```json { "env": {
| "ANTHROPIC_CUSTOM_HEADERS": {"anthropic-beta":
| "context-1m-2025-08-07"}, "ANTHROPIC_MODEL":
| "claude-sonnet-4-20250514",
| "CLAUDE_CODE_MAX_OUTPUT_TOKENS": 8192 } }
| ```
| varyherb wrote:
| Yup! Claude Code has a lot of undocumented configuration.
| Once I saw the beta header value in their docs [1], I tried
| to see in their source code if there was anyway to specify
| this flag via env var config. Their source code is already on
| your computer, just gotta dig through the minified JS :) Try:
|
| `cat $(which claude) | grep ANTHROPIC_BETAS`
|
| Sibling comment's approach with the other (documented) env
| var works too.
|
| [1] https://docs.anthropic.com/en/docs/build-with-
| claude/context...
| anonym29 wrote:
| Tested this morning. Worked wonderfully, except ran into output
| issues. Attempted to patched the minified Claude file's
| CLAUDE_CODE_MAX_OUTPUT_TOKENS hard limit of 32000 on Sonnet to
| 64000, which worked, and I was able to generate outputs above
| 32000 tokens, but this coincided with a breakage of the 1m
| context window for me. Still testing and playing around with
| this, but this may be getting patched?
| gdudeman wrote:
| A tip for those who both use Claude Code and are worried about
| token use (which you should be if you're stuffing 400k tokens
| into context even if you're on 20x Max): 1. Build
| context for the work you're doing. Put lots of your codebase into
| the context window. 2. Do work, but at each logical
| stopping point hit double escape to rewind to the context-filled
| checkpoint. You do not spend those tokens to rewind to that
| point. 3. Tell Claude your developer finished XYZ, have it
| read it into context and give high level and low level feedback
| (Claude will find more problems with your developer's work than
| with yours).
|
| If you want to have multiple chats running, use /resume and pull
| up the same thread. Hit double escape to the point where Claude
| has rich context, but has not started down a specific rabbit
| hole.
| rvnx wrote:
| Thank you for the tips, do you know how to rollback latest
| changes ? Trying very hard to do it, but seems like Git is the
| only way ?
| gdudeman wrote:
| Git or my favorite "Undo all of those changes."
| spike021 wrote:
| this usually gets the job done for me as well
| SparkyMcUnicorn wrote:
| I haven't used it, but saw this the other day:
| https://github.com/RonitSachdev/ccundo
| rtuin wrote:
| Quick tip when working with Claude Code and Git: When you're
| happy with an intermediate result, stage the changes by
| running `git add` (no commit). That makes it possible to
| always go back to the staged changes when Claude messes up.
| You can then just discard the unstaged changes and don't have
| to roll back to the latest commit.
| seperman wrote:
| Very interesting. Why does Claude find more problems if we
| mention the code is written by another developer?
| bgilly wrote:
| In my experience, Claude will criticize others more than it
| will criticize itself. Seems similar to how LLMs in general
| tend to say yes to things or call anything a good idea by
| default.
|
| I find it to be an entertaining reflection of the cultural
| nuances embedded into training data and reinforcement
| learning processes.
| umbra07 wrote:
| Interesting. In my experience, it's the opposite. Claude is
| too syncophantic. If you tell it that it was wrong, it will
| just accept your word at face value. If I give a problem to
| both Claude and Gemini, their responses differ and I ask
| Claude why Gemini has a different response - Claude will
| just roll over and tell me that Gemini's response was
| perfect and that it messed up.
|
| This is why I was really taken by Gemini 2.0/2.5 when it
| first came out - it was the first model that really pushed
| back at you. It would even tell me that it wanted _x_
| additional information to continue onwards, unprompted.
| Sadly, as Google has neutered 2.5 over the last few months,
| its independent streak has also gone away, and its only
| slightly more individualistic than Claude /OpenAI's models.
| mcintyre1994 wrote:
| Total guess, but maybe it breaks it out of the sycophancy
| that most models seem to exhibit?
|
| I wonder if they'd also be better at things like telling you
| an idea is dumb if you tell it it's from someone else and
| you're just assessing it.
| gdudeman wrote:
| Claude is very agreeable and is an eager helper.
|
| It gives you the benefit of the doubt if you're coding.
|
| It also gives you the benefit of the doubt if you're looking
| for feedback on your developers work. If you give it a hint
| of distrust "my developer says they completed this, can you
| check and make sure, give them feedback....?" Claude will
| look out for you.
| daveydave wrote:
| I would guess the training data (conversational as opposed to
| coding specific solutions) is weighted towards people finding
| errors in others work, more than people discussing errors in
| their own. If you knew there was an error in your thinking,
| you probably wouldn't think that way.
| sixothree wrote:
| I've been using Serena MCP to keep my context smaller. It seems
| to be working because claude uses it pretty much exclusively to
| search the codebase.
| lucasfdacunha wrote:
| Could you elaborate a bit on how that works? Does it need any
| changes in how you use Claude?
| sixothree wrote:
| No. I have three MCPs installed and this is the only one
| that doesn't need guidance. You'll see it using it for
| search and finding references and such. It's a one line
| install and no work to maintain.
|
| The advantage is that Claude won't have to use the file
| system to find files. And it won't have to go read files
| into context to find what it's looking for. It can use its
| context for the parts of code that actually matter.
|
| And I feel like my results have actually been much better
| with this.
| yahoozoo wrote:
| I thought double escape just clears the text box?
| gdudeman wrote:
| With an empty text box, double escape shows you a list of
| previous inputs from you. You can go back and fork at any one
| of those.
| oars wrote:
| I tell Claude that it wrote XYZ in another session (I wrote it)
| then use that context to ask questions or make changes.
| ewoodrich wrote:
| Hah, I do the same when I need to manually intervene to nudge
| the solution in the direction I want after a few failed
| attempts to recontruct my prompt to avoid some undesired path
| the LLM _really_ wants to go down.
| gdudeman wrote:
| I'll note this saves a lot of wait time as well! No sitting
| there while a new Claude builds context from scratch.
| i_have_an_idea wrote:
| This sounds like the programmer equivalent of astrology.
|
| > Build context for the work you're doing. Put lots of your
| codebase into the context window.
|
| If you don't say that, what do you think happens as the agent
| works on your codebase.
| bubblyworld wrote:
| You don't have to think about it, you can just go try it. It
| doesn't work as well (yet) for me. I'm still way better than
| Claude at finding an initial heading.
|
| Astrology doesn't produce working code =P
| gdudeman wrote:
| You don't say that - you instruct the LLM to read files about
| X, Y, and Z. Putting the context in helps the agent plan
| better (next step) and write correct code (final step).
|
| If you're asking the agent to do chunks of work, this will
| get better results than asking it to blindly go forth and do
| work. Anthropic's best practices guide says as much.
|
| If you're asking the agent to create one method that
| accomplishes X, this isn't useful.
| insane_dreamer wrote:
| I usually tell CC (or opencode, which I've been using recently)
| to look up the files and find the relevant code. So I'm not
| attaching a huge number of files to the context. But I don't
| actually know whether this saves tokens or not.
| Wowfunhappy wrote:
| I do this all the time and it sometimes works, but it's not a
| silver bullet. Sometimes Claude benefits from having the full
| conversation.
| FajitaNachos wrote:
| What's the benefit to using claude code CLI directly over
| something like Cursor?
| trenchpilgrim wrote:
| Claude Code is a much more flexible tool:
| https://docs.anthropic.com/en/docs/claude-
| code/overview#why-...
| qafy wrote:
| the benefit is you can use your preferred editor. no need to
| learn a completely new piece of software that doesnt match
| your workflow just to get access to agentic workflows. for
| example, my entire workflow for the last 15+ years has been
| tmux+vim, and i have no desire to change that.
| KptMarchewa wrote:
| You don't have to deal with awfulness of vs code.
| nojs wrote:
| In my experience jumping back like this is risky unless you
| explicitly tell it you made changes, otherwise they will get
| clobbered because it will update files based on the old
| context.
|
| Telling it to "re-read" xyz files before starting works though.
| bamboozled wrote:
| I always ask it to read the last 5 commits and anaylize and
| modified or staged files, works well...
| mattmanser wrote:
| Why do you find this better than just starting again at
| that point? I'm trying to understand the benefit of using
| this 'trick', without being able to try it as I'm away from
| my computer.
|
| Couldn't you start a new context and achieve the same
| thing, without any of the risks of this approach?
| bamboozled wrote:
| LLMs have no "memory" so it generally gives it something
| to go off, I forgot to add that I only do this if the
| change i'm making is related to whatever I did yesterday.
|
| I do this because sometimes I just manually edit code and
| the LLM doesn't know everything that's happened.
|
| I also find the. best way to work with "AI" is to make
| very small changes and commit frequently, I truly think
| it's a slot machine and if it does go wild, you can lose
| hours of work.
| cube00 wrote:
| > Tell Claude your developer finished XYZ [...] (Claude will
| find more problems with your developer's work than with yours).
|
| It's crazy to think LLMs are so focused on pleasing us that we
| have to trick them like this to get frank and fearless
| feedback.
| razemio wrote:
| I think it is something else. If you think about it, humans
| often write about correcting errors done by others.
| Refactoring code, fixing bugs and write code more efficient.
| I guess it triggers other paths in the model, if we write
| that someone else did it. It is not about pleasing but our
| constant desire to improve things.
| cube00 wrote:
| If we tell it a rival LLM wrote the code it will be extra
| critical to tap into of its capitalist streak to crush the
| competition?
| ZeroCool2u wrote:
| It's great they've finally caught up, but unfortunate it's on
| their mid-tier model only and it's laughably expensive.
| thimabi wrote:
| Oh, well, ChatGPT is being left in the dust...
|
| When done correctly, having one million tokens of context window
| is amazing for all sorts of tasks: understanding large codebases,
| summarizing books, finding information on many documents, etc.
|
| Existing RAG solutions fill a void up to a point, but they lack
| the precision that large context windows offer.
|
| I'm excited for this release and hope to see it soon on the UI as
| well.
| OutOfHere wrote:
| Fwiw, OpenAI does have a decent active API model family of
| GPT-4.1 with a 1M context. But yes, the context of the GPT-5
| models is terrible in comparison, and it's altogether atrocious
| for the GPT-5-Chat model.
|
| The biggest issue in ChatGPT right now is a very inconsistent
| experience, presumably due to smaller models getting used even
| for paid users with complex questions.
| wahnfrieden wrote:
| Doesn't it matter more what context they provide via Claude
| Code and Codex CLI? And arent they similar anyway there?
|
| Because the API with maximum context is very expensive (also
| not rolled out to everyone)
| kotaKat wrote:
| A million tokens? Damn, I'm gonna need a _lot_ of quarters to
| play this game at Chuck-E-Cheese.
| xnx wrote:
| 1M context windows are not created equal. I doubt Claude's recall
| is as good as Gemini's 1M context recall.
| https://cloud.google.com/blog/products/ai-machine-learning/t...
| xnx wrote:
| Good analysis here:
| https://news.ycombinator.com/item?id=44878999
|
| > the model that's best at details in long context text and
| code analysis is still Gemini.
|
| > Gemini Pro and Flash, by comparison, are far cheaper
| firasd wrote:
| A big problem with the chat apps (ChatGPT; Claude.ai) is the
| weird context window hijinks. Especially ChatGPT does wild
| stuff.. sudden truncation; summarization; reinjecting 'ghost
| snippets' etc
|
| I was thinking this should be up to the user (do you want to
| continue this conversation with context rolling out of the window
| or start a new chat) but now I realized that this is inevitable
| given the way pricing tiers and limited computation works. Like
| the only way to have full context is use developer tools like
| Google AI Studio or use a chat app that wraps the API
|
| With a custom chat app that wraps the API you can even inject the
| current timestamp into each message and just ask the LLM btw
| every 10 minutes just make a new row in a markdown table that
| summarizes every 10 min chunk
| cruffle_duffle wrote:
| > btw every 10 minutes just make a new row in a markdown table
| that summarizes every 10 min chunk
|
| Why make it time based instead of "message based"... like
| "every 10 messages, summarize to blah-blah.md"?
| dev0p wrote:
| Probably it's more cost effective and less error prone to
| just dump the message log rather than actively rethink the
| context window, costing resources and potentially losing
| information in the process. As the models gets better, this
| might change.
| firasd wrote:
| Sure. But you'd want to help out the LLM with a message count
| like this is message 40, this is message 41... so when it
| hits message 50 it's like ahh time for a new summary and call
| the memory_table function (cause it's executing the earlier
| standing order in your prompt)
| tosh wrote:
| How did they do the 1M context window?
|
| Same technique as Qwen? As Gemini?
| deadbabe wrote:
| Unfortunately, larger context isn't really the answer after a
| certain point. Small focused context is better, lazily throwing a
| bunch of tokens in as a context is going to yield bad results.
| ramoz wrote:
| Awesome addition to a great model.
|
| The best interface for long context reasoning has been AIStudio
| by Google. Exceptional experience.
|
| I use Prompt Tower to create long context payloads.
| simianwords wrote:
| How does "supporting 1M tokens" really work in practice? Is it a
| new model? Or did they just remove some hard coded constraint?
| eldenring wrote:
| Serving a model efficiently at 1M context is difficult and
| could be much more expensive/numerically tricky. I'm guessing
| they were working on serving it properly, since its the same
| "model" in scores and such.
| simianwords wrote:
| Thanks - still not clear what they did really. Some inference
| time hacks?
| FergusArgyll wrote:
| That would imply the model always had a 1m token context
| but they limited it in the api and app? That's strange
| because they can just charge more for every token past 250k
| (like google does, I believe).
|
| But if not shouldn't it have to be completely retrained
| model? it's clearly not that - good question!
| Aeolun wrote:
| They already had 0.5M context window on the enteprise
| version.
| otabdeveloper4 wrote:
| Most likely still 32k tokens under the hood, but with some
| context slicing/averaging hacks to make inference not error
| out on infinite input.
|
| (That's what I do locally with llama.cpp)
| nickphx wrote:
| Yay, more room for stray cats.
| alienbaby wrote:
| The fracturing of all the models offered across providers is
| annoying. The number of different models and the fact a given
| model will have different capabilities from different providers
| is ridiculous.
| chrisweekly wrote:
| Peer of this post currently also on HN front page, comparing perf
| for Claude vs Gemini, w/ 1M tokens:
| https://news.ycombinator.com/item?id=44878999
| DiabloD3 wrote:
| Neat. I do 1M tokens context locally, and do it entirely with a
| single GPU and FOSS software, and have access to a wide range of
| models of equivalent or better quality.
|
| Explain to me, again, how Anthropic's flawed business model
| works?
| codazoda wrote:
| Tell us more?
| DiabloD3 wrote:
| Nothing really to say, its just like everyone else's
| inference setups.
|
| Select a model that produces good results, has anywhere from
| 256k to 1M context (ex: Qwen3-Coder can do 1M), is under one
| of the acceptable open weights licenses, and run it in
| llama.cpp.
|
| llama.cpp can split layers between active and MoE, and only
| load the active ones into vram, leaving the rest of it
| available for context.
|
| With Qwen3-Coder-30B-A3B, I can use Unsloth's Q4_K_M, consume
| a mere 784MB of VRAM with the active layers, then consume
| 27648MB (kv cache) + 3096MB (context) with the kv cache
| quantized to iq4_nl. This will fit onto a single card with
| 32GB of VRAM, or slightly spill over on 24GB.
|
| Since I don't personally need that much, I'm not pouring
| entire projects into it (I know people do this, and more data
| _does not produce better results_ ), I bump it down to 512k
| context and fit it in 16.0GB, to avoid spill over on my 24GB
| card. In the event I do need the context, I am always free to
| enable it.
|
| I do not see a meaningful performance difference between all
| on the card and MoE sent to RAM while active is on VRAM, its
| very much a worthwhile option for home inference.
|
| Edit: For completeness sake, 256k context with this
| configuration is 8.3GB total VRAM, making _very_ budget good
| inference absolutely possible.
| ffitch wrote:
| I wonder how modern models fair on NovelQA and FLenQA (benchmarks
| that test ability to understand long context beyond needle in a
| haystack retrieval). The only such test on a reasoning model that
| I found was done on o3-mini-high
| (https://arxiv.org/abs/2504.21318), it suggests that reasoning
| noticeably improves FLenQA performance, but this test only
| explored context up to 3,000 tokens.
| dang wrote:
| Related ongoing thread:
|
| _Claude vs. Gemini: Testing on 1M Tokens of Context_ -
| https://news.ycombinator.com/item?id=44878999 - Aug 2025 (9
| comments)
| whalesalad wrote:
| My first thought was "gg no re" can't wait to see how this
| changes compaction requirements in claude code.
| pmxi wrote:
| The reason I initially got interested in Claude was because they
| were the first to offer a 200K token context window. That was
| massive in 2023. However, they didn't keep up once Gemini offered
| a 1M token window last year.
|
| I'm glad to see an attempt to return to having a competitive
| context window.
| markb139 wrote:
| I've tried 2 AI tools recently. Neither could produce the correct
| code to calculate the CPU temperature on a Raspberry Pi RP2040.
| The code worked, looked ok and even produced reasonable looking
| results - until I put a finger on the chip and thus raised the
| temp. The calculated temperature went down. As an aside the free
| version of chatGPT didn't know about anything newer than 2023 so
| couldn't tell me about the RP2350
| anvuong wrote:
| How can you be sure putting the finger on the chip raise the
| temp? If you feel hot that means heat from the chip is being
| transferred to your finger, that may decrease the temp, no?
| broshtush wrote:
| From my understanding putting your finger on an uncooled CPU
| acts like a passive cooler, thus actually decreasing
| temperature.
| fwip wrote:
| I don't think a larger context window would help with that.
| fpauser wrote:
| Best comment ;)
| ghjv wrote:
| wouldn't your finger have acted as a heat sink, lowering the
| temp? sounds like the program may have worked correctly. could
| be worth trying again with a hot enough piece of metal instead
| of your finger
| logicchains wrote:
| With that pricing I can't imagine why anyone would use Claude
| Sonnet through the API when Gemini 2.5 Pro is both better and
| cheaper (especially at long-context understanding).
| CuriouslyC wrote:
| Claude is a good deal with the $20 subscription giving a fair
| amount of sonnet use with Code. It's also got a very distinct
| voice as far as LLMs go, and tends to produce cleaner/clearer
| writing in general. I wouldn't use the API in an application
| but the subscription feels like a pretty good deal.
| siva7 wrote:
| Ah, so claude code on subscription will become a crippled down
| version
| joduplessis wrote:
| As far as coding goes Claude seems to be the most competent right
| now, I like it. GPT5 is abysmal - I'm not sure if they're bugs,
| or what, but the new release takes a good few steps back. Gemini
| still a hit and miss - and Grok seems to be a poor man's Claude
| (where code is kind of okay, a bit buggy and somehow similar to
| Claude).
| wahnfrieden wrote:
| Are you evaluating gpt5-thinking on high mode, via API or Codex
| CLI on Pro tier? Just wondering what specifically you compared
| since those factors affect its performance and context
| brokegrammer wrote:
| Many people are confused about the usefulness of 1M tokens
| because LLMs often start to get confused after about 100k. But
| this is big for Claude 4 because it uses automatic RAG when the
| context becomes large. With optimized retrieval thanks to RAG,
| we'll be able to make good use of those 1M tokens.
| m4r71n wrote:
| How does this work under the hood? Does it build an in-memory
| vector database of the input sources and runs queries on top of
| that data to supplement the context window?
| brokegrammer wrote:
| No idea how it's implemented because it's proprietary.
| Details here: https://support.anthropic.com/en/articles/11473
| 015-retrieval...
| Balgair wrote:
| Wow!
|
| As a fiction writer/noodler this is amazing. I can put not just a
| whole book in as before, not just a whole series, but the entire
| corpus of author _s_ in.
|
| I mean, from the pov of biography writers, this is awesome too.
| Just dump it all in, right?
|
| I'll have to switch using to Sonnet 4 now for workflows and edit
| my RAG code to be longer windows, a _lot_ longer
| irthomasthomas wrote:
| Brain: Hey, you going to sleep? Me: Yes. Brain: That 200,001st
| token cost you $600,000/M.
| qwertox wrote:
| > desperately need LLMs to maintain extremely effective context
|
| Last time I used Gemini it did something very surprising: instead
| of providing readable code, it started to generate pseudo-
| minified code.
|
| Like on CSS class would become one long line of CSS, and one JS
| function became one long line of JS, with most of the variable
| names minified, while some remained readable, but short. It did
| away with all unnecessary spaces.
|
| I was asking myself what is happening here, and my only
| explanation was that maybe Google started training Gemini on
| minified code, on making Gemini understand and generate it, in
| order to maximize the value of every token.
| ericol wrote:
| "...in API"
|
| That's a VERY relevant clarification. this DOESN'T apply to web
| or app users.
|
| Basically, if you want a 1M context window you have to
| specifically pay for it.
| sporkland wrote:
| Does anyone have data on how much better these 1M token context
| models produce better results than the more limited windows
| alongside certain RAG implementations? Or how much better in the
| face of RAG the 200k vs 1M token models perform on a benchmark?
| poniko wrote:
| [Claude usage limit reached. Your limit will reset at..] .. eh
| lunch is a good time to go home anyways..
| chmod775 wrote:
| For some context, only the tweaks files and scripting parts of
| Cyberpunk 2077 are ~2 million LOC.
| not_that_d wrote:
| My experience with the current tools so far:
|
| 1. It helps to get me going with new languages, frameworks,
| utilities or full green field stuff. After that I expend a lot of
| time parsing the code to understand what it wrote that I kind of
| "trust" it because it is too tedious but "it works".
|
| 2. When working with languages or frameworks that I know, I find
| it makes me unproductive, the amount of time I spend writing a
| good enough prompt with the correct context is almost the same or
| more that if I write the stuff myself and to be honest the
| solution that it gives me works for this specific case but looks
| like a junior code with pitfalls that are not that obvious unless
| you have the experience to know it.
|
| I used it with Typescript, Kotlin, Java and C++, for different
| scenarios, like websites, ESPHome components (ESP32), backend
| APIs, node scripts etc.
|
| Botton line: usefull for hobby projects, scripts and to
| prototypes, but for enterprise level code it is not there.
| jeremywho wrote:
| My workflow is to use Claude desktop with the filesystem mcp
| server.
|
| I give claude the full path to a couple of relevant files
| related to the task at hand, ie where the new code should hook
| into or where the current problem is.
|
| Then I ask it to solve the task.
|
| Claude will read the files, determine what should be done and
| it will edit/add relevant files. There's typically a couple of
| build errors I will paste back in and have it correct.
|
| Current code patterns & style will be maintained in the new
| code. It's been quite impressive.
|
| This has been with Typescript and C#.
|
| I don't agree that what it has produced for me is hobby-grade
| only...
| taberiand wrote:
| I've been using it the same way. One approach that's worked
| well for me is to start a project and first ask it to analyse
| and make a plan with phases for what needs to be done, save
| that plan into the project, then get it to do each phase in
| sequence. Once it completes a phase, have it review the code
| to confirm if the phase is complete. Each phase of work and
| review is a new chat.
|
| This way helps ensure it works on manageable amounts of code
| at a time and doesn't overload its context, but also keeps
| the bigger picture and goal in sight.
| mnky9800n wrote:
| I find that sometimes this works great and sometimes it
| happily tells you everything works and your code fails
| successfully and if you aren't reading all the code you
| would never know. It's kind of strange actually. I don't
| have a good feeling when it will get everything correct and
| when it will fail and that's what is disconcerting. I would
| be happy to be given advice on what to do to untangle when
| it's good and when it's not. I love chatting with Claude
| code about code. It's annoying that it doesn't always get
| it right and also doesn't really interact with failure like
| a human would. At Least in my experience anyways.
| taberiand wrote:
| Of course, everything needs to be verified - I'm just
| trying to figure out a process that enables it to work as
| effectively as it can on large code bases in a structured
| way. Committing each stage to git, fixing issues and
| adjusting the context still comes into play.
| nwatson wrote:
| One can also integrate with, say, a running PyCharm with the
| Jetbrains IDE MCP server. Claude Desktop can then interact
| directly with PyCharm.
| hamandcheese wrote:
| Any particular reason you prefer that over Claude code?
| jeremywho wrote:
| I'm on windows. Claude Code via WSL hasn't been as smooth a
| ride.
| JyB wrote:
| That's exactly how you should do it. You can also plug in an
| MCP for your CI or mention cli.github.com in your prompt to
| also make it iterate on CI failures.
|
| Next you use claude code instead and you make several work on
| their own clone on their own workspace and branches in the
| background; So you can still iterate yourself on some other
| topic on your personal clone.
|
| Then you check out its tab from time to time and optionally
| checkout its branch if you'd rather do some updates yourself.
| It's so ingrained in my day-to-day flow now it's been super
| impressive.
| risyachka wrote:
| Pretty much my experience too.
|
| I usually go to option 2 - just write it by myself as it is
| same time-wise but keeps skills sharp.
| fpauser wrote:
| Not to degenerate is really challenging these days. There are
| the bubbles that simulate multiple realities to us and try to
| untrain us logic thinking. And there are the llms that try to
| convice us that self thinking is unproductive. I wonder when
| this digitalophily suddenly turns into digitalophobia.
| sciencejerk wrote:
| It's happening, friend, don't let the AI hype fool you. I'm
| detecting quite a bit of reluctance and lack of 100% buy-in
| on AI coding tools and trends, even from your typically
| tech-loving Software Engineers.
| flowerthoughts wrote:
| I predict microservices will get a huge push forward. The
| question then becomes if we're good enough at saying "Claude,
| this is too big now, you have to split it in two services" or
| not.
|
| If LLMs maintain the code, the API boundary
| definitions/documentation and orchestration, it might be
| manageable.
| fsloth wrote:
| Why microservices? Monoliths with code-golfed minimal
| implementation size (but high quality architecture)
| implemented in strongly typed language would consume far less
| tokens (and thus would be cheaper to maintain).
| arwhatever wrote:
| Won't this cause [insert LLM] to lose context around the
| semantics of messages passed between microservices?
|
| You could then put all services in 1 repo, or point LLM at X
| number of folders containing source for all X services, but
| then it doesn't seem like you'll have gained anything, and at
| the cost of added network calls and more infra management.
| urbandw311er wrote:
| Why not just cleanly separated code in a single execution
| environment? No need to actually run the services in separate
| execution environments just for the sake of an LLM being able
| to parse it, that's crazy! You can just give it the files or
| folders it needs for the particular services within the
| project.
|
| Obviously there's still other reasons to create micro
| services if you wish, but this does not need to be another
| reason.
| fpauser wrote:
| Same conclusion here. Also good for analyzing existing
| codebases and to generate documentation for undocumented
| projects.
| j45 wrote:
| It's quite good at this, I have been tying in Gemini Pro with
| this too.
| johnisgood wrote:
| > but for enterprise level code it is not there
|
| It is good for me in Go but I had to tell it what to write and
| how.
| sdesol wrote:
| I've been able to create a very advanced search engine for my
| chat app that is more than enterprise ready. I've spent a
| decade thinking about search, but in a different language.
| Like you, I needed to explain what I knew about writing a
| search engine in Java for the LLM, to write it in JavaScript
| using libraries I did not know and it got me 95% of the way
| there.
|
| It is also incredibly important to note that the 5% that I
| needed to figure out was the difference between throw away
| code and something useful. You absolutely need domain
| knowledge but LLMs are more than enterprise ready in my
| opinion.
|
| Here is some documentation on how my search solution is used
| in my app to show that it is not a hobby feature.
|
| https://github.com/gitsense/chat/blob/main/packages/chat/wid.
| ..
| johnisgood wrote:
| Thanks for your reply, I am in the same boat, and it works
| for me, like it seems to work for you. So as long as we are
| effective with it, why not? Of course I am not doing things
| blindly and expect good results.
| jiggawatts wrote:
| Something I've discovered is that it may be worthwhile writing
| the prompt anyway, even for a framework you're an expert with.
| Sometimes the AIs will surprise me with a novel approach, but
| the real value is that the prompt makes for _excellent_
| documentation of the requirements! It's a much better starting
| point for doc-comments or PR blurbs than after-the-fact
| ramblings.
| viccis wrote:
| I agree. For me it's a modern version of that good ol "rails
| new" scaffolding with Ruby on Rails that got you started with a
| project structure. It makes sense because LLMs are particularly
| good at tasks that require little more knowledge than just a
| near perfect knowledge of the documentation of the tooling
| involved, and creating a well organized scaffold for a
| greenfield project falls squarely in that area.
|
| For legacy systems, especially ones in which a lot of the
| things they do are because of requirements from external
| services (whether that's tech debt or just normal growing
| complexity in a large connected system), it's less useful.
|
| And for tooling that moves fast and breaks things (looking at
| you, Databricks), it's basically worthless. People have already
| brought attention to the fact that it will only be as current
| as its training data was, and so if a bunch of terminology,
| features, and syntax have changed since then (ahem,
| Databricks), you would have to do some kind of prompt
| engineering with up to date docs for it to have any hope of
| succeeding.
| pvorb wrote:
| I'm wondering what exact issue you are referring to with
| Databricks? I can't remember a time I had to change a line I
| wrote during the past 2.5 years I've been using it. Or are
| you talking about non-breaking changes?
| alfalfasprout wrote:
| The bigger problem I'm seeing is engineers that become over
| reliant on vibe coding tools are starting to lose context on
| how systems are designed and work.
|
| As a result, their productivity might go up on simple "ticket
| like tasks" where it's basically just simple implementation
| (find the file(s) to edit, modify it, test it) but when they
| start using it for all their tasks suddenly they don't know how
| anything works. Or worse, they let the LLM dictate and bad
| decisions are made.
|
| These same people are also very dogmatic on the use of these
| tools. They refuse to just code when needed.
|
| Don't get me wrong, this stuff has value. But I just hate
| seeing how it's made many engineers complacent and accelerated
| their ability to add to tech debt like never before.
| mnky9800n wrote:
| Yea that's right. It's kind of annoying how useful it is for
| hobby projects and it is suddenly useless on anything at work.
| Haha. I love Claude code for some stuff (like generating a
| notebook to analyse some data). But it really just disconnects
| you from the problem you are solving without you going through
| everything it writes. And I'm really bullish on ai coding tools
| haha, for example:
|
| https://open.substack.com/pub/mnky9800n/p/coding-agents-prov...
| pqs wrote:
| I'm not a programmer, but I need to write python and bash
| programs to do my work. I also have a few websites and other
| personal projects. Claude Code helps me implement those little
| projects I've been wanting to do for a very long time, but I
| couldn't due to the lack of coding experience and time. Now I'm
| doing them. Also now I can improve my emacs environment,
| because I can create lisp functions with ease. For me, this is
| the perfect tool, because now I can do those little projects I
| couldn't do before, making my life easier.
| zingar wrote:
| Big +1 to customizing emacs! Used to feel so out of reach,
| but now I basically rolled my own cursor.
| chamomeal wrote:
| LLMs totally kick ass for making bash scripts
| dboreham wrote:
| Strong agree. Bash is so annoying that there have been many
| scripts that I wanted to have, but just didn't write (did
| the thing manually instead) rather than go down the rabbit
| hole of Bash nonsense. LLMs turn this on its head. I
| probably have LLMs write 1-2 bash scripts a week now, that
| I commit to git for use now and later.
| unshavedyak wrote:
| Similarly my Nix[OS] env had a ton of annoyances and
| updates needed that i didn't care to do. My first week of
| Claude saw tons of Nix improvements for my environment
| across my three machines (desk, server, macbook) and it's
| a much more rich environment.
|
| Claude did great at Nix, something i struggled with due
| to lack of documentation. It was far from perfect, but it
| usually pointed me towards the answer that i could later
| refine with it. Felt magical.
| elcritch wrote:
| Similarly I've been making Ansible Playbooks using LLMs
| of late, often by converting shell scripts. Play books
| are pretty great and easier to make idempotent than
| shell. But without Claude I'd forget the syntax or
| commands and it'd take forever to setup.
| int_19h wrote:
| Why not use a more sensible shell, e.g. Fish?
| chamomeal wrote:
| Also great at making fish scripts!
|
| Bash scripts are p much universal though. I can send em
| to my coworkers. I can use them in my awful prod-
| debugging-helm environment.
| MangoCoffee wrote:
| At the end of the day, all tools are made to make their
| users' lives easier.
|
| I use GitHub Copilot. I recently did a vibe code hobby
| project for a command line tool that can display my
| computer's IP, hard drive, hard drive space, CPU, etc. GPT
| 4.1 did coding and Claude did the bug fixing.
|
| The code it wrote worked, and I even asked it to create a
| PowerShell script to build the project for release
| dfedbeef wrote:
| Try typing ctrl+shift+escape.
| dekhn wrote:
| For context I'm a principal software engineer who has worked
| in and out of machine learning for decades (along with a
| bunch of tech infra, high performance scientific computing,
| and a bunch of hobby projects).
|
| In the few weeks since I've started using
| Gemini/ChatGPT/Claude, I've
|
| 1. had it read my undergrad thesis and the paper it's based
| on, implementing correct pytorch code for featurization and
| training, along wiht some aspects of the original paper that
| I didn't include in my thesis. I had been waiting until
| retirement until taking on this task.
|
| 2. had it write a bunch of different scripts for automating
| tasks (typically scripting a few cloud APIs) which I then
| ran, cleaning up a long backlog of activities I had been
| putting off.
|
| 3. had it write a yahtzee game and implement a decent "pick a
| good move" feature . It took a few tries but then it output a
| fully functional PyQt5 desktop app that played the game. It
| beat my top score of all time in the first few plays.
|
| 4. tried to convert the yahtzee game to an android app so my
| son and I could play. This has continually failed on every
| chat agent I've tried- typically getting stuck with gradle or
| the android SDK. This matches my own personal experience with
| android.
|
| 5. had it write python and web-based g-code senders that
| allowed me to replace some tools I didn't like (UGS). Adding
| real-time vis of the toolpath and objects wasn't that hard
| either. Took about 10 minutes and it cleaned up a number of
| issues I saw with my own previous implementations
| (multithreading). It was stunning how quickly it can create
| fully capable web applications using javascript and external
| libraries.
|
| 6. had it implement a gcode toolpath generator for basic
| operations. At first I asked it to write Rust code, which
| turned out to be an issue (mainly because the opencascade
| bindings are incomplete), it generated mostly functional code
| but left it to me to implement the core algorithm. I asked it
| to switch to C++ and it spit out the correct code the first
| time. I spent more time getting cmake working on my system
| than I did writing the prompt and waiting for the code.
|
| 7. had it Write a script to extract subtitles from a movie,
| translate them into my language, and re-mux them back into
| the video. I was able to watch the movie less than an hour
| after having the idea- and most of that time was just
| customizing my prompt to get several refinements.
|
| 8. had it write a fully functional chemistry structure
| variational autoencoder that trains faster and more accurate
| than any I previously implemented.
|
| 9. various other scientific/imaging/photography related
| codes, like impleemnting multi-camera rectification, so I can
| view obscured objects head-on from two angled cameras.
|
| With a few caveats (Android projects, Rust-based toolpath
| generation), I have been absolutely blown away with how
| effective the tools are (especially used in a agent which has
| terminal and file read/write capabilities). It's like having
| a mini-renaissance in my garage, unblocking things that would
| have taken me a while, or been so frustrating I'd give up.
|
| I've also found that AI summaries in google search are often
| good enough that I don't click on links to pages (wikipedia,
| papers, tutorials etc). The more experience I get, the more
| limitations I see, but many of those limitations are simply
| due to the extraordinary level of unnecessary complexity
| required to do nearly anything on a modern computer (see my
| comments about about Android apps & gradle).
| stpedgwdgfhgdd wrote:
| For enterprise software development CC is definitely there.
| 100k Go code paas platform, micro services architecture, mono
| repo is manageable.
|
| The prompt needs to be good, but in plan mode it will
| iteratively figure it out.
|
| You need to have automated tests. For enterprise software
| development that actually goes without saying.
| dclowd9901 wrote:
| It also steps right over easy optimizations. I was doing a
| query on some github data (tedious work) and rather than
| preliminarily filter down using the graphql search method, it
| wanted to comb through all PRs individually. This seems like
| something it probably should have figured out.
| amelius wrote:
| It is very useful for small tasks like fixing network problems,
| or writing regexp patterns based on a few examples.
| MarcelOlsz wrote:
| _Here 's how YOU can save $200/mo!_
| brulard wrote:
| For me it was like this for like a year (using Cline + Sonnet &
| Gemini) until Claude Code came out and until I learned how to
| keep context real clean. The key breakthrough was treating AI
| as an architect/implementer rather than a code generator.
|
| Most recently I ask first CC to create a design document for
| what we are going to do. He has instructions to look into the
| relevant parts of the code and docs to reference them. I review
| it and few back-and-forths we have defined what we want to do.
| Next step is to chunk it into stages and even those to smaller
| steps. All this may take few hours, but after this is well
| defined, I clear the context. I then let him read the docs and
| implement one stage. This goes mostly well and if it doesn't I
| either try to steer him to correct it, or if it's too bad, I
| improve the docs and start this stage over. After stage is
| complete, we commit, clear context and proceed to next stage.
|
| This way I spend maybe a day creating a feature that would take
| me maybe 2-3. And at the end we have a document, unit tests,
| storybook pages, and features that gets overlooked like
| accessibility, aria-things, etc.
|
| At the very end I like another model to make a code review.
|
| Even if this didn't make me faster now, I would consider it
| future-proofing myself as a software engineer as these tools
| are improving quickly
| imiric wrote:
| This is a common workflow that most advanced users are
| familiar with.
|
| Yet even following it to a T, and being _really_ careful with
| how you manage context, the LLM will still hallucinate,
| generate non-working code, steer you into wrong directions
| and dead ends, and just waste your time in most scenarios.
| There 's no magical workflow or workaround for avoiding this.
| These issues are inherent to the technology, and have been
| since its inception. The tools have certainly gotten more
| capable, and the ecosystem has matured greatly in the last
| couple of years, but these issues remain unsolved. The idea
| that people who experience them are not using the tools
| correctly is insulting.
|
| I'm not saying that the current generation of this tech isn't
| useful. I've found it very useful for the same scenarios GP
| mentioned. But the above issues prevent me from relying on it
| for anything more sophisticated than that.
| brulard wrote:
| > These issues are inherent to the technology
|
| That's simply false. Even if LLMs don't produce correct and
| valid code on first shot 100% times of the cases, if you
| use an agent, it's simply a matter of iterations. I have
| claude code connected to Playwright, context7 for docs and
| to Playwright, so it can iterate by itself if there are
| syntax errors, runtime errors or problems with the data on
| the backend side. Currently I have near zero cases when it
| does not produce valid working code. If it is incorrect in
| some aspect, it is then not that hard to steer it to better
| solution or to fix yourself.
|
| And even if it failed in implementing most of these stages
| of the plan, it's not all wasted time. I brainstormed
| ideas, formed the requirements, specifications to features
| and have clear documentation and plan of the
| implementation, unit tests, etc. and I can use it to code
| it myself. So even in the worst case scenario my
| development workflow is improved.
| mathiaspoint wrote:
| It definitely isn't. LLMs often end up stuck in weird
| corners they just don't get and need someone familiar
| with the theory of what they're working on to unstick
| them. If the agent is the same model as the code
| generator it won't be able to on its own.
| sawjet wrote:
| Skill issue
| brulard wrote:
| I was getting to stuck state with Gemini and to lesser
| extent with Sonnet 4, but my cases were resolved by Opus.
| I think it is mostly due to size of the task and if you
| split it in advance to smaller chunks, all these models
| has much higher probability to resolve.
| nojs wrote:
| Could you explain your exact playwright setup in more
| detail? I've found that claude really struggles to end-
| to-end test complex features that require browser use. It
| gets stuck for several minutes trying to find the right
| button to click for example.
| brulard wrote:
| No special setup, just something along "test with
| playwright" in the process list. It can get stuck, but
| for me it was not often enough for me to care. If it
| happens, I push it in the right direction.
| john-tells-all wrote:
| I've seen this referred to as Chain of Thought. I've used it
| with great success a few times.
|
| https://martinfowler.com/articles/2023-chatgpt-xu-hao.html
| aatd86 wrote:
| For me it's the opposite. As long as I ask for small tasks,
| or error checking, it can help. But I'd rather think of the
| overall design myself because I tend to figure out corner
| cases or superlinear complexities much better. I develop
| better mental models than the NNs. That's somewhat of a
| relief.
|
| Also the longer the conversation goes, the less effective it
| gets. (saturated context window?)
| brulard wrote:
| I don't think thats the opposite. I have an idea what I
| want and to some extent how I want it to be done. The
| design document starts with a brainstorming where I throw
| all my ideas at the agent and we iterate together.
|
| > Also the longer the conversation goes, the less effective
| it gets. (saturated context window?)
|
| Yes, this is exactly why I said the breakthrough came for
| me when I learned how to keep the context clean. That means
| multiple times in the process I ask the model to put the
| relevant parts of our discussion into an MD document, I may
| review and edit it and I reset the context with /clear.
| Then I have him read just the relevant things from MD docs
| and we continue.
| ramshanker wrote:
| Same here. A small variation: I explicitly use website to
| manage what context it gets to see.
| brulard wrote:
| What do you mean by website? An HTML doc?
| ramshanker wrote:
| I mean the website of AI providers. chatgpt.com ,
| gemini.google.com , claude.ai and so on.
| spaceywilly wrote:
| I've had more success this way as well. I will use the
| model via web ui, paste in the relevant code, and ask it
| to implement something. It spits out the code, I copy it
| back into the ide, and build. I tried Claude Code but I
| find it goes off the rails too easily. I like the chat
| through the UI because it explains what it's doing like a
| senior engineer would
| brulard wrote:
| Well, this is the way we could do it for 2 years already,
| but basically you are doing the transport layer for the
| process, which can not be efficient. If you really want
| to have tight control of what exactly the LLM sees, than
| that's still an option. But you only get so far with this
| approach.
| drums8787 wrote:
| My experience is the opposite I guess. I am having a great time
| using claude to quickly implement little "filler features" that
| require a good amount of typing and pulling from/editing
| different sources. Nothing that requires much brainpower beyond
| remembering the details of some sub system, finding the right
| files, and typing.
|
| Once the code is written, review, test and done. And on to more
| fun things.
|
| Maybe what has made it work is that these tasks have all fit
| comfortably within existing code patterns.
|
| My next step is to break down bigger & more complex changes
| into claude friendly bites to save me more grunt work.
| unlikelytomato wrote:
| I wish I shared this experience. There are virtually no
| filter features for me to work on. When things feel like
| filler on my team, it's generally a sign of tech debt and we
| wouldn't want to have it generate all the code it would take.
| What are some examples of filler features for you?
|
| On the other hand, it does cost me about 8 hours a week
| debugging issues created by bad autocompletes from my team.
| The last 6 months have gotten really bad with that. But that
| is a different issue.
| apimade wrote:
| Many who say LLMs produce "enterprise-grade" code haven't
| worked in mid-tier or traditional companies, where projects are
| held together by duct tape, requirements are outdated, and
| testing barely exists. In those environments, enterprise-ready
| code is rare even without AI.
|
| For developers deeply familiar with a codebase they've worked
| on for years, LLMs can be a game-changer. But in most other
| cases, they're best for brainstorming, creating small tests, or
| prototyping. When mid-level or junior developers lean heavily
| on them, the output may look useful.. until a third-party
| review reveals security flaws, performance issues, and built-in
| legacy debt.
|
| That might be fine for quick fixes or internal tooling, but
| it's a poor fit for enterprise.
| bityard wrote:
| I work in the enterprise, although not as a programmer, but I
| get to see how the sausage is made. And describing code as
| "enterprise grade" would not be a compliment in my book. Very
| analogous to "contractor grade" when describing home
| furnishings.
| Aeolun wrote:
| Umm, Claude Code is a lot better than a lot of enterprise
| grade code I see. And it actually learns from mistakes with a
| properly crafted instruction xD
| cube00 wrote:
| >And it actually learns from mistakes with a properly
| crafted instruction
|
| ...until it hallucinates and ignores said instruction.
| typpilol wrote:
| I've found having a ton of linting tools can help the AI
| write much better and secure code.
|
| My eslint config is a mess but the code it writes comes out
| pretty good. Although it makes a few iterations after the
| lint errors pop for it to rewrite it, the code it writes is
| way better.
| therealpygon wrote:
| I mostly agree, with the caveat that I would say it can
| certainly be useful when used appropriately as an "assistant".
| NOT vibe coding blindly and hoping what I end up with is
| useful. "Implement x specific thing" (e.g. add an edit button
| to component x), not "implement a whole new epic feature that
| includes changes to a significant number of files". Imagine
| meeting a house builder and saying "I want a house", then
| leaving and expecting to come back to exactly the house you
| dreamed of.
|
| I get why, it's a test of just how intuitive the model can be
| at planning and execution which drives innovation more than 1%
| differences in benchmarks ever will. I encourage that
| innovation in the hobby arena or when dogfooding your AI
| engineer. But as a replacement developer in an enterprise where
| an uncaught mistake could cost millions? No way. I wouldn't
| even want to be the manager of the AI engineering team, when
| they come looking for the only real person to blame for the
| mistake not being caught.
|
| For additional checks/tasks as a completely extra set of eyes,
| building internal tools, and for scripts? Sure. It's incredibly
| useful with all sorts of non- application development tasks.
| I've not written a batch or bash script in forever...you just
| don't really need to much anymore. The linear flow of most
| batch/bash/scripts (like you mentioned) couldn't be a more
| suitable domain.
|
| Also, with a basic prompt, it can be an incredibly useful
| rubber duck. For example, I'll say something like "how do you
| think I should solve x problem"(with tools for the codebase and
| such, of course), and then over time having rejected and been
| adversarial to every suggestion, I end up working through the
| problem and have a more concrete mental design. Think "over-
| eager junior know-it-all that tries to be right constantly"
| without the person attached and you get a better idea of what
| kind of LLM output you can expect including following false
| leads to test your ideas. For me it's less about wanting a plan
| from the LLM, and more about talking through the problems I
| think my plan could solve better, when more things are
| considered outside the LLMs direct knowledge or access.
|
| "We can't do that, changing X would break Y external process
| because Z. Summarize that concern into a paragraph to be added
| to the knowledge base. Then, what other options would you
| suggest?"
| tonyhart7 wrote:
| it depends on model but sonnet is more than capable for
| enterprise code
|
| when you stuck at claude doing dumb shit, you didnt give the
| model enough context to know better the system
|
| after following spec driven development, works with LLM in
| large code base make it so much easier than without it like its
| heaven and hell differences
|
| but also it increase in token cost exponentially, so there's
| that
| hoppp wrote:
| I used it with Tyopescript and Go, SQL, Rust
|
| Using it with rust is just horrible imho. Lots and lots of
| errors, I cant wait to stop this rust project already. But the
| project itself is quite complex
|
| Go on the other hand is super productive, mainly because the
| language is already very simple. I can move 2x fast
|
| Typescript is fine, I use it for react components and it will
| do animations Im lazy to do...
|
| SQL and postgresql is fine, I can do it without it also, I just
| dont like to write stored functions cuz of the boilerplatey
| syntax, a little speed up saves me from carpal tunnel
| epolanski wrote:
| I really find your experience strikingly different than mine,
| I'll share you my flow:
|
| - step A: ask AI to write a featureA-requirements.md file at
| the root of the project, I give it a general description for
| the task, then have it ask me as many questions as possible to
| refine user stories and requirements. It generally comes up
| with a dozen or more of questions, of which multiples I
| would've not thought about and found out much later. Time:
| between 5 and 40 minutes. It's very detailed.
|
| - step B: after we refine the requirements (functional and non
| functional) we write together a todo plan as featureA-todo.md.
| I refine the plan again, this is generally shorter than the
| requirements and I'm generally done in less than 10 minutes.
|
| - step C: implementation phase. Again the AI does most of the
| job, I correct it at each edit and point flaws. Are there cases
| where I would've done that faster? Maybe. I can still jump in
| the editor and do the changes I want. This step in general
| includes comprehensive tests for all the requirements and edge
| cases we have found in step A, both functional, integration and
| E2Es. This really varies but it is generally highly tied to the
| quality of phase A and B. It can be as little as few minutes
| (especially true when we indeed come up with the most effective
| plan) and as much as few hours.
|
| - step D: documentation and PR description. With all of this
| context (in requirements and todos) at this point updating any
| relevant documentation and writing the PR description is a very
| short experiment.
|
| In all of that: I have textual files with precise coding style
| guidelines, comprehensive readmes to give precise context, etc
| that get referenced in the context.
|
| Bottom line: you might be doing something profoundly wrong,
| because in my case, all of this planning, requirements
| gathering, testing, documenting etc is pushing me to deliver a
| much higher quality engineering work.
| mcintyre1994 wrote:
| You'd probably like Kiro, it seems to be built specifically
| for this sort of spec-driven development.
| epolanski wrote:
| How would it be better than what I'm doing with Claude?
| TZubiri wrote:
| Remember kids, just because you CAN doesn't mean you SHOULD
| mrcwinn wrote:
| This tells me they've gotten very good at caching and modeling
| the impact of caching.
| fpauser wrote:
| O observed that claude produces a lot of bloat. Wonder how such
| llm generated projects age.
| cadamsdotcom wrote:
| I'm glad to see the only company chasing margins - which they get
| by having a great product and a meticulous brand - finding even
| more ways to get margin. That's good business.
| howinator wrote:
| I could be wrong, but I think this pricing is the first to admit
| that cost scales quadratically with number of tokens. It's the
| first time I've seen nonlinear pricing from an LLM provider which
| implicitly mirrors the inference scaling laws I think we're all
| aware of.
| jpau wrote:
| Google[1] also has a "long context" pricing structure. OpenAI
| may be considering offering similar since they do not offer
| their priority processing SLAs[2] for context >128K.
|
| [1] https://cloud.google.com/vertex-ai/generative-ai/pricing
|
| [2] https://openai.com/api-priority-processing/
| energy123 wrote:
| Is this marginal pricing or if you go from 200,000 to 200,001
| tokens your total costs double?
| reverseblade2 wrote:
| Does this cover subscription?
| anonym29 wrote:
| API only for now, but at the very bottom of the post: "We're
| also exploring how to bring long context to other Claude
| products."
|
| So, not yet, but maybe someday?
| _joel wrote:
| Fantastic, use up your quota even more quickly. :)
| phyzix5761 wrote:
| What I've found with LLMs is they're basically a better version
| of Google Search. If I need a quick "How do I do..." or if I need
| to find a quick answer to something its way more useful than
| Google and the fact that I can ask follow up questions is
| amazing. But for any serious deep work it has a long way to go.
| mr_moon wrote:
| I feel exactly the same way. why skim and sift 15 different
| stackoverflow posts when an LLM can pick out exactly the info I
| need?
|
| I don't need to spin up an entire feature in a few seconds. I
| need help understanding where something is broken; what are
| some opinions o best practice; or finding out what a poorly
| written snippet is doing.
|
| context still v important for this though and I appreciate
| cranking that capacity. "read 15000 stackoverflow posts for me
| please"
| anvuong wrote:
| The action of sifting through through poop to find gold
| actually positively develops my critical thinking skill. I,
| too, went through a phase of just asking LLM for a specific
| concept instead of Googling it and weave through dozens of
| wiki pages or niche mailing list discussions. It did improve
| my productivity but I feel like it dulls my brain. So
| recently I have to tone that down and force myself to go back
| to the old way. Maybe too much of a good thing is bad.
| Whatarethese wrote:
| This is my primary use of AI. Looking for a new mountain bike
| and using AI to list and compare parts of the bike and which is
| best for my use case scenario. Works pretty well so far.
| throawaywpg wrote:
| Google always planned search to be just a stopgap
| meander_water wrote:
| I like to spend a lot of time in "Ask" mode in Cursor. I guess
| the equivalent in Claude code is "plan" mode.
|
| Where I have minimal knowledge about the framework or language, I
| ask a lot of questions about how the implementation would work,
| what the tradeoffs are etc. This is to minimize any
| misunderstanding between me and the tool. Then I ask it to write
| the implementation plan, and execute it one by one.
|
| Cursor lets you have multiple tabs open so I'll have a Ask mode
| and Agent mode running in parallel.
|
| This is a lot slower, and if it was a language/framework I'm
| familiar with I'm more likely to execute the plan myself.
| itissid wrote:
| My experience with Claude code beyond building anything bigger
| than a webpage, a small API, a tutorial on CSS etc has been
| pretty bad. I think context length is a manageable problem, but
| not the main one. I used it to write a 50K LoC python code base
| with 300 unit tests and it went ok for the first few weeks and
| then it failed. This is after there is a CLAUDE.md file for every
| single module that needs it as well as detailed agents for
| testing, design, coding and review.
|
| I won't going into a case by case list of its failures, The core
| of the issue is misaligned incentives, which I want to get into:
|
| 1. The incentives for coding agent, in general and claude, are
| writing LOTS of code. None of them -- O -- are good at the
| planning and verification.
|
| 2. The involvement of the human, ironically, in a haphazard way
| in the agent's process. And this has to do with how the problem
| of coding for these agents is defined. Human developers are like
| snow flakes when it comes to opinions on software design, there
| is no way to apply each's preference(except paper machet and
| superglue SO, Reddit threads and books) to the design of the
| system in any meaningful way and that makes a simple system way
| too complex or it makes a complex problem simplistic.
| - There is no way to evolve the plan to accept new preferences
| except text in CLAUDE.md file in git that you will have to read
| through and edit. - There is no way to know the near
| term effect of code choices now on 1 week from now. -
| So much code is written that asking a person to review it in case
| you are at the envelope and pushing the limit feels morally wrong
| and an insane ask. How many of your Code reviews are instead
| replaced by 15-30 min design meetings to instead solicit feedback
| on design of the PR -- because it so complex -- and just push the
| PR into dev? WTF am I even doing I wonder. - It does
| not know how far to explore for better rewards and does not know
| it better from local rewards, Resulting in commented out tests
| and deleting arbitrary code, to make its plan "work".
|
| In short code is a commodity for CEOs of Coding agent companies
| and CXOs of your company to use(sales force has everyone coding,
| but that just raises the floor and its a good thing, it does NOT
| lower the bar and make people 10x devs). All of them have bought
| into this idea that 10x is somehow producing 10x code. Your time
| reviewing and unmangling and mainitaining the code is not the
| commodity. It never ever was.
| lpa22 wrote:
| One of the most helpful usages of CC so far is when I simply ask:
|
| "Are there any bugs in the current diff"
|
| It analyzes the changes very thoroughly, often finds very subtle
| bugs that would cost hours of time/deployments down the line, and
| points out a bunch of things to think through for correctness.
| swyx wrote:
| maybe want to reify that as a claude code hook!
| bertil wrote:
| That matches my experience with non-coding tasks: it's not very
| creative, but it's a comprehensive critical reader.
| neucoas wrote:
| I am trying this tomorrow
| lpa22 wrote:
| Let me know how it goes. It's a game changer
| KTibow wrote:
| I'm surprised that works even without telling it to think/think
| hard/think harder/ultrathink.
| GuB-42 wrote:
| I added this to my toolbox in addition to traditional linters.
|
| My experience is that it is about 10% harmful, 80% useless and
| 10% helpful. Which is actually great, the 10% is worth it, but
| it is far from a hands off experience.
|
| By harmful I mean something like suggesting a wrong fix to code
| that works, it usually happens when I am doing something
| unusual or counter intuitive, for example having a function
| "decrease_x" that (correctly) adds 1 to x. It may hint for
| better documentation, but you have to be careful not to go on
| autopilot and just do what it says.
|
| By useless I mean something like "you didn't check for null"
| even though the variable can't be null or is passed to a
| function that handles the "null" case gracefully. In general,
| it tends to be overly defensive and following the
| recommendations would lead to bloated code.
|
| By helpful I mean finding a real bug. Most of them minor, but
| for some, I am glad I did that check.
|
| LLMs complement traditional linters well, but they don't
| replace them.
| csomar wrote:
| > it usually happens when I am doing something unusual or
| counter intuitive,
|
| That's usually your signal that your code needs refactoring.
| GuB-42 wrote:
| I wouldn't say it needs refactoring. Maybe more
| documentation, or some work on naming. But I believe that
| code you write has to be at least a bit unusual.
|
| Every project worth making is unique. Otherwise, why not
| use something off the shelf?
|
| For example, let's say you want to shuffle songs for a
| music player, you write your shuffling algorithm and it is
| "wrong", but there is a reason it is "wrong": it better
| matches the expectations of the user than a truly random
| shuffle. A LLM trained on thousands of truly random
| shuffles may try to "fix" your code, but it is actually the
| worst thing you can do. That "wrong" shuffle is the reason
| why you wrote that code in the first place, the "wrongness"
| is what adds value. But now, imagine that you realize that
| a true random shuffle is actually the way to go, then
| "fixing" your code is not what you should do either,
| instead, you should delete it and use the shuffle function
| your standard library offers.
|
| The unusual/unique/surprising parts of your code is where
| the true value is, and if there is none of that in your
| codebase, maybe you are just reinventing the wheel. Now, if
| a LLM trips off these parts, maybe you need some
| documentation, as a way to tell both the LLM and a human
| reading that part that it is something you should pay
| attention to. I am not a fan of comments in general, but
| that's where they are useful: explaining why you wrote that
| weird code, something along the lines of "I know it is not
| the correct algorithm, but users prefer it that way".
| Vegenoid wrote:
| While this can be true, I think a lot of people are far too
| eager to say that because someone is doing something in an
| unusual way, it's probably wrong. Not everything fits the
| cookie cutter model, there is tons of software written for
| all kinds of purposes. Suggesting that they're writing code
| wrong when someone says "an LLM struggles with my unusual
| code", when we aren't actually looking at the code and the
| context, is not helpful.
| elcritch wrote:
| Recently I realized you can add Github Copilot as a reviewer to
| a PR. It's surprisingly handy and found a few annoying typos
| and one legit bug mostly from me forgetting to update another
| field.
| aurareturn wrote:
| I do the same with Github Copilot after every change.
|
| I work with a high stakes app and breaking changes cause a ton
| of customer headaches. LLMs have been excellent at catching
| potential little bugs.
| hahn-kev wrote:
| We use Code Rabbit for this in our open source project.
| i_have_an_idea wrote:
| While this is cool, can anything be done about the speed of
| inference?
|
| At least for my use, 200K context is fine, but I'd like to see a
| lot faster task completion. I feel like more people would be OK
| with the smaller context if the agent acts quickly (vs waiting
| 2-3 mins per prompt).
| jeffhuys wrote:
| There's work being done in this field - I saw a demo using the
| same method as stable diffusion does, but then for text. Was
| extremely fast (3 pages of text in like a second). It'll come.
| wahnfrieden wrote:
| Meanwhile the key is to become proficient at using worktrees to
| parallelize agents instead of working serially with them
| i_have_an_idea wrote:
| Sounds nice, in theory, but in practice I want to iterate on
| one, perhaps, two tasks at a time, and keep a good
| understanding of what the agent is doing, so that I can
| prevent it from going off the rails, making bad decisions and
| then building on them even further.
|
| Worktrees and parallel agents do nothing to help me with
| that. It's just additional cognitive load.
| maxnevermind wrote:
| Does very large context significantly increase a response time?
| Are there any benchmarks/leader-boards estimating different
| models in that regard?
| hoppp wrote:
| So I can upload 1M tokens per prompt but pay $3 per 1M input
| tokens?
|
| Its really expensive to use.
| Aeolun wrote:
| Only the first time. After that it's 0.3 per 1M input tokens
| (cached).
| psyclobe wrote:
| Isn't Opus better? Whenever I run out of Opus tokens and get
| kicked down to Sonnet it's quite a shock sometimes.
|
| But man I'm at the perfect stage in my career for these tools. I
| know a lot, I understand a lot, I have a _lot_ of great ideas-but
| I'm getting kinda tired of hammering out code all day long. Now
| with Claude I am just busting ass executing in all these ideas
| and tests and fixes-never going back!
| Aeolun wrote:
| Haha, I think I recognize that. I'm just worried my actual
| skills will athrophy while I use Claude Code like I'm a manager
| on steroids.
| stavros wrote:
| By definition the thing that atrophies is the thing you never
| need to use.
| as367 wrote:
| That is an unfortunate logo.
| wiseowise wrote:
| This is something that I wish I would unremember.
| tomsanbear wrote:
| I just want a better way to invalidate old context... It's great
| that I can fit more context, but the main challenge is claude
| getting sidetracked with 10 invalid grep calls, pytest dumping a
| 10k token stack trace etc.... And yes the ability to go back in
| time via esc+esc is great but I want claude to read the error
| stack learn from it and purge from its context or at least let me
| lobotomize ot selectively... Learning and discarding the raw
| output from tool calls feels like the missing piece here still.
| typpilol wrote:
| Vscode recently rolled out checkpoints where you can go back to
| a previous state of the conversation. But it's still not
| enough.
|
| We honestly need to be able to see exactly what's in the
| context and be able to manually edit it.
| aledalgrande wrote:
| I hope that they are going to put something in Claude Code to
| display if you're entering the expensive window. Sometime I just
| keep the conversation going. I wouldn't want that to burn my Max
| credits 2x faster.
| terminalshort wrote:
| Yeah, that 1 MM tokens is a $15 (IIRC) API call. That's gonna
| add up quick! My favorite hypothetical AI failure scenario is
| that LLM agents eventually achieve human level general
| intelligence, but have to burn so many tokens to do it that
| they actually become more expensive than a human.
| k9294 wrote:
| I believe Claude Code uses cache aggressively, so this 1kk
| tokens will be 90% discounted or do I miss something?
| socrateslee wrote:
| It's like double "double the dose"
| forgingahead wrote:
| Wish it was on the web app as well!
| williamtrask wrote:
| Claude is down.
|
| EDIT: for the moment... it supports 0 tokens of context xD
| nojs wrote:
| Currently the quality seems to degrade long before the context
| limit is reached, as the context becomes "polluted".
|
| Should we expect the higher limit to also increase the practical
| context size proportionally?
| m13rar wrote:
| This is amazing. shout out to anthropic for doing this. I would
| like to have a CLAUDE Model which is not nerfed with ethics and
| values to please the users and write overtly large plans to
| impress the user.
| elcritch wrote:
| I'm finding GPT5 to be more succinct and on par with Claude
| Code so far. They're really toned down the obsequiousness.
| truelson wrote:
| We do know Parkinson's Law (
| https://en.m.wikipedia.org/wiki/Parkinson%27s_law ) will apply to
| all this, right?
| simon_rider wrote:
| feels like we just traded "not enough context" for "too much
| noise." The million-token window is cool for marketing, but until
| retrieval and summarization get way better, it's like dumping the
| entire repo on a junior dev's desk and saying "figure it out."
| They'll spend half their time paging through irrelevant crap, and
| the other half hallucinating connections. Bigger context is only
| a net win if the model can filter, prioritize, and actually
| reason over it
| whalesalad wrote:
| I can't tell you the number of times I had almost reached
| utopia only to hit compaction limits. Post-compaction I am
| usually dead in the water and the spiraling or repetition
| begins. Claude has a hard time compacting/remembering/flaggign
| a-ha moments from the session. Stuff that is important in the
| context of the task, but not appropriate for CLAUDE.md for
| instance. I have been thinking for months that if the context
| window was 2-3x larger, I would be unstoppable. So happy for
| this change, and excited to test it this week.
| MagicMoonlight wrote:
| It's a stupid metric because nothing in the real world has half a
| million words of context. So all they're doing is feeding it
| imagined slop, or sticking together random files.
| zaptrem wrote:
| It's useful for hours-long long-context debugging sessions in
| Claude Code, etc.
| cintusshied wrote:
| For folks using LLMs for big coding projects, what's your go-to
| workflow for deciding which parts of the codebase to feed the
| model?
| aitchnyu wrote:
| Aider automatically makes an outline of your whole codebase,
| which takes fraction of the tokens of the real files.
|
| https://aider.chat/docs/repomap.html
| mrcwinn wrote:
| I wish they'd fix other things faster. Still can't upload an
| Excel file in the iOS app, even with analyst mode enabled. The
| voice mode feels like it's 10 years behind OpenAI (no realtime,
| for example). And Opus 4.1 still occasionally goes absolutely
| mental and provides much worse analysis than o3 or GPT5-thinking.
|
| Rooting for Anthropic. Competition in this space is good.
|
| I watched an interview with Dario recently where he said he
| wasted a "product guy" and it really shows.
| cognix_dev wrote:
| To be honest, I am not particularly interested in whether the
| next model is better than the previous one. Rather than being
| superior, it is important that it maintains context and has
| enough memory capacity to not interfere with work. I believe that
| is what makes the new model competitive.
| CodeCompost wrote:
| In Visual Studio as well?
| omlelelom_kimox wrote:
| HM
| throwmeaway222 wrote:
| Why do I get the feeling that HN devs on here want to just feed
| it their entire folder, source, binaries everything and have it
| make changes in seconds.
| Roark66 wrote:
| I noticed the quality of answer degrades horribly beyond few
| thousands of tokens. Maybe 10k. Is anyone actually successfully
| using these 100k+ token contexts for anything?
| nprateem wrote:
| Anyone else found Claude has become hugely more stupid recently?
|
| It used to always pitch answers at the right level, but recently
| it just seems to have left its common sense at the door. Gemini
| just gives much better answers for non-technical questions now.
| amelius wrote:
| 1M?
|
| 640K ought to be enough for anybody ... right?
| doppelgunner wrote:
| Great! Now we can use AI to read and think like a specific
| "book".
| hassadyb wrote:
| i personally use it in my codding tasks such ana amazing and
| powerful llm
| muzani wrote:
| Of course Bolt is the customer spotlight. These vibe coding tools
| chuck the entire codebase and charge for tokens used. By 10k
| lines of code or so, these apps were not able to fit.
| ndkap wrote:
| Does anybody know which technology most of these companies that
| support 1M tokens use? Or is it all hidden?
| t43562 wrote:
| I think this highlights some problems with software development
| in general. i.e. the code isn't enough - you need to have domain
| knowledge too and a lot of knowledge about how and why the
| company needs things done in some way or another. You might
| imagine that dumping the contents of your wiki and all your chat
| channels into some sort of context might do it but that would
| miss the 100s of verbal conversations between people in the
| company. It would also fall foul of the way everything tends to
| work in any way you can imagine except what the wiki says.
|
| Even if you transcribed all the voice chats and meetings and
| added it in, it challenges a human to work out what is going on.
| No-context human developers are pretty useless too.
___________________________________________________________________
(page generated 2025-08-13 23:01 UTC)