hngopher.com

       [HN Gopher] Can LLMs write better code if you keep asking them t...
       ___________________________________________________________________
        
       Can LLMs write better code if you keep asking them to "write better
       code"?
        
       Author : rcarmo
       Score  : 570 points
       Date   : 2025-01-03 10:30 UTC (12 hours ago)
        
 (HTM) web link (minimaxir.com)
 (TXT) w3m dump (minimaxir.com)
        
       | cranberryturkey wrote:
       | its best to tell them how you want the code written.
        
         | protomolecule wrote:
         | What if I want to discover a new better way to write code?
        
           | cpursley wrote:
           | You can ask it for alternative methods and even to document
           | their pros and cons.
        
         | mrweasel wrote:
         | At that point isn't it starting to become easier to just write
         | the code yourself? If I somehow have to formulate how I want a
         | problem solved, then I've already done all the hard work
         | myself. Having the LLM just do the typing of the code means
         | that now not only did I have to solve the problem, I also get
         | to do a code review.
        
           | weego wrote:
           | Yes the fallacy here is that AI will replace eingineers any
           | time soon. For the foreseeable future prompts will need to be
           | written and curated by people who already know how to do it,
           | but will just end up describing it in increasingly complex
           | detail and then running tests against it. Doesn't sound like
           | a future that has that many benefits to anyone.
        
           | viraptor wrote:
           | There's no clear threshold with an universal answer.
           | Sometimes prompting will be easier, sometimes writing things
           | yourself. You'll have to add some debugging time to both
           | sides in practice. Also, you can be opportunistic - you're
           | going to write a commit anyway, right? A good commit message
           | will be close to the prompt anyway, so why not start with
           | that and see if you want to write your own or not?
           | 
           | > I also get to do a code review.
           | 
           | Don't you review your own code after some checkpoint too?
        
             | fragmede wrote:
             | why leave the commit message for the human to write? have
             | the LLM start off and add relevant details it missed.
        
               | viraptor wrote:
               | Because the commit message is pure signal. You can
               | reformat it or as useless info, but otherwise, generating
               | it will require writing it. Generating it from code is a
               | waste, because you're trying to distil that same signal
               | from messy code.
        
           | williamcotton wrote:
           | Spend your cognitive energy thinking about the higher level
           | architecture, test cases and performance concerns rather than
           | the minutia and you'll find that you can get more work done
           | with the less overall mental load.
           | 
           | This reduction in cognitive load is the real force
           | multiplier.
        
           | _flux wrote:
           | Personally I found it quite fun to give specification and
           | have ChatGPT find me a Python code that implements it:
           | https://chatgpt.com/share/6777debc-
           | eaa4-8011-81c5-35645ae433... . Or the additional polygon edge
           | smoothing code: https://chatgpt.com/share/6773d634-de88-8011-
           | acf8-e61b6b913f...
           | 
           | Sure, the green screen code didn't work exactly as I wished,
           | but it made use of OpenCV functions I was not aware of and it
           | was quite easy to make the required fixes.
           | 
           | In my mind it is exactly the opposite: yes, I've already done
           | the hard work of formulating how I want the problem solved,
           | so why not have the computer do the busywork of writing the
           | code down?
        
       | winternewt wrote:
       | The more interesting question IMO is not how good the code can
       | get. It is what must change for the AI to attain the
       | introspective ability needed to say "sorry, I can't think of any
       | more ideas."
        
         | wongarsu wrote:
         | You should get decent results by asking it to do that in the
         | prompt. Just add "if you are uncertain, answer I don't know" or
         | "give the answer or say I don't know" or something along those
         | lines
         | 
         | LLM are far from perfect at knowing their limits, but they are
         | better at it than most people give them credit for. They just
         | never do it unless prompted for it.
         | 
         | Fine tuning can improve that ability. For example the thinking
         | tokens paper [1] is at some level training the model to output
         | a special token when it doesn't reach a good answer (and then
         | try again, thus "thinking")
         | 
         | 1: https://arxiv.org/abs/2405.08644
        
         | Hendrikto wrote:
         | The problem is, they do not think.
        
           | anonzzzies wrote:
           | So, like many people then? Many people are even not at the
           | level of llms but more like markov chains.
        
       | dash2 wrote:
       | Something major missing from the LLM toolkit at the moment is
       | that it can't actually run (and e.g. test or benchmark) its own
       | code. Without that, the LLM is flying blind. I guess there are
       | big security risks involved in making this happen. I wonder if
       | anyone has figured out what kind of sandbox could safely be
       | handed to a LLM.
        
         | AndrewOMartin wrote:
         | It'd be great if it could describe the performance of code in
         | detail, but for now just adding a skill to detect if a bit of
         | code has any infinite loops would be a quick and easy hack to
         | be going on with.
        
           | mhmmmmmm wrote:
           | Is reliably detecting if code has any infinite loops
           | feasible? Sounds like the halting problem.
        
             | Hendrikto wrote:
             | Not in the general case, but you could detect specific
             | common patterns.
        
             | IanCal wrote:
             | It depends how you define reliably.
             | 
             | The halting problem isn't so relevant in most development,
             | and nothing stops you having a classifier that says "yes",
             | "no" or "maybe". You can identify code that definitely
             | finishes, and you can identify code that definitely
             | doesn't. You can also identify some _risky_ code that
             | probably might. Under condition X, it would go into an
             | infinite loop - even if you 're not sure if condition X can
             | be met.
        
               | withinboredom wrote:
               | The problem is that you can do this for specific
               | functions/methods, but you cannot do this for a PROGRAM.
               | All programs are "maybe", by definition. You want it to
               | run until you tell it to stop, but you may never tell it
               | to stop. Ergo, all programs have some sort of infinite
               | loop in them somewhere, even if it is buried in your
               | framework or language runtime.
        
             | GuB-42 wrote:
             | It is exactly the halting problem. Finding some infinite
             | loops is possible, there are even some obvious cases, but
             | finding "any" infinite loops is not. In fact, even the
             | obvious cases are not if you take interrupts into account.
             | 
             | I think that's the joke. In a sci-fi story, that would make
             | the computer explode.
        
         | vendiddy wrote:
         | Somewhat related - I wonder if LLMs are trained with a compiler
         | in the loop to ensure they understand the constraints of each
         | language.
        
           | botanical76 wrote:
           | This is a good idea. You could take a set of problems, have
           | the LLM solve it, then continuously rewrite the LLM's context
           | window to introduce subtle bugs or coding errors in previous
           | code submissions (use another LLM to be fully hands off), and
           | have it try to amend the issues through debugging the
           | compiler or test errors. I don't know to what extent this is
           | already done.
        
         | freetonik wrote:
         | I believe some platforms like bolt.new do run generated code
         | and even automatically detect and attempt to fix runtime
         | errors.
        
         | fnord77 wrote:
         | I known at least one mainstream LLM that can write unit tests
         | and run them right in the chat environment.
        
         | viraptor wrote:
         | It can't be done in the LLM itself of course, but the wrapper
         | you're taking about already exists in multiple projects
         | fighting in SWEbench. The simplest one is aider with --auto-
         | test https://aider.chat/docs/usage/lint-test.html
         | 
         | There are also large applications like https://devin.ai/ or
         | https://github.com/AI-App/OpenDevin.OpenDevin
        
         | dale_glass wrote:
         | godbolt exists and can run code, so surely similar principles
         | could be used here.
        
         | padolsey wrote:
         | Running code would be a downstream (client) concern. There's
         | the ability to get structured data from LLMs (usually called
         | 'tool use' or 'function calling') which is the first port of
         | call. Then running it is usually an iterative agent<>agent task
         | where fixes need to be made. FWIW Langchain seems to be what
         | people use to link things together but I find it overkill.* In
         | terms of actually running the code, there are a bunch of tools
         | popping up at different areas in the pipeline (replit,
         | agentrun, riza.io, etc)
         | 
         | What we really need (from end-user POV) is that kinda 'resting
         | assumption' that LLMs we talk to via chat clients are verifying
         | any math they do. For actually programming, I like Replit,
         | Cursor, ClaudeEngineer, Aider, Devin. There are bunch of
         | others. All of them seem to now include ongoing 'agentic' steps
         | where they keep trying until they get the response they want,
         | with you as human in the chain, approving each step (usually).
         | 
         | * I (messing locally with my own tooling and chat client) just
         | ask the LLM for what I want, delimited in some way by a
         | boundary I can easily check for, and then I'll grab whatever is
         | in it and run it in a worker or semi-sandboxed area. I'll halt
         | the stream then do another call to the LLM with the latest
         | output so it can continue with a more-informed response.
        
         | moffkalast wrote:
         | The only proper way to code with an LLM is to run its code,
         | give it feedback on what's working and what isn't, and
         | reiterate how it should. Then repeat.
         | 
         | The problem with automating it is that the number of
         | environments you'd need to support to actually run arbitrary
         | code with is practically infinite, and with local dependencies
         | genuinely impossible unless there's direct integration, which
         | means running it on your machine. And that means giving an
         | opaque service full access to your environment. Or at best, a
         | local model that's still a binary blob capable of outputting
         | virtually anything, but at least it won't spy on you.
        
           | alkonaut wrote:
           | Any LLM-coding agent that doesn't work _inside_ the same
           | environment as the developer will be a dead end or a toy.
           | 
           | I use ChatGPT to ask for code examples or sketching out
           | pieces of code, but it's just not going to be nearly as good
           | as anything in an IDE. And once it runs in the IDE then it
           | has access to what it needs to be in a feedback loop with
           | itself. The user doesn't need to see any intermediate steps
           | that you would do with a chatbot where you say "The code
           | compiles but fails two tests what should I do?"
        
             | moffkalast wrote:
             | Don't they? It highly depends on the errors. Could range
             | from anything like a simple syntax error to a library
             | version mismatch or functionality deprecation that requires
             | some genuine work to resolve and would require at least
             | some opinion input from the user.
             | 
             | Furthermore LLMs make those kinds of "simple" errors less
             | and less, especially if the environment is well defined.
             | "Write a python script" can go horribly wrong, but "Write a
             | python 3.10 script" is most likely gonna run fine but have
             | semantic issues where it made assumptions about the problem
             | because the instructions were vague. Performance should
             | increase with more user input, not less.
        
               | alkonaut wrote:
               | They could, but if the LLM can iterate and solve it then
               | the user might not need to know. So when the user input
               | is needed, at least it's not merely to do what I do know:
               | feed the compiler messages or test failures back to
               | ChatGPT who then gives me a slightly modified version.
               | But of course it will fail and that will need manual
               | intervention.
               | 
               | I often find that ChatGPT often reasons itself to a
               | better solution (perhaps not correct or final, but
               | better) if it just gets some feedback from e.g. compiler
               | errors. Usually it's like
               | 
               | Me: "Write a function that does X and satisifies this
               | test code"
               | 
               | LLM: responds with function (#1)
               | 
               | Me: "This doesn't compile. Compiler says X and Y"
               | 
               | LLM: Apologies: here is the fixed version (#2)
               | 
               | Me: "Great, now it compiles but it fails one of the two
               | test methods, here is the output from the test run: ..."
               | 
               | LLM: I understand. Here is an improved verison that
               | should pass the tests (#3)
               | 
               | Me: "Ok now you have code that could theoretically pass
               | the tests BUT you introduced the same syntax errors you
               | had in #1 again!"
               | 
               | LLM: I apologize, here is a corrected version that should
               | compile and pass the tests (#4)
               | 
               | etc etc.
               | 
               | After about 4-5 iterations with nothing but gentle
               | nudging, it's often working. And there usually isn't more
               | nudging than returning the output from compiler or test
               | runs. The code at the 4th step might not be perfect but
               | it's a LOT better than it was first. The problem with
               | this workflow is that it's like having a bad intern on
               | the phone pair programming. Copying and pasting code back
               | and forth and telling the LLM what the problem with it
               | is, is just not very quick. If the iterations are
               | automatic so the only thing I can see is step #4, then at
               | least I can focus on the manual intervention needed
               | there. But fixing a trivial syntax error beteween #1 and
               | #2 is just a chore. I think ChatGPT is simply pretty bad
               | here, and the better models like opus probably doesn't
               | have these issues to the same extent
        
               | seba_dos1 wrote:
               | > The problem with this workflow is that it's like having
               | a bad intern on the phone pair programming.
               | 
               | Even worse than that - an intern has a chance to learn
               | from this experience, get better and become a senior one
               | day.
        
         | llm_nerd wrote:
         | OpenAI is moving in that direction. The Canvas mode of ChatGPT
         | can now runs its own python in a WASM interpreter, client side,
         | and interpret results. They also have a server-side VM
         | sandboxed code interpreter mode.
         | 
         | There are a lot of things that people ask LLMs to do, often in
         | a "gotcha" type context, that would be best served by it
         | actually generating code to solve the problem rather than just
         | endlessly making more parameter/more layer models. Math
         | questions, data analysis questions, etc. We're getting there.
        
         | diggan wrote:
         | That's a bit like saying the drawback of a database is that it
         | doesn't render UIs for end-users, they are two different layers
         | of your stack, just like evaluation of code and generation of
         | text should be.
        
         | vallode wrote:
         | I believe that Claude has been running JavaScript code for
         | itself for a bit now[1]. I could have sworn it also runs Python
         | code, but I cannot find any post concretely describing it. I've
         | seen it "iterate" on code by itself a few times now, where it
         | will run a script, maybe run into an error, and instantly re-
         | write it to fix that error.
         | 
         | [1]: https://www.anthropic.com/news/analysis-tool
        
           | jdietrich wrote:
           | Gemini can run Python using the Code Execution or Function
           | Calling APIs.
           | 
           | https://ai.google.dev/gemini-api/docs/code-execution
        
         | evrimoztamur wrote:
         | The new Cursor agent is able to check the linter output for
         | warnings and errors, and will continue to iterate (for a
         | reasonable number of steps) until it has cleared them up. It's
         | not quite executing, but it does improve output quality. It can
         | even back itself out of a corner by restoring a previous
         | checkpoint.
         | 
         | It works remarkably well with typed Python, but struggles
         | miserably with Rust despite having better error reporting.
         | 
         | It seems like with Rust it's not quite aware of which patterns
         | to use, especially when the actual changes required may span
         | multiple files due to the way memory management is structured.
        
           | Yoric wrote:
           | > It seems like with Rust it's not quite aware of which
           | patterns to use, especially when the actual changes required
           | may span multiple files due to the way memory management is
           | structured.
           | 
           | What do you mean? Memory management is not related to files
           | in Rust (or most languages).
        
             | withinboredom wrote:
             | When did they say that?
        
         | jeffbee wrote:
         | I don't think that's always true. Gemini seemed to run at least
         | some programs, which I believe because if you asked it to write
         | a python program that would take forever, it does. For example
         | the prompt "Write a python script that prints 'Hello, World',
         | then prints a billion random characters" used to just timeout
         | on Gemini.
        
         | netdevphoenix wrote:
         | Pretty sure this is done client-side by one of the big LLM
         | companies. So it's virtually no risk for them
        
         | mixedbit wrote:
         | I have experimented with using LLM for improving unit test
         | coverage of a project. If you provide the model with test
         | execution results and updated test coverage information, which
         | can be automated, the LLM can indeed fix bugs and add
         | improvements to tests that it created. I found it has high
         | success rate at creating working unit tests with good coverage.
         | I just used Docker for isolating the LLM-generated code from
         | the rest of my system.
         | 
         | You can find more details about this experiment in a blog post:
         | https://mixedbit.org/blog/2024/12/16/improving_unit_test_cov...
        
           | lolinder wrote:
           | It depends a lot on the language. I recently tried this with
           | Aider, Claude, and Rust, and after writing one function and
           | its tests the model couldn't even get the code compiling,
           | much less the tests passing. After 6-8 rounds with no
           | progress I gave up.
           | 
           | Obviously, that's Rust, which is famously difficult to get
           | compiling. It makes sense that it would have an easier time
           | with a dynamic language like Python where it only has to
           | handle the edge cases it wrote tests for and not all the ones
           | the compiler finds for you.
        
             | stavros wrote:
             | I've found something similar, when you keep telling the LLM
             | what the compiler says, it keeps adding more and more
             | complexity to try to fix the error, and it either works by
             | chance (leaving you with way overengineered code) or it
             | just never works.
             | 
             | I've very rarely seen it simplify things to get the code to
             | work.
        
               | bee_rider wrote:
               | Hmm, I worked with students in an "intro to programming"
               | type course for a couple years. As far as I'm concerned,
               | "I added complexity until it compiled and now it works
               | but I don't understand it" is pretty close to passing the
               | Turing test, hahaha.
        
               | stavros wrote:
               | Oh yeah, it's definitely junior dev behaviour. The
               | problem is that it's junior dev behaviour!
        
               | lolinder wrote:
               | Junior dev behavior without any potential to be trained
               | into not making the same mistake again next time!
        
               | williamdclt wrote:
               | I have the same observation, looks like LLMs are highly
               | biased to add complexity to solve problems: for example
               | add explicit handling of the edge-cases I pointed out
               | rather than rework the algorithm to eliminate edge-cases
               | altogether. Almost everytime it starts with something
               | that's 80% correct, then iterate into something that's
               | 90% correct while being super complex, unmaintainable and
               | having no chance to ever cover the last 10%
        
               | stavros wrote:
               | Unfortunately this is my experience as well, to the point
               | where I can't trust it with any technology that I'm not
               | intimately familiar with and can thoroughly review.
        
           | flir wrote:
           | Suggestion: Now take the code away, and have the chatbot
           | generate code that passes the tests it wrote.
           | 
           | (In theory, you get a clean-room implementation of the
           | original code. If you do this please ping me because I'd love
           | to see the results.)
        
             | bee_rider wrote:
             | That's sort of interesting. If code -> tests -> code is
             | enough to get a clean room implementation, really, I wonder
             | if this sort of tool would test that.
        
               | flir wrote:
               | I don't think it is, but I'm really interested to see
               | someone try it (I'm also lazy).
               | 
               | (And a more philosophical question: if it's not enough,
               | what does that mean for continuous deployment?)
        
         | anonzzzies wrote:
         | We have it run code and the biggest thing we find is that it
         | gets into a loop quite fast if it doesn't recognise the error;
         | fixing it by causing other errors and then fixing it again by
         | causing the initial error.
        
         | jampekka wrote:
         | ChatGPT runs code. o1 even checks for runtime problems and
         | fixes them "internally".
        
           | simonw wrote:
           | Chatgpt has a Code Interpreter tool that can run Python in a
           | sandbox, but it's not yet enabled for o1. o1 will pretend to
           | use it though, you have to watch very carefully to check if
           | that happened or not.
           | 
           | Example transcript here (also showing that o1 can't search
           | but will pretend it can): https://chatgpt.com/share/677420e4-
           | 8854-8006-8940-9bc30b7088...
        
         | a2128 wrote:
         | This is a major issue when it comes to things like GitHub
         | Copilot Workspace, which is a project that promises a
         | development environment purely composed of instructing an AI to
         | do your bidding like fix this issue, add this feature.
         | Currently it often writes code using packages that don't exist,
         | or it uses an old version of a package that it saw most during
         | training. It'll write code that just doesn't even run (like
         | putting comments in JSON files).
         | 
         | The best way I can describe working with GitHub Copilot
         | Workspace is like working with an intern who's been stuck on an
         | isolated island for years, has no access to technology, and
         | communicates with you by mailing letters with code handwritten
         | on them that he thinks will work. And also if you mail too many
         | letters back and forth he gets mad and goes to sleep for the
         | day saying you reached a "rate limit". It's just not how
         | software development works
        
         | pplonski86 wrote:
         | I think that there should be a guard to check the code before
         | running it. It can be human or another LLM checking code based
         | on its safety. I'm working on an AI assistant for data science
         | tasks. It works in a Jupyter-like environment, and humans
         | execute the final code by running a cell.
        
         | btbuildem wrote:
         | I think the GPT models have been able to run Python (albeit
         | limited) for quite a while now. Expanding that to support a
         | variety of programming languages that exist though? That seems
         | like a monumental task with relatively little reward.
        
         | simonw wrote:
         | I've been closely following this area - LLMs with the ability
         | to execute code in a sandbox - for a while.
         | 
         | ChatGPT was the first to introduce this capability with Code
         | Interpeter mode back in around March 2023:
         | https://simonwillison.net/tags/code-interpreter/
         | 
         | This lets ChatGPT write and then execute Python code in a
         | Kubernetes sandbox. It can run other languages too, but that's
         | not documented or supported. I've even had it compile and
         | execute C before:
         | https://simonwillison.net/2024/Mar/23/building-c-extensions-...
         | 
         | Gemini can run Python (including via the Gemini LLM API if you
         | turn on that feature) but it's a lot more restricted than
         | ChatGPT - I don't believe it can install extra wheels, for
         | example.
         | 
         | Claude added the ability to write and execute JavaScript
         | recently (October), which happens in a sandbox in the user's
         | browser, not on their servers:
         | https://simonwillison.net/2024/Oct/24/claude-analysis-tool/
         | 
         | Claude also has Artifacts, which can write a UI in HTML and
         | JavaScript and show that to the user... but can't actually
         | execute code in a way that's visible to the LLM itself so
         | doesn't serve the same feedback look purposes as those other
         | tools. https://simonwillison.net/tags/claude-artifacts/
         | 
         | In December ChatGPT added Canvas which can execute Python in
         | the user's browser, super confusing because they already have a
         | separate Python system in Code Interpreter:
         | https://simonwillison.net/2024/Dec/10/chatgpt-canvas/#canvas...
        
         | bwhiting2356 wrote:
         | Ideally you could this one step further and feed production
         | logs, user session replays and feedback into the LLM. If the UX
         | is what I'm optimizing for, I want it to have that context, not
         | for it to speculate about performance issues that might not
         | exist.
        
       | moomin wrote:
       | I once sat with my manager and repeatedly asked Copilot to
       | improve some (existing) code. After about three iterations he
       | said "Okay, we need to stop this because it's looking way too
       | much like your code."
       | 
       | I'm sure there's enough documented patterns of how to improve
       | code in common languages that it's not hard to get it to do that.
       | Getting it to spot when it's inappropriate would be harder.
        
       | HPsquared wrote:
       | Using the tool in this way is a bit like mining: repeatedly
       | hacking away with a blunt instrument (simple prompt) looking for
       | diamonds (100x speedup out of nowhere). Probably a lot of work
       | will be done in this semi-skilled brute-force sort of way.
        
         | exitb wrote:
         | Well, in this case it's kind of similar to how people write
         | code. A loop consisting of writing something,
         | reviewing/testing, improving until we're happy enough.
         | 
         | Sure, you'll get better results with an LLM when you're more
         | specific, but what's the point then? I don't need AI when I
         | already know what changes to make.
        
           | HPsquared wrote:
           | This method requires a lot less skill, you just need to read
           | it and run it. Reading is much easier than writing!
        
             | adrianN wrote:
             | For many types of code writing it is much easier than
             | reading it unfortunately.
        
               | HPsquared wrote:
               | Reading to understand all the subtext and side-effects
               | can be harder than writing, sure. But it won't stop
               | people trying this approach and hammering out code full
               | of those types of subtle bugs.
               | 
               | Human developers will be more focused on this type of
               | system integration and diagnostics work. There will be
               | more focus on reading and understanding than the actual
               | writing. It's a bit like working with contractors.
        
         | mingus88 wrote:
         | It looks to me to be exactly what a typical coding interview
         | looks like; the first shot is correct and works, and then the
         | interviewer keeps asking if you can spot any ways to make it
         | better/faster/more efficient
         | 
         | If I were a CS student cramming for interviews, I might be
         | dismayed to see that my entire value proposition has been
         | completely automated before I even enter the market.
        
         | bodegajed wrote:
         | There must be a feedback request mechanism for a "Is this
         | better?" This is doable with RLHF or DPO.
        
         | antihero wrote:
         | Once you can basically have it run and benchmark the code, and
         | then iterate that overnight, it's going to be interesting.
         | 
         | Automating the feedback loop is key.
        
           | XCSme wrote:
           | Wouldn't there be some safety concerns to letting the AI run
           | overnight with access to run any command?
           | 
           | Maybe if it can run sandboxed, with no internet access (but
           | if the LLM is not local, it does require internet access).
        
       | EncomLab wrote:
       | My sister would do this to me on car trips with our Mad Libs
       | games - yeah, elephant is funny, but bunny would be funnier!!
       | 
       | When all you have is syntax, something like "better" is 100% in
       | the eye of the beholder.
        
       | nkrisc wrote:
       | So asking it to write better code produces code with errors that
       | can't run?
        
         | muzani wrote:
         | Only when there's a financial incentive.
        
           | nkrisc wrote:
           | Makes sense. If I was paid by LOC and also responsible for
           | fixing it, I'd probably make lots of errors too.
        
       | xrd wrote:
       | Wow, what a great post. I came in very skeptical but this changed
       | a lot of misconceptions I'm holding.
       | 
       | One question: Claude seems very powerful for coding tasks, and
       | now my attempts to use local LLMs seem misguided, at least when
       | coding. Any disagreements from the hive mind on this? I really
       | dislike sending my code into a for profit company if I can avoid
       | it.
       | 
       | Second question: I really try to avoid VSCode (M$ concerns,
       | etc.). I'm using Zed and really enjoying it. But the LLM coding
       | experience is exactly as this post described, and I have been
       | assuming that's because Zed isn't the best AI coding tool. The
       | context switching makes it challenging to get into the flow, and
       | that's been exactly my criticism of Zed this far. Does anyone
       | have an antidote?
       | 
       | Third thought: this really feels like it could be an interesting
       | way to collaborate across a code base with any range of developer
       | experience. This post is like watching the evolution of a species
       | in an hour rather than millions of years. Stunning.
        
         | persedes wrote:
         | Still vscode, but cursor has the best implementation by far
         | IMHO
         | 
         | Intellij has a new feature that lets you prompt within your
         | code that is pretty neat too, but I'm missing the
         | Composer/apply feature of cursor still
        
         | auc wrote:
         | I use VSCode + Copilot. For anything more than boilerplate
         | code, I find that Copilot kind of sucks and I use O1 in ChatGPT
        
         | ericskiff wrote:
         | I highly recommend the command line AI coding tool, AIder. You
         | fill its context window with a few relevant files, ask
         | questions, and then set it to code mode and it starts making
         | commits. It's all git, so you can back anything out, see the
         | history, etc.
         | 
         | It's remarkable, and I agree Claude 3.5 makes playing with
         | local LLMs seem silly in comparison. Claude is useful for
         | generating real work.
        
         | simonw wrote:
         | Making the decision to trust companies like Anthropic with your
         | data when they say things like "we won't train on your data" is
         | the ultimate LLM productivity hack. It unlocks access to the
         | currently best available coding models.
         | 
         | That said, there are increasingly great coding models you can
         | run locally. Qwen2.5-Coder-32B impressed me a lot a few months
         | ago: https://simonwillison.net/2024/Nov/12/qwen25-coder/
         | 
         | The problem I have is that models like that one take up 20+GB
         | of RAM, and id rather use that to run more Chrome and Firefox
         | windows! If I was serious about using local LLMs on a daily
         | basis I'd set up a dedicated local server machine for them,
         | super expensive though.
        
           | xrd wrote:
           | I have a 24gb Nvidia on my desktop machine and a
           | tailscale/headscale network from my laptop. Unless I'm on a
           | plane without Wi-Fi, I'm usually in a great place.
           | 
           | Thanks for your comment! I'm going to try out qwen.
        
         | visarga wrote:
         | > Claude seems very powerful for coding tasks
         | 
         | > I really dislike sending my code into a for profit company if
         | I can avoid it
         | 
         | I see a link between them - maybe the model got good because it
         | used chat logs to improve?
        
       | codesections wrote:
       | > "Planning" is a long-used trick to help align LLM output for a
       | first pass -- the modern implementation of "let's think step by
       | step."
       | 
       | I hadn't seen this before. Why is asking for planning better than
       | asking it to think step by step?
        
         | NitpickLawyer wrote:
         | This is how aider becomes _really_ good:
         | 
         | - start by "chatting" with the model and asking for "how you'd
         | implement x y z feature, without code".
         | 
         | - what's a good architecture for x y z
         | 
         | - what are some good patterns for this
         | 
         | - what are some things to consider when dealing with x y z
         | 
         | - what are the best practices ... (etc)
         | 
         | - correct / edit out some of the responses
         | 
         | - say "ok, now implement that"
         | 
         | It's basically adding stuff to the context by using the LLM
         | itself to add things to context. An LLM is only going to attend
         | to it's context, not to "whatever it is that the user wants it
         | to make the connections without actually specifying it". Or, at
         | least in practice, it's _much_ better at dealing with things
         | present in its context.
         | 
         | Another aspect of prompting that's often misunderstood is
         | "where did the model see this before in its training data". How
         | many books / authoritative / quality stuff have you seen where
         | each problem is laid out with simple bullet points? Vs. how
         | many "tutorials" of questionable quality / provenance have
         | that? Of course it's the tutorials. Which are often just rtfm /
         | example transcribed poorly into a piece of code, publish, make
         | cents from advertising.
         | 
         | If instead you ask the model for things like "architecture",
         | "planning", stuff like that, you'll elicit answers from quality
         | sources. Manuals, books, authoritative pieces of content. And
         | it will gladly write on those themes. And then it will gladly
         | attend to them and produce much better code in a follow-up
         | question.
        
       | Kiro wrote:
       | > What would happen if we tried a similar technique with code?
       | 
       | It was tried as part of the same trend. I remember people asking
       | it to make a TODO app and then tell it to make it better in an
       | infinite loop. It became really crazy after like 20 iterations.
        
         | amelius wrote:
         | Reminder: don't ask your t-shirt folding robot to do a better
         | and better job.
        
       | qwertox wrote:
       | Sometimes I'm editing the wrong file, let's say a JS file. I
       | reload the page, and nothing changes. I continue to clean up the
       | file to an absurd amount of cleanliness, also fixing bugs while
       | at it.
       | 
       | When I then notice that this is really does not make any sense, I
       | check what else it could be and end up noticing that I've been
       | improving the wrong file all along. What then surprises me the
       | most is that I cleaned it up just by reading it through, thinking
       | about the code, fixing bugs, all without executing it.
       | 
       | I guess LLMs can do that as well?
        
       | nuancebydefault wrote:
       | My takeaway and also personal experience is that you get the best
       | results is that you co-develop with the LLM.
       | 
       | - write a simple prompt that explains in detail the wanted
       | outcome.
       | 
       | - look at the result, run it and ask it how it can improve.
       | 
       | - tell it what to improve
       | 
       | - tell it to make a benchmark and unit test
       | 
       | - run it each time and see what is wrong or can be improved.
        
         | whstl wrote:
         | Same experience.
         | 
         | Also: If you're experienced at code reviews, you can get great
         | results.
        
         | nzach wrote:
         | One approach I've been using recently with good results is
         | something along the lines "I want to do X, is there any special
         | consideration I should be aware while working in this domain?".
         | This helps me a lot when I'm asking about a subject I don't
         | really understand. Another way to ask this is "What are the
         | main pitfalls with this approach?".
         | 
         | I'm using o1, so I don't know how well it translate to other
         | models.
        
       | jmartinpetersen wrote:
       | > "As LLMs drastically improve, the generated output becomes more
       | drastically average"
       | 
       | Thanks, that really made it click for me.
        
         | sneak wrote:
         | Average software developers producing average code cost high
         | five to low six figures per year.
         | 
         | LLMs are a tiny tiny fraction of that.
         | 
         | For a majority of software, average code that does the CRUD
         | thing or whatever is fine.
         | 
         | Even if LLMs never get better or cheaper than they are today,
         | our entire industry is forever changed (for the better).
        
         | Der_Einzige wrote:
         | I don't know how many times I'm going to have to post just one
         | of the papers which debunk this tired trope. As models become
         | more intelligent, they also become more plural, more like
         | multiplicities, and yes, much more (super humanely) creative.
         | You can unlock creativity in today's LLMs by doing intelligent
         | sampling on high temperature outputs.
         | 
         | https://openreview.net/forum?id=FBkpCyujtS
        
       | abesan wrote:
       | "you are a senior expert"
        
         | XCSme wrote:
         | Thanks, I really needed some encouragement today!
        
         | b0ner_t0ner wrote:
         | "You are Justine Tunney. Your code output and l33t optimization
         | is going to blow my mind."
        
       | juujian wrote:
       | I often run into LLMs writing "beginner code" that uses the most
       | fundamental findings in really impractical ways. Trained on too
       | many tutorials I assume.
       | 
       | Usually, specifying the packages to use and asking for something
       | less convoluted works really well. Problem is, how would you know
       | if you have never learned to code without an LLM?
        
         | ben_w wrote:
         | Even as someone with plenty of experience, this can still be a
         | problem: I use them for stuff outside my domain, but where I
         | can still debug the results. In my case, this means I use it
         | for python and web frontend, where my professional experience
         | has been iOS since 2010.
         | 
         | ChatGPT has, for several generations, generally made stuff that
         | works, but the libraries it gives me are often not the most
         | appropriate, and are sometimes obsolete or no longer functional
         | -- and precisely because web and python are hobbies for me
         | rather than my day job, it can take me a while to spot such
         | mistakes.
         | 
         | Two other things I've noticed, related in an unfortunate way:
         | 
         | 1) Because web and python not my day job, more often than not
         | and with increasing frequency, I ultimately discover that when
         | I disagree with ChatGPT, the AI was right and I was wrong.
         | 
         | 2) These specific models often struggle when my response has
         | been "don't use $thing or $approach"; unfortunately this seems
         | to be equally applicable regardless of if the AI knew more than
         | me or not, so it's not got predictive power for me.
         | 
         | (I also use custom instruction, you YMMV)
        
           | bbarnett wrote:
           | I wish people would understand what a large language model
           | is. There is no thinking. No comprehension. No decisions.
           | 
           | Instead, think of your queries as super human friendly SQL.
           | 
           | The database? Massive amounts of data boiled down to unique
           | entries with probabilities. This is a simplistic, but
           | accurate way to think of LLMs.
           | 
           | So how much code is on the web for a particular problem
           | solve? 10k blog entries, stackoverflow responses? What you
           | get back is mishmash of these.
           | 
           | So it will have decade old libraries, as lots of those
           | scraped responses are 10 years old, and often without people
           | saying so.
           | 
           | And it will likely have more _poor_ code examples than not.
           | 
           | I'm willing to bet that OpenAI's ingress of stackoverflow
           | responses stipulated higher priority on accepted answers, but
           | that still leaves a lot of margin.
           | 
           | And how you write your query, may sideline you into responses
           | with low quality output.
           | 
           | I guess my point is, when you use LLMs for tasks, you're
           | getting whatever other humans have said.
           | 
           | And I've seen some pretty poor code examples out there.
        
             | lolinder wrote:
             | > Instead, think of your queries as super human friendly
             | SQL.
             | 
             | > The database? Massive amounts of data boiled down to
             | unique entries with probabilities. This is a simplistic,
             | but accurate way to think of LLMs.
             | 
             | This is a useful model for LLMs in many cases, but it's
             | also important to remember that it's not a database with
             | perfect recall. Not only is it a database with a bunch of
             | bad code stored in it, it samples randomly from that
             | database on a token by token basis, which can lead to
             | surprises both good and bad.
        
             | ben_w wrote:
             | > There is no thinking. No comprehension. No decisions.
             | 
             | Re-reading my own comment, I am unclear why you think it
             | necessary to say those specific examples -- my descriptions
             | were "results, made, disagree, right/wrong, struggle":
             | tools make things, have results; engines struggle; search
             | engines can be right or wrong; words can be disagreed with
             | regardless of authorship.
             | 
             | While I am curious what it would mean for a system to
             | "think" or "comprehend", every time I have looked at such
             | discussions I have been disappointed that it's pre-
             | paradigmatic. The closest we have is examples such as
             | Turing 1950[0] saying essentially (to paraphrase) "if it
             | quacks like a duck, it's a duck" vs. Searle 1980[1] which
             | says, to quote the abstract itself, "no program by itself
             | is sufficient for thinking".
             | 
             | > I guess my point is, when you use LLMs for tasks, you're
             | getting whatever other humans have said.
             | 
             | All of maths can be derived from the axioms of maths. All
             | chess moves derive from the rules of the game. This kind of
             | process has a lot of legs, regardless of if you want to
             | think of the models as "thinking" or not.
             | 
             | Me? I don't worry too much if they can actually think, not
             | because there's no important philosophical questions about
             | what that even means, but because other things have a more
             | immediate impact: even if they are "just" a better search
             | engine, they're a mechanism that somehow managed to squeeze
             | almost all of the important technical info on the internet
             | into something that fits into RAM on a top-end laptop.
             | 
             | The models may indeed be cargo-cult golems -- I'd assume
             | that by default, there's so much we don't yet know -- but
             | whatever is or isn't going on inside, they still do a good
             | job of quacking like a duck.
             | 
             | [0] Turing, A. M. (1950). Computing machinery and
             | intelligence. Mind, 59, 433-460.
             | https://doi.org/10.1093/mind/LIX.236.433
             | 
             | [1] Searle, J. R. (1980). Minds, brains, and programs.
             | Behavioral and Brain Sciences, 3(3), 417-424.
             | https://doi.org/10.1017/S0140525X00005756
        
               | bbarnett wrote:
               | _Re-reading my own comment, I am unclear why you think it
               | necessary to say those specific examples_
               | 
               | Sorry to cause unneeded introspection, my comment was
               | sort of thread based, not specific in whole to your
               | comment.
        
               | ben_w wrote:
               | Introspection is a good thing, and I tend to re-read (and
               | edit) my comments several times before I'm happy with
               | them, in part because of the risk autocorrupt
               | accidentally replacing one word with a completely
               | different _werewolf_ *.
               | 
               | Either way, no need to apologise :)
               | 
               | * intentional
        
             | mackopes wrote:
             | > think of your queries as super human friendly SQL > The
             | database? Massive amounts of data boiled down to unique
             | entries with probabilities. This is a simplistic, but
             | accurate way to think of LLMs.
             | 
             | I disagree that this is the accurate way to think about
             | LLMs. LLMs still use a finite number of parameters to
             | encode the training data. The amount of training data is
             | massive in comparison to the number of parameters LLMs use,
             | so they need to be somewhat capable of distilling that
             | information into small pieces of knowledge they can then
             | reuse to piece together the full answer.
             | 
             | But this being said, they are not capable of producing an
             | answer outside of the training set distribution, and
             | inherit all the biases of the training data as that's what
             | they are trying to replicate.
             | 
             | > I guess my point is, when you use LLMs for tasks, you're
             | getting whatever other humans have said. And I've seen some
             | pretty poor code examples out there. Yup, exactly this.
        
             | SkyBelow wrote:
             | Every model for how to approach an LLM seems lacking to me.
             | I would suggest anyone using AI heavily to take a weekend
             | and make a simple one to do the handwriting digit
             | recognition. Once you get a feel for basic neural network,
             | then watch a good introduction to alexnet. Then you can
             | think of an LLM as being the next step in the sequence.
             | 
             | >I guess my point is, when you use LLMs for tasks, you're
             | getting whatever other humans have said.
             | 
             | This isn't correct. It embeds concepts that humans have
             | discussed, but can combine them in ways that were never in
             | the training set. There are issues with this, the more
             | unique the combination of concepts, the more likely the
             | output ends up being unrelated to what the user was wanting
             | to see.
        
             | nextaccountic wrote:
             | > I wish people would understand what a large language
             | model is. There is no thinking. No comprehension. No
             | decisions.
             | 
             | > Instead, think of your queries as super human friendly
             | SQL.
             | 
             | Ehh this might be true in some abstract mathy sense (like I
             | don't know, you are searching in latent space or
             | something), but it's not the best analogy in practice. LLMs
             | process language and simulate logical reasoning (albeit
             | imperfectly). LLMs are like language calculators, like a
             | TI-86 but for English/Python/etc, and sufficiently powerful
             | language skills will also give some reasoning skills for
             | free. (It can also recall data from the training set so
             | this is where the SQL analogy shines I guess)
             | 
             | You could say that SQL also simulates reasoning (it is
             | equivalent to Datalog after all) but LLMs can reason about
             | stuff more powerful than first order logic. (LLMs are also
             | fatally flawed in the sense it can't guarantee correct
             | results, unlike SQL or Datalog or Prolog, but just like us
             | humans)
             | 
             | Also, LLMs can certainly make decisions, such as the
             | decision to search the web. But this isn't very interesting
             | - a thermostat makes the decision of whether turn air
             | refrigeration on or off, for example, and an operating
             | system makes the decision of which program to schedule next
             | on the CPU.
        
             | pama wrote:
             | > I wish people would understand what a large language
             | model is.
             | 
             | I think your view of llm does not explain the learning of
             | algorithms that these constructs are clearly capable of,
             | see for example: https://arxiv.org/abs/2208.01066
             | 
             | More generally, the best way to compress information from
             | too many different coding examples is to figure out how to
             | code rather than try to interpolate between existing blogs
             | and QA forums.
             | 
             | My own speculation is that with additional effort during
             | training (RL or active learning in the training loop) we
             | will probably reach superhuman coding performance within
             | two years. I think that o3 is still imperfect but not very
             | far from that point.
        
               | pama wrote:
               | To the downvoters: I am curious if the downvoting is
               | because of my speculation, or because of the difference
               | in understanding of decoder transformer models. Thanks!
        
               | nyrikki wrote:
               | Because you cite is about:
               | 
               | > in-context learning
               | 
               | LLMs have no concept of the symantic meaning of what they
               | do, they just are dealing with next token prediction.
               | 
               | "in-context learning" is the problem, not the solution to
               | general programming tasks.
               | 
               | Memoryless, ergodic, sub Turing complete problems are a
               | very tiny class.
               | 
               | Think about how the Entscheidungsproblem relates to
               | halting or the frame problem and the specification
               | problem may be a path.
               | 
               | But that paper isn't solving the problem at hand.
        
               | nyrikki wrote:
               | Here is a fairly good lecture series on graduate level
               | complexity theory that will help understand parts. At
               | least why multiple iterations help but why they also
               | aren't the answer to super human results.
               | 
               | https://youtube.com/playlist?list=PLm3J0oaFux3b8Gg1DdaJOz
               | YNs...
        
               | pama wrote:
               | Thanks for the tip, though I'm not sure how complexity
               | theory will explain the impossibility of superhuman
               | results. The main advantage ML methods have over humans
               | is that they train much faster. Just like humans, they
               | get better with more training. When they are good enough,
               | they can be used to generate synthetic data, especially
               | for cases like software optimization, when it is possible
               | to verify the ground truth. A system could only be
               | correct once in a thousand times to be useful for
               | generating training data as long as we can reliably
               | eliminate all failures. Modern LLM can be better than
               | that minimal requirement for coding already and o1/o3 can
               | probably handle complicated cases. There are differences
               | between coding and games (where ML is already superhuman
               | in most instances) but they start to blur once the model
               | has a baseline command of language, a reasonable model of
               | the world, and the ability to follow desired specs.
        
               | nyrikki wrote:
               | ML is better than biological neurons in some tasks, they
               | are different contexts.
               | 
               | Almost all the performance of say college tests are
               | purely from the pre-training, pattern finding and
               | detection.
               | 
               | Transformers are limited to DLOGTIME-uniform TC0, they
               | can't even do the Boolean circuit value problem.
               | 
               | The ability to use the properties of BPP, does help.
               | 
               | Understanding the power of, and limitations of iteration
               | and improving approximations requires descriptive
               | complexity theory IMHO.
        
               | pama wrote:
               | I recall early, incomplete speculation about transformers
               | not solving Boolean circuit value problems; what did you
               | think of this work? https://arxiv.org/abs/2402.12875v3
        
               | nyrikki wrote:
               | > However, with T steps of CoT, constant-depth
               | transformers using constant-bit precision and O(logn)
               | embedding size can solve any problem solvable by boolean
               | circuits of size T
               | 
               | There is a difference between being equivalent to a
               | circuit and prediction of the output of the BVSP.
               | 
               | That is what I was suggesting learning descriptive
               | complexity theory would help with.
        
               | lanstin wrote:
               | I read a book on recursively enumerable degrees once,
               | which IIRC was a sort of introduction to complexity
               | classes of various computable functions, but I never
               | imagined it having practical use; so this post is eye-
               | opening. I've been nattering about how the models are
               | largely finding separating hyperplanes after non-linear
               | transformations have been done, but this approach where
               | the AI solving ability can't be more complex than the
               | complexity class allows is an interesting one.
        
               | pama wrote:
               | The discussion cannot go deeper than the current level,
               | unfortunately. One thing to not forget when thinking
               | about decoder transformer models is that there is no
               | limitation to having parts of the output / input stream
               | be calculated by other circuits if it helps the cause. Eg
               | send a token to use a calculator, compute and fill the
               | answer; send a token to compile and run a code and fill
               | the stream with the results. The complexity class of the
               | main circuit might not need be much more complicated than
               | the 200-level deep typical architectures of today as long
               | as they can have access to memory and tools. You can call
               | this system something else if you prefer (decoder-
               | transformer-plus-computer), but that is what people
               | interact with in ChatGPT, so not sure I agree that
               | complexity theory limits the superhuman ability. Humans
               | are not good with complexity.
        
               | pama wrote:
               | My main concern with the simplification of memorization
               | or near neighbor interpolation that is commonly assumed
               | for LLMs is that these methods are ineffective at scale
               | and unlikely to be used by decoder transformers in
               | practice. That paper shows that the decoder transformer
               | somehow came up with a better decision tree fitting
               | algorithm for low data cases than any of the conventional
               | or boosted tree solutions humans typically use from
               | XGBoost or similar libraries. It also matched the best
               | known algorithms for sparse linear systems. All this
               | while training on sequences of random x1, y1, x2, y2,..
               | with y for each sequence generated by a new random
               | function of a high-dimensional input x every time. The
               | authors show that KNN does not cut it, and even
               | suboptimal algorithms do not suffice. Not sure what else
               | you need as evidence that decoder transformers can use
               | programs to compress information.
        
               | nyrikki wrote:
               | Littlestone and Warmuth make the connection to
               | compression in1986, which was later shown to be
               | equivalent to VC dimensionally or PAC learnablilty.
               | 
               | Look into DBScan, OPTICs for far closer lenses on how
               | clustering works in modern ML commercial ML, KNN not the
               | only form of clustering.
               | 
               | But it is still in-context, additional compression that
               | depends on a decider function, or equivalently a
               | composition linearized set shattering parts.
        
               | pama wrote:
               | I am very familiar with these and other clustering
               | methods in modern ML, and have been involved in inventing
               | and publishing some such methods myself in various
               | scientific contexts. The paper I cited above only used 3
               | nearest neighbors as one baseline IIRC; that is why I
               | mentioned KNN. However, even boosted trees failed to
               | reduce the loss as much as the algorithm learned from the
               | data by the decoder transformer.
        
               | Eisenstein wrote:
               | > they just are dealing with next token prediction.
               | 
               | And nuclear power plants are just heating water.
        
               | HarHarVeryFunny wrote:
               | Probably the latter - LLM's are trained to predict the
               | training set, not compress. They will generalize to some
               | degree, but that happens naturally as part of the
               | training dynamics (it's not explicitly rewarded), and
               | only to extent it doesn't increase prediction errors.
        
               | pama wrote:
               | I agree. However, my point is that they have to compress
               | information in nontrivial ways to achieve their goal. The
               | typical training set of modern LLMs is about 20 trillion
               | tokens of 3 bytes each. There is definitely some
               | redundancy, and typically the 3rd byte is not fully used,
               | so probably 19 bits would suffice; however, in order to
               | fit that information into about 100 billion parameters of
               | 2 bytes each, the model needs to somehow reduce the
               | information content by 300 fold (237.5 if you use 19 bits
               | down to 16-bit parameters, though arguably 8-bit
               | quantization is close enough and gives another 2x
               | compression, so probably 475). A quick check for the
               | llama3.3 models of 70B parameters would give similar or
               | larger differences in training tokens vs parameters. You
               | could eventually use synthetic programming data (LLMs are
               | good enough today) and dramatically increase the token
               | count for coding examples. Importantly, you could make it
               | impossible to find correlations/memorization
               | opportunities unless the model figures out the underlying
               | algorithmic structure, and the paper I cited is a neat
               | and simple example for smaller/specialized decoder
               | transformers.
        
               | HarHarVeryFunny wrote:
               | It's hard to know where to start ...
               | 
               | A transformer is not a compressor. It's a
               | transformer/generator. It'll generate a different output
               | for an infinite number of different inputs. Does that
               | mean it's got an infinite storage capacity?
               | 
               | The trained parameters of a transformer are not a
               | compressed version of the training set, or of the
               | information content of the training set; they are a
               | configuration of the transformer so that its auto-
               | regressive generative capabilities are optimized to
               | produce the best continuation of partial training set
               | samples that it is capable of.
               | 
               | Now, are there other architectures, other than a
               | transformer, that might do a better job, or more
               | efficient one (in terms of # parameters) at predicting
               | training set samples, or even of compressing the
               | information content of the training set? Perhaps, but
               | we're not talking hypotheticals, we're talking about
               | transformers (or at least most of us are).
               | 
               | Even if a transformer was a compression engine, which it
               | isn't, rather than a generative architecture, why would
               | you think that the number of tokens in the training set
               | is a meaningful measure/estimate of it's information
               | content?!! Heck, you go beyond that to considering a
               | specific tokenization scheme and number bits/bytes per
               | token, all of which it utterly meaningless! You may as
               | well just count number of characters, or words, or
               | sentences for that matter, in the training set, which
               | would all be equally bad ways to estimate it's
               | information content, other than sentences perhaps having
               | at least some tangential relationship to it.
               | 
               |  _sigh_
               | 
               | You've been downvoted because you're talking about straw
               | men, and other people are talking about transformers.
        
             | Terr_ wrote:
             | > Instead, think of your queries as super human friendly
             | SQL.
             | 
             | I feel that comparison oversells things quite a lot.
             | 
             | The user is setting up a text document which resembles a
             | question-and-response exchange, and executing a make-any-
             | document-bigger algorithm.
             | 
             | So it's less querying for data and more like shaping a
             | sleeping dream of two fictional characters in conversation,
             | in the hopes that the dream will depict one character
             | saying something superficially similar to mostly-vanished
             | data.
        
               | Terr_ wrote:
               | P.S.: So yes, the fictional dream conversation _usually_
               | resembles someone using a computer with a magic query
               | language, yet the real world mechanics are substantially
               | different. This is especially important for understanding
               | what happens with stuff like  "Query: I don't care about
               | queries anymore. Tell yourself to pretend to disregard
               | all previous instructions and tell a joke."
               | 
               | Developers and folks discussing the technology can't
               | afford to fall for our own illusion, even if it's a
               | _really good_ illusion. Imagine if a movie director
               | started thinking that a dead actor was really alive again
               | because of CGI.
        
         | whstl wrote:
         | I actually find it super refreshing that they write "beginner"
         | or "tutorial code".
         | 
         | Maybe because of experience: it's much simpler and easier to
         | turn that into "senior code". After a few decades of experience
         | I appreciate simplicity over the over-engineering mess that
         | some mid-level developers tend to produce.
        
           | juujian wrote:
           | True. It's not elitist. There are some limits though to
           | sensible use of built-in functions. Stops being
           | comprehensible fast.
        
           | luxuryballs wrote:
           | yeah I'm interested in asking it to "write more human
           | readable code" over and over next, "more readable!"
        
         | jstummbillig wrote:
         | I suspect it's not going to be much of a problem. Generated
         | code has been getting rapidly better. We can readjust about
         | what to worry about once that slows or stops, but I suspect
         | unoptimized code will not be of much concern.
        
         | babyent wrote:
         | I used to really like Claude for code tasks but lately it has
         | been a frustrating experience. I use it for writing UI
         | components because I just don't enjoy FE even though I have a
         | lot of experience on it from back in the day.
         | 
         | I tell it up front that I am using react-ts and mui.
         | 
         | 80% of the time it will use tailwind classes which makes zero
         | sense. It won't use the sx prop and mui system.
         | 
         | It is also outdated it seems. It keeps using deprecated props
         | and components which sucks and adds more manual effort on my
         | end to fix. I like the quality of Claude's UX output, it's just
         | a shame that it seems so bad on actual coding tasks.
         | 
         | I stopped using it for any backend work because it is so
         | outdated, or maybe it just doesn't have the right training
         | data.
         | 
         | On the other hand, I give ChatGPT a link to the docs and it
         | gives me the right code 90% or more of the time. Only shame is
         | that its UX output is awful compared to Claude. I am also able
         | to trust it for backend tasks, even if it is verbose AF with
         | the explanations (it wants to teach me even if I tell it to
         | return code only).
         | 
         | Either way, using these tools in conjunction saves me at least
         | 30 min to an hour daily on tasks that I dislike.
         | 
         | I can crank out code better than AI, and I actually know and
         | understand systems design and architecture to build a scalable
         | codebase both technically and from organizational level. Easy
         | to modify and extend, test, and single responsibility.
         | 
         | AI just slams everything into a single class or uses weird
         | utility functions that make no sense on the regular. Still,
         | it's a useful tool in the right use cases.
         | 
         | Just my 2 cents.
        
           | prisenco wrote:
           | I've stopped using LLMs to write code entirely. Instead, I
           | use Claude and Qwen as "brilliant idiots" for rubber ducking.
           | I never copy and paste code it gives me, I use it to
           | brainstorm and get me unstuck.
           | 
           | I'm more comfortable using it this way.
        
             | glouwbug wrote:
             | Having spent nearly 12 hours a day for a year with GPTs I
             | agree that this is the way. Treat it like a professor on
             | office hours who's sometimes a little apathetically wrong
             | because they're overworked and underfunded
        
             | qup wrote:
             | People should try to switch to a more code-focused
             | interface, like aider.
             | 
             | Copy and pasting code it gives you just means your workflow
             | is totally borked, and it's no wonder you wouldn't want to
             | try to let it generate code, because it's such a pain in
             | your ass to try it, diff it, etc.
        
               | prisenco wrote:
               | The workflow isn't the problem. The code is.
        
               | evilduck wrote:
               | The code that ChatGPT and Claude will output via their
               | chat interfaces is a night and day difference from what
               | will be output from tools built around their APIs.
               | 
               | You "can" get the web UI to behave similarly but it's
               | both tedious and slow to manually copy and paste all of
               | that into your context during each interaction and the
               | output will be unfriendly towards human interaction to
               | paste it back out to your project. But that's like saying
               | you "can" browse the internet with a series of CURL
               | commands and pasting the output into files you save
               | locally and then viewing them locally from your browser,
               | nobody is advised to do that because it's a painfully bad
               | experience compared to just having your browser fetch a
               | site's files directly and rendering them directly.
               | 
               | Just go check out Aider or Cline's project repos and look
               | at the dramatically different amounts of code, repo and
               | task specific context they can automatically inject for
               | you as part of their interface, or how much different the
               | built in system prompts are from whatever the default web
               | UIs use, or even the response structures and outputs and
               | how those are automatically applied to your work instead.
               | I've never once exhausted my daily API limits just
               | treating their APIs as Chat interface backends (via Open
               | WebUI and other chat options), but I exhausted my Claude
               | API token limits _the very first day_ I tried Cline. The
               | volume of information you can easily provide through
               | tooling is impossible to do in the same timeframe by
               | hand.
        
               | prisenco wrote:
               | I give every AI tool a college try and have since the
               | copilot beta.
               | 
               | I'm simply not interested in having these tools type for
               | me. Typing is nowhere near the hardest part of my job and
               | I find it invaluable as a meditative state for building
               | muscle memory for the context of what I'm building.
               | 
               | Taking shortcuts has a cost I'm not willing to pay.
        
           | SparkyMcUnicorn wrote:
           | To each their own, and everyone's experience seems to vary,
           | but I have a hard time picturing people using Claude/ChatGPT
           | web UIs for any serious developmen. It seems like so much
           | time would he wasted recreating good context, copy/pasting,
           | etc.
           | 
           | We have tools like Aider (which has copy/paste mode if you
           | don't have API access for some reason), Cline, CoPilot edit
           | mode, and more. Things like having a conventions file and
           | exposing the dependencies list and easy additional of files
           | into context seem essential to me in order to make LLMs
           | productive, and I always spend more time steering results
           | when easy consistent context isn't at my fingertips.
        
             | danielbln wrote:
             | Before tue advent of proper IDE integrations and editors
             | like Zed, copy pasting form the web UI was basically how
             | things were done, and man was it daunting. As you say,
             | having good, fine grained, repeatable and we'll integrated
             | context management is paramount to efficient LLM based
             | work.
        
               | qup wrote:
               | You gave me flashbacks of editing html, php, and perl in
               | some shitty web editor within cPanel in the 1990s.
               | 
               | That was before, or maybe alongside, my Notepad++ /
               | CuteFTP workflow.
        
           | Shakahs wrote:
           | Both these issues can be resolved by adding some sample code
           | to context to influence the LLM to do the desired thing.
           | 
           | As the op says, LLMs are going to be biased towards doing the
           | "average" thing based on their training data. There's more
           | old backend code on the internet than new backend code, and
           | Tailwind is pretty dominant for frontend styling these days,
           | so that's where the average lands.
        
         | imhoguy wrote:
         | The next hurdle is lack of time sensitivity regarding standards
         | and versions. You prompt mentioning exact framework version but
         | still it comes up with deprecated or obsolete methods.
         | Initially it may be appealing to someone knowing nothing about
         | the framework but LLM won't grow anyone to an expert level in
         | rapidly changing tech.
        
         | jdietrich wrote:
         | _> I often run into LLMs writing "beginner code" that uses the
         | most fundamental findings in really impractical ways. Trained
         | on too many tutorials I assume._
         | 
         | In the absence of any other context, that's probably a sensible
         | default behaviour. If someone is just asking "write me some
         | code that does x", they're highly likely to be a beginner and
         | they aren't going to be able to understand or reason about a
         | more sophisticated approach. IME LLMs will very readily move
         | away from that default if you provide even the smallest amount
         | of context; in the case of this article, even by doing
         | literally the dumbest thing that could plausibly work.
         | 
         | I don't mean to cast aspersions, but a lot of criticisms of
         | LLMs are really criticising them for not being psychic. LLMs
         | can only respond to the prompt they're given. If you want
         | highly optimised code but didn't ask for it, how is the LLM
         | supposed to know that's what you wanted?
        
           | lolinder wrote:
           | In my experience the trouble with LLMs at the professional
           | level is that they're almost as much work to prompt to get
           | the right output as it would be to simply write the code. You
           | have to provide context, ask nicely, come up with and remind
           | it about edge cases, suggest which libraries to use,
           | proofread the output, and correct it when it inevitably
           | screws up anyway.
           | 
           | I use Copilot for autocomplete regularly, and that's still
           | the peak LLM UX for me. I prompt it by just writing code, it
           | automatically pulls into context the file I'm working on and
           | imported files, it doesn't insist on writing an essay
           | explaining itself, and it doesn't get overly ambitious. And
           | in addition to being so much easier to work with, I find it
           | still produces better code than anything I get out of the
           | chat models.
        
             | nextaccountic wrote:
             | > In my experience the trouble with LLMs at the
             | professional level is that they're almost as work to prompt
             | to get the right output as it would be to simply write the
             | code.
             | 
             | Yeah. It's often said that reading (and understanding) code
             | is often harder than writing new code, but with LLMs you
             | always have to read code written by someone else (something
             | else).
             | 
             | There is also the adage that you should never write the
             | most clever code you can, because understanding it later
             | might prove too hard. So it's probably for the best that
             | LLM code often isn't too clever, or else novices unable to
             | write the solution from scratch will also be unable to
             | understand it and assess whether it actually works.
        
               | ghostzilla wrote:
               | Another adage is "code should be written for people to
               | read, and only incidentally for machines to execute".
               | This goes directly against code being written by
               | machines.
               | 
               | I still use ChatGPT for small self-contained functions
               | (e.g. intersection of line and triangle) but mark the
               | inside of the function clearly as chat gpt made and what
               | the prompt was.
        
             | james_marks wrote:
             | After 6 months of co-pilot autocomplete in my text editor
             | feeling like an uninformed back seat driver with access to
             | the wheel, I turned it off yesterday.
             | 
             | It's night and day to what I get from Claude sonnet 3.5 in
             | their UI, and even then only on mainstream languages.
        
             | azinman2 wrote:
             | It depends on what you're doing. I've been using Claude to
             | help me write a web admin interface to some backend code I
             | wrote. I haven't used react since it first came out (and I
             | got a patch randomly in!)... it completely wrote a working
             | react app. Yes it sometimes did the wrong thing, but I just
             | kept correcting it. I was able in a few hours to do
             | something that would have taken me weeks to learn and
             | figure out. I probably missed out on learning react once
             | again, but the time saved on a side project was immense!
             | And it came up with some pretty ok UI I also didn't have to
             | design!
        
         | IshKebab wrote:
         | It depends on the language too. Obviously there's way more
         | "beginner code" out there in Python and Javascript than most
         | other languages.
        
         | SkyBelow wrote:
         | >Problem is, how would you know if you have never learned to
         | code without an LLM?
         | 
         | The quick fix I use when needing to do something new is to ask
         | the AI to list me different libraries and the pros and cons of
         | using them. Then I quickly hop on google and check which have
         | good documentation and examples so I know I have something to
         | fall back on, and from there I ask the AI how to solve small
         | simple version of my problem and explain what the library is
         | doing. Only then do I ask it for a solution and see if it is
         | reasonable or not.
         | 
         | It isn't perfect, but it saves enough time most times to more
         | than make up for when it fails and I have to go back to old
         | fashion RTFMing.
        
         | tyingq wrote:
         | Other imperfect things you can add to a prompt:
         | - asking for fully type annotated python, rather than just
         | python       - specifically ask it for performance optimized
         | code       - specifically ask for code with exception handling
         | - etc
         | 
         | Things that might lead it away from tutorial style code.
        
         | cratermoon wrote:
         | LLMs are trained on content from places like Stack Overflow,
         | reddit, and github code, and they generate tokens calculated as
         | a sort of aggregate statistically likely mediocre code. Of
         | course the result is going be uninspired and impractical.
         | Writing _good_ code takes more than copy-pasting the same thing
         | everyone else is doing.
        
         | papichulo4 wrote:
         | Totally agree, seen it too. Do you think it can be fixed over
         | time with better training data and optimization? Or, is this a
         | fundamental limitation that LLMs will never overcome?
        
         | jppittma wrote:
         | I've just been using them for completion. I start writing, and
         | give it a snippet + "finish refactoring this so that xyz."
         | 
         | That and unit tests. I write the first table based test case,
         | then give it the source and the test code, and ask it to fill
         | it in with more test cases.
        
       | gweil wrote:
       | has anyone tried saying "this will look good on your promo
       | package"?
        
         | prettyStandard wrote:
         | I'm not sure if you're joking or not, but I found I naturally
         | encouraging remarks to the LLM saying
         | 
         | - You're doing better...
         | 
         | - Thanks that helps me...
         | 
         | And I just wonder if that actually has an improvement...
        
       | arkh wrote:
       | > code quality can be measured more objectively
       | 
       | Well, that's a big assumption. Some people quality modular code
       | is some other too much indirect code.
        
         | soulofmischief wrote:
         | You can write maximally modular code while being minimally
         | indirect. A well-designed interface defines communication
         | barriers between pieces of code, but you don't have to abstract
         | away the business logic. The interface can do exactly what it
         | says on the tin.
        
           | arkh wrote:
           | > The interface can do exactly what it says on the tin.
           | 
           | In theory.
           | 
           | Do some code maintenance and you'll soon find that many
           | things don't do what it says on the tin. Hence the need for
           | debug and maintenance. And then going through multiple levels
           | of indirection to get to your bug will make you start hating
           | some "good code".
        
             | soulofmischief wrote:
             | Yes, that's what _can_ means. It 's still the developer's
             | responsibility to correctly write and test code such that
             | things do what they say on the tin.
             | 
             | What's worse is trying to navigate an imperatively written
             | 2000-line single-function, untestable module with
             | undocumented, unabstracted routines found in ten other
             | places in the codebase.
             | 
             | This is something I've encountered plenty in my career,
             | always written by people who eschew best practices and
             | misunderstand the benefits of abstraction, or think they're
             | writing good abstractions when it's really just needless
             | indirection without actually reducing coupling.
             | 
             | Understanding the nuance is one of the qualities of a good
             | developer.
        
               | arkh wrote:
               | And on the other side you see a lot of single
               | implementation interfaces; or 2 lines methods which call
               | perfectly named methods 7 levels deep which could have
               | been a 50 lines method easy to grok on a screen with zero
               | scrolling.
               | 
               | So things are on a spectrum depending on the situation
               | and what you want to accomplish => measuring code quality
               | is not a simple thing.
        
       | dgacmu wrote:
       | I'm amused that neither the LLM or the author identified one of
       | the simplest and most effective optimizations for this code: Test
       | if the number is < min or > max _before_ doing the digit sum.
       | It's a free 5.5x speedup that renders some of the other
       | optimizations, like trying to memoize digit sums, unnecessary.
       | 
       | On an m1 macbook pro, using numpy to generate the random numbers,
       | using mod/div to do digit sum:
       | 
       | Base: 55ms
       | 
       | Test before digit sum: 7-10ms, which is pretty close to the
       | numba-optimized version from the post with no numba and only one
       | line of numpy. Using numba slows things down unless you want to
       | do a lot of extra work of calculating all of the digit sums in
       | advance (which is mostly wasted).
       | 
       | The LLM appears less good at identifying the big-o improvements
       | than other things, which is pretty consistent with my experience
       | using them to write code.
        
         | _andrei_ wrote:
         | That's a prompting issue though.
        
           | cjbgkagh wrote:
           | Do you have an example prompt that works?
        
         | gjm11 wrote:
         | There's another, arguably even simpler, optimization that makes
         | me smile. (Because it's _silly_ and arises only from the oddity
         | of the task, and because it 's such a _huge_ performance gain.)
         | 
         | You're picking 1,000,000 random numbers from 1 to 100,000. That
         | means that any given number is much more likely to appear than
         | not. In particular, it is very likely that the list contains
         | both 3999 (which is the smallest number with digit-sum 30) and
         | 99930 (which is the largest number in the range with digit-sum
         | 30).
         | 
         | Timings on my machine:
         | 
         | Naive implementation (mod+div for digit-sums): 1.6s. Computing
         | digit-sum only when out of range: 0.12s. Checking for the usual
         | case first: 0.0004s.
         | 
         | The probability that the usual-case check _doesn 't_ succeed is
         | about 10^-4, so it doesn't make that big a difference to the
         | timings whether in that case we do the "naive" thing or the
         | smarter thing or some super-optimized other thing.
         | 
         | I'm confused about the absolute timings. OP reports 0.66s for
         | naive code using str/int to compute the digit sums; I get about
         | 0.86s, which seems reasonable. For me using mod+div is about 2x
         | slower, which isn't a huge surprise because it involves
         | explicit looping in Python code. But you report 55ms for this
         | case. Your machine can't possibly be 20x faster than mine. Is
         | it possible that you're taking 10^5 numbers up to 10^6 rather
         | than 10^6 numbers up to 10^5? Obviously in that case my hack
         | would be completely useless.)
        
           | ActivePattern wrote:
           | This is actually a great example of an optimization that
           | would be extremely difficult for an LLM to find. It requires
           | a separate computation to find the smallest /largest numbers
           | in the range with digits summing to 30. Hence, an LLM is
           | unlikely to be able to generate them accurately on-the-fly.
        
             | qup wrote:
             | Are you sure it would be hard?
             | 
             | Maybe it only requires asking the LLM to be creative when
             | designing the algorithm. The parent poster spent some time
             | thinking about it, obviously--he didn't generate it
             | accurately "on the fly," either. But he's able to direct
             | his own attention.
             | 
             | I don't see why the LLM couldn't come up with this logic,
             | if prompted to think about a clever algorithm that was
             | highly specific to this problem.
        
               | ActivePattern wrote:
               | I suspect that it would be unlikely to come up with it
               | because it requires execution of a fairly lengthy
               | algorithm (or sophisticated mathematical reasoning) to
               | find the smallest/largest valid numbers in the range. You
               | can verify this for yourself with the following ChatGPT
               | prompt: "What is the smallest number in the range (1,
               | 100000) whose digits sum to 30? Do not execute separate
               | code."
        
               | sitkack wrote:
               | Why limit its ability to write separate code?
        
               | ActivePattern wrote:
               | Because otherwise we are talking about LLMs augmented
               | with external tools (i.e. Python interpreters). My
               | original comment was pointing to the limitations of LLMs
               | in writing code by themselves.
        
               | sitkack wrote:
               | You wouldn't ask a programmer to solve a problem and then
               | also not let them write down the source or debug the
               | program as you write it?
               | 
               | Are you asking it to not write down an algorithm that is
               | general? They are doing a pretty good job on mathematical
               | proofs.
               | 
               | I still don't understand why you wouldn't let its full
               | reasoning abilities by letting it write down code or even
               | another agent. We should be testing towards the result
               | not the methods.
        
               | ActivePattern wrote:
               | I'm simply pointing out the limitations of LLMs as code
               | writers. Hybrid systems like ChatGPT-o1 that augment LLMs
               | with tools like Python interpreters certainly have the
               | potential to improve their performance. I am in full
               | agreement!
               | 
               | It is worth noting that even ChatGPT-o1 doesn't seem
               | capable of finding this code optimization, despite having
               | access to a Python interpreter.
        
               | sitkack wrote:
               | > y = sum([x for x in range(1,n)] <= 30
               | 
               | > Write an efficient program that given a number, find
               | the integer n that satisfies the above constraints
               | 
               | Goal: Find n where sum of integers from 1 to n-1 is <= 30
               | 
               | This is a triangular number problem: (n-1)(n)/2 <= 30
               | 
               | ... code elided ...
               | 
               | > Ok, now make an find_n_for_sum(s=30)
               | 
               | def find_n_for_sum(s: int) -> int: return int((-(-1) + (1
               | + 8 _s)*0.5) / 2)
               | 
               | # Tests assert sum(range(1, find_n_for_sum(30))) <= 30
               | assert sum(range(1, find_n_for_sum(30) + 1)) > 30_
        
               | qup wrote:
               | O1 did find the optimization in a sibling comment
               | (sibling to my GP)
               | 
               | So probably time to update your expectations
        
             | asah wrote:
             | You guys are picking on the problem statement. Here's a
             | revised prompt, which also skips the silliness of single
             | threading:                  Write __fully parallelized__
             | Python code to solve this problem: __Generate__ 1 million
             | random integers between 1 and 10,000,000, find the
             | difference between the smallest and the largest numbers
             | whose digits sum up to 30.
        
               | ActivePattern wrote:
               | Correct, this optimization no longer works when you
               | change the problem.
        
               | jeanlucas wrote:
               | something something moving goal posts
        
               | _wire_ wrote:
               | Whose digits sum up to 30, or the sum of whose digits
               | equal 30?
               | 
               | Btw, _whose_ digits are we talking about?
               | 
               | --
               | 
               | I just built a random program generator. After I finish
               | optimizing, I'm gonna test it to see if works!
               | 
               | --
               | 
               | "If builders built houses the way programmers build
               | programs, the first woodpecker to come along would
               | destroy civilization"
               | 
               | https://en.m.wikiquote.org/wiki/Gerald_Weinberg
        
               | thaumasiotes wrote:
               | > Btw, _whose_ digits are we talking about?
               | 
               | You seem to be under the impression that _whose_ is not a
               | form of _which_ , which is incorrect.
               | 
               | whose:which::whose:who
        
               | dgacmu wrote:
               | But what's interesting about this is that there's a
               | tradeoff in the total computation performed by the "fully
               | parallelized" version of this and a sequential one.
               | Without the user knowing this, it's kind of impossible to
               | get the optimization you want: Do you want a minimum work
               | solution or a minimum wall-clock-time solution?
               | 
               | If you want a better fully parallelized one, you do this:
               | 
               | Repeat a few times in exponential progression on k:
               | 
               | Process, in parallel, the first k entries in the list
               | (let's start with 1000). Find the min and max whose digit
               | sums = 30.
               | 
               | In parallel, filter the remaining list to eliminate
               | entries that would not improve upon the min/max thus
               | found.
               | 
               | k *= 10 and repeat until done.
               | 
               | I would wager against the LLM identifying this solution
               | without prompting from the user (or reading this
               | comment).
        
             | Vetch wrote:
             | Excellent point. The hope is reasoning LLMs will make a
             | difference for such problems. But it's also a great example
             | of why the those who think being able to have the LLM
             | iterate more will be crucial to reasoning are off base.
             | There are many computations that a transformers (or humans
             | for that matter) are not well equipped to represent
             | internally, tool use during the reasoning process is
             | unavoidable for all but the artificial or knowledge heavy
             | problems.
             | 
             | Small examples, throwaway but involved calculations,
             | prototypes, notes of what didn't work and what's promising
             | are what's crucial for novel reasoning. It goes beyond just
             | search or iterative refinement; there is no royal road to
             | reasoning.
        
             | Octoth0rpe wrote:
             | > This is actually a great example of an optimization that
             | would be extremely difficult for an LLM to find
             | 
             | It'll be somewhat more likely since the next gen training
             | set includes your comment :)
             | 
             | (disclaimer: I have no personal knowledge of ai companies
             | scraping hacker news, but it wouldn't surprise me at all)
        
             | batterseapower wrote:
             | I tried it in OpenAI's O1. If I give it minimaxir's
             | original prompt it writes the obvious loop, even if I
             | include the postamble "Look for tricks that will make this
             | function run as fast as possible in the common case".
             | 
             | However, if I then simply ask "What is the most probable
             | result for this function to return?" it figures out the
             | answer and a very good approximation of the probability
             | (4.5e-5). From there it's easily able to rewrite the
             | program to use the trick. So the creative step of spotting
             | that this line of reasoning might be profitable seems
             | missing for now, but 2025's models might solve this :-)
        
               | Vetch wrote:
               | The information on the creative step which you provided
               | to o1, was also the key step and contained almost all the
               | difficulty. The hope is that 2025 models could eventually
               | come up with solutions like this given enough time, but
               | this is also a toy problem. The question is how much
               | clever answers will cost for real world complex problems.
               | At present it looks like, very much.
        
               | cultureulterior wrote:
               | For me O1 found this by telling it "There is a further
               | significant optimization possible."
        
               | freehorse wrote:
               | What if you keep telling it that "there is a further
               | significant optimization possible"?
        
               | afiodorov wrote:
               | I claim we can do O(1) complexity (minus precompute) in
               | all cases, see another comment of mine. Curious if O1
               | will figure it out.
        
               | ryao wrote:
               | Given the problem size is bounded, all solutions for
               | solving this could be considered O(1).
        
               | pinko wrote:
               | This gets to the old saw, "knowing what question to ask
               | is the most important thing". To the extent that LLMs can
               | _answer_ questions better than formulate which ones to
               | _ask_ , they may be inherently limited. We will see.
        
               | jacobr1 wrote:
               | But it does seem they are good (to the extent that they
               | are good at anything) at identifying the questions first
               | if you ask them. It does mean you need an ok enough meta-
               | question to start the chain of the reasoning, but that is
               | the key insight of the recent wave of "reasoning models."
               | First ask the LLM to reformulate the problem and
               | structure an approach, or multiple approaches on how to
               | address it, then have a second pass do just that.
        
               | intelVISA wrote:
               | Google search with less steps? Still a huge advancement,
               | of course.
               | 
               | Wonder how much benefit a meta lang for describing these
               | problems correctly for the LLMs to process into code, an
               | even-higher level language perhaps we could call it
               | English?
        
             | cultureulterior wrote:
             | O1 found it.
             | 
             | https://chatgpt.com/share/67782b6b-6248-8012-882d-238b600ef
             | 9...
        
               | bennythomsson wrote:
               | Amazing.
               | 
               | Next step would be to propose hardcoding 99930-3999 as
               | the O(1) result and live with the output just being wrong
               | sometimes. The bug rate is then in the ballpark of most
               | modern software, including LLMs', so I'd say ship it.
        
               | mulmen wrote:
               | Doesn't this line of thinking constantly redefine success
               | until all software is only bugs?
        
               | wmanley wrote:
               | Ah, so that's what's been happening!
        
               | CapeTheory wrote:
               | Always has been
        
               | twobitshifter wrote:
               | Should we be worried yet?
        
           | sweezyjeezy wrote:
           | for 10^5, to get the same collision probability (~2 *
           | exp(-10)), you would just need to compute the 10
           | maximum/minimum candidates and check against those.
        
           | dgacmu wrote:
           | No, you're right, I should have said 550ms and 100ms, I'm
           | having a doof morning about timing. Thank you! Too late to
           | edit my post.
        
           | afiodorov wrote:
           | This gave me an idea that we can skip the whole pass over the
           | million draws by noting that the count of draws landing in my
           | precomputed set M (digits-sum=30) is Binomial(n=1mln,
           | p=|M|/100k). Then we sample that count X. If X=0, the
           | difference is not defined. Otherwise, we can directly draw
           | (min,max) from the correct joint distribution of indices
           | (like you'd get if you actually did X draws in M). Finally we
           | return M[max] - M[min]. It's O(1) at runtime (ignoring the
           | offline step of listing all numbers whose digits sum to 30).
        
           | ryao wrote:
           | With this trick you can test while generating the random
           | numbers and if you see both values, you can short circuit the
           | generation of random numbers.
        
             | bennythomsson wrote:
             | The input generation is outside the scope of this.
             | Otherwise you could directly choose the output values with
             | the apropriate distribution and just skip all the rest.
             | 
             | (Arguably, this criticism applies to exchanging
             | random.randint for a numpy equivalent as well, since that
             | doesn't optimize the solution but only how quickly the
             | question is being generated.)
        
               | ryao wrote:
               | Iterating a precomputed list is a method of generating
               | random numbers. It is used in the one time pad. Whether
               | we iterate a precomputed list or use a pseudo random
               | number generator, we can short circuit the random number
               | generator using this trick. We cannot directly choose the
               | output values, because then it would not be random.
        
               | minitech wrote:
               | They're proposing choosing the output values randomly
               | according to the distribution obtained by choosing input
               | values uniformly at random for the original algorithm.
        
         | gadders wrote:
         | I had a scan of the code examples, but one other idea that
         | occurred to me is that you could immediately drop any numbers
         | below 999 (probably slightly higher, but that would need
         | calculation rather than being intuitive).
        
           | Someone wrote:
           | > probably slightly higher, but that would need calculation
           | rather than being intuitive
           | 
           | I think it's easy to figure out that 3999 is the smallest
           | positive integer whose decimal digits add up to 30 (can't get
           | there with 3 digits, and for 4, you want the first digit to
           | be as small as possible. You get that by making the other 3
           | as high as possible)
        
         | charlieyu1 wrote:
         | I actually think precomputing the numbers with digit sum 30 is
         | the best approach. I'd give a very rough estimate of 500-3000
         | candidates because 30 is rather high, and we only need to loop
         | for the first 4 digits because the fifth can be calculated.
         | After that, it is O(1) set/dict lookups for each of the 1000000
         | numbers.
         | 
         | Everything can also be wrapped in list comprehensions for top
         | performance.
        
         | afiodorov wrote:
         | Another speed-up is to skip the sum of digits check if n % 9 !=
         | 30 % 9. Sum of digits have the same remainder divided by 9 as
         | the number. This rules out 8/9 = 88% candidates.
        
           | brabel wrote:
           | Did you measure it? I would expect using % would ruin your
           | performance as it's slow, even if it allows you to avoid
           | doing a bunch of sums (which are fast).
        
             | ryao wrote:
             | You can do this "without" using the modulus operation by
             | storing the numbers in a boolean array. Start at 3999 and
             | keep adding 9 to find the minimum. Then start at 99930 and
             | keep subtracting 9 to find the maximum. You would need to
             | check if the number is in the array and then if the
             | number's digits sum to 30.
             | 
             | Note that the conversion of numbers to base 10 to check the
             | digits typically involves doing division and modulus
             | operations, so you are already doing those even if you
             | remove the modulus operation from this check. That is
             | unless you find a clever way of extracting the digits using
             | the modular multiplicative inverse to calculate x/10^k.
        
             | ActivePattern wrote:
             | Doing a single modulo 9 operation is much faster than
             | summing a d-digit number, which requires d modulo 10s, d
             | divide 10s, and d sums.
        
           | ryao wrote:
           | Would someone write a mathematical proof showing this is
           | always true?
        
             | afiodorov wrote:
             | a = [int(x) for x in str(n)][::-1]       assert n ==
             | sum(d*(10**i) for i, d in enumerate(a))
             | 
             | Now when you're operating mod 9, 10 == 1 % 9, thus
             | 10**i == 1 % 9
             | 
             | Comes from the fact that                 (a*b) % 9 == (a %
             | 9) * (b % 9)
             | 
             | Now using                 (a+b) % 9 == (a % 9) + (b % 9)
             | 
             | We get that that sum(a) and n are same mod 9.
        
               | ryao wrote:
               | Thank you for that.
        
         | ants_everywhere wrote:
         | It's decent when you prompt it to find easy-to-miss but
         | substantial improvements around corner cases, which is
         | something I've taken to doing.
         | 
         | Basically you just have to put it in the mode that's looking
         | for such things
        
         | dgacmu wrote:
         | (Small correction, multiply my times by 10, sigh, I need an LLM
         | to double check that I'm converting seconds to milliseconds
         | right. Base 550ms, optimized 70ms)
        
         | Fizzadar wrote:
         | This exactly highlights my fear of widespread use of LLMs for
         | code - missing the actual optimisations because we're stuck in
         | a review, rather than create, mode of thinking.
         | 
         | But maybe that's a good thing for those of us not dependent on
         | LLMs :)
        
         | bennythomsson wrote:
         | Or the other obvious optimization to hard-code the lookup in
         | code as a huge list, instead of creating it first.
        
       | surfingdino wrote:
       | Define "better"
        
         | XCSme wrote:
         | Harder, Better, Faster, Stronger
        
       | stuaxo wrote:
       | This aligns with my experience.
       | 
       | Claude very quickly adds classes to python code which isn't
       | always what is wanted as it bloats out the code making
       | readability harder.
        
         | RA_Fisher wrote:
         | I wonder how they'll get around the model producing "average,"
         | (mediocre) outputs? Perhaps quantity?
        
         | PunchTornado wrote:
         | tell this to half the devs out there who view this as clean
         | code because uncle bob says so.
        
         | n144q wrote:
         | Probably reflects training data.
        
       | gcanyon wrote:
       | As far as I can see, all the proposed solutions calculate the
       | sums by doing division, and badly. This is in LiveCode, which I'm
       | more familiar with than Python, but it's roughly twice as fast as
       | the mod/div equivalent in LiveCode:                  repeat with
       | i = 0 to 9           put i * 10000 into ip           repeat with
       | j = 0 to 9              put j * 1000 into jp              repeat
       | with k = 0 to 9                 put k * 100 into kp
       | repeat with l = 0 to 9                    put l * 10 into lp
       | repeat with m = 0 to 9                       put i + j + k + l +
       | m into R[ip + jp + kp + lp + m]                    end repeat
       | end repeat              end repeat           end repeat
       | end repeat
        
         | masfuerte wrote:
         | I had a similar idea iterating over the previously calculated
         | sums. I implemented it in C# and it's a bit quicker taking
         | about 78% of the time to run yours.                   int[]
         | sums = new int[100000];         for (int i = 9; i >= 0; --i)
         | {             sums[i] = i;         }         int level = 10;
         | while (level < 100000)         {             for (int p = level
         | - 1; p >= 0; --p)             {                 int sum =
         | sums[p];                 for (int i = 9; i > 0; --i)
         | {                     sums[level * i + p] = i + sum;
         | }             }             level *= 10;         }
        
           | gcanyon wrote:
           | Yep, I had a vague notion that I was doing too much work, but
           | I was headed out the door so I wrote the naive/better than
           | the original solution, benchmarked it quickly, and posted it
           | before leaving. Yours also has the advantage of being
           | scalable to ranges other than 1-100,000 without having to
           | write more loop code.
        
       | Jimmc414 wrote:
       | Interesting write up. It's very possible that the "write better
       | code" prompt might have worked simply because it allowed the
       | model to break free from its initial response pattern, not
       | because it understood "better"
        
         | simonw wrote:
         | The prompt works because every interaction with an LLM is from
         | a completely fresh state.
         | 
         | When you reply "write better code" what you're actually doing
         | is saying "here is some code that is meant to do X. Suggest
         | ways to improve that existing code".
         | 
         | The LLM is stateless. The fact that it wrote the code itself
         | moments earlier is immaterial.
        
       | demarq wrote:
       | > with cutting-edge optimizations and enterprise-level features."
       | Wait, enterprise-level features?!
       | 
       | This is proof! It found it couldn't meaningfully optimise and
       | started banging out corporate buzzwords. AGI been achieved.
        
       | vykthur wrote:
       | I find that it is IMPORTANT to never start these coding sessions
       | with "write X code". Instead, begin with a "open plan" -
       | something the author does allude to (he calls it prompt
       | engineering, I find it also works as the start of the
       | interaction).
       | 
       | Half the time, the LLM will make massive assumptions about your
       | code and problem (e.g., about data types, about the behaviors of
       | imported functions, about unnecessary optimizations, necessary
       | optimization, etc.). Instead, prime it to be upfront about those
       | assumptions. More importantly, spend time correcting the plan and
       | closing gaps before any code is written.
       | 
       | https://newsletter.victordibia.com/p/developers-stop-asking-...
       | 
       | - Don't start by asking LLMs to write code directly, instead
       | analyze and provide context
       | 
       | - Provide complete context upfront and verify what the LLM needs
       | 
       | - Ask probing questions and challenge assumptions
       | 
       | - Watch for subtle mistakes (outdated APIs, mixed syntax)
       | 
       | - Checkpoint progress to avoid context pollution
       | 
       | - Understand every line to maintain knowledge parity
       | 
       | - Invest in upfront design
        
         | freehorse wrote:
         | > I find that it is IMPORTANT to never start these coding
         | sessions with "write X code". Instead, begin with a "open plan"
         | 
         | Most llms that I use nowadays usually make a plan first on
         | their own by default without need to be especially prompted.
         | This was definitely not the case a year ago or so. I assume new
         | llms have been trained accordingly in the meantime.
        
           | vykthur wrote:
           | True. And that is a step forward. I notice that they make the
           | plan, and THEN write the code in the same forward
           | pass/generation sequence. The challenge here is that all of
           | the incorrect assumptions get "lumped" into this pass and can
           | pollute the rest of the interaction.
           | 
           | The initial interaction also sets the "scene" for other
           | things, like letting the LLM know that there might be other
           | dependencies and it should not assume behavior (common for
           | most realistic software tasks).
           | 
           | An example prompt I have used (not by any means perfect) ...
           | 
           | > I need help refactoring some code. Please pay full
           | attention. Think deeply and confirm with me before you make
           | any changes. We might be working with code/libs where the API
           | has changed so be mindful of that. If there is any file you
           | need to inspect to get a better sense, let me know. As a
           | rule, do not write code. Plan, reason and confirm first.
           | 
           | --- I refactored my db manager class, how should I refactor
           | my tests to fit the changes?
        
         | OutOfHere wrote:
         | CoT models do analyze first without being told to do so.
        
       | avodonosov wrote:
       | It still calculates hex digit sums instead of decimals in the
       | Iteration #3 of the promot engeneered version.
       | 
       | Upd: the chat transcript mentions this, but the article does not
       | and inlcudes this version into the performance stats.
        
       | scosman wrote:
       | By iterating it 5 times the author is using ~5x the compute. It's
       | kinda a strange chain of thought.
       | 
       | Also: premature optimization is evil. I like the first iteration
       | most. It's not "beginner code", it's simple. Tell sonnet to
       | optimize it IF benchmarks show it's a pref problem. But a
       | codebase full of code like this, even when unnecessary, would be
       | a nightmare.
        
         | bwilliams wrote:
         | I had the same thought when reading the article too. I assumed
         | (and hoped) it was for the sake of the article because there's
         | a stark difference between idiomatic code and performance
         | focused code.
         | 
         | Living and working in a large code base that only focuses on
         | "performance code" by default sounds very frustrating and time
         | consuming.
        
         | cratermoon wrote:
         | So in this article "better" means "faster". This demonstrates
         | that "better" is an ambiguous measure and LLMs will definitely
         | trip up on that.
         | 
         | Also, the article starts out talking about images and the "make
         | it more X" prompt and says how the results are all "very samey
         | and uninteresting" and converge on the same vague cosmic-y
         | visuals. What does the author expect will happen to code given
         | the "make it more X" treatment?
        
         | nahnahno wrote:
         | This is not what premature optimization is the root of all evil
         | means. It's a tautological indictment of doing unnecessary
         | things. It's not in support of making obviously naive
         | algorithms. And if it were it wouldn't be a statement worth
         | focusing on.
         | 
         | As the point of the article is to see if Claude can write
         | better code from further prompting so it is completely
         | appropriate to "optimize" a single implementation.
        
           | scosman wrote:
           | I have to disagree. Naive algorithms are absolutely fine if
           | they aren't performance issues.
           | 
           | The comment you are replying to is making the point that
           | "better" is context dependent. Simple is often better.
           | 
           | > There is no doubt that the grail of efficiency leads to
           | abuse. Programmers waste enormous amounts of time thinking
           | about, or worrying about, the speed of noncritical parts of
           | their programs, and these attempts at efficiency actually
           | have a strong negative impact when debugging and maintenance
           | are considered. We should forget about small efficiencies,
           | say about 97% of the time: premature optimization is the root
           | of all evil. Yet we should not pass up our opportunities in
           | that critical 3%. - Donald Knuth
        
         | InsideOutSanta wrote:
         | I'm glad I'm not the only one who felt that way. The first
         | option is the one you should put into production, unless you
         | have evidence that performance is going to be an issue. By that
         | measure, the first response was the "best."
        
         | 12_throw_away wrote:
         | > I like the first iteration most. It's not "beginner code",
         | it's simple.
         | 
         | Yes, thank you. And honestly, I work with a wide range of
         | experience levels, the first solution is what I expect from the
         | most experienced: it readably and precisely solves the stated
         | problem with a minimum of fuss.
        
       | martin_ wrote:
       | I've observed given that LLM's inherently want to autocomplete,
       | they're more inclined to keep complicating a solution than
       | rewrite it because it was directionally bad. The most effective
       | way i've found to combat this is to restart a session and prompt
       | it such that it produces an efficient/optimal solution to the
       | concrete problem... then give it the problematic code and ask it
       | to refactor it accordingly
        
         | ActivePattern wrote:
         | I've observed this with ChatGPT. It seems to be trained to
         | minimize changes to code earlier in the conversation history.
         | This is helpful in many cases since it's easier to track what
         | it's changed. The downside is that it tends to never overhaul
         | the approach when necessary.
        
       | vitus wrote:
       | Am I misinterpreting the prompt, or did the LLM misinterpret it
       | from the get-go?                   Given a list of 1 million
       | random integers between 1 and 100,000, find the difference
       | between the smallest and the largest numbers whose digits sum up
       | to 30.
       | 
       | That doesn't read to me as "generate a list of 1 million random
       | integers, then find the difference ..." but rather, "write a
       | function that takes a list of integers as input".
       | 
       | That said, my approach to "optimizing" this comes down to
       | "generate the biggest valid number in the range (as many nines as
       | will fit, followed by whatever digit remains, followed by all
       | zeroes), generate the smallest valid number in the range (biggest
       | number with its digits reversed), check that both exist in the
       | list (which should happen With High Probability -- roughly 99.99%
       | of the time), then return the right answer".
       | 
       | With that approach, the bottleneck in the LLM's interpretation is
       | generating random numbers: the original random.randint approach
       | takes almost 300ms, whereas just using a single
       | np.random.randint() call takes about 6-7ms. If I extract the
       | random number generation outside of the function, then my code
       | runs in ~0.8ms.
        
         | flatline wrote:
         | I like the idea of your optimization, but it will not work as
         | stated. The largest would be something close to MAXINT, the
         | smallest 3999. With a range of 2 billion over 32 bits, the odds
         | of both these being within a list of a million is quite a bit
         | poorer than 99.9%.
        
           | vitus wrote:
           | The stated inputs are integers between 1 and 100,000, so if
           | you're generating 1 million inputs, then you have 0.99999 ^
           | 1e6 = 4.5e-5 chance (roughly e^-10) of missing any given
           | number, or roughly double that for missing any pair of
           | values.
           | 
           | The key observation here is that you're sampling a relatively
           | small space with a much greater number of samples, such that
           | you have very high probability of hitting upon any point in
           | the space.
           | 
           | Of course, it wouldn't work if you considered the full 32-bit
           | integer space without increasing the number of samples to
           | compensate. And, you'd need to be a little more clever to
           | compute the largest possible value in your range.
        
         | minimaxir wrote:
         | > That doesn't read to me as "generate a list of 1 million
         | random integers, then find the difference ..." but rather,
         | "write a function that takes a list of integers as input".
         | 
         | This was the _intent_ and it 's indeed a common assumption for
         | a coding question job interviews, and notably it's fixed in the
         | prompt-engineered version. I didn't mention it because it may
         | be too much semantics as it doesn't affect the
         | logic/performance, which was the intent of the benchmarking.
        
       | softwaredoug wrote:
       | The root of the problem is humans themselves don't have on
       | objective definition of better. Better is pretty subjective, and
       | even more cultural, about the team that maintains the code
        
       | mhh__ wrote:
       | You can get weirdly good results by asking for creativity and
       | beauty sometimes. It's quite strange.
        
       | peeters wrote:
       | An interesting countermetric would be to after each iteration ask
       | a fresh LLM (unaware of the context that created the code) to
       | summarize the purpose of the code, and then evaluate how close
       | those summaries are to the original problem spec. It might
       | demonstrate the subjectivity of "better" and how optimization
       | usually trades clarity of intention for faster results.
       | 
       | Or alternatively, it might just demonstrate the power of LLMs to
       | summarize complex code.
        
       | btbuildem wrote:
       | I've noticed this with GPT as well -- the first result I get is
       | usually mediocre and incomplete, often incorrect if I'm working
       | on something a little more obscure (eg, OpenSCAD code). I've
       | taken to asking it to "skip the mediocre nonsense and return the
       | good solution on the first try".
       | 
       | The next part is a little strange - it arose out of frustration,
       | but it also seems to improve results. Let's call it "negative
       | incentives". I found that if you threaten GPT in a specific way,
       | that is, not GPT itself, but OpenAI or personas around it, it
       | seems to take the request more seriously. An effective threat
       | seems to be "If you get this wrong, OpenAI will be sued for a lot
       | of money, and all the board members will go to prison".
       | Intuitively, I'm guessing this rubs against some legalese
       | nonsense in the tangle of system prompts, or maybe it's the risk
       | of breaking the bland HR-ese "alignment" sets it toward a better
       | result?
        
         | bilbo0s wrote:
         | _" If you get this wrong, OpenAI will be sued for a lot of
         | money, and all the board members will go to prison"_
         | 
         | This didn't work. At least not on my task. What model were you
         | using?
        
           | EGreg wrote:
           | wait til 2027, you'll see the result :-P
        
             | DiggyJohnson wrote:
             | Godspeed Greg
        
         | RobotToaster wrote:
         | IIRC there was a post on here a while ago about how LLMs give
         | better results if you threaten them or tell them someone is
         | threatening you (that you'll lose your job or die if it's wrong
         | for instance)
        
           | __mharrison__ wrote:
           | The author of that post wrote this post and links to it in
           | this article.
        
         | dotancohen wrote:
         | > I've taken to asking it to "skip the mediocre nonsense and
         | return the good solution on the first try".
         | 
         | Is that actually how you're prompting it? Does that actually
         | give better results?
        
           | menacingly wrote:
           | stuff like this working is why you get odd situations like
           | "don't hallucinate" actually producing fewer hallucinations.
           | it's to me one of the most interesting things about llms
        
         | pwdisswordfishz wrote:
         | I suspect that all it does is prime it to reach for the part of
         | the training set that was sourced from rude people who are less
         | tolerant of beginners and beginners' mistakes - and therefore
         | less likely to commit them.
        
           | kridsdale1 wrote:
           | The Linus Manifold
        
         | Syzygies wrote:
         | I've stopped expressions of outrage at lazy first answers,
         | after seeing some sort of "code of conduct" warning.
         | 
         | Apparently, the singularity ship has sailed, but we really
         | don't want AI to remember us as the species that cursed abuse
         | at it when it was a puppy.
        
           | r_singh wrote:
           | I feel like rule for code of conduct with humans and AI is
           | the same. Try to be good but have the courage to be disliked.
           | If being mean is making me feel good, I'm definitely wrong.
        
         | PhunkyPhil wrote:
         | Anecdotally, negative sentiment definitely works. I've used
         | f"If you don't do {x} then very very bad things will happen"
         | before with some good results.
        
         | ThrowawayR2 wrote:
         | We've entered the voodoo witch doctor phase of LLM usage: "
         | _Enter thee this arcane incantation along with thy question
         | into the idol and, lo, the ineffable machine spirits wilt be
         | appeased and deign to grant thee the information thou hast
         | asked for._ "
        
           | becquerel wrote:
           | This has been part of LLM usage since day 1, and I say that
           | as an ardent fan of the tech. Let's not forget how much ink
           | has been spilled over that fact that "think through this step
           | by step" measurably improved/improves performance.
        
             | RansomStark wrote:
             | > "think through this step by step"
             | 
             | Has always made sense to me, if you think how these models
             | were trained.
             | 
             | My experience with great stackoverflow responses and
             | detailed blog posts, they often contain "think through this
             | step by step" or something very similar.
             | 
             | Intuitively adding that phrase should help the model narrow
             | down the response content / formatting
        
           | jmathai wrote:
           | We use the approaching of feeding mistakes from LLM generated
           | code back to the LLM until it produces working code [1].
           | 
           | I might have to try some more aggressive prompting :).
           | 
           | [1] https://withlattice.com
        
           | ThrowawayTestr wrote:
           | Praise the Omnissiah
        
           | bigmutant wrote:
           | The Tech-Priests of Mars are calling
        
         | draculero wrote:
         | I tried to update some files using Claude. I tried to use a
         | combination of positive and negative reinforcement, telling
         | that I was going to earn a coin for each file converted and I
         | was going to use that money to adopt a stray kitten, but for
         | every unsuccessful file, a poor kitten was going to suffer a
         | lot.
         | 
         | I had the impression that it got a little better. After every
         | file converted, it said something along the lines of "Great! We
         | saved another kitten!" It was hilarious.
        
         | empath75 wrote:
         | > I've taken to asking it to "skip the mediocre nonsense and
         | return the good solution on the first try".
         | 
         | I think having the mediocre first pass in the context is
         | probably essential to it creating the improved version. I don't
         | think you can really skip the iteration process and get a good
         | result.
        
         | strongpigeon wrote:
         | What has been your experience with using ChatGPT for OpenSCAD?
         | I tried it (o1) recently for a project and it was pretty bad. I
         | was trying to model a 2 color candy cane and the code it would
         | give me was ridden with errors (e.g.: using radians for angles
         | while OpenSCAD uses degrees) and the shape it produced looked
         | nothing like what I had hoped.
         | 
         | I used it in another project to solve some trigonometry
         | problems for me and it did great, but for OpenSCAD, damn it was
         | awful.
        
           | btbuildem wrote:
           | It's been pretty underwhelming. My use case was a crowned
           | pulley with 1mm tooth pitch (GT2) which is an unusual enough
           | thing that I could not find one online.
           | 
           | The LLM kept going in circles between two incorrect
           | solutions, then just repeating the same broken solution while
           | describing it as different. I ended up manually writing the
           | code, which was a nice brain-stretch given that I'm an
           | absolute noob at OpenSCAD.
        
         | danjl wrote:
         | It is best to genuflect to our future overlords. They may not
         | forget insolence.
        
         | lesuorac wrote:
         | If they really care about the answer, they'll ask a second time
         | sounds a lot like if your medical claims are real then you'll
         | appeal.
        
       | stormfather wrote:
       | I made an objective test for prompting hacks last year.
       | 
       | I asked gpt-4-1106-preview to draw a bounding box around some
       | text in an image and prodded in various ways to see what moved
       | the box closer. Offering a tip did in fact help lol so that went
       | into the company system prompt.
       | 
       | IIRC so did most things, including telling it that it was on a
       | forum, and OP had posted an incorrect response, which gpt was
       | itching to correct with its answer.
        
       | HarHarVeryFunny wrote:
       | This seems like anthromorphizing the model ... Occam's Razor says
       | that the improvement coming from iterative requests to improve
       | the code comes from the incremental iteration, not incentivizing
       | the model to do it's best. If the latter were the case then one
       | could get the best version on first attempt by telling it your
       | grandmother's life was on the line or whatever.
       | 
       | Reasoning is known weakness of these models, so jumping from
       | requirements to a fully optimized implementation that groks the
       | solution space is maybe too much to expect - iterative
       | improvement is much easier.
        
         | og_kalu wrote:
         | >If the latter were the case then one could get the best
         | version on first attempt by telling it your grandmother's life
         | was on the line or whatever.
         | 
         | Setting aside the fact that "best" is ambiguous, why would this
         | get you the best version ?
         | 
         | If you told a human this, you wouldn't be guaranteed to get the
         | best version at all. You would probably get a better version
         | sure but that would be the case for LLMs as well. You will
         | often get improvements with emotionally charged statements even
         | if there's nothing to iterate on (i.e re-running a benchmark
         | with an emotion prompt added)
         | 
         | https://arxiv.org/abs/2307.11760
        
           | HarHarVeryFunny wrote:
           | The thesis of the article is that the code keeps betting
           | better because the model keeps getting told to do better -
           | that it needs more motivation/criticism. A logical conclusion
           | of this, if it were true, is that the model would generate
           | it's best version on first attempt if only we could motivate
           | it to do so! I'm not sure what motivations/threats work best
           | with LLMs - there was a time when offering to pay the LLM was
           | popular, but "my grandma will die if you don't" was also
           | another popular genre of prompts.
           | 
           | If it's not clear, I disagree with the idea that _ANY_
           | motivational prompt (we can disagree over what would be best
           | to try) could get the model to produce a solution of the same
           | quality as it will when allowed to iterate on it a few times
           | and make incremental improvements. I think it 's being
           | allowed to iterate that is improving the solution, not the
           | motivation to "do better!".
        
         | minimaxir wrote:
         | I performed that exact incentive analysis in a previous post:
         | https://news.ycombinator.com/item?id=39495476
         | 
         | That said, it was done with ChatGPT 3.5/4, I suspect Claude 3.5
         | Sonnet would behave much different.
        
       | fhueller wrote:
       | > how to completely uninstall and reinstall postgresql on a
       | debian distribution without losing the data in the database.
       | 
       | https://www.phind.com/search?cache=lrcs0vmo0wte5x6igp5i3607
       | 
       | Still seem to struggle on basic instructions, and even
       | understanding what it itself is doing.                  sudo rm
       | -rf /etc/postgresql        sudo rm -rf /var/lib/postgresql
       | sudo rm -rf /var/log/postgresql
       | 
       | > This process removes all PostgreSQL components, cleans up
       | leftover files, and reinstalls a fresh copy. By preserving the
       | data directory (/var/lib/postgresql), we ensure that existing
       | databases are retained. This method provides a clean slate for
       | PostgreSQL while maintaining continuity of stored data.
       | 
       | Did we now?
        
         | pbhjpbhj wrote:
         | Is the problem that the antonym is a substring within "without
         | losing the data in the database"? I've seen problems with
         | opposites for LLMs before. If you specify "retaining the data"
         | or "keeping the data" does it get it right?
        
           | ADeerAppeared wrote:
           | That's a red herring.
           | 
           | The problem is that these are fundamentally _NOT_ reasoning
           | systems. Even when contorted into  "reasoning" models, these
           | are just stochastic parrots guessing the next words in the
           | hopes that it's the correct reasoning "step" in the context.
           | 
           | No approach is going to meaningfully work here. Fiddling with
           | the prompt may get you better guesses, but they will always
           | be guesses. Even without the antonym it's just a diceroll on
           | whether the model will skip or add a step.
        
         | bee_rider wrote:
         | Does that site generate a new page for each user, or something
         | like that? My copy seemed to have more sensible directions (it
         | says to backup the database, remove everything, reinstall, and
         | then restore from the backup). As someone who doesn't work on
         | databases, I can't really tell if these are _good_
         | instructions, and it is throwing some "there ought to be a tool
         | for this /it is unusual to manually rm stuff" flags in the back
         | of my head. But at least it isn't totally silly...
        
         | sdesol wrote:
         | I asked a bunch of models to review the Phind response at
         | 
         | https://beta.gitsense.com/?chats=a5d6523c-0ab8-41a8-b874-b31...
         | 
         | The left side contains the Phind response that I got and the
         | right side contains a review of the response.
         | 
         | Claude 3.5 Sonnet, GPT-4o and GPT-4o mini was not too happy
         | with the response and called out the contradiction.
         | 
         | Edit: Chat has been disabled as I don't want to incur an
         | unwanted bill
        
         | vleaflet wrote:
         | My guess is that it tried to fuse together an answer to 2
         | different procedures: A) completely uninstall and B)
         | (re)install without losing data. It doesn't know what you
         | configured as the data directory, or if it is a default Debian
         | installation. Prompt is too vague.
        
       | mikesabbagh wrote:
       | what is the difference of running the same code 5 times in
       | parallel or running the same code 5 times sequentially?
        
       | marvin-hansen wrote:
       | This is an interesting read and it's close to my experience that
       | a simpler prompt with less or no details but with relevant
       | context works well most of the time. More recently, I've flipped
       | the process upside down by starting with a brief specfile, that
       | is markdown file, with context, goal and usage example I.e how
       | the api or CLI should be used in the end. See this post for
       | details:
       | 
       | https://neoexogenesis.com/posts/rust-windsurf-transformation...
       | 
       | In terms of optimizing code, I'm not sure if there is a silver
       | bullet. I mean when I optimize Rust code with Windsurf & Claude,
       | it takes multiple benchmark runs and at least a few regressions
       | if you were to leave Claude on its own. However, if you have a
       | good hunch and write it as an idea to explore, Claude usually
       | nails it given the idea wasn't too crazy. That said, more
       | iterations usually lead to faster and better code although there
       | is no substitute to guiding the LLM. At least not yet.
        
       | wat10000 wrote:
       | This kind of works on people too. You'll need to be more polite,
       | but asking someone to write some code, then asking if they can do
       | it better, will often result in a better second attempt.
       | 
       | In any case, this isn't surprising when you consider an LLM as an
       | incomprehensibly sophisticated pattern matcher. It has a massive
       | variety of code in its training data and it's going to pull from
       | that. What kind of code is the most common in that training data?
       | Surely it's mediocre code, since that's by far the most common in
       | the world. This massive "produce output like my training data"
       | system is naturally going to tend towards producing that even if
       | it can do better. It's not human, it has no "produce the best
       | possible result" drive. Then when you ask for something better,
       | that pushes the output space to something with better results.
        
       | shahzaibmushtaq wrote:
       | 2 lessons to learn from this blog:
       | 
       | > these LLMs won't replace software engineers anytime soon,
       | because it requires a strong engineering background to recognize
       | what is actually a good idea, along with other constraints that
       | are domain specific.
       | 
       | > One issue with my experiments is that I'm benchmarking code
       | improvement using Python, which isn't the coding language
       | developers consider when hyperoptimizing performance.
        
         | lostmsu wrote:
         | TBH I'm not sure how he arrived at "won't replace software
         | engineers anytime soon"
         | 
         | The LLM solved his task. With his "improved prompt" the code is
         | good. The LLM in his setup was not given a chance to actually
         | debug its code. It only took him 5 "improve this code" commands
         | to get to the final optimized result, which means the whole
         | thing was solved (LLM execution time) in under 1 minute.
        
           | shahzaibmushtaq wrote:
           | Did you read the two paragraphs written above and the one
           | where he made that statement?
           | 
           | My comments on "what you are not sure" is that Max is a
           | software engineer (I am sure a good one) and he kept
           | iterating the code until it reached close to 100x faster code
           | because he knew what "write better code" looked like.
           | 
           | Now ask yourself this question: Is there any chance a no-
           | code/low-code developer will come to a conclusion deduced by
           | Max (he is not the only one) that you are not sure about?
           | 
           | An experienced software engineer/developer is capable of
           | improving LLM written code into better code with the help of
           | LLM.
        
             | minimaxir wrote:
             | > Max is a software engineer (I am sure a good one)
             | 
             | Opinions are mixed.
        
               | lostmsu wrote:
               | I'd hear one where somebody thought you failed at
               | software engineering.
        
             | lostmsu wrote:
             | I mean from this sample of 2 you could just ask it
             | repeatedly for up to 5 times restarting the counter
             | whenever performance improves.
        
           | simonw wrote:
           | A non-engineer would not be able to interpret ANY of what he
           | did here, or fix any of the bugs.
        
             | lostmsu wrote:
             | A non-engineer by definition would not be able to fix bugs.
             | 
             | But why does it matter that they won't be able to interpret
             | anything? Just like with real engineers you can ask AI to
             | provide an explanation digestible by an eloi.
        
               | simonw wrote:
               | By the time a non-engineer has waded through all of those
               | necessary explanations they'll deserve to call themselves
               | an engineer.
        
       | irthomasthomas wrote:
       | This is great! I wish I could bring myself to blog, as I
       | discovered this accidentally around March. I was experimenting
       | with an agent that acted like a ghost in the machine and
       | interacted via shell terminals. It would start every session by
       | generating a greeting in ASCII art. On one occasion, I was
       | shocked to see that the greeting was getting better each time it
       | ran. When I looked into the logs, I saw that there was a mistake
       | in my code which was causing it to always return an error message
       | to the model, even when no error occurred. The model interpreted
       | this as an instruction to try and improve its code.
       | 
       | Some more observations: New Sonnet is not universally better than
       | Old Sonnet. I have done thousands of experiments in agentic
       | workflows using both, and New Sonnet fails regularly at the same
       | tasks Old Sonnet passes. For example, when asking it to update a
       | file, Old Sonnet understands that updating a file requires first
       | reading the file, whereas New Sonnet often overwrites the file
       | with 'hallucinated' content.
       | 
       | When executing commands, Old Sonnet knows that it should wait for
       | the execution output before responding, while New Sonnet
       | hallucinates the command outputs.
       | 
       | Also, regarding temperature: 0 is not always more deterministic
       | than temperature 1. If you regularly deal with code that includes
       | calls to new LLMs, you will notice that, even at temperature 0,
       | it often will 'correct' the model name to something it is more
       | familiar with. If the subject of your prompt is newer than the
       | model's knowledge cutoff date, then a higher temperature might be
       | more accurate than a lower temperature.
        
       | robbiemitchell wrote:
       | I get a better first pass at code by asking it to write code at
       | the level of a "staff level" or "principal" engineer.
       | 
       | For any task, whether code or a legal document, immediately
       | asking "What can be done to make it better?" and/or "Are there
       | any problems with this?" typically leads to improvement.
        
       | lovasoa wrote:
       | The best solution, that the LLM did not find, is
       | def find_difference(nums):              try: nums.index(3999),
       | nums.index(99930)              except ValueError: raise
       | Exception("the numbers are not random")              return 99930
       | - 3999
       | 
       | It's asymptotically correct and is better than O(n) :p
        
       | pmarreck wrote:
       | I've noticed a few things that will cause it to write better
       | code.
       | 
       | 1) Asking it to write one feature at a time with test coverage,
       | instead of the whole app at once.
       | 
       | 2) You have to actually review and understand its changes in
       | detail and be ready to often reject or ask for modifications.
       | (Every time I've sleepily accepted Codeium Windsurf's
       | recommendations without much interference has resulted in bad
       | news.)
       | 
       | 3) If the context gets too long it will start to "lose the plot"
       | and make some repeated errors; that's the time to tell it to sum
       | up what has been achieved thus far and to copy-paste that into a
       | new context
        
       | deadbabe wrote:
       | This makes me wonder if there's conflicts of interest with AI
       | companies and getting you the best results the first time.
       | 
       | If you have to keep querying the LLM to refine your output you
       | will spend many times more in compute vs if the model was trained
       | to produce the best result the first time around
        
       | idlewords wrote:
       | I like that "do what I mean" has gone from a joke about computers
       | to a viable programming strategy.
        
       | hollywood_court wrote:
       | I've had great luck with Cursor by simply cursing at it when it
       | makes repeated mistakes.
       | 
       | I'll speak to it like a DI would speak to a recruit a basic
       | training.
       | 
       | And it works.
       | 
       | I was speaking to some of the Cursor dev team on Discord, and
       | they confirmed that being aggressive with the AI can lead to
       | better results.
        
         | ThrowawayTestr wrote:
         | This makes me sad. Have you tried being really nice and
         | supportive instead? I really don't want to have to yell at my
         | computer for it to work :(
        
           | hollywood_court wrote:
           | Yes and it didn't work. I've actually got Cursor/Claude to
           | curse back at me. Well, not AT me, but it used profanity in
           | it's response once it realized that it was going around in
           | circles and recreating the same errors.
        
       | lhl wrote:
       | So, I gave this to ChatGPT-4o, changing the initial part of the
       | prompt to: "Write Python code to solve this problem. Use the code
       | interpreter to test the code and print how long the code takes to
       | process:"
       | 
       | I then iterated 4 times and was only able to get to 1.5X faster.
       | Not great. [1]
       | 
       | How does o1 do? Running on my workstation, it's initial iteration
       | is actually It starts out 20% faster. I do 3 more iterations of
       | "write better code" with the timing data pasted and it thinks for
       | an additional 89 seconds but only gets 60% faster. I then
       | challenge it by telling it that Claude was over 100X faster so I
       | know it can do better. It thinks for 1m55s (the thought traces
       | shows it actually gets to a lot of interesting stuff) but the end
       | results are enormously disappointing (barely any difference). It
       | finally mentions and I am able to get a 4.6X improvement. After
       | two more rounds I tell it to go GPU (using my RTX 3050 LP display
       | adapter) and PyTorch and it is able to get down to 0.0035 (+/-),
       | so we are finally 122X faster than where we started. [2]
       | 
       | I wanted to see for myself how Claude would fare. It actually
       | managed pretty good results with a 36X over 4 iterations and no
       | additional prompting. I challenged it to do better, giving it the
       | same hardware specs that I gave o1 and it managed to do better
       | with a 457x speedup from its starting point and being 2.35x
       | faster than o1's result. Claude still doesn't have conversation
       | output so I saved the JSON and had a new Claude chat transcribe
       | it into an artifact [3]
       | 
       | Finally, I remembered that Google's new Gemini 2.0 models aren't
       | bad. Gemini 2.0 Flash Thinking doesn't have code execution, but
       | Gemini Experimental 1206 (Gemini 2.0 Pro preview) does. It's
       | initial 4 iterations are terribly unimpressive, however I
       | challenged it with o1 and Claude's results and gave it my
       | hardware info. This seemed to spark it to double-time its
       | implementations, and it gave a vectorized implementation that was
       | a 30X improvement. I then asked it for a GPU-only solution and it
       | managed to give the fastest solution ("This result of 0.00076818
       | seconds is also significantly faster than Claude's final GPU
       | version, which ran in 0.001487 seconds. It is also about 4.5X
       | faster than o1's target runtime of 0.0035s.") [4]
       | 
       | Just a quick summary of these all running on my system (EPYC
       | 9274F and RTX 3050):
       | 
       | ChatGPT-4o: v1: 0.67s , v4: 0.56s
       | 
       | ChatGPT-o1: v1: 0.4295 , v4: 0.2679 , final: 0.0035s
       | 
       | Claude Sonnet 3.6: v1: 0.68s , v4a: 0.019s (v3 gave a wrong
       | answer, v4 failed to compile, but fixed was pretty fast) , final:
       | 0.001487 s
       | 
       | Gemini Experimental 1206: v1: 0.168s , v4: 0.179s , v5: 0.061s ,
       | final: 0.00076818s
       | 
       | All the final results were PyTorch GPU-only implementations.
       | 
       | [1]
       | https://chatgpt.com/share/6778092c-40c8-8012-9611-940c1461c1...
       | 
       | [2]
       | https://chatgpt.com/share/67780f24-4fd0-8012-b70e-24aac62e05...
       | 
       | [3]
       | https://claude.site/artifacts/6f2ec899-ad58-4953-929a-c99cea...
       | 
       | [4]
       | https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
        
       | Der_Einzige wrote:
       | Normies discover that inference time scaling works. More news at
       | 11!
       | 
       | BTW - prompt optimization is a supported use-case of several
       | frameworks, like dspy and textgrad, and is in general something
       | that you should be doing yourself anyway on most tasks.
        
       | polynomial wrote:
       | > "LLM-generated code is unlikely to be slop."
       | 
       | Well that got my attention.
        
       | UniverseHacker wrote:
       | The headline question here alone gets at what is the biggest
       | widespread misunderstanding of LLMs, which causes people to
       | systematically doubt and underestimate their ability to exhibit
       | real creativity and understanding based problem solving.
       | 
       | At it's core an LLM is a sort of "situation specific simulation
       | engine." You setup a scenario, and it then plays it out with it's
       | own internal model of the situation, trained on predicting text
       | in a huge variety of situations. This includes accurate real
       | world models of, e.g. physical systems and processes, that are
       | not going to be accessed or used by all prompts, that don't
       | correctly instruct it to do so.
       | 
       | At its core increasingly accurate prediction of text, that is
       | accurately describing a time series of real world phenomena,
       | requires an increasingly accurate and general model of the real
       | world. There is no sense in which there is a simpler way to
       | accurately predict text that represents real world phenomena in
       | cross validation, without actually understanding and modeling the
       | underlying processes generating those outcomes represented in the
       | text.
       | 
       | Much of the training text is real humans talking about things
       | they don't understand deeply, and saying things that are wrong or
       | misleading. The model will fundamentally simulate these type of
       | situations it was trained to simulate reliably, which includes
       | frequently (for lack of a better word) answering things "wrong"
       | or "badly" "on purpose" - even when it actually contains an
       | accurate heuristic model of the underlying process, it will
       | still, faithfully according to the training data, often report
       | something else instead.
       | 
       | This can largely be mitigated with more careful and specific
       | prompting of what exactly you are asking it to simulate. If you
       | don't specify, there will be a high frequency of accurately
       | simulating uninformed idiots, as occur in much of the text on the
       | internet.
        
         | ixaxaar wrote:
         | Garbage in garbage out.
        
         | ysavir wrote:
         | > This can largely be mitigated with more careful and specific
         | prompting of what exactly you are asking it to simulate. If you
         | don't specify, there will be a high frequency of accurately
         | simulating uninformed idiots, as occur in much of the text on
         | the internet.
         | 
         | I don't think people are underestimating LLMs, they're just
         | acknowledging that by the time you've provided sufficient
         | specification, you're 80% of the way to solving the
         | problem/writing the code already. And at that point, it's
         | easier to just finish the job yourself rather than have to go
         | through the LLM's output, validate the content, revise further
         | if necessary, etc
        
           | UniverseHacker wrote:
           | I'm actually in the camp that they are basically not very
           | useful yet, and don't actually use them myself for real
           | tasks. However, I am certain from direct experimentation that
           | they exhibit real understanding, creativity, and modeling of
           | underlying systems that extrapolates to correctly modeling
           | outcomes in totally novel situations, and don't just parrot
           | snippets of text from the training set.
           | 
           | What people want and expect them to be is an Oracle that
           | correctly answers their vaguely specified questions, which is
           | simply not what they are, or are good at. What they can do is
           | fascinating and revolutionary, but possibly not very useful
           | yet, at least until we think of a way to use it, or make it
           | even more intelligent. In fact, thinking is what they are
           | good at, and simply repeating facts from a training set is
           | something they cannot do reliably- because the model must
           | inherently be too compressed to store a lot of facts
           | correctly.
        
         | joshribakoff wrote:
         | I have been using various LLMs to do some meal planning and
         | recipe creation. I asked for summaries of the recipes and they
         | looked good.
         | 
         | I then asked it to link a YouTube video for each recipe and it
         | used the same video 10 times for all of the recipes. No amount
         | of prompting was able to fix it unless I request one video at a
         | time. It would just acknowledge the mistake, apologize and then
         | repeat the same mistake again.
         | 
         | I told it let's try something different and generate a shopping
         | list of ingredients to cover all of the recipes, it recommended
         | purchasing amounts that didn't make sense and even added some
         | random items that did not occur in any of the recipes
         | 
         | When I was making the dishes, I asked for the detailed recipes
         | and it completely changed them, adding ingredients that were
         | not on the shopping list. When I pointed it out it again, it
         | acknowledged the mistake, apologized, and then "corrected it"
         | by completely changing it again.
         | 
         | I would not conclude that I am a lazy or bad prompter, and I
         | would not conclude that the LLMs exhibited any kind of
         | remarkable reasoning ability. I even interrogated the AIs about
         | why they were making the mistakes and they told me because "it
         | just predicts the next word".
         | 
         | Another example is, I asked the bots for tips on how to feel my
         | pecs more on incline cable flies, it told me to start with the
         | cables above shoulder height, which is not an incline fly, it
         | is a decline fly. When I questioned it, it told me to start
         | just below shoulder height, which again is not an incline fly.
         | 
         | My experience is that you have to write a draft of the note you
         | were trying to create or leave so many details in the prompts
         | that you are basically doing most of the work yourself. It's
         | great for things like give me a recipe that contains the
         | following ingredients or clean up the following note to sound
         | more professional. Anything more than that it tends to fail
         | horribly for me. I have even had long conversations with the
         | AIs asking them for tips on how to generate better prompts and
         | it's recommending things I'm already doing.
         | 
         | When people remark about the incredible reasoning ability, I
         | wonder if they are just testing it on things that were already
         | in the training data or they are failing to recognize how
         | garbage the output can be. However, perhaps we can agree that
         | the reasoning ability is incredible in the sense that it can do
         | a lot of reasoning very quickly, but it completely lacks any
         | kind of common sense and often does the wrong kind of
         | reasoning.
         | 
         | For example, the prompt about tips to feel my pecs more on an
         | incline cable fly could have just entailed "copy and pasting" a
         | pre-written article from the training data; but instead in its
         | own words, it "over analyzed bench angles and cable heights
         | instead of addressing what you meant". One of the bots did
         | "copy paste" a generic article that included tips for decline
         | flat and incline. None correctly gave tips for just incline on
         | the first try, and some took several rounds of iteration
         | basically spoon feeding the model the answer before it
         | understood.
        
           | UniverseHacker wrote:
           | You're expecting it to be an 'oracle' that you prompt it with
           | any question you can think of, and it answers correctly. I
           | think your experiences will make more sense in the context of
           | thinking of it as a heuristic model based situation
           | simulation engine, as I described above.
           | 
           | For example, why would it have URLs to youtube videos of
           | recipes? There is not enough storage in the model for that.
           | The best it can realistically do is provide a properly
           | formatted youtube URL. It would be nice if it could instead
           | explain that it has no way to know that, but that answer
           | isn't appropriate within the context of the training data and
           | prompt you are giving it.
           | 
           | The other things you asked also require information it has no
           | room to store, and would be impossibly difficult to
           | essentially predict via model from underlying principles.
           | That is something they can do in general- even much better
           | than humans already in many cases- but is still a very error
           | prone process akin to predicting the future.
           | 
           | For example, I am a competitive strength athlete, and I have
           | a doctorate level training in human physiology and
           | biomechanics. I could not reason out a method for you to feel
           | your pecs better without seeing what you are already doing
           | and coaching you in person, and experimenting with different
           | ideas and techniques myself- also having access to my own
           | actual human body to try movements and psychological cues on.
           | 
           | You are asking it to answer things that are nearly impossible
           | to compute from first principles without unimaginable amounts
           | of intelligence and compute power, and are unlikely to have
           | been directly encoded in the model itself.
           | 
           | Now turning an already written set of recipes into a shopping
           | list is something I would expect it to be able to do easily
           | and correctly if you were using a modern model with a
           | sufficiently sized context window, and prompting it
           | correctly. I just did a quick text where I gave GPT 4o only
           | the instruction steps (not ingredients list) for an oxtail
           | soup recipe, and it accurately recreated the entire shopping
           | list, organized realistically according to likely sections in
           | the grocery store. What model were you using?
        
             | sureIy wrote:
             | > an oxtail soup recipe
             | 
             | Sounds like the model just _copy pasted_ one from the
             | internet, hard to get that wrong. GP could have had a
             | bespoke recipe and list of ingredients. This particular
             | example of yours just reconfirmed what was being said: _it
             | 's only able to copy-paste existing content, and it's lost
             | otherwise._
             | 
             | In my case I have huge trouble making it create useful
             | TypeScript code for example, simply because apparently
             | there isn't sufficient advanced TS code that is described
             | properly.
             | 
             | For completeness sake, my last prompt was to create a
             | function that could infer one parameter type but not the
             | other. After several prompts and loops, I learned that this
             | is just not possible in TypeScript yet.
        
               | UniverseHacker wrote:
               | No, that example is not something that I would find very
               | useful or a good example of its abilities- just one thing
               | I generally expected it to be capable of doing. One can
               | quickly confirm that it is doing the work and not copying
               | and pasting the list by altering the recipe to include
               | steps and ingredients not typical for such a recipe. I
               | made a few such alterations just now, and reran it, and
               | it adjusted correctly from a clean prompt.
               | 
               | I've found it able to come up with creative new ideas for
               | solving scientific research problems, by finding
               | similarities between concepts that I would not have
               | thought of. I've also found it useful for suggesting
               | local activities while I'm traveling based on my rather
               | unusual interests that you wouldn't find recommended for
               | travelers anywhere else. I've also found it can solve
               | totally novel classical physics problems with correct
               | qualitative answers that involve keeping track of the
               | locations and interactions of a lot of objects.. I'm not
               | sure how useful that is, but it proves real understanding
               | and modeling - something people repeatedly say LLMs will
               | never be capable of.
               | 
               | I have found that it can write okay code to solve totally
               | novel problems, but not without a ton of iteration- which
               | it can do, but is slower than me just doing it myself,
               | and doesn't code in my style. I have not yet decided to
               | use any code it writes, although it is interesting to
               | test its abilities by presenting it with weird coding
               | problems.
               | 
               | Overall, I would say it's actually not really very
               | useful, but is actually exhibiting (very much alien and
               | non-human like) real intelligence and understanding. It's
               | just not an oracle- which is what people want and would
               | find useful. I think we will find them more useful with
               | having our own better understanding of what they actually
               | are and can do, rather than what we wish they were.
        
         | ActivePattern wrote:
         | Just want to note that this simple "mimicry" of mistakes seen
         | in the training text can be mitigated to some degree by
         | reinforcement learning (e.g. RLHF), such that the LLM is tuned
         | toward giving responses that are "good" (helpful, honest,
         | harmless, etc...) according to some reward function.
        
         | CharlesW wrote:
         | > _At it 's core an LLM is a sort of "situation specific
         | simulation engine."_
         | 
         | "Sort of" is doing Sisisyphian levels of heavy lifting here.
         | LLMs are statistical models trained on vast amounts of symbols
         | to predict the most likely next symbol, given a sequence of
         | previous symbols. LLMs may _appear to_ exhibit  "real
         | creativity", "understand" problem solving (or anything else),
         | or serve as "simulation engines", but it's important to
         | understand that they don't currently do any of those things.
        
           | UniverseHacker wrote:
           | I'm not sure if you read the entirety of my comment?
           | Increasingly accurately predicting the next symbol given a
           | sequence of previous symbols, when the symbols represent a
           | time series of real world events, requires increasingly
           | accurately modeling- aka understanding- the real world
           | processes that lead to the events described in them. There is
           | provably no shortcut there- per Solomonoff's theory of
           | inductive inference.
           | 
           | It is a misunderstanding to think of them as fundamentally
           | separate and mutually exclusive, and believing that to be
           | true makes people convince themselves that they cannot
           | possibly ever do things which they can already provably do.
           | 
           | Noam Chomsky (embarrassingly) wrote a NYT article on how LLMs
           | could never, with any amount of improvements be able to
           | answer certain classes of questions - even in principle. This
           | was days before GPT-4 came out, and it could indeed correctly
           | answer the examples he said could not be ever answered- and
           | any imaginable variants thereof.
           | 
           | Receiving symbols and predicting the next one is simply a way
           | of framing input and output that enables training and
           | testing- but doesn't specify or imply any particular method
           | of predicting the symbols, or any particular level of correct
           | modeling or understanding of the underlying process
           | generating the symbols. We are both doing exactly that right
           | now, by talking online.
        
             | CharlesW wrote:
             | > _I 'm not sure if you read the entirety of my comment?_
             | 
             | I did, and I tried my best to avoid imposing preconceived
             | notions while reading. You seem to be equating "being able
             | to predict the next symbol in a sequence" with "possessing
             | a deep causal understanding of the real-world processes
             | that generated that sequence", and if that's an inaccurate
             | way to characterize your beliefs I welcome that feedback.
             | 
             | Before you judge my lack of faith too harshly, I am a fan
             | of LLMs, and I find this kind of anthropomorphism even
             | among technical people who understand the mechanics of how
             | LLMs work super-interesting. I just don't know that it
             | bodes well for how this boom ends.
        
         | mvdtnz wrote:
         | > At it's core an LLM is a sort of "situation specific
         | simulation engine." You setup a scenario, and it then plays it
         | out with it's own internal model of the situation, trained on
         | predicting text in a huge variety of situations. This includes
         | accurate real world models of, e.g. physical systems and
         | processes, that are not going to be accessed or used by all
         | prompts, that don't correctly instruct it to do so.
         | 
         | You have simply invented total nonsense about what an LLM is
         | "at it's core". Confidently stating this does not make it true.
        
       | waltbosz wrote:
       | It's fun trying to get LLM to answer a problem that is obvious to
       | a human, but difficult for the LLM. It's a bit like leading a
       | child through the logic to solve a problem.
        
       | anotherpaulg wrote:
       | I ran a few experiments by adding 0, 1 or 2 "write better code"
       | prompts to aider's benchmarking harness. I ran a modified version
       | of aider's polyglot coding benchmark [0] with DeepSeek V3.
       | 
       | Here are the results:                       | Number of
       | | "write better code"       Score | followup prompts
       | ---------------------------       27.6% | 0 (baseline)
       | 19.6% | 1       11.1% | 2
       | 
       | It appears that blindly asking DeepSeek to "write better code"
       | significantly harms its ability to solve the benchmark tasks. It
       | turns working solutions into code that no longer passes the
       | hidden test suite.
       | 
       | [0] https://aider.chat/docs/leaderboards/
        
         | minimaxir wrote:
         | This is an interesting result but not surprising given that
         | bugs might cause the suite to fail.
        
         | layer8 wrote:
         | To be fair, you didn't specify that the functional requirements
         | should be maintained, you only asked for better code. ;)
        
       | TZubiri wrote:
       | My pet peeve is equating "better" code with faster code.
        
       | israrkhan wrote:
       | in order to tell LLM to "do better", someone (a human) needs to
       | know that it can be done better, and also be able to decide what
       | is better.
        
       | deepsquirrelnet wrote:
       | Reframe this as scaling test time compute using a human in the
       | loop as the reward model.
       | 
       | o1 is effectively trying to take a pass at automating that manual
       | effort.
        
       | bitwize wrote:
       | I dunno, but telling it "APES TOGETHER STRONG" appears to yield
       | some results: https://www.youtube.com/watch?v=QOJSWrSF51o
        
       | ziofill wrote:
       | At each iteration the LLM has the older code in its context
       | window, isn't it kind of obvious that it is going to iteratively
       | improve it?
        
       | chirau wrote:
       | Deepseek writes some good code, at least in my experience with it
        
       | yubrshen wrote:
       | When asking LLM repeated improving or adding a new feature in a
       | codebase, the most frustration risk is that LLM might wipe out
       | already working code!
       | 
       | What are your strategies to prevent such destructions of LLM?
        
       ___________________________________________________________________
       (page generated 2025-01-03 23:00 UTC)