[HN Gopher] A messy experiment that changed how I think about AI...
       ___________________________________________________________________
        
       A messy experiment that changed how I think about AI code analysis
        
       Author : namanyayg
       Score  : 349 points
       Date   : 2025-01-05 14:15 UTC (8 hours ago)
        
 (HTM) web link (nmn.gl)
 (TXT) w3m dump (nmn.gl)
        
       | JoeAltmaier wrote:
       | Pretty impressive. But for the part about nitpicking on style and
       | uniformity (at the end) the results seem useful.
       | 
       | Btw I thought, from the title, this would be about an AI taught
       | to dismiss anyone's work but their own, blithely hold forth on
       | code they had no experience with, and misinterpret goals and
       | results to fit their preconceived notions. You know, to read code
       | like a Senior Developer.
        
         | svilen_dobrev wrote:
         | > to read code like a Senior Developer.
         | 
         | you mean, as in "code written by someone else == bad code" ?
        
           | dearing wrote:
           | Its being cute but speaking about politicking at code review.
        
       | jerpint wrote:
       | I think there will be lessons learned here as well for better
       | agentic systems writing code more generally; instead of
       | "committing" to code as of the first token generated, first
       | generate overall structure of code base, with abstractions, and
       | only then start writing code.
       | 
       | I usually instruct Claude/chatGPT/etc not to generate any code
       | until I tell it to, as they are eager to do so and often box
       | themselves in a corner early on
        
         | rmbyrro wrote:
         | aider.chat has an /architect mode where you can discuss the
         | architecture first and later ask it to execute the
         | architectural decisions
         | 
         | works pretty well, especially because you can use a more
         | capable model for architecting and a cheaper one to code
        
           | namanyayg wrote:
           | I didn't know about this, thanks for sharing
        
         | thegeomaster wrote:
         | This is literally chain-of-thought! Even better than generic
         | chain-of-thought prompting ("Think step by step and write down
         | your thought process."), you're doing a domain-specific CoT,
         | where you use some of your human intuition on how to approach a
         | problem and imparting the LLM with it.
        
         | j_bum wrote:
         | Yes I frequently do this too.
         | 
         | In fact I often ask whatever model I'm interacting with to _not
         | do anything_ until we've devised a plan. This goes for search,
         | code, commands, analysis, etc.
         | 
         | It often leads to better results for me across the board. But
         | often I need to repeat those instructions as the chat gets
         | longer. These models are so hyped to generate _something_ even
         | if it's not requested.
        
         | Kinrany wrote:
         | We already have languages for expressing abstractions, they're
         | called programming languages. Working software is always built
         | interactively, with a combination of top-down and bottom-up
         | reasoning and experimentation. The problem is not in starting
         | with real code, the problem is in being unable to keep editing
         | the draft.
        
           | qup wrote:
           | Not a problem with the correct tooling.
        
         | namanyayg wrote:
         | That's exactly what I've understood, and this becomes even more
         | important as the size of codebase scales.
         | 
         | Ultimately, LLMs (like humans) can keep a limited context in
         | their "brains". To use them effectively, we have to provide the
         | right context.
        
       | jalopy wrote:
       | This looks very interesting, however it seems to me like the
       | critical piece of this technique is missing from the post: the
       | implementations of getFileContext() and shouldStartNewGroup().
       | 
       | Am I the one missing something here?
        
         | thegeomaster wrote:
         | Reading between the lines, it sounds like they are creating an
         | AI product for more than just their own codebase. If this is
         | the case, they'd probably be keeping a lot of the secret sauce
         | hidden.
         | 
         | More broadly, it's nowadays almost impossible to find what
         | worked for other people in terms of prompting and using LLMs
         | for various tasks within an AI product. Everyone guards this
         | information religiously as a moat. A few open source projects
         | are everything you have if you want to get a jumpstart on how
         | an LLM-based system is productized.
        
         | iamleppert wrote:
         | No, the code he posted sorts files by size, groups them, and
         | then...jazz hands?
        
         | layer8 wrote:
         | Yeah, and in the code bases I'm familiar with, you'd need a lot
         | of contextual knowledge that can't be derived from the code
         | base itself.
        
       | whinvik wrote:
       | Sounds interesting. Do you have documentation on how you built
       | the whole system?
        
         | namanyayg wrote:
         | I'll write something up, what are you curious about exactly?
        
           | JTyQZSnP3cQGa8B wrote:
           | > Do you have documentation on how you built the whole system
           | 
           | Or any actual "proof" (i.e. source code) that your method is
           | useful? I have seen a hundred articles like this one and,
           | surprise!, no one ever posts source code that would confirm
           | the results.
        
             | namanyayg wrote:
             | I have been trying to figure out how to publish evals or
             | benchmarks for this.
             | 
             | But where can I get high quality data of codebases,
             | prompts, and expected results? How do I benchmark one
             | codebase output vs another?
             | 
             | Would love any tips from the HN community
        
               | JTyQZSnP3cQGa8B wrote:
               | That's the problem with people who use AI. You think too
               | much and fail to deliver. I'm not asking for benchmarks
               | or complicated stuff, I want source code, actual proof
               | that I can diff myself. Also that's why the SWE is doomed
               | because of AI, but that's another story.
        
           | techn00 wrote:
           | the implementations of getFileContext() and
           | shouldStartNewGroup().
        
       | theginger wrote:
       | What is with the font joining the character c and t on this
       | site?(In headings)
        
         | escape_goat wrote:
         | It's not joining it in a kerning sense, that's just the
         | remarkably serif nature of EB Garamond, which has a little
         | teardrop terminal on the tip of the 'c'. It's possible that you
         | have font smoothing that is tainting the gap, otherwise it's
         | your eyes.
        
           | teraflop wrote:
           | No, the heading font is Lato, not Garamond, and it's
           | definitely some kind of digraph that only shows up with the
           | combination "ct". Compare the letter "c" in these two
           | headings: https://i.imgur.com/Zq53gTd.png
        
             | escape_goat wrote:
             | This should be upvoted. Thank you, I hadn't realized that
             | OP was referring to the heading font or scrolled down to
             | see what is yes, quite a remarkable ligature. It appears to
             | be Lato delivered from
             | <https://brick.freetls.fastly.net/fonts/lato/700.woff> The
             | ligature appears due to discretionary ligatures being
             | turned on.                    h1, h2, h3 {          font-
             | feature-settings: "kern" 1, "liga" 1, "pnum" 1, "tnum" 0,
             | "onum" 1, "lnum" 0, "dlig" 1;           font-variant-
             | ligatures: discretionary-ligatures;          }
        
           | eichin wrote:
           | Actually, EB Garamond has c_t and s_t ligatures.
        
             | codesnik wrote:
             | and a very subtle f_f. I don't find those nice though.
        
             | jfk13 wrote:
             | It does, but those would only be applied if the `font-
             | variant-ligatures: historical-ligatures` property were
             | specified, so they don't appear on this site.
        
               | escape_goat wrote:
               | I inspected for a ligature and any evidence of CSS
               | kerning being turned on before commenting, but I didn't
               | test it to see what the page looked like with it turned
               | on, so I didn't have active knowledge of the possibility
               | of a ligature. If I'd know, it would have been better to
               | give wider scope to the possibility that somehow kerning
               | was being activated by OP's browser. I should have known
               | better than to make a remark about a font without
               | absolutely scrupulous precision! I actually appreciate
               | the comments and corrections.
        
           | wymerica wrote:
           | I was curious about this as well, it looks as though he's
           | using a specific font which creates a ligature between those
           | letters. I think it's specific because it's only on the CT
           | and it's on other pages in his site. I went further to
           | investigate what this might be and it's a little used print
           | style:
           | https://english.stackexchange.com/questions/591499/what-
           | is-t...
        
         | csallen wrote:
         | I was wondering the same thing. That doesn't seem to happen in
         | the Lato font on Google Fonts:
         | 
         | https://fonts.google.com/specimen/Lato?preview.text=Reaction...
         | 
         | EDIT: It's called ligatures: https://developer.mozilla.org/en-
         | US/docs/Web/CSS/font-varian.... The CSS for headings on this
         | site turns on some extra ligatures.
        
           | jfk13 wrote:
           | Specifically, `font-variant-ligatures: discretionary-
           | ligatures` enables this.
           | 
           | (So does `font-feature-settings: "dlig" 1`, which is the low-
           | level equivalent; the site includes both.)
        
         | namanyayg wrote:
         | In a previous lifetime I was a huge typography nerd (I could
         | name 95% of common fonts in just a glance ~10 years ago).
         | 
         | These are ligatures. I got the code to enable them from
         | Kenneth's excellent Normalize-Opentype.css [0]
         | 
         | [0]: https://kennethormandy.com/journal/normalize-opentype-css/
        
         | skykooler wrote:
         | I was wondering that too - I don't think that's a ligature I've
         | ever seen before.
        
       | advael wrote:
       | I read to like the first line under the first bold heading and
       | immediately this person seemed like an alien. I'll go back and
       | read the rest because it's silly to be put off a whole article by
       | this kind of thing, but what in the actual fuck?
       | 
       | I was probably not alive the last time anyone would have learned
       | that you should read existing code in some kind of linear order,
       | let alone programming. Is that seriously what the author did as a
       | junior, or is it a weirdly stilted way to make an analogy to
       | sequential information being passed into an LLM... which also
       | seems to misunderstand the mechanism of attention if I'm honest
       | 
       | I swear like 90% of people who write about "junior developers"
       | have a mental model of them that just makes zero sense that
       | they've constructed out of a need to dunk on a made up guy to
       | make their point
        
         | dnadler wrote:
         | While that wasn't my experience as a junior developer, this is
         | something that I used to do with academic papers.
         | 
         | I would read it start to finish. Later on, I learned to read
         | the abstract, then jump to either the conclusion or some
         | specific part of the motivation or results that was
         | interesting. To be fair, I'm still not great at reading these
         | kinds of things, but from what I understand, reading it start
         | to finish is usually not the best approach.
         | 
         | So, I think I agree that this is not really common with code,
         | but maybe this can be generalized a bit.
        
           | disgruntledphd2 wrote:
           | > reading it start to finish is usually not the best
           | approach.
           | 
           | It really, really depends on who you are and what your goal
           | is. If it's your area, then you can probably skim the
           | introduction and then forensically study methods and results,
           | mostly ignore conclusion.
           | 
           | However, if you're just starting in an area, the opposite
           | parts are often more helpful, as they'll provide useful
           | context about related work.
        
           | Aurornis wrote:
           | > this is something that I used to do with academic papers
           | 
           | Academic papers are designed to be read from start to finish.
           | They have an abstract to set the stage, an introduction, a
           | more detailed setup of the problem, some results, and a
           | conclusion in order.
           | 
           | A structured, single-document academic paper is not analogous
           | to a multi-file codebase.
        
             | rorytbyrne wrote:
             | No, they are designed to elucidate the author's thought
             | process - not the reader's learning process. There's a
             | subtle, but important difference.
             | 
             | Also: https://web.stanford.edu/class/ee384m/Handouts/HowtoR
             | eadPape...
        
               | p1esk wrote:
               | _they are designed to elucidate the author 's thought
               | process - not the reader's learning process_
               | 
               | No, it's exactly the opposite: when I write papers I
               | follow a rigid template of what a reader (reviewer)
               | expects to see. Abstract, intro, prior/related work, main
               | claim or result, experiments supporting the claim,
               | conclusion, citations. There's no room or expectation to
               | explain any of the thought process that led to the claim
               | or discovery.
               | 
               | Vast majority of papers follow this template.
        
           | the_af wrote:
           | You're supposed to read academic papers from start to finish.
        
             | jghn wrote:
             | I was taught to read the abstract, then the conclusion,
             | then look at the figures, and maybe dig into other sections
             | if there's something that drew my interest.
             | 
             | Given the variety of responses here, I wonder if some of
             | this is domain specific.
        
             | fsmv wrote:
             | I learned very quickly reading math papers that you should
             | not get stuck staring at the formulas, read the rest first
             | and let them explain the formulas.
             | 
             | I would not say it should be read start to finish, I often
             | had to read over parts multiple times to understand it.
        
             | baq wrote:
             | You're supposed to read the abstract, preferably the bottom
             | half first to see if there are conclusions there, then
             | proceed to the conclusions if the abstract is insufficient.
             | Once you're through with that, you can skim the
             | introduction and decide if the paper is worth your
             | attention.
             | 
             | Reading start to finish is only worth it if you're
             | interested in the gory details, I'm usually not.
        
         | Klonoar wrote:
         | _> I was probably not alive the last time anyone would have
         | learned that you should read existing code in some kind of
         | linear order_
         | 
         | I think you're jumping ahead and missing a point that the
         | article itself made: there are indeed bootcamp developers who
         | were taught this way. I have spent quite a number of hours of
         | my life trying to walk some prospective developers back from
         | this mindset.
         | 
         | That said I think that you could write this entire article
         | without dunking on junior developers and I don't consider it
         | particularly well written, but that's a separate issue I guess.
        
           | advael wrote:
           | I suppose such a bootcamp may exist but wow, that's crazy to
           | me
           | 
           | But yea, having now read the whole thing I'm mostly taking
           | issue with the writing style I guess. I find the method they
           | tried interesting but it's worth noting that it's ultimately
           | just another datapoint for the value of multi-scale analytic
           | techniques when processing most complex data (Which is a
           | great thing to have applied here, don't get me wrong)
        
         | namanyayg wrote:
         | I was a junior so long ago that I've forgotten how I first read
         | code, but I do remember I was very confused.
         | 
         | Edited the post to improve clarity. Thanks for the writing tip!
        
           | advael wrote:
           | Yea sorry if I came off caustic there, dealing with really
           | dismissive attitudes toward juniors I'm actively trying to
           | foster has perhaps left a bad taste in my mouth
        
             | namanyayg wrote:
             | No worries. I took the metaphor too far and you rightfully
             | called me out. I'm still learning how to write well, I
             | promise you'll see better from me in the future.
        
               | advael wrote:
               | Love to see someone genuinely trying to improve at
               | something and I'm glad to have played a tiny part in it
        
         | nfRfqX5n wrote:
         | Didn't seem like dunking on juniors to me
        
         | coldtea wrote:
         | I don't know. Your comment feels like alien. The first line
         | under the first bold heading is:
         | 
         | "Remember your first day reading production code? Without any
         | experience with handling mature codebases, you probably quickly
         | get lost in the details".
         | 
         | Which looks pretty much accurate. And yes, this includes the
         | (later) implied idea that many juniors would read a PR in some
         | kind of linear order, or at least, not read it in order of
         | importance, or don't know how to properly order their PR code
         | reading. And yes, some just click in the order Github shows the
         | changed files.
         | 
         | Not that for 99% of the industry, "junior dev" is not the same
         | as something like:
         | 
         | "just out of uni person with 12+ years of experience
         | programming since age 10, who built a couple of toy compilers
         | before they were 16, graduated Stanford, and was recently hired
         | at my FAANG team"
         | 
         | It's usually something bewteen that and the DailyWTF fare,
         | often closer to the latter.
        
           | lolinder wrote:
           | The article was updated, probably in response to the parent
           | comment. It used to read this:
           | 
           | > Remember your first day reading production code? You
           | probably did what I did - start at line 1, read every file
           | top to bottom, get lost in the details.
           | 
           | I copied before refreshing, and sure enough that line was
           | modified.
        
         | schaefer wrote:
         | > I was probably not alive the last time anyone would have
         | learned that you should read existing code in some kind of
         | linear order, let alone programming.
         | 
         | If you want to dive all the way down that rabbit hole, can I
         | recommend you check out the wikipedia article for the book
         | Literate Programming [1] by Donald Kunth [2].
         | 
         | [1]: https://en.wikipedia.org/wiki/Literate_programming [2]:
         | https://en.wikipedia.org/wiki/Donald_Knuth
        
         | bobnamob wrote:
         | I think this article is indicative of the "vibe" I've been
         | getting when reading any discussion around genAI programming.
         | 
         | The range of (areas of) competence is just so damn vast in our
         | industry that any discussion about the quality of generated
         | code (or code reviews in this case) is doomed. There just isn't
         | a stable, shared baseline for what quality looks like.
         | 
         | I mean really - how on earth can Jonny Startup, who spends his
         | days slinging JS/TS to get his business launched in < a
         | month[1], and Terrence PhD the database engineer, who writes
         | simulation tested C++ for FoundationDB, possibly have a
         | grounded discussion about code quality? Rarely do I see people
         | declaring their priors.
         | 
         | Furthermore, the article is so bereft of detail and gushes so
         | profusely about the success and virtues of their newly minted
         | "senior level" AI that I can't help but wonder if they're
         | selling something...
         | 
         | /rant
         | 
         | [1] Please don't read this as slight against Jonny Startup, his
         | priorities are different
        
           | 9rx wrote:
           | Is there a difference in quality? Johnny Startup is
           | presumably trading quality in order to release sooner, but
           | the lower quality accepted in that trade is recognizable.
        
             | bobnamob wrote:
             | If Jonny startup has been building release prioritised
             | systems all his life/career, there's a decent chance he
             | doesn't even know what more goes into systems with higher
             | release & maintenance standards.
             | 
             | Conversely, if Terrence has only ever worked in high rigour
             | environments, he's unlikely to understand Jonny's
             | perspective when Jonny says that code generation tools are
             | doing amazing "reliable" things.
             | 
             | Again, this isn't meant to be a value judgement against
             | either Jonny or Terrence, more that they don't have shared
             | context & understanding on what and how the other is
             | building, and therefore are going to struggle to have a
             | productive conversation about a magic blackbox that one
             | thinks will take their job in 6 months.
        
           | zkry wrote:
           | > Furthermore, the article is so bereft of detail and gushes
           | so profusely about the success and virtues of their newly
           | minted "senior level" AI that I can't help but wonder if
           | they're selling something...
           | 
           | With all the money in the AI space these days, my prior
           | probability for an article extolling the virtues of AI
           | actually trying to sell something is rather high.
           | 
           | I just want a few good unbiased academic studies on the
           | effects of various AI systems on things like delivery time
           | (like are AI systems preventing IT projects from going
           | overtime on a fat-tailed distribution? is it possible with AI
           | to put end to the chapter of software engineering projects
           | going disastrously overtime/overbudget?)
        
         | lolinder wrote:
         | To anyone who gets confused by the parent comment, note that
         | the line they're referring to has been updated. It used to
         | read:
         | 
         | > Remember your first day reading production code? You probably
         | did what I did - start at line 1, read every file top to
         | bottom, get lost in the details.
         | 
         | Now it reads:
         | 
         | > Remember your first day reading production code? Without any
         | experience with handling mature codebases, you probably quickly
         | get lost in the details.
        
           | namanyayg wrote:
           | Oops, I should have marked my edit clearly. Added a footnote
           | now.
        
             | lolinder wrote:
             | Thanks! No worries, we all live and learn. :)
        
           | thimabi wrote:
           | The change makes me question the authenticity of the text. I
           | mean, did the author actually read files from top to bottom,
           | or did he just write that because it suited his narrative?
           | 
           | That's a trivial change to make for a line that did not
           | receive the feedback that the author wanted. If that's the
           | case, maybe the text was more about saying what people wanted
           | to hear than honestly portraying how to make AI read code
           | better.
        
             | namanyayg wrote:
             | I forced an analogy and took the metaphor too far. I
             | promise you'll see better from me in the future!
        
               | autobodie wrote:
               | Metaphor? What metaphor? What analogy?
        
               | tmpz22 wrote:
               | > Remember your first day reading production code? You
               | probably did what I did - start at line 1, read every
               | file top to bottom, get lost in the details.
               | 
               | Top to bottom left to right is how we read text (unless
               | you are using Arabic or Hebrew!), the analogy was fine
               | IMO. Don't let one HN comment shake your confidence,
               | while people here may be well intentioned they are not
               | always right.
        
               | namanyayg wrote:
               | Haha thank you for the kind words!
               | 
               | I've been a lurker on HN ever since I was a kid. I've
               | seen over and over how HN is the most brusque & brutal
               | online community.
               | 
               | But that's also why I love it. Taking every piece of
               | feedback here to learn and improve in the future, and
               | feeling grateful for the thousands of views my article is
               | receiving!
        
               | edanm wrote:
               | Hebrew speakers also read top to bottom and left to
               | right, when they're reading code, because coding is
               | (almost always) in English languages. :)
        
               | lolinder wrote:
               | Don't take this feedback too personally--remember that
               | most HN users read and don't vote or comment, a subset of
               | them read and vote, and only a tiny loud fraction of us
               | actually comment.
               | 
               | Your article has been very well received, and it wasn't
               | because that one line deceived people into paying
               | attention, it's because the content is good.
        
               | iaseiadit wrote:
               | When I started out, I did read code top-to-bottom. I was
               | mostly self-taught and didn't have a mental model yet of
               | how code was structured, so I relied on this "brute
               | force" method to familiarize myself.
               | 
               | I suppose it's not safe to assume that everyone started
               | out like this. But advael is guilty of assuming that
               | _nobody_ started out like this. And on top of that,
               | conveying it in a very negative and critical way. Don 't
               | get discouraged.
        
               | Jtsummers wrote:
               | This discussion is about junior professionals, not zero
               | experience programmers. If a junior professional
               | programmer is still starting at the top of files instead
               | of at the entry points to the program or the points of
               | interest, then they had a very poor education.
        
             | brundolf wrote:
             | Wow, people are being very uncharitable in this comment
             | section
        
               | apstls wrote:
               | Welcome to LLM-related threads on HN.
        
           | soneca wrote:
           | Oh, I was confused, thanks a lot.
           | 
           | And, indeed, reading every file from top to bottom is very
           | alien to me as a junior.
           | 
           | I would just try to get to the file I thought the change I
           | needed was made and start trying and error. Definitely not
           | checking the core files, much less creating a mental model of
           | the architecture (the very concept of architecture would be
           | alien to me then).
           | 
           | I would do get lost in irrelevant details (because I thought
           | they were relevant), while completely missing the details
           | that did matter.
        
         | olivierduval wrote:
         | I think that you missed the point and should have read until
         | "That's exactly how we feed codebases to AI"... ;-)
         | 
         | Actually, the article shows that feed an AI with "structured"
         | source code files instead of just "flat full set" files allow
         | the LLM to give better insights
        
         | loeg wrote:
         | I have actually just printed out codebases and read them cover
         | to cover before (sometimes referencing ahead for context), as a
         | senior engineer. If you need to quickly understand what every
         | line is doing on a small to medium sized body of code, it's a
         | pretty good way to avoid distraction and ramp up quickly. I
         | find that just reading every line goes pretty quickly and gives
         | me a relatively good memory of what's going on.
        
           | ninetyninenine wrote:
           | Doing this requires higher IQ. Believe it or not a ton of
           | people literally don't do this because they can't. This
           | ability doesn't exist for them. Thousands of pages of code is
           | impossible to understand line by line for them. This
           | separation of ability is very very real.
        
           | pdhborges wrote:
           | I don't read all the lines of code but I open and scan a ton
           | of files from the code base to get a feel of which concepts
           | abstractions and tricks are used.
        
         | myvoiceismypass wrote:
         | > I was probably not alive the last time anyone would have
         | learned that you should read existing code in some kind of
         | linear order, let alone programming.
         | 
         | Some of us have been around since before the concept of a "Pull
         | Request" even existed.
         | 
         | Early in my career we used to print out code (on paper, not
         | diffs) and read / have round table reviews in person! This was
         | only like 2 decades ago, too!
        
         | overgard wrote:
         | Yeah, to me his description of how programmers think didn't
         | really jive with either senior or junior. I think with senior
         | developers when they look at a code review, they're busy, so
         | they're looking for really obvious smells. If there's no
         | obvious smells and it's easy to understand what the code is
         | intending to do, they usually let it pass. Most of the time if
         | one of my PR's get's rejected it's something along the line of
         | "I don't know why, but doing X seems sketch" or "I need more
         | comments to understand the intended flow" or "The
         | variable/function names aren't great"
        
       | OzzyB wrote:
       | So it turns out that AI is just like another function, inputs and
       | outputs, and the better you design your input (prompt) the better
       | the output (intelligence), got it.
        
         | jprete wrote:
         | The Bitter Lesson claimed that the best approach was to go with
         | more and more data to make the model more and more generally
         | capable, rather than adding human-comprehensible structure to
         | the model. But a lot of LLM applications seem to add missing
         | domain structure until the LLM does what is wanted.
        
           | mbaytas wrote:
           | Improving model capability with more and more data is what
           | model developers do, over months. Structure and prompting
           | improvements can be done by the end user, today.
        
           | do_not_redeem wrote:
           | The Bitter Lesson pertains to the long term. Even if it
           | holds, it may take decades to be proven correct in this case.
           | Short-term, imparting some human intuition is letting us get
           | more useful results faster than waiting around for "enough"
           | computation/data.
        
           | Philpax wrote:
           | The Bitter Lesson states that you can overcome the weakness
           | of your current model by baking priors in (i.e. specific
           | traits about the problem, as is done here), but you will get
           | better long-term results by having the model learn the priors
           | itself.
           | 
           | That seems to have been the case: compare the tricks people
           | had to do with GPT-3 to how Claude Sonnet 3.6 performs today.
        
         | shahzaibmushtaq wrote:
         | You got that 100% right. The title should be "The day I told
         | (not taught) AI to read code like a Senior Developer".
        
         | syndicatedjelly wrote:
         | Not trying to nitpick, but the phrase "AI is just like another
         | function" is too charitable in my opinion. A function, in
         | mathematics as well as programming, transforms a given input
         | into a specific output in the codomain space. Per the Wikipedia
         | definition,                   In mathematics, a function from a
         | set X to a set Y assigns to each element of X exactly one
         | element of Y.[1] The set X is called the domain of the
         | function[2] and the set Y is called the codomain of the
         | function.[3]
         | 
         | Not to call you out specifically, but a lot of people seem to
         | misunderstand AI as being just like any other piece of code.
         | The problem is, unlike most of the code and functions we write,
         | it's not simply another function, and even worse, it's usually
         | not deterministic. If we both give a function the same input,
         | we should expect the same input. But this isn't the case when
         | we paste text into ChatGPT or something similar.
        
           | int_19h wrote:
           | LLMs are _literally_ a deterministic function of a bunch of
           | numbers to a bunch of numbers. The non-deterministic part
           | only comes when you apply the random pick to select a token
           | based on the weights (deterministically) computed by the
           | model.
        
       | risyachka wrote:
       | >> The AI went from: >> "This file contains authentication logic
       | using JWT tokens"
       | 
       | So what was the initial prompt? "What's in this file?"
       | 
       | And then you added context and it became context-aware. A bit of
       | an overstatement to call this "Holy Shit moment"
       | 
       | Also , why is "we"? What is "our AI"? And what is "our benchmark
       | script"?
       | 
       | And how big is your codebase? 50k files? 20 files?
       | 
       | This post has very very little value without a ton of details,
       | looks like nowadays everything "ai" labeled gets to the front
       | page.
        
         | dijksterhuis wrote:
         | > looks like nowadays everything "ai" labeled gets to the front
         | page.
         | 
         | it's been this way for like a year or more. hype machine gotta
         | hype.
        
       | sgt101 wrote:
       | As John McCarthy used to complain "Look Mom, no hands!"
       | 
       | This isn't an experiment, or proof of anything. It's a story.
       | 
       | I wish people in the community would engage their critical skills
       | and training. The folks that taught the author should hang their
       | head in shame that their student is producing such rubbish.
        
         | namanyayg wrote:
         | It's my personal experience for now. Have some experiments and
         | further study planned.
         | 
         | It's difficult to set up evals, especially with production code
         | situations. Any tips?
        
           | sgt101 wrote:
           | Yeah, it is hard.
           | 
           | The principal you need to work to is that you need to create
           | evidence that other people will find compelling and then show
           | that you have interrogated your results to show that you have
           | checked that it's really working better than chance and not
           | the result of some fluke or other. Finally you need to find a
           | way to explain what's happening - like an actual mechanism.
           | 
           | 1. Find or make a data set - I've been using code_search_net
           | to try and study the ability of LLM's to document code,
           | specifically the impact optimising n-shot learning on them*,
           | this may not be close enough to your application, but you
           | need many examples to draw conclusions. It's likely that you
           | will have to do some statistics to demonstrate the effects of
           | your innovations so you probably need around 100 examples.
           | 
           | 2. Results from one model may not be informative enough, it
           | might be useful/necessary to compare several different models
           | to see if the effect you are finding is consistent or whether
           | some special feature of a model is what is required. For
           | example, does this effect work with only the largest and most
           | sophisticated modern models, or is this something that can be
           | seen to a greater or lesser effect with a variety of models?
           | 
           | 3) You need to ablate - what is it in the setup that is most
           | impactful? What happens if we change a word or add a word to
           | the prompt? Does this work on long code snippits? Does it
           | work on code with many functions? Does it work on code from
           | particular languages?
           | 
           | 4) You need a quantitative measure of performance. I am a
           | liberal sort , but I will not be convinced by an assertion
           | that "it worked better than before" or "this review is like
           | an senior, I think". There needs to be a number that someone
           | like me can't argue with - or at least, can argue with but
           | can't dismiss.
           | 
           | *I couldn't make it work, I think because the search space
           | for finding good prompting shots (sample function)is vast and
           | the output space (possible documents) is vast. Many bothans
           | died in order to bring you this very very very (in hindsight
           | with about $200 of OpenAI spending) obvious result. Having
           | said that I am not confident that it couldn't be made to work
           | at this point so I haven't written it up and won't make any
           | sort of definitive claim. Mainly I wonder if there is a
           | heuristic that I could use to choose examples a-priori
           | instead of trying them at random. I did try shorter examples
           | and I did try more typical (size) examples. The other issue
           | is that I am using sentence similarity as a measure of
           | quality, but that isn't something I am confident of.
        
         | og_kalu wrote:
         | Yeah it's a story. So what ? If you'd have liked evals or
         | implementation details then you can just say so.
         | 
         | People can and are free to tell stories if they want. It's not
         | some failing. You don't have to engage with it anymore than
         | that.
        
         | efuquen wrote:
         | > The folks that taught the author should hang their head in
         | shame that their student is producing such rubbish.
         | 
         | This is unnecessary and rude, you should hang your head in
         | shame for that. I wish some people in this community weren't so
         | reactionary and would engage with empathy instead of trying to
         | personally roast people as soon as they don't agree with
         | something.
         | 
         | Someone can tell a story on the internet, it doesn't have to be
         | some rigorous experiment or proof.
        
           | sgt101 wrote:
           | I completely disagree.
           | 
           | Computer Science has a massive ethics crisis. Uncritical
           | adulation and a total lack of accountability or consequence
           | is part of that.
           | 
           | There is a massive misallocation of capital which is burning
           | opportunity for our society. Users are getting terrible
           | experiences and systems because people read this sort of
           | thing and believe it. Trust in our technology is eroded and
           | this has consequences for actual people. We have abandoned
           | the standards that protected people and you have the view
           | that these standards, or a shadow of them even, are
           | unnecessary?
           | 
           | Someone has taught a generation that this is all ok, it
           | isn't.
        
             | efuquen wrote:
             | I don't disagree with some of the points are making here,
             | but the main point of my own comment is your last sentence
             | from above was mean-spirited and unnecessary.
             | 
             | You want to have a conversation about quality and ethics in
             | computing and how this post can be pushing a narrative that
             | is not in line with your views on this, I think that is
             | worthwhile to have. But personal denigration of someone
             | else isn't necessary in doing that.
        
       | cloudking wrote:
       | Sounds like OP hasn't tried the AI IDEs mentioned in the article.
       | 
       | For example, Cursor Agent mode does this out of the box. It
       | literally looks for context before applying features, changes,
       | fixes etc. It will even build, test and deploy your code - fixing
       | any issues it finds along the way.
        
         | batata_frita wrote:
         | Have you tried cline to compare with cursor?
         | 
         | I haven't tried Cursor yet but for me cline does an excelent
         | job. It uses internal mechanisms to understand the code base
         | before making any changes.
        
           | cloudking wrote:
           | I'll give it a go, I've also heard Windsurf is quite good.
           | 
           | Personally I've been very impressed with Cursor Agent mode,
           | I'm using it almost exclusively. It understands the entire
           | codebase, makes changes across files, generates new files,
           | and interacts with terminal input/output. Using it, I've been
           | able to build, test & deploy fullstack React web apps and
           | three.js games from scratch.
        
       | yapyap wrote:
       | haha man, some of yall really talk about AI like it's some baby
       | with all the knowledge in the world, waiting to be taught common
       | sense
        
       | _0ffh wrote:
       | Very sceptical of "Context First: We front-load system
       | understanding before diving into code". The LLM sees the whole
       | input at once, it's a transformer, not a recurrent model. Order
       | shouldn't matter in that sense.
       | 
       | Ed. I see some people are disagreeing. I wish they explained how
       | they imagine that would work.
        
       | Workaccount2 wrote:
       | Just like training data, the more context and the higher quality
       | the context you give the model, the better the outputs become.
        
       | scinadier wrote:
       | A bit of a disappointing read. The author never elaborates on the
       | details of the particular day in which they taught AI to read
       | code like a Senior Developer.
       | 
       | What did they have for lunch? We'll never know.
        
         | namanyayg wrote:
         | Credit goes to "You Suck at Cooking" for their genius smash
         | burger recipe [0] for my lunch that day
         | 
         | [0] https://www.youtube.com/watch?v=nq9WnmCGoFQ
        
       | quantadev wrote:
       | In my Coding Agent, I ended up realizing my prompts need to be
       | able to specifically mention very specific areas in the code, for
       | which no real good syntax exists for doing that so I invented
       | something I call "Named Blocks".
       | 
       | My coding agent allows you to put any number of named blocks in
       | your code and then mention those in your prompts by name, and the
       | AI understands what code you mean. Here's an example:
       | 
       | In my code:                   -- block_begin SQL_Scripts
       | ...some sql scripts...         -- block_end
       | 
       | Example prompt:                   Do you see any bugs in
       | block(SQL_Script)?
        
         | mdaniel wrote:
         | > for which no real good syntax exists for doing that
         | 
         | Up to you, but several editors have established syntax which
         | any code-trained model will likely have seen plenty of examples
         | 
         | vim (set foldmethod=marker and then {{{ begin\n }}} end\n )
         | https://vimdoc.sourceforge.net/htmldoc/usr_28.html#28.6>
         | 
         | JetBrains <editor-fold desc=""></editor-fold>
         | https://www.jetbrains.com/help/idea/working-with-source-code...
         | 
         | Visual Studio (#pragma region) https://learn.microsoft.com/en-
         | us/cpp/preprocessor/region-en... (et al, each language has its
         | own)
        
           | quantadev wrote:
           | The great thing about my agent, which I left out, is that it
           | extracts out all the named blocks using just pure Python, so
           | that the prompt itself has them embedded directly in it.
           | That's faster and more efficient than even having a "tool
           | call" that extracts blocks by name. So I needed a solution
           | where my own code can get named block content out of any kind
           | of file. Having one syntax that works on ALL types of files
           | was ideal.
           | 
           | UPDATE: In other words it's always "block_begin" "block_end"
           | regardless of what the comment characters are which will be
           | different for different files of course.
        
       | qianli_cs wrote:
       | I thought it was about LLM training but it's actually prompt
       | engineering.
        
         | namanyayg wrote:
         | I'm thinking about training next! But deepseek is so good
         | already
        
       | voidhorse wrote:
       | To me, this post really just highlights how important the human
       | element will remain. Without achieving the same level of
       | contextual understanding of the code base, I have no clue as to
       | whether or not the AI warning makes any sense.
       | 
       | At a superficial level, I have no idea what "shared patterns"
       | means or why it logically follows that sharing them would cause a
       | race condition. It also starts out talking about authentication
       | changes, but then cites a PR that modified "retry logic"--without
       | that shared context, it's not clear to me that an auth change has
       | anything to do with retry logic unless the retry is related to
       | retries on authentication failures.
        
       | dimtion wrote:
       | Without knowing exactly how createNewGroup and addFileToGroup are
       | implemented it is hard to tell, but it looks like the code
       | snippet has a bug where the last group created is never pushed to
       | groups variable.
       | 
       | I'm surprised this "senior developer AI reviewer" did not caught
       | this bug...
        
       | crazygringo wrote:
       | I'm fascinated by stories like these, because I think it shows
       | that LLM's have only shown a small amount of their potential so
       | far.
       | 
       | In a way, we've solved the raw "intelligence" part -- the next
       | token prediction. (At least in certain domains like text.)
       | 
       | But now we have to figure out how to structure that raw
       | intelligence into actual useful thinking patterns. How to take a
       | problem, analyze it, figure out ways of breaking it down, try
       | those ways until you run into roadblocks, then start figuring out
       | some solution ideas, thinking about them more to see if they
       | stand up to scrutiny, etc.
       | 
       | I think there's going to be a lot of really interesting work
       | around that in the next few years. A kind of "engineering of
       | practical thinking". This blog post is a great example of one
       | first step.
        
         | Terr_ wrote:
         | > But now we have to figure out how to structure that raw
         | intelligence into actual useful thinking patterns.
         | 
         | My go-to framing is:
         | 
         | 1. W've developed an amazing tool that extends a document. Any
         | "intelligence" is in there.
         | 
         | 2. Many uses begin with a document that resembles a movie-
         | script conversation between a computer and a human,
         | alternatively adding new lines (from a real human) and
         | _performing_ the extended lines that parse out as  "Computer
         | says."
         | 
         | 3. This illusion is effective against _homo sapiens_ , who are
         | biologically and subconsciously primed to make and experience
         | stories. We confuse the actor with the character with the
         | scriptwriter.
         | 
         | Unfortunately, the illusion is so good that a lot of
         | _developers_ are having problems pulling themselves back to the
         | real world too. It 's as if we're trying to teach fashion-sense
         | and embarrassment and empathy to a cloud which _looks like_ a
         | person, rather than changing how the cloudmaker machine works.
         | (The latter also being more difficult and more expensive.)
        
       | afro88 wrote:
       | Another cherry-picked example of an LLM doing something amazing,
       | written about with a heavy dose of anthropomorphism.
       | 
       | It's easy to get LLMs to do seemingly amazing things. It's
       | incredibly hard to build something where it does this amazing
       | thing consistently and accurately for all reasonable inputs.
       | 
       | > Analyzing authentication system files:
       | 
       | > - Core token validation logic
       | 
       | > - Session management
       | 
       | > - Related middleware
       | 
       | This hard coded string is doing some very heavy lifting. This
       | isn't anything special until this string is also generated
       | accurately and consistently for any reasonable PR.
       | 
       | OP if you are reading, the first thing you should do is get a
       | variety of codebases with a variety of real world PRs and set up
       | some evals. This isn't special until evals show it producing
       | consistent results.
        
         | namanyayg wrote:
         | That's exactly what I want to do next.
         | 
         | Any tips on how should I get codebases and real world PRs? Are
         | the ones on popular open source repos on GitHub sufficient? I
         | worry that they don't really accurately reflect real world
         | closed source experience because of the inherent selection
         | bias.
         | 
         | Secondly, after getting all this, how do I evaluate which
         | method gave better results? Should it be done by a human, or
         | should I just plug an LLM to check?
        
           | jarebear6expepj wrote:
           | "...should it be done by a human?"
           | 
           | Sigh.
        
             | namanyayg wrote:
             | I'll do it personally in the beginning but was thinking
             | about doing it on scale
        
               | JTyQZSnP3cQGa8B wrote:
               | > doing it on scale
               | 
               | Like cloud-scale, no-code scale, or NoSQL scale? You are
               | confused, which shows that, maybe, you should not be
               | using such tools with the experience that you don't have.
        
               | zBard wrote:
               | 'Like cloud-scale, no-code scale or NoSQL scale'.
               | 
               | That is the dumbest statement I have heard this week. You
               | should perhaps refrain from commenting, at-least until
               | you gain the modicum of intelligence that you currently
               | don't have.
        
           | smusamashah wrote:
           | Very first thing you can tell us (or try if you haven't) is
           | that if you re-prompt, does it give the same answer? Second
           | can you get it to generate (consistently and repeatedly) the
           | text that gp pointed out?
           | 
           | Don't need to switch to a different repo for quick test, just
           | make it reproable on your current repo.
        
             | namanyayg wrote:
             | Not the exact text, but still decent quality. I'll play
             | around with temperature and prompts a bit.
        
           | QuadmasterXLII wrote:
           | If you could get an LLM to check, you could just spam
           | solutions with any assortment of models and then use your
           | checker to pick the best.
        
           | throwup238 wrote:
           | _> I worry that they don 't really accurately reflect real
           | world closed source experience because of the inherent
           | selection bias._
           | 
           | As opposed to what, yet another beginner React app? That's
           | what everyone seems to be testing with but none of the
           | projects I've seen are reflective of a production codebase
           | that's years old and has been touched by a dozen developers.
           | 
           | Throw it at a complicated non-frontend mixed language repo
           | like cxx-qt [1] or something, preferably where the training
           | data doesn't include the latest API.
           | 
           | [1] https://github.com/KDAB/cxx-qt
        
             | lukan wrote:
             | "preferably where the training data doesn't include the
             | latest API"
             | 
             | That is the reason LLM's in their current shape are pretty
             | useless to me for most tasks.
             | 
             | They happily mix different versions of popular frameworks,
             | so I have to do so much manual work to fix it, I rather do
             | all by myself then.
             | 
             | Pure (common) math problems, or other domains where the
             | tech did not change so much, like bash scripts or regex are
             | where I can use them. But my actual code? Not really. The
             | LLM would need to be trained only on the API version I use
             | and that is not a thing yet, as far as I am aware.
        
           | fuzzythinker wrote:
           | https://www.kaggle.com/competitions/konwinski-prize
        
           | rHAdG12327 wrote:
           | Given how eager Microsoft is to steal other people's code,
           | perhaps the leaked Windows source code would be an option. Or
           | perhaps Microsoft will let you train on their internal issue
           | tracker.
        
         | InkCanon wrote:
         | I also think there's some exaggeration. Annotating files with a
         | feature tag system is both manual and not scabale. Custom
         | prompting for each commit or feature a lot more so. You do a
         | decent bit of specialized work here.
         | 
         | And I think he left out the most important part, was the answer
         | actually right? The real value of any good dev at all is that
         | he can provide reasonably accurate analysis with logic and
         | examples. "Could have an error" is more like a compiler warning
         | than the output of a good engineer.
         | 
         | Side note: "broke the benchmark script?" If you have an
         | automated way to qualitatively evaluate the output of an LLM in
         | a reasonably broad context like code reading, that's far bigger
         | a story.
        
           | imoreno wrote:
           | >Annotating files with a feature tag system is both manual
           | and not scabale.
           | 
           | Wouldn't you have the AI annotate it?
        
         | ninetyninenine wrote:
         | This post talks as if the results are a worthless pile of trash
         | while obeying the HN rules of not directly insulting the
         | results. I agree with everything under the first paragraph.
         | 
         | Let me spell it out for you. These results. Are. Not.
         | Worthless.
         | 
         | Certainly what you said is correct on what he "should" do to
         | get additional data, but your tonality of implying that the
         | results are utter trash and falsely anthropomorphizing
         | something is wrong.
         | 
         | Why is it wrong? Imagine Einstein got most things wrong in his
         | life. Most things but he did discover special and general
         | relativity. It's just everything else was wrong. Relativity is
         | still worth something. The results are still worthwhile.
         | 
         | We have an example of an LLM hallucinating. Then we have
         | another example of additional contextual data causing the LLM
         | to stop hallucinating. This is a data point leaving a clue
         | about hallucinations and stopping hallucinations. It's
         | imperfect but a valuable clue.
         | 
         | My guess is that there's a million causal factors that cause an
         | LLM to hallucinate and he's found one.
         | 
         | If he does what he did a multitude of times for different
         | topics and different problems where contextual data stops an
         | hallucination, with enough data and categorization of said data
         | we may be able to output statistical data and have insight into
         | what's going on from a statistical perspective. This is just
         | like how we analyze other things that produce fuzzy data like
         | humans.
         | 
         | Oh no! Am I anthropomorphizing again?? Does that action make
         | everything I said wrong? No, it doesn't. Humans produce correct
         | data when given context. It is reasonable to assume in many
         | cases LLMs will do the same. I wrote this post because I agree
         | with everything you said but not your tone which implies that
         | what OP did is utterly trivial.
        
           | godelski wrote:
           | They didn't say worthless, they said amazing.
           | 
           | Their comment is "do it consistently, then I'll buy your
           | explanation"
        
             | ninetyninenine wrote:
             | lol "seemingly amazing" means not amazing at all.
             | 
             | He didn't literally say it but the comment implies it is
             | worthless as does yours.
             | 
             | Humans dont "buy it" when they think something is
             | worthless. The tonality is bent this way.
             | 
             | He could have said, "this is amazingly useful data but we
             | need more" but of course it doesn't read like this at all
             | thanks to the first paragraph. Let's not hallucinate it
             | into something it's not with wordplay. The comment is
             | highly negative.
        
               | wholinator2 wrote:
               | You seem very emotionally involved in this. It says "an
               | LLM doing something amazing". That's the sentence. Later
               | the term "seemingly amazing" is used. Implying that it
               | _seems amazing_. Anything beyond that is your personal
               | interpretation. Do you disagree that there is an excess
               | of cherrypicked LLM examples getting anthropomorphized?
               | Yeah, it did a cool thing. Yes, llms doing single cool
               | things are everywhere. Yes, I well be more convinced of
               | its impact when i see it tested more widely.
        
               | ninetyninenine wrote:
               | I am emotionally involved in the sense that I disagree
               | and dislike the tone. The core of my post is addressing
               | tonality and thus emotions is the topic of my post. I'm
               | emotionally involved in the same way it's normal for
               | humans to have emotions. If you can't see this then you
               | don't have the rationality or emotional capacity to
               | understand what I am saying.
               | 
               | > Anything beyond that is your personal interpretation.
               | 
               | Not true. In the context of that post when something is
               | implied that it only seems amazing it also implies that
               | it is likely not amazing. That is a common human
               | interpretation. I find if you're not interpreting it that
               | way your emotions are influencing your interpretation.
               | 
               | > Do you disagree that there is an excess of cherrypicked
               | LLM examples getting anthropomorphized?
               | 
               | I disagree. I think everybody is doing the opposite and
               | cherry-picking the instances where the LLM gets stuff
               | wrong and saying LLMs are garbage because of that and
               | ignoring the instances where it gets shit right and
               | classifying that instance as regurgitation.
               | 
               | You're not being rational here by bringing up
               | anthropomorphization. We are mainly talking about
               | correctness. If an aspect of LLM intelligence is similar
               | to humans in generating correct data then
               | anthropomorphizing it is the way to go. Whether it's
               | anthropomorphizing something is completely orthogonal to
               | the problem. We are talking about correct results. If
               | anthropomorphizing something gets us there... who cares?
               | Human intelligence works and it's even reasonable to
               | believe LLMs think like humans because they are literally
               | trained on human data.
               | 
               | The act of even mentioning anthropomorphization without
               | mentioning correctness is irrational and illogical.
        
         | nuancebydefault wrote:
         | Still, the findings in the article are very valuable. The fact
         | that directing the "thought" process of the LLM by this kind of
         | prompting, yields much better results, is useful.
         | 
         | The comparison to how a senior dev would approach the
         | assignment, as a metaphor explaining the mechanism, makes
         | perfect sense to me.
        
         | mensetmanusman wrote:
         | No need for the pessimism, these are new tools that humans have
         | invented. We are groking how to utilize them.
        
           | ramblerman wrote:
           | OP brought a rational argument, you didn't. It sounds like
           | you are defending your optimism with emotion.
           | 
           | > We are groking how to utilize them.
           | 
           | Indeed.
        
         | asah wrote:
         | It's incredibly hard to get __ HUMANS __ to do this amazing
         | thing consistently and accurately for all reasonable inputs.
        
           | iaseiadit wrote:
           | Some humans can do it consistently, other humans can't.
           | 
           | Versus how no publicly-available AI can do it consistently
           | (yet). Although it seems like a matter of time at this point,
           | and then work as we know it changes dramatically.
        
           | overgard wrote:
           | To be fair, most senior developers don't have any incentive
           | to put this amount of analysis into a working codebase. When
           | the system is working, nobody really wants to spend time they
           | could be working on something interesting trying to find bugs
           | in old code. Plus there's the social consideration that your
           | colleagues might not like you a lot if you spend all your
           | time picking their (working) code apart while not doing any
           | of your tasks. Usually this kind of analysis would come from
           | someone specifically brought in to find issues, like an
           | auditor or a pen-tester.
        
             | kotlip wrote:
             | The right incentives would motivate bug hunting, it
             | entirely depends on company management. Most competent
             | senior devs I've worked with spend a great deal of time
             | carefully reading through PRs that involve critical
             | changes. In either case, the question is not whether humans
             | tend to act a certain way, but whether they are capable of
             | skillfully performing a task.
        
             | smrtinsert wrote:
             | Unfortunately some senior devs like myself do care. Too bad
             | no one else does. Code reviews become quick after a while,
             | you brain adapts to being to review code deeply and quickly
        
           | talldayo wrote:
           | Humans are fully capable of protracted, triple-checked
           | scrutiny if the incentives align just right. Given the same
           | conditions, you cannot ever compel an AI to stop being wrong
           | or consistently communicate what it doesn't understand.
        
         | ryanackley wrote:
         | What I want to know is how accurate was the comment? I've found
         | AI to frequently suggest _plausible_ changes. Like they use
         | enough info and context to look like excellent suggestions on
         | the surface but you realize with some digging it was so
         | completely wrong.
        
         | Der_Einzige wrote:
         | The people who claim it's that hard to do these things have
         | never heard of or used constrained/structured generation, and
         | it shows big time!
         | 
         | Most other related issues of models these days are due to the
         | tokenizer or poor choice of sampler settings which is a cheap
         | shot on models.
        
         | j45 wrote:
         | What i'm learning is just because something might be hard to
         | you or I, doesn't mean it's not possible or not working.
         | 
         | LLMs can generlaly only do what they have data on, either in
         | training, or instructions via prompting it seems.
         | 
         | Keeping instructions reliable, is increasing and testing,
         | appears to benefit from LLMops tools like Agenta, etc.
         | 
         | It seems to me like LLMs are reasonably well suited for things
         | that code can't do easily as well. You can find models on
         | Hugging face that are great at categorizing and applying labels
         | and categorization, instead of trying to get a generalized
         | assistant model to do it.
         | 
         | I'm more and more looking at tools like OpenRouter to allow
         | doing each step with the model that does it best, almost
         | functionally where needed to increase stability.
         | 
         | For now, it seems to be one way to improve reliability
         | dramatically, happy to learn about what others are finding too.
         | 
         | It seems like a pretty nascent area still where existing
         | tooling in other areas of tech is still figuring itself out in
         | the LLM space.
        
       | revskill wrote:
       | The seniors master more than 2 or 3 languages.
        
       | SunlitCat wrote:
       | Oh my. That title alone inspired me to ask ChatGPT to read a
       | simple hello world cpp program like a drunken sailor.
       | 
       | The end result was quite hilarious I have to say.
       | 
       | It's final verdict was:
       | 
       | End result? It's a program yellin', "HELLO WORLD!" Like me at the
       | pub after 3 rum shots. Cheers, matey! hiccup
       | 
       | :D
        
         | namanyayg wrote:
         | Recently I've started appending "in the style of Edgar Allen
         | Poe" to my prompts when I'm discussing code architecture.
         | 
         | It's really quite interesting how the LLM comes up with ways to
         | discuss about code :)
        
       | dartos wrote:
       | I think the content is interesting, but anthropomorphizing AI
       | always rubs me the wrong way and ends up sounding like marketing.
       | 
       | Are you trying to market a product?
        
       | zbyforgotp wrote:
       | Personally I would not hardcode the discovery process in code but
       | just gave the llm tools to browse the code and find what it needs
       | itself.
        
       | atemerev wrote:
       | This is what Aider doing out of the box
        
       | mbrumlow wrote:
       | > Context First: We front-load system understanding before diving
       | into code Pattern Matching: Group similar files to spot repeated
       | approaches Impact Analysis: Consider changes in relation to the
       | whole system
       | 
       | Wait. You fixed your AI by doing traditional programming !?!?!
        
         | _0ffh wrote:
         | The context first bit doesn't even make sense.
         | 
         | Transformers process their whole context window in parallel,
         | unlike people who process it serially. There simply _is_ no
         | place that gets processed  "first".
         | 
         | I'd love to see anyone who disagrees explain to me how that is
         | supposed to work.
        
       | danjl wrote:
       | > Identifying tech debt before it happens
       | 
       | Tech debt is a management problem, not a coding problem. A
       | statement like this undermines my confidence in the story being
       | told, because it indicates the lack of experience of the author.
        
         | noirbot wrote:
         | I don't think that's totally accurate though. It can definitely
         | be a coding problem - corners cut for expediency and such.
         | Sometimes that's because management doesn't offer enough time
         | to not do it that way, but it can also just be because the dev
         | doesn't bother or does the bare minimum.
         | 
         | I'd argue the creation of tech debt is often coding problem.
         | The longevity and persistence of tech debt is a management
         | problem.
        
           | danjl wrote:
           | Not taking enough time is a management problem. It doesn't
           | matter whether It is the manager or the developer who takes
           | shortcuts. The problem is planning, not coding.
        
           | dijksterhuis wrote:
           | > it can also just be because the dev doesn't bother or does
           | the bare minimum
           | 
           | sounds like a people problem -- which is management problem.
           | 
           | > I'd argue the creation of tech debt is often coding
           | problem. The longevity and persistence of tech debt is a
           | management problem.
           | 
           | i'd argue the creation of tech debt is more often due to
           | those doing the coding operating under the limitations placed
           | on them. The longevity and persistence of tech debt is just
           | an extension of that.
           | 
           | given an infinite amount of time and money, i can write an
           | ideal solution to a solvable problem (or at least close to
           | ideal, i'm not that good of a dev).
           | 
           | the limitations create tech debt, and they're always there
           | because infinite resources (time and money) don't exist.
           | 
           | so tech debt always exists because there's always
           | limitations. most of the time those resource limitations are
           | decided by management (time/money/people)
           | 
           | but there are language/framework/library limitations which
           | create tech debt too though, which i think is what you might
           | be referring to?
           | 
           | usually those are less common though
        
       | shahzaibmushtaq wrote:
       | All fresh bootcamp grads aren't going to understand what the
       | author is talking about, and many senior developers (even mid-
       | seniors) are looking for what prompts the author wrote to teach
       | AI how to become a senior developer.
        
       | highcountess wrote:
       | Dev palms just got that much more sweaty.
        
       | brundolf wrote:
       | People are being very uncharitable in the comments for some
       | reason
       | 
       | This is a short and sweet article about a very cool real-world
       | result in a very new area of tooling possibilities, with some
       | honest and reasonable thoughts
       | 
       | Maybe the "Senior vs Junior Developer" narrative is a little
       | stretched, but the substance of the article is great
       | 
       | Can't help but wonder if people are getting mad because they feel
       | threatened
        
         | namanyayg wrote:
         | Felt a bit more cynic than usual haha.
        
         | jryan49 wrote:
         | It seems llms are very useful for some people and not so much
         | for others. Both sides believe it's all or nothing. If it's
         | garbage for me it must be garbage. If it's doing my work it
         | must be able to do everyone's work... Everyone is very
         | emotionally about it too because of the hype around it. Almost
         | all conversations about llms, especially on hn are full of this
         | useless bickering.
        
         | epolanski wrote:
         | I am more and more convinced that many engineers are very
         | defensive about AI and would rather point out any flaw than
         | think how to leverage the tools to get any benefit out of them.
         | 
         | Just the other day I used cursor and iteratively implemented
         | stories for 70 .vue files in few hours, while also writing
         | documentation for the components and pages, and with the
         | documentation being further fed to cursor, to write many E2Es,
         | something that would've taken me at least few days if not a
         | week.
         | 
         | When I shared that with some coworkers they went into a hunt to
         | find all the shortcomings (often petty duplication of mocks,
         | sometimes missing a story scenario, nothing major).
         | 
         | I found it striking as we really needed it and it provides
         | tangible benefits:
         | 
         | - domain and UI stakeholders can navigate stories and think of
         | more cases/feedback with ease on a UX/UI pov without having to
         | replicate the scenarios manually doing multiple time consuming
         | repetitive operations in the actual applications
         | 
         | - documentation proved to be very valuable to a junior that
         | joined us this very january
         | 
         | - E2Es caught multiple bugs in their own PRs in the weeks after
         | 
         | And yet, instead of appreciating the cost/benefit ratio
         | (something that should characterise a good engineer, after all,
         | that's our job) of the solution, I was scolded because they (or
         | I) would've done a more careful job missing that they never
         | done that in the first place.
         | 
         | I have many such examples, such as automatically providing all
         | the translation keys and translations for a new locale, just to
         | find cherry picked criticism that this or that could've been
         | spelled differently. Of course it can, what's your job if not
         | being responsible for the localisation? That shouldn't diminish
         | that 95% of the content was correct and provided in few seconds
         | rather than days.
         | 
         | Why they do that? I genuinely feel some feel threatened, most
         | of those reek insecurity.
         | 
         | I can understand some criticism towards those who build and
         | sell hype with cherry picked results, but I cannot but find
         | some of the worst critics suffering of Luddism.
        
           | krainboltgreene wrote:
           | Given how much damage it's done to our industry without any
           | appreciable impact on the actual system's efficacy it makes
           | sense to me that _experts in a mechanism_ are critical of
           | people telling them how effective this  "tool" is for the
           | mechanism.
           | 
           | I suppose it's simply easier to think of them as scared and
           | afraid of losing their lobs to robots, but the reality is
           | most programmers already know someone who lost their job to a
           | robot that doesn't even exist yet.
        
             | chchchangelog wrote:
             | Oh no being a working serf is going away
             | 
             | In the US anyway, we can all take advantage of the 2A, sit
             | around strapped keeping the politicians in-line
             | 
             | https://aeon.co/essays/game-theory-s-cure-for-corruption-
             | mak...
             | 
             | No need to live by a politically convenient to the elite,
             | economic ledger.
             | 
             | Key axioms, techniques, and technology were invented before
             | the contemporary leadership used media to take credit.
             | Little reason to actually venerate normal people who
             | happened to come along at an opportune time.
             | 
             | Lucky to exist, survivorship bias is a terrible reason to
             | deify anyone from prior generations. Maybe all the actual
             | smart people from their day died? The survivors being risk
             | averse weenies.
             | 
             | 1900s American culture needs to be tossed in the bin along
             | with the old "winners".
        
               | jazzyjackson wrote:
               | You sound like a markov chain.
        
               | cutnpaste wrote:
               | Is that a pejorative?
               | 
               | Let me try...
               | 
               | Your comment reads like:
               | 
               | > cat textbook.txt | echo
               | 
               | Zing!
        
               | __loam wrote:
               | Least insane hackernews commenter
        
           | __loam wrote:
           | To me it sounds like you're rushing the work and making
           | mistakes then dismissing people who point it out.
        
         | imoreno wrote:
         | To me, articles like this are not so interesting for the
         | results. I'm not reading them to find out just exactly what the
         | performance of AIs is, exactly. Obviously it's not useful for
         | that, it's not systematic, anecdotal, unscientific...
         | 
         | I think LLMs today, for all their goods and bads, can do some
         | useful work. The problem is that there is still mystery on how
         | to use them effectively. I'm not talking about some pie in the
         | sky singularity stuff, but just coming up with prompts to do
         | basic, simple tasks effectively.
         | 
         | Articles like that are great for learning new prompting tricks
         | and I'm glad the authors are choosing to share their knowledge.
         | Yes, OP isn't saying the last word on prompting, and there's a
         | million ways it could be better. But the article is still
         | useful to an average person trying to learn how to use LLMs
         | more productively.
         | 
         | >the "Senior vs Junior Developer" narrative
         | 
         | It sounds to me like just another case of "telling the AI to
         | explicitly reason through its answer improves the quality of
         | results". The "senior developer" here is better able to triage
         | aspects of the codebase to identify the important ones (and to
         | the "junior" everything seems equally important) and I would
         | say has better reasoning ability.
         | 
         | Maybe it works because when you ask the LLM to code something,
         | it's not really trying to "do a good job", besides whatever
         | nebulous bias is instilled from alignment. It's just trying to
         | act the part of a human who is solving the problem. If you tell
         | it to act a more competent part, it does better - but it has to
         | have some knowledge (aka training data) of what the more
         | competent part looks like.
        
         | dang wrote:
         | No doubt the overstated title is part of the reason, so we've
         | adopted the subtitle above, which is presumably more accurate.
        
       | ianbutler wrote:
       | Code context and understanding is very important for improving
       | the quality of LLM generated code, it's why the core of our
       | coding agent product Bismuth (which I won't link but if you're so
       | inclined check my profile) is built around a custom code search
       | engine that we've also built.
       | 
       | We segment the project into logical areas based on what the user
       | is asking, then find interesting symbol information and use it to
       | search call chain information which we've constructed at project
       | import.
       | 
       | This gives the LLM way better starting context and we then
       | provide it tools to move around the codebase through normal
       | methods you or I would use like go_to_def.
       | 
       | We've analyzed a lot of competitor products and very few have
       | done anything other than a rudimentary project skeleton like
       | Aider or just directly feeding opened code as context which
       | breaks down very quickly on large code projects.
       | 
       | We're very happy with the level of quality we see from our
       | implementation and it's something that really feels overlooked
       | sometimes by various products in this space.
       | 
       | Realistically, the only other product I know of approaching this
       | correctly with any degree of search sophistication is Cody from
       | SourceGraph which yeah, makes sense.
        
       | kmoser wrote:
       | I wonder how the results would compare to simply prompting it to
       | "analyze this as if you were a senior engineer"?
        
         | jappgar wrote:
         | I do this all the time. Actually, I tell it that _I_ am a
         | senior engineer.
         | 
         | A lot of people tinkering with AI think it's more complex than
         | it is. If you ask it ELI5 it will do that.
         | 
         | Often I will say "I already know all that, I'm an experienced
         | engineer and need you to think outside the box and troubleshoot
         | with me. "
         | 
         | It works great.
        
       | Arch-TK wrote:
       | I am struggling to teach AI to stop dreaming up APIs which don't
       | exist and failing to solve relatively simple but not often
       | written about problems.
       | 
       | It's good when it works, it's crap when it doesn't, for me it
       | mostly doesn't work. I think AI working is a good indicator of
       | when you're writing code which has been written by lots of other
       | people before.
        
         | Terr_ wrote:
         | > I think AI working is a good indicator of when you're writing
         | code which has been written by lots of other people before.
         | 
         | This is arguably good news for the programming profession,
         | because that has a big overlap with cases that could be
         | improved by a library or framework, the traditional way we've
         | been trying to automate ourselves out of a job for several
         | decades now.
        
       | charles_f wrote:
       | I wondered if there was a reason behind the ligature between c
       | and t across the article (e.g. is it easier to read for people
       | with dyslexia).
       | 
       | If like me you didn't know, apparently this is mostly stylistic,
       | and comes from a historical practice that predates printing.
       | There are other common ligatures such as CT, st, sp and th.
       | https://rwt.io/typography-tips/opentype-part-2-leg-ligatures
        
       | patrickhogan1 wrote:
       | This is great. More context is better. Only question is after you
       | have the AI your code why would you have to tell it basic things
       | like this is session middleware.
        
       | deadbabe wrote:
       | This strikes me as basically doing the understanding _for_ the
       | LLM and then having it summarize it.
        
       | redleggedfrog wrote:
       | That's funny those are considered Senior Dev attributes. I would
       | think you'd better be doing that basic kind of stuff from the
       | minute your writing code for production and future maintenance.
       | Otherwise your making a mess someone else is going to have to
       | clean up.
        
       | guerrilla wrote:
       | Today I learned I have "senior dev level awareness". This seems
       | pretty basic to me, but impressive that the LLM was able to do
       | it. On the other hand, this borderline reads like those people
       | with their "AI" girlfriends.
        
       | riazrizvi wrote:
       | Nice article. The comments are weird as fuck.
        
       | ptx wrote:
       | Well, did you check if the AI's claims were correct?
       | 
       | Does PR 1234 actually exist? Did it actually modify the retry
       | logic? Does the token refresh logic actually share patterns with
       | the notification service? Was the notification service added last
       | month? Does it use websockets?
        
       | stevenhuang wrote:
       | Related article on how LLMs are force fed information line by
       | line
       | 
       | https://amistrongeryet.substack.com/p/unhobbling-llms-with-k...
       | 
       | > Our entire world - the way we present information in scientific
       | papers, the way we organize the workplace, website layouts,
       | software design - is optimized to support human cognition. There
       | will be some movement in the direction of making the world more
       | accessible to AI. But the big leverage will be in making AI more
       | able to interact with the world as it exists.
       | 
       | > We need to interpret LLM accomplishments to date in light of
       | the fact that they have been laboring under a handicap. This
       | helps explain the famously "jagged" nature of AI capabilities:
       | it's not surprising that LLMs struggle with tasks, such as ARC
       | puzzles, that don't fit well with a linear thought process. In
       | any case, we will probably find ways of removing this handicap
        
       | disambiguation wrote:
       | OP you only took this half way. We already know LLMs can say
       | smart sounding things while also being wrong and irrelevant. You
       | need to manually validate how many N / 100 LLM outputs are both
       | correct and significant - and how much did it miss! Otherwise you
       | might fall into a trap of dealing with too much noise for only a
       | little bit of signal. The next step from there is comparing it
       | with human level signal to noise ratio.
        
       | Jimmc414 wrote:
       | @namanyayg Thanks for posting this, OP. I created a prompt series
       | based on this and so far I like the results. Here is the repo if
       | you are interested.
       | 
       | https://github.com/jimmc414/better_code_analysis_prompts_for...
       | 
       | I used this tool to flatten the example repo and PRs into text:
       | 
       | https://github.com/jimmc414/1filellm
        
       ___________________________________________________________________
       (page generated 2025-01-05 23:01 UTC)