[HN Gopher] A messy experiment that changed how I think about AI...
___________________________________________________________________
A messy experiment that changed how I think about AI code analysis
Author : namanyayg
Score : 349 points
Date : 2025-01-05 14:15 UTC (8 hours ago)
(HTM) web link (nmn.gl)
(TXT) w3m dump (nmn.gl)
| JoeAltmaier wrote:
| Pretty impressive. But for the part about nitpicking on style and
| uniformity (at the end) the results seem useful.
|
| Btw I thought, from the title, this would be about an AI taught
| to dismiss anyone's work but their own, blithely hold forth on
| code they had no experience with, and misinterpret goals and
| results to fit their preconceived notions. You know, to read code
| like a Senior Developer.
| svilen_dobrev wrote:
| > to read code like a Senior Developer.
|
| you mean, as in "code written by someone else == bad code" ?
| dearing wrote:
| Its being cute but speaking about politicking at code review.
| jerpint wrote:
| I think there will be lessons learned here as well for better
| agentic systems writing code more generally; instead of
| "committing" to code as of the first token generated, first
| generate overall structure of code base, with abstractions, and
| only then start writing code.
|
| I usually instruct Claude/chatGPT/etc not to generate any code
| until I tell it to, as they are eager to do so and often box
| themselves in a corner early on
| rmbyrro wrote:
| aider.chat has an /architect mode where you can discuss the
| architecture first and later ask it to execute the
| architectural decisions
|
| works pretty well, especially because you can use a more
| capable model for architecting and a cheaper one to code
| namanyayg wrote:
| I didn't know about this, thanks for sharing
| thegeomaster wrote:
| This is literally chain-of-thought! Even better than generic
| chain-of-thought prompting ("Think step by step and write down
| your thought process."), you're doing a domain-specific CoT,
| where you use some of your human intuition on how to approach a
| problem and imparting the LLM with it.
| j_bum wrote:
| Yes I frequently do this too.
|
| In fact I often ask whatever model I'm interacting with to _not
| do anything_ until we've devised a plan. This goes for search,
| code, commands, analysis, etc.
|
| It often leads to better results for me across the board. But
| often I need to repeat those instructions as the chat gets
| longer. These models are so hyped to generate _something_ even
| if it's not requested.
| Kinrany wrote:
| We already have languages for expressing abstractions, they're
| called programming languages. Working software is always built
| interactively, with a combination of top-down and bottom-up
| reasoning and experimentation. The problem is not in starting
| with real code, the problem is in being unable to keep editing
| the draft.
| qup wrote:
| Not a problem with the correct tooling.
| namanyayg wrote:
| That's exactly what I've understood, and this becomes even more
| important as the size of codebase scales.
|
| Ultimately, LLMs (like humans) can keep a limited context in
| their "brains". To use them effectively, we have to provide the
| right context.
| jalopy wrote:
| This looks very interesting, however it seems to me like the
| critical piece of this technique is missing from the post: the
| implementations of getFileContext() and shouldStartNewGroup().
|
| Am I the one missing something here?
| thegeomaster wrote:
| Reading between the lines, it sounds like they are creating an
| AI product for more than just their own codebase. If this is
| the case, they'd probably be keeping a lot of the secret sauce
| hidden.
|
| More broadly, it's nowadays almost impossible to find what
| worked for other people in terms of prompting and using LLMs
| for various tasks within an AI product. Everyone guards this
| information religiously as a moat. A few open source projects
| are everything you have if you want to get a jumpstart on how
| an LLM-based system is productized.
| iamleppert wrote:
| No, the code he posted sorts files by size, groups them, and
| then...jazz hands?
| layer8 wrote:
| Yeah, and in the code bases I'm familiar with, you'd need a lot
| of contextual knowledge that can't be derived from the code
| base itself.
| whinvik wrote:
| Sounds interesting. Do you have documentation on how you built
| the whole system?
| namanyayg wrote:
| I'll write something up, what are you curious about exactly?
| JTyQZSnP3cQGa8B wrote:
| > Do you have documentation on how you built the whole system
|
| Or any actual "proof" (i.e. source code) that your method is
| useful? I have seen a hundred articles like this one and,
| surprise!, no one ever posts source code that would confirm
| the results.
| namanyayg wrote:
| I have been trying to figure out how to publish evals or
| benchmarks for this.
|
| But where can I get high quality data of codebases,
| prompts, and expected results? How do I benchmark one
| codebase output vs another?
|
| Would love any tips from the HN community
| JTyQZSnP3cQGa8B wrote:
| That's the problem with people who use AI. You think too
| much and fail to deliver. I'm not asking for benchmarks
| or complicated stuff, I want source code, actual proof
| that I can diff myself. Also that's why the SWE is doomed
| because of AI, but that's another story.
| techn00 wrote:
| the implementations of getFileContext() and
| shouldStartNewGroup().
| theginger wrote:
| What is with the font joining the character c and t on this
| site?(In headings)
| escape_goat wrote:
| It's not joining it in a kerning sense, that's just the
| remarkably serif nature of EB Garamond, which has a little
| teardrop terminal on the tip of the 'c'. It's possible that you
| have font smoothing that is tainting the gap, otherwise it's
| your eyes.
| teraflop wrote:
| No, the heading font is Lato, not Garamond, and it's
| definitely some kind of digraph that only shows up with the
| combination "ct". Compare the letter "c" in these two
| headings: https://i.imgur.com/Zq53gTd.png
| escape_goat wrote:
| This should be upvoted. Thank you, I hadn't realized that
| OP was referring to the heading font or scrolled down to
| see what is yes, quite a remarkable ligature. It appears to
| be Lato delivered from
| <https://brick.freetls.fastly.net/fonts/lato/700.woff> The
| ligature appears due to discretionary ligatures being
| turned on. h1, h2, h3 { font-
| feature-settings: "kern" 1, "liga" 1, "pnum" 1, "tnum" 0,
| "onum" 1, "lnum" 0, "dlig" 1; font-variant-
| ligatures: discretionary-ligatures; }
| eichin wrote:
| Actually, EB Garamond has c_t and s_t ligatures.
| codesnik wrote:
| and a very subtle f_f. I don't find those nice though.
| jfk13 wrote:
| It does, but those would only be applied if the `font-
| variant-ligatures: historical-ligatures` property were
| specified, so they don't appear on this site.
| escape_goat wrote:
| I inspected for a ligature and any evidence of CSS
| kerning being turned on before commenting, but I didn't
| test it to see what the page looked like with it turned
| on, so I didn't have active knowledge of the possibility
| of a ligature. If I'd know, it would have been better to
| give wider scope to the possibility that somehow kerning
| was being activated by OP's browser. I should have known
| better than to make a remark about a font without
| absolutely scrupulous precision! I actually appreciate
| the comments and corrections.
| wymerica wrote:
| I was curious about this as well, it looks as though he's
| using a specific font which creates a ligature between those
| letters. I think it's specific because it's only on the CT
| and it's on other pages in his site. I went further to
| investigate what this might be and it's a little used print
| style:
| https://english.stackexchange.com/questions/591499/what-
| is-t...
| csallen wrote:
| I was wondering the same thing. That doesn't seem to happen in
| the Lato font on Google Fonts:
|
| https://fonts.google.com/specimen/Lato?preview.text=Reaction...
|
| EDIT: It's called ligatures: https://developer.mozilla.org/en-
| US/docs/Web/CSS/font-varian.... The CSS for headings on this
| site turns on some extra ligatures.
| jfk13 wrote:
| Specifically, `font-variant-ligatures: discretionary-
| ligatures` enables this.
|
| (So does `font-feature-settings: "dlig" 1`, which is the low-
| level equivalent; the site includes both.)
| namanyayg wrote:
| In a previous lifetime I was a huge typography nerd (I could
| name 95% of common fonts in just a glance ~10 years ago).
|
| These are ligatures. I got the code to enable them from
| Kenneth's excellent Normalize-Opentype.css [0]
|
| [0]: https://kennethormandy.com/journal/normalize-opentype-css/
| skykooler wrote:
| I was wondering that too - I don't think that's a ligature I've
| ever seen before.
| advael wrote:
| I read to like the first line under the first bold heading and
| immediately this person seemed like an alien. I'll go back and
| read the rest because it's silly to be put off a whole article by
| this kind of thing, but what in the actual fuck?
|
| I was probably not alive the last time anyone would have learned
| that you should read existing code in some kind of linear order,
| let alone programming. Is that seriously what the author did as a
| junior, or is it a weirdly stilted way to make an analogy to
| sequential information being passed into an LLM... which also
| seems to misunderstand the mechanism of attention if I'm honest
|
| I swear like 90% of people who write about "junior developers"
| have a mental model of them that just makes zero sense that
| they've constructed out of a need to dunk on a made up guy to
| make their point
| dnadler wrote:
| While that wasn't my experience as a junior developer, this is
| something that I used to do with academic papers.
|
| I would read it start to finish. Later on, I learned to read
| the abstract, then jump to either the conclusion or some
| specific part of the motivation or results that was
| interesting. To be fair, I'm still not great at reading these
| kinds of things, but from what I understand, reading it start
| to finish is usually not the best approach.
|
| So, I think I agree that this is not really common with code,
| but maybe this can be generalized a bit.
| disgruntledphd2 wrote:
| > reading it start to finish is usually not the best
| approach.
|
| It really, really depends on who you are and what your goal
| is. If it's your area, then you can probably skim the
| introduction and then forensically study methods and results,
| mostly ignore conclusion.
|
| However, if you're just starting in an area, the opposite
| parts are often more helpful, as they'll provide useful
| context about related work.
| Aurornis wrote:
| > this is something that I used to do with academic papers
|
| Academic papers are designed to be read from start to finish.
| They have an abstract to set the stage, an introduction, a
| more detailed setup of the problem, some results, and a
| conclusion in order.
|
| A structured, single-document academic paper is not analogous
| to a multi-file codebase.
| rorytbyrne wrote:
| No, they are designed to elucidate the author's thought
| process - not the reader's learning process. There's a
| subtle, but important difference.
|
| Also: https://web.stanford.edu/class/ee384m/Handouts/HowtoR
| eadPape...
| p1esk wrote:
| _they are designed to elucidate the author 's thought
| process - not the reader's learning process_
|
| No, it's exactly the opposite: when I write papers I
| follow a rigid template of what a reader (reviewer)
| expects to see. Abstract, intro, prior/related work, main
| claim or result, experiments supporting the claim,
| conclusion, citations. There's no room or expectation to
| explain any of the thought process that led to the claim
| or discovery.
|
| Vast majority of papers follow this template.
| the_af wrote:
| You're supposed to read academic papers from start to finish.
| jghn wrote:
| I was taught to read the abstract, then the conclusion,
| then look at the figures, and maybe dig into other sections
| if there's something that drew my interest.
|
| Given the variety of responses here, I wonder if some of
| this is domain specific.
| fsmv wrote:
| I learned very quickly reading math papers that you should
| not get stuck staring at the formulas, read the rest first
| and let them explain the formulas.
|
| I would not say it should be read start to finish, I often
| had to read over parts multiple times to understand it.
| baq wrote:
| You're supposed to read the abstract, preferably the bottom
| half first to see if there are conclusions there, then
| proceed to the conclusions if the abstract is insufficient.
| Once you're through with that, you can skim the
| introduction and decide if the paper is worth your
| attention.
|
| Reading start to finish is only worth it if you're
| interested in the gory details, I'm usually not.
| Klonoar wrote:
| _> I was probably not alive the last time anyone would have
| learned that you should read existing code in some kind of
| linear order_
|
| I think you're jumping ahead and missing a point that the
| article itself made: there are indeed bootcamp developers who
| were taught this way. I have spent quite a number of hours of
| my life trying to walk some prospective developers back from
| this mindset.
|
| That said I think that you could write this entire article
| without dunking on junior developers and I don't consider it
| particularly well written, but that's a separate issue I guess.
| advael wrote:
| I suppose such a bootcamp may exist but wow, that's crazy to
| me
|
| But yea, having now read the whole thing I'm mostly taking
| issue with the writing style I guess. I find the method they
| tried interesting but it's worth noting that it's ultimately
| just another datapoint for the value of multi-scale analytic
| techniques when processing most complex data (Which is a
| great thing to have applied here, don't get me wrong)
| namanyayg wrote:
| I was a junior so long ago that I've forgotten how I first read
| code, but I do remember I was very confused.
|
| Edited the post to improve clarity. Thanks for the writing tip!
| advael wrote:
| Yea sorry if I came off caustic there, dealing with really
| dismissive attitudes toward juniors I'm actively trying to
| foster has perhaps left a bad taste in my mouth
| namanyayg wrote:
| No worries. I took the metaphor too far and you rightfully
| called me out. I'm still learning how to write well, I
| promise you'll see better from me in the future.
| advael wrote:
| Love to see someone genuinely trying to improve at
| something and I'm glad to have played a tiny part in it
| nfRfqX5n wrote:
| Didn't seem like dunking on juniors to me
| coldtea wrote:
| I don't know. Your comment feels like alien. The first line
| under the first bold heading is:
|
| "Remember your first day reading production code? Without any
| experience with handling mature codebases, you probably quickly
| get lost in the details".
|
| Which looks pretty much accurate. And yes, this includes the
| (later) implied idea that many juniors would read a PR in some
| kind of linear order, or at least, not read it in order of
| importance, or don't know how to properly order their PR code
| reading. And yes, some just click in the order Github shows the
| changed files.
|
| Not that for 99% of the industry, "junior dev" is not the same
| as something like:
|
| "just out of uni person with 12+ years of experience
| programming since age 10, who built a couple of toy compilers
| before they were 16, graduated Stanford, and was recently hired
| at my FAANG team"
|
| It's usually something bewteen that and the DailyWTF fare,
| often closer to the latter.
| lolinder wrote:
| The article was updated, probably in response to the parent
| comment. It used to read this:
|
| > Remember your first day reading production code? You
| probably did what I did - start at line 1, read every file
| top to bottom, get lost in the details.
|
| I copied before refreshing, and sure enough that line was
| modified.
| schaefer wrote:
| > I was probably not alive the last time anyone would have
| learned that you should read existing code in some kind of
| linear order, let alone programming.
|
| If you want to dive all the way down that rabbit hole, can I
| recommend you check out the wikipedia article for the book
| Literate Programming [1] by Donald Kunth [2].
|
| [1]: https://en.wikipedia.org/wiki/Literate_programming [2]:
| https://en.wikipedia.org/wiki/Donald_Knuth
| bobnamob wrote:
| I think this article is indicative of the "vibe" I've been
| getting when reading any discussion around genAI programming.
|
| The range of (areas of) competence is just so damn vast in our
| industry that any discussion about the quality of generated
| code (or code reviews in this case) is doomed. There just isn't
| a stable, shared baseline for what quality looks like.
|
| I mean really - how on earth can Jonny Startup, who spends his
| days slinging JS/TS to get his business launched in < a
| month[1], and Terrence PhD the database engineer, who writes
| simulation tested C++ for FoundationDB, possibly have a
| grounded discussion about code quality? Rarely do I see people
| declaring their priors.
|
| Furthermore, the article is so bereft of detail and gushes so
| profusely about the success and virtues of their newly minted
| "senior level" AI that I can't help but wonder if they're
| selling something...
|
| /rant
|
| [1] Please don't read this as slight against Jonny Startup, his
| priorities are different
| 9rx wrote:
| Is there a difference in quality? Johnny Startup is
| presumably trading quality in order to release sooner, but
| the lower quality accepted in that trade is recognizable.
| bobnamob wrote:
| If Jonny startup has been building release prioritised
| systems all his life/career, there's a decent chance he
| doesn't even know what more goes into systems with higher
| release & maintenance standards.
|
| Conversely, if Terrence has only ever worked in high rigour
| environments, he's unlikely to understand Jonny's
| perspective when Jonny says that code generation tools are
| doing amazing "reliable" things.
|
| Again, this isn't meant to be a value judgement against
| either Jonny or Terrence, more that they don't have shared
| context & understanding on what and how the other is
| building, and therefore are going to struggle to have a
| productive conversation about a magic blackbox that one
| thinks will take their job in 6 months.
| zkry wrote:
| > Furthermore, the article is so bereft of detail and gushes
| so profusely about the success and virtues of their newly
| minted "senior level" AI that I can't help but wonder if
| they're selling something...
|
| With all the money in the AI space these days, my prior
| probability for an article extolling the virtues of AI
| actually trying to sell something is rather high.
|
| I just want a few good unbiased academic studies on the
| effects of various AI systems on things like delivery time
| (like are AI systems preventing IT projects from going
| overtime on a fat-tailed distribution? is it possible with AI
| to put end to the chapter of software engineering projects
| going disastrously overtime/overbudget?)
| lolinder wrote:
| To anyone who gets confused by the parent comment, note that
| the line they're referring to has been updated. It used to
| read:
|
| > Remember your first day reading production code? You probably
| did what I did - start at line 1, read every file top to
| bottom, get lost in the details.
|
| Now it reads:
|
| > Remember your first day reading production code? Without any
| experience with handling mature codebases, you probably quickly
| get lost in the details.
| namanyayg wrote:
| Oops, I should have marked my edit clearly. Added a footnote
| now.
| lolinder wrote:
| Thanks! No worries, we all live and learn. :)
| thimabi wrote:
| The change makes me question the authenticity of the text. I
| mean, did the author actually read files from top to bottom,
| or did he just write that because it suited his narrative?
|
| That's a trivial change to make for a line that did not
| receive the feedback that the author wanted. If that's the
| case, maybe the text was more about saying what people wanted
| to hear than honestly portraying how to make AI read code
| better.
| namanyayg wrote:
| I forced an analogy and took the metaphor too far. I
| promise you'll see better from me in the future!
| autobodie wrote:
| Metaphor? What metaphor? What analogy?
| tmpz22 wrote:
| > Remember your first day reading production code? You
| probably did what I did - start at line 1, read every
| file top to bottom, get lost in the details.
|
| Top to bottom left to right is how we read text (unless
| you are using Arabic or Hebrew!), the analogy was fine
| IMO. Don't let one HN comment shake your confidence,
| while people here may be well intentioned they are not
| always right.
| namanyayg wrote:
| Haha thank you for the kind words!
|
| I've been a lurker on HN ever since I was a kid. I've
| seen over and over how HN is the most brusque & brutal
| online community.
|
| But that's also why I love it. Taking every piece of
| feedback here to learn and improve in the future, and
| feeling grateful for the thousands of views my article is
| receiving!
| edanm wrote:
| Hebrew speakers also read top to bottom and left to
| right, when they're reading code, because coding is
| (almost always) in English languages. :)
| lolinder wrote:
| Don't take this feedback too personally--remember that
| most HN users read and don't vote or comment, a subset of
| them read and vote, and only a tiny loud fraction of us
| actually comment.
|
| Your article has been very well received, and it wasn't
| because that one line deceived people into paying
| attention, it's because the content is good.
| iaseiadit wrote:
| When I started out, I did read code top-to-bottom. I was
| mostly self-taught and didn't have a mental model yet of
| how code was structured, so I relied on this "brute
| force" method to familiarize myself.
|
| I suppose it's not safe to assume that everyone started
| out like this. But advael is guilty of assuming that
| _nobody_ started out like this. And on top of that,
| conveying it in a very negative and critical way. Don 't
| get discouraged.
| Jtsummers wrote:
| This discussion is about junior professionals, not zero
| experience programmers. If a junior professional
| programmer is still starting at the top of files instead
| of at the entry points to the program or the points of
| interest, then they had a very poor education.
| brundolf wrote:
| Wow, people are being very uncharitable in this comment
| section
| apstls wrote:
| Welcome to LLM-related threads on HN.
| soneca wrote:
| Oh, I was confused, thanks a lot.
|
| And, indeed, reading every file from top to bottom is very
| alien to me as a junior.
|
| I would just try to get to the file I thought the change I
| needed was made and start trying and error. Definitely not
| checking the core files, much less creating a mental model of
| the architecture (the very concept of architecture would be
| alien to me then).
|
| I would do get lost in irrelevant details (because I thought
| they were relevant), while completely missing the details
| that did matter.
| olivierduval wrote:
| I think that you missed the point and should have read until
| "That's exactly how we feed codebases to AI"... ;-)
|
| Actually, the article shows that feed an AI with "structured"
| source code files instead of just "flat full set" files allow
| the LLM to give better insights
| loeg wrote:
| I have actually just printed out codebases and read them cover
| to cover before (sometimes referencing ahead for context), as a
| senior engineer. If you need to quickly understand what every
| line is doing on a small to medium sized body of code, it's a
| pretty good way to avoid distraction and ramp up quickly. I
| find that just reading every line goes pretty quickly and gives
| me a relatively good memory of what's going on.
| ninetyninenine wrote:
| Doing this requires higher IQ. Believe it or not a ton of
| people literally don't do this because they can't. This
| ability doesn't exist for them. Thousands of pages of code is
| impossible to understand line by line for them. This
| separation of ability is very very real.
| pdhborges wrote:
| I don't read all the lines of code but I open and scan a ton
| of files from the code base to get a feel of which concepts
| abstractions and tricks are used.
| myvoiceismypass wrote:
| > I was probably not alive the last time anyone would have
| learned that you should read existing code in some kind of
| linear order, let alone programming.
|
| Some of us have been around since before the concept of a "Pull
| Request" even existed.
|
| Early in my career we used to print out code (on paper, not
| diffs) and read / have round table reviews in person! This was
| only like 2 decades ago, too!
| overgard wrote:
| Yeah, to me his description of how programmers think didn't
| really jive with either senior or junior. I think with senior
| developers when they look at a code review, they're busy, so
| they're looking for really obvious smells. If there's no
| obvious smells and it's easy to understand what the code is
| intending to do, they usually let it pass. Most of the time if
| one of my PR's get's rejected it's something along the line of
| "I don't know why, but doing X seems sketch" or "I need more
| comments to understand the intended flow" or "The
| variable/function names aren't great"
| OzzyB wrote:
| So it turns out that AI is just like another function, inputs and
| outputs, and the better you design your input (prompt) the better
| the output (intelligence), got it.
| jprete wrote:
| The Bitter Lesson claimed that the best approach was to go with
| more and more data to make the model more and more generally
| capable, rather than adding human-comprehensible structure to
| the model. But a lot of LLM applications seem to add missing
| domain structure until the LLM does what is wanted.
| mbaytas wrote:
| Improving model capability with more and more data is what
| model developers do, over months. Structure and prompting
| improvements can be done by the end user, today.
| do_not_redeem wrote:
| The Bitter Lesson pertains to the long term. Even if it
| holds, it may take decades to be proven correct in this case.
| Short-term, imparting some human intuition is letting us get
| more useful results faster than waiting around for "enough"
| computation/data.
| Philpax wrote:
| The Bitter Lesson states that you can overcome the weakness
| of your current model by baking priors in (i.e. specific
| traits about the problem, as is done here), but you will get
| better long-term results by having the model learn the priors
| itself.
|
| That seems to have been the case: compare the tricks people
| had to do with GPT-3 to how Claude Sonnet 3.6 performs today.
| shahzaibmushtaq wrote:
| You got that 100% right. The title should be "The day I told
| (not taught) AI to read code like a Senior Developer".
| syndicatedjelly wrote:
| Not trying to nitpick, but the phrase "AI is just like another
| function" is too charitable in my opinion. A function, in
| mathematics as well as programming, transforms a given input
| into a specific output in the codomain space. Per the Wikipedia
| definition, In mathematics, a function from a
| set X to a set Y assigns to each element of X exactly one
| element of Y.[1] The set X is called the domain of the
| function[2] and the set Y is called the codomain of the
| function.[3]
|
| Not to call you out specifically, but a lot of people seem to
| misunderstand AI as being just like any other piece of code.
| The problem is, unlike most of the code and functions we write,
| it's not simply another function, and even worse, it's usually
| not deterministic. If we both give a function the same input,
| we should expect the same input. But this isn't the case when
| we paste text into ChatGPT or something similar.
| int_19h wrote:
| LLMs are _literally_ a deterministic function of a bunch of
| numbers to a bunch of numbers. The non-deterministic part
| only comes when you apply the random pick to select a token
| based on the weights (deterministically) computed by the
| model.
| risyachka wrote:
| >> The AI went from: >> "This file contains authentication logic
| using JWT tokens"
|
| So what was the initial prompt? "What's in this file?"
|
| And then you added context and it became context-aware. A bit of
| an overstatement to call this "Holy Shit moment"
|
| Also , why is "we"? What is "our AI"? And what is "our benchmark
| script"?
|
| And how big is your codebase? 50k files? 20 files?
|
| This post has very very little value without a ton of details,
| looks like nowadays everything "ai" labeled gets to the front
| page.
| dijksterhuis wrote:
| > looks like nowadays everything "ai" labeled gets to the front
| page.
|
| it's been this way for like a year or more. hype machine gotta
| hype.
| sgt101 wrote:
| As John McCarthy used to complain "Look Mom, no hands!"
|
| This isn't an experiment, or proof of anything. It's a story.
|
| I wish people in the community would engage their critical skills
| and training. The folks that taught the author should hang their
| head in shame that their student is producing such rubbish.
| namanyayg wrote:
| It's my personal experience for now. Have some experiments and
| further study planned.
|
| It's difficult to set up evals, especially with production code
| situations. Any tips?
| sgt101 wrote:
| Yeah, it is hard.
|
| The principal you need to work to is that you need to create
| evidence that other people will find compelling and then show
| that you have interrogated your results to show that you have
| checked that it's really working better than chance and not
| the result of some fluke or other. Finally you need to find a
| way to explain what's happening - like an actual mechanism.
|
| 1. Find or make a data set - I've been using code_search_net
| to try and study the ability of LLM's to document code,
| specifically the impact optimising n-shot learning on them*,
| this may not be close enough to your application, but you
| need many examples to draw conclusions. It's likely that you
| will have to do some statistics to demonstrate the effects of
| your innovations so you probably need around 100 examples.
|
| 2. Results from one model may not be informative enough, it
| might be useful/necessary to compare several different models
| to see if the effect you are finding is consistent or whether
| some special feature of a model is what is required. For
| example, does this effect work with only the largest and most
| sophisticated modern models, or is this something that can be
| seen to a greater or lesser effect with a variety of models?
|
| 3) You need to ablate - what is it in the setup that is most
| impactful? What happens if we change a word or add a word to
| the prompt? Does this work on long code snippits? Does it
| work on code with many functions? Does it work on code from
| particular languages?
|
| 4) You need a quantitative measure of performance. I am a
| liberal sort , but I will not be convinced by an assertion
| that "it worked better than before" or "this review is like
| an senior, I think". There needs to be a number that someone
| like me can't argue with - or at least, can argue with but
| can't dismiss.
|
| *I couldn't make it work, I think because the search space
| for finding good prompting shots (sample function)is vast and
| the output space (possible documents) is vast. Many bothans
| died in order to bring you this very very very (in hindsight
| with about $200 of OpenAI spending) obvious result. Having
| said that I am not confident that it couldn't be made to work
| at this point so I haven't written it up and won't make any
| sort of definitive claim. Mainly I wonder if there is a
| heuristic that I could use to choose examples a-priori
| instead of trying them at random. I did try shorter examples
| and I did try more typical (size) examples. The other issue
| is that I am using sentence similarity as a measure of
| quality, but that isn't something I am confident of.
| og_kalu wrote:
| Yeah it's a story. So what ? If you'd have liked evals or
| implementation details then you can just say so.
|
| People can and are free to tell stories if they want. It's not
| some failing. You don't have to engage with it anymore than
| that.
| efuquen wrote:
| > The folks that taught the author should hang their head in
| shame that their student is producing such rubbish.
|
| This is unnecessary and rude, you should hang your head in
| shame for that. I wish some people in this community weren't so
| reactionary and would engage with empathy instead of trying to
| personally roast people as soon as they don't agree with
| something.
|
| Someone can tell a story on the internet, it doesn't have to be
| some rigorous experiment or proof.
| sgt101 wrote:
| I completely disagree.
|
| Computer Science has a massive ethics crisis. Uncritical
| adulation and a total lack of accountability or consequence
| is part of that.
|
| There is a massive misallocation of capital which is burning
| opportunity for our society. Users are getting terrible
| experiences and systems because people read this sort of
| thing and believe it. Trust in our technology is eroded and
| this has consequences for actual people. We have abandoned
| the standards that protected people and you have the view
| that these standards, or a shadow of them even, are
| unnecessary?
|
| Someone has taught a generation that this is all ok, it
| isn't.
| efuquen wrote:
| I don't disagree with some of the points are making here,
| but the main point of my own comment is your last sentence
| from above was mean-spirited and unnecessary.
|
| You want to have a conversation about quality and ethics in
| computing and how this post can be pushing a narrative that
| is not in line with your views on this, I think that is
| worthwhile to have. But personal denigration of someone
| else isn't necessary in doing that.
| cloudking wrote:
| Sounds like OP hasn't tried the AI IDEs mentioned in the article.
|
| For example, Cursor Agent mode does this out of the box. It
| literally looks for context before applying features, changes,
| fixes etc. It will even build, test and deploy your code - fixing
| any issues it finds along the way.
| batata_frita wrote:
| Have you tried cline to compare with cursor?
|
| I haven't tried Cursor yet but for me cline does an excelent
| job. It uses internal mechanisms to understand the code base
| before making any changes.
| cloudking wrote:
| I'll give it a go, I've also heard Windsurf is quite good.
|
| Personally I've been very impressed with Cursor Agent mode,
| I'm using it almost exclusively. It understands the entire
| codebase, makes changes across files, generates new files,
| and interacts with terminal input/output. Using it, I've been
| able to build, test & deploy fullstack React web apps and
| three.js games from scratch.
| yapyap wrote:
| haha man, some of yall really talk about AI like it's some baby
| with all the knowledge in the world, waiting to be taught common
| sense
| _0ffh wrote:
| Very sceptical of "Context First: We front-load system
| understanding before diving into code". The LLM sees the whole
| input at once, it's a transformer, not a recurrent model. Order
| shouldn't matter in that sense.
|
| Ed. I see some people are disagreeing. I wish they explained how
| they imagine that would work.
| Workaccount2 wrote:
| Just like training data, the more context and the higher quality
| the context you give the model, the better the outputs become.
| scinadier wrote:
| A bit of a disappointing read. The author never elaborates on the
| details of the particular day in which they taught AI to read
| code like a Senior Developer.
|
| What did they have for lunch? We'll never know.
| namanyayg wrote:
| Credit goes to "You Suck at Cooking" for their genius smash
| burger recipe [0] for my lunch that day
|
| [0] https://www.youtube.com/watch?v=nq9WnmCGoFQ
| quantadev wrote:
| In my Coding Agent, I ended up realizing my prompts need to be
| able to specifically mention very specific areas in the code, for
| which no real good syntax exists for doing that so I invented
| something I call "Named Blocks".
|
| My coding agent allows you to put any number of named blocks in
| your code and then mention those in your prompts by name, and the
| AI understands what code you mean. Here's an example:
|
| In my code: -- block_begin SQL_Scripts
| ...some sql scripts... -- block_end
|
| Example prompt: Do you see any bugs in
| block(SQL_Script)?
| mdaniel wrote:
| > for which no real good syntax exists for doing that
|
| Up to you, but several editors have established syntax which
| any code-trained model will likely have seen plenty of examples
|
| vim (set foldmethod=marker and then {{{ begin\n }}} end\n )
| https://vimdoc.sourceforge.net/htmldoc/usr_28.html#28.6>
|
| JetBrains <editor-fold desc=""></editor-fold>
| https://www.jetbrains.com/help/idea/working-with-source-code...
|
| Visual Studio (#pragma region) https://learn.microsoft.com/en-
| us/cpp/preprocessor/region-en... (et al, each language has its
| own)
| quantadev wrote:
| The great thing about my agent, which I left out, is that it
| extracts out all the named blocks using just pure Python, so
| that the prompt itself has them embedded directly in it.
| That's faster and more efficient than even having a "tool
| call" that extracts blocks by name. So I needed a solution
| where my own code can get named block content out of any kind
| of file. Having one syntax that works on ALL types of files
| was ideal.
|
| UPDATE: In other words it's always "block_begin" "block_end"
| regardless of what the comment characters are which will be
| different for different files of course.
| qianli_cs wrote:
| I thought it was about LLM training but it's actually prompt
| engineering.
| namanyayg wrote:
| I'm thinking about training next! But deepseek is so good
| already
| voidhorse wrote:
| To me, this post really just highlights how important the human
| element will remain. Without achieving the same level of
| contextual understanding of the code base, I have no clue as to
| whether or not the AI warning makes any sense.
|
| At a superficial level, I have no idea what "shared patterns"
| means or why it logically follows that sharing them would cause a
| race condition. It also starts out talking about authentication
| changes, but then cites a PR that modified "retry logic"--without
| that shared context, it's not clear to me that an auth change has
| anything to do with retry logic unless the retry is related to
| retries on authentication failures.
| dimtion wrote:
| Without knowing exactly how createNewGroup and addFileToGroup are
| implemented it is hard to tell, but it looks like the code
| snippet has a bug where the last group created is never pushed to
| groups variable.
|
| I'm surprised this "senior developer AI reviewer" did not caught
| this bug...
| crazygringo wrote:
| I'm fascinated by stories like these, because I think it shows
| that LLM's have only shown a small amount of their potential so
| far.
|
| In a way, we've solved the raw "intelligence" part -- the next
| token prediction. (At least in certain domains like text.)
|
| But now we have to figure out how to structure that raw
| intelligence into actual useful thinking patterns. How to take a
| problem, analyze it, figure out ways of breaking it down, try
| those ways until you run into roadblocks, then start figuring out
| some solution ideas, thinking about them more to see if they
| stand up to scrutiny, etc.
|
| I think there's going to be a lot of really interesting work
| around that in the next few years. A kind of "engineering of
| practical thinking". This blog post is a great example of one
| first step.
| Terr_ wrote:
| > But now we have to figure out how to structure that raw
| intelligence into actual useful thinking patterns.
|
| My go-to framing is:
|
| 1. W've developed an amazing tool that extends a document. Any
| "intelligence" is in there.
|
| 2. Many uses begin with a document that resembles a movie-
| script conversation between a computer and a human,
| alternatively adding new lines (from a real human) and
| _performing_ the extended lines that parse out as "Computer
| says."
|
| 3. This illusion is effective against _homo sapiens_ , who are
| biologically and subconsciously primed to make and experience
| stories. We confuse the actor with the character with the
| scriptwriter.
|
| Unfortunately, the illusion is so good that a lot of
| _developers_ are having problems pulling themselves back to the
| real world too. It 's as if we're trying to teach fashion-sense
| and embarrassment and empathy to a cloud which _looks like_ a
| person, rather than changing how the cloudmaker machine works.
| (The latter also being more difficult and more expensive.)
| afro88 wrote:
| Another cherry-picked example of an LLM doing something amazing,
| written about with a heavy dose of anthropomorphism.
|
| It's easy to get LLMs to do seemingly amazing things. It's
| incredibly hard to build something where it does this amazing
| thing consistently and accurately for all reasonable inputs.
|
| > Analyzing authentication system files:
|
| > - Core token validation logic
|
| > - Session management
|
| > - Related middleware
|
| This hard coded string is doing some very heavy lifting. This
| isn't anything special until this string is also generated
| accurately and consistently for any reasonable PR.
|
| OP if you are reading, the first thing you should do is get a
| variety of codebases with a variety of real world PRs and set up
| some evals. This isn't special until evals show it producing
| consistent results.
| namanyayg wrote:
| That's exactly what I want to do next.
|
| Any tips on how should I get codebases and real world PRs? Are
| the ones on popular open source repos on GitHub sufficient? I
| worry that they don't really accurately reflect real world
| closed source experience because of the inherent selection
| bias.
|
| Secondly, after getting all this, how do I evaluate which
| method gave better results? Should it be done by a human, or
| should I just plug an LLM to check?
| jarebear6expepj wrote:
| "...should it be done by a human?"
|
| Sigh.
| namanyayg wrote:
| I'll do it personally in the beginning but was thinking
| about doing it on scale
| JTyQZSnP3cQGa8B wrote:
| > doing it on scale
|
| Like cloud-scale, no-code scale, or NoSQL scale? You are
| confused, which shows that, maybe, you should not be
| using such tools with the experience that you don't have.
| zBard wrote:
| 'Like cloud-scale, no-code scale or NoSQL scale'.
|
| That is the dumbest statement I have heard this week. You
| should perhaps refrain from commenting, at-least until
| you gain the modicum of intelligence that you currently
| don't have.
| smusamashah wrote:
| Very first thing you can tell us (or try if you haven't) is
| that if you re-prompt, does it give the same answer? Second
| can you get it to generate (consistently and repeatedly) the
| text that gp pointed out?
|
| Don't need to switch to a different repo for quick test, just
| make it reproable on your current repo.
| namanyayg wrote:
| Not the exact text, but still decent quality. I'll play
| around with temperature and prompts a bit.
| QuadmasterXLII wrote:
| If you could get an LLM to check, you could just spam
| solutions with any assortment of models and then use your
| checker to pick the best.
| throwup238 wrote:
| _> I worry that they don 't really accurately reflect real
| world closed source experience because of the inherent
| selection bias._
|
| As opposed to what, yet another beginner React app? That's
| what everyone seems to be testing with but none of the
| projects I've seen are reflective of a production codebase
| that's years old and has been touched by a dozen developers.
|
| Throw it at a complicated non-frontend mixed language repo
| like cxx-qt [1] or something, preferably where the training
| data doesn't include the latest API.
|
| [1] https://github.com/KDAB/cxx-qt
| lukan wrote:
| "preferably where the training data doesn't include the
| latest API"
|
| That is the reason LLM's in their current shape are pretty
| useless to me for most tasks.
|
| They happily mix different versions of popular frameworks,
| so I have to do so much manual work to fix it, I rather do
| all by myself then.
|
| Pure (common) math problems, or other domains where the
| tech did not change so much, like bash scripts or regex are
| where I can use them. But my actual code? Not really. The
| LLM would need to be trained only on the API version I use
| and that is not a thing yet, as far as I am aware.
| fuzzythinker wrote:
| https://www.kaggle.com/competitions/konwinski-prize
| rHAdG12327 wrote:
| Given how eager Microsoft is to steal other people's code,
| perhaps the leaked Windows source code would be an option. Or
| perhaps Microsoft will let you train on their internal issue
| tracker.
| InkCanon wrote:
| I also think there's some exaggeration. Annotating files with a
| feature tag system is both manual and not scabale. Custom
| prompting for each commit or feature a lot more so. You do a
| decent bit of specialized work here.
|
| And I think he left out the most important part, was the answer
| actually right? The real value of any good dev at all is that
| he can provide reasonably accurate analysis with logic and
| examples. "Could have an error" is more like a compiler warning
| than the output of a good engineer.
|
| Side note: "broke the benchmark script?" If you have an
| automated way to qualitatively evaluate the output of an LLM in
| a reasonably broad context like code reading, that's far bigger
| a story.
| imoreno wrote:
| >Annotating files with a feature tag system is both manual
| and not scabale.
|
| Wouldn't you have the AI annotate it?
| ninetyninenine wrote:
| This post talks as if the results are a worthless pile of trash
| while obeying the HN rules of not directly insulting the
| results. I agree with everything under the first paragraph.
|
| Let me spell it out for you. These results. Are. Not.
| Worthless.
|
| Certainly what you said is correct on what he "should" do to
| get additional data, but your tonality of implying that the
| results are utter trash and falsely anthropomorphizing
| something is wrong.
|
| Why is it wrong? Imagine Einstein got most things wrong in his
| life. Most things but he did discover special and general
| relativity. It's just everything else was wrong. Relativity is
| still worth something. The results are still worthwhile.
|
| We have an example of an LLM hallucinating. Then we have
| another example of additional contextual data causing the LLM
| to stop hallucinating. This is a data point leaving a clue
| about hallucinations and stopping hallucinations. It's
| imperfect but a valuable clue.
|
| My guess is that there's a million causal factors that cause an
| LLM to hallucinate and he's found one.
|
| If he does what he did a multitude of times for different
| topics and different problems where contextual data stops an
| hallucination, with enough data and categorization of said data
| we may be able to output statistical data and have insight into
| what's going on from a statistical perspective. This is just
| like how we analyze other things that produce fuzzy data like
| humans.
|
| Oh no! Am I anthropomorphizing again?? Does that action make
| everything I said wrong? No, it doesn't. Humans produce correct
| data when given context. It is reasonable to assume in many
| cases LLMs will do the same. I wrote this post because I agree
| with everything you said but not your tone which implies that
| what OP did is utterly trivial.
| godelski wrote:
| They didn't say worthless, they said amazing.
|
| Their comment is "do it consistently, then I'll buy your
| explanation"
| ninetyninenine wrote:
| lol "seemingly amazing" means not amazing at all.
|
| He didn't literally say it but the comment implies it is
| worthless as does yours.
|
| Humans dont "buy it" when they think something is
| worthless. The tonality is bent this way.
|
| He could have said, "this is amazingly useful data but we
| need more" but of course it doesn't read like this at all
| thanks to the first paragraph. Let's not hallucinate it
| into something it's not with wordplay. The comment is
| highly negative.
| wholinator2 wrote:
| You seem very emotionally involved in this. It says "an
| LLM doing something amazing". That's the sentence. Later
| the term "seemingly amazing" is used. Implying that it
| _seems amazing_. Anything beyond that is your personal
| interpretation. Do you disagree that there is an excess
| of cherrypicked LLM examples getting anthropomorphized?
| Yeah, it did a cool thing. Yes, llms doing single cool
| things are everywhere. Yes, I well be more convinced of
| its impact when i see it tested more widely.
| ninetyninenine wrote:
| I am emotionally involved in the sense that I disagree
| and dislike the tone. The core of my post is addressing
| tonality and thus emotions is the topic of my post. I'm
| emotionally involved in the same way it's normal for
| humans to have emotions. If you can't see this then you
| don't have the rationality or emotional capacity to
| understand what I am saying.
|
| > Anything beyond that is your personal interpretation.
|
| Not true. In the context of that post when something is
| implied that it only seems amazing it also implies that
| it is likely not amazing. That is a common human
| interpretation. I find if you're not interpreting it that
| way your emotions are influencing your interpretation.
|
| > Do you disagree that there is an excess of cherrypicked
| LLM examples getting anthropomorphized?
|
| I disagree. I think everybody is doing the opposite and
| cherry-picking the instances where the LLM gets stuff
| wrong and saying LLMs are garbage because of that and
| ignoring the instances where it gets shit right and
| classifying that instance as regurgitation.
|
| You're not being rational here by bringing up
| anthropomorphization. We are mainly talking about
| correctness. If an aspect of LLM intelligence is similar
| to humans in generating correct data then
| anthropomorphizing it is the way to go. Whether it's
| anthropomorphizing something is completely orthogonal to
| the problem. We are talking about correct results. If
| anthropomorphizing something gets us there... who cares?
| Human intelligence works and it's even reasonable to
| believe LLMs think like humans because they are literally
| trained on human data.
|
| The act of even mentioning anthropomorphization without
| mentioning correctness is irrational and illogical.
| nuancebydefault wrote:
| Still, the findings in the article are very valuable. The fact
| that directing the "thought" process of the LLM by this kind of
| prompting, yields much better results, is useful.
|
| The comparison to how a senior dev would approach the
| assignment, as a metaphor explaining the mechanism, makes
| perfect sense to me.
| mensetmanusman wrote:
| No need for the pessimism, these are new tools that humans have
| invented. We are groking how to utilize them.
| ramblerman wrote:
| OP brought a rational argument, you didn't. It sounds like
| you are defending your optimism with emotion.
|
| > We are groking how to utilize them.
|
| Indeed.
| asah wrote:
| It's incredibly hard to get __ HUMANS __ to do this amazing
| thing consistently and accurately for all reasonable inputs.
| iaseiadit wrote:
| Some humans can do it consistently, other humans can't.
|
| Versus how no publicly-available AI can do it consistently
| (yet). Although it seems like a matter of time at this point,
| and then work as we know it changes dramatically.
| overgard wrote:
| To be fair, most senior developers don't have any incentive
| to put this amount of analysis into a working codebase. When
| the system is working, nobody really wants to spend time they
| could be working on something interesting trying to find bugs
| in old code. Plus there's the social consideration that your
| colleagues might not like you a lot if you spend all your
| time picking their (working) code apart while not doing any
| of your tasks. Usually this kind of analysis would come from
| someone specifically brought in to find issues, like an
| auditor or a pen-tester.
| kotlip wrote:
| The right incentives would motivate bug hunting, it
| entirely depends on company management. Most competent
| senior devs I've worked with spend a great deal of time
| carefully reading through PRs that involve critical
| changes. In either case, the question is not whether humans
| tend to act a certain way, but whether they are capable of
| skillfully performing a task.
| smrtinsert wrote:
| Unfortunately some senior devs like myself do care. Too bad
| no one else does. Code reviews become quick after a while,
| you brain adapts to being to review code deeply and quickly
| talldayo wrote:
| Humans are fully capable of protracted, triple-checked
| scrutiny if the incentives align just right. Given the same
| conditions, you cannot ever compel an AI to stop being wrong
| or consistently communicate what it doesn't understand.
| ryanackley wrote:
| What I want to know is how accurate was the comment? I've found
| AI to frequently suggest _plausible_ changes. Like they use
| enough info and context to look like excellent suggestions on
| the surface but you realize with some digging it was so
| completely wrong.
| Der_Einzige wrote:
| The people who claim it's that hard to do these things have
| never heard of or used constrained/structured generation, and
| it shows big time!
|
| Most other related issues of models these days are due to the
| tokenizer or poor choice of sampler settings which is a cheap
| shot on models.
| j45 wrote:
| What i'm learning is just because something might be hard to
| you or I, doesn't mean it's not possible or not working.
|
| LLMs can generlaly only do what they have data on, either in
| training, or instructions via prompting it seems.
|
| Keeping instructions reliable, is increasing and testing,
| appears to benefit from LLMops tools like Agenta, etc.
|
| It seems to me like LLMs are reasonably well suited for things
| that code can't do easily as well. You can find models on
| Hugging face that are great at categorizing and applying labels
| and categorization, instead of trying to get a generalized
| assistant model to do it.
|
| I'm more and more looking at tools like OpenRouter to allow
| doing each step with the model that does it best, almost
| functionally where needed to increase stability.
|
| For now, it seems to be one way to improve reliability
| dramatically, happy to learn about what others are finding too.
|
| It seems like a pretty nascent area still where existing
| tooling in other areas of tech is still figuring itself out in
| the LLM space.
| revskill wrote:
| The seniors master more than 2 or 3 languages.
| SunlitCat wrote:
| Oh my. That title alone inspired me to ask ChatGPT to read a
| simple hello world cpp program like a drunken sailor.
|
| The end result was quite hilarious I have to say.
|
| It's final verdict was:
|
| End result? It's a program yellin', "HELLO WORLD!" Like me at the
| pub after 3 rum shots. Cheers, matey! hiccup
|
| :D
| namanyayg wrote:
| Recently I've started appending "in the style of Edgar Allen
| Poe" to my prompts when I'm discussing code architecture.
|
| It's really quite interesting how the LLM comes up with ways to
| discuss about code :)
| dartos wrote:
| I think the content is interesting, but anthropomorphizing AI
| always rubs me the wrong way and ends up sounding like marketing.
|
| Are you trying to market a product?
| zbyforgotp wrote:
| Personally I would not hardcode the discovery process in code but
| just gave the llm tools to browse the code and find what it needs
| itself.
| atemerev wrote:
| This is what Aider doing out of the box
| mbrumlow wrote:
| > Context First: We front-load system understanding before diving
| into code Pattern Matching: Group similar files to spot repeated
| approaches Impact Analysis: Consider changes in relation to the
| whole system
|
| Wait. You fixed your AI by doing traditional programming !?!?!
| _0ffh wrote:
| The context first bit doesn't even make sense.
|
| Transformers process their whole context window in parallel,
| unlike people who process it serially. There simply _is_ no
| place that gets processed "first".
|
| I'd love to see anyone who disagrees explain to me how that is
| supposed to work.
| danjl wrote:
| > Identifying tech debt before it happens
|
| Tech debt is a management problem, not a coding problem. A
| statement like this undermines my confidence in the story being
| told, because it indicates the lack of experience of the author.
| noirbot wrote:
| I don't think that's totally accurate though. It can definitely
| be a coding problem - corners cut for expediency and such.
| Sometimes that's because management doesn't offer enough time
| to not do it that way, but it can also just be because the dev
| doesn't bother or does the bare minimum.
|
| I'd argue the creation of tech debt is often coding problem.
| The longevity and persistence of tech debt is a management
| problem.
| danjl wrote:
| Not taking enough time is a management problem. It doesn't
| matter whether It is the manager or the developer who takes
| shortcuts. The problem is planning, not coding.
| dijksterhuis wrote:
| > it can also just be because the dev doesn't bother or does
| the bare minimum
|
| sounds like a people problem -- which is management problem.
|
| > I'd argue the creation of tech debt is often coding
| problem. The longevity and persistence of tech debt is a
| management problem.
|
| i'd argue the creation of tech debt is more often due to
| those doing the coding operating under the limitations placed
| on them. The longevity and persistence of tech debt is just
| an extension of that.
|
| given an infinite amount of time and money, i can write an
| ideal solution to a solvable problem (or at least close to
| ideal, i'm not that good of a dev).
|
| the limitations create tech debt, and they're always there
| because infinite resources (time and money) don't exist.
|
| so tech debt always exists because there's always
| limitations. most of the time those resource limitations are
| decided by management (time/money/people)
|
| but there are language/framework/library limitations which
| create tech debt too though, which i think is what you might
| be referring to?
|
| usually those are less common though
| shahzaibmushtaq wrote:
| All fresh bootcamp grads aren't going to understand what the
| author is talking about, and many senior developers (even mid-
| seniors) are looking for what prompts the author wrote to teach
| AI how to become a senior developer.
| highcountess wrote:
| Dev palms just got that much more sweaty.
| brundolf wrote:
| People are being very uncharitable in the comments for some
| reason
|
| This is a short and sweet article about a very cool real-world
| result in a very new area of tooling possibilities, with some
| honest and reasonable thoughts
|
| Maybe the "Senior vs Junior Developer" narrative is a little
| stretched, but the substance of the article is great
|
| Can't help but wonder if people are getting mad because they feel
| threatened
| namanyayg wrote:
| Felt a bit more cynic than usual haha.
| jryan49 wrote:
| It seems llms are very useful for some people and not so much
| for others. Both sides believe it's all or nothing. If it's
| garbage for me it must be garbage. If it's doing my work it
| must be able to do everyone's work... Everyone is very
| emotionally about it too because of the hype around it. Almost
| all conversations about llms, especially on hn are full of this
| useless bickering.
| epolanski wrote:
| I am more and more convinced that many engineers are very
| defensive about AI and would rather point out any flaw than
| think how to leverage the tools to get any benefit out of them.
|
| Just the other day I used cursor and iteratively implemented
| stories for 70 .vue files in few hours, while also writing
| documentation for the components and pages, and with the
| documentation being further fed to cursor, to write many E2Es,
| something that would've taken me at least few days if not a
| week.
|
| When I shared that with some coworkers they went into a hunt to
| find all the shortcomings (often petty duplication of mocks,
| sometimes missing a story scenario, nothing major).
|
| I found it striking as we really needed it and it provides
| tangible benefits:
|
| - domain and UI stakeholders can navigate stories and think of
| more cases/feedback with ease on a UX/UI pov without having to
| replicate the scenarios manually doing multiple time consuming
| repetitive operations in the actual applications
|
| - documentation proved to be very valuable to a junior that
| joined us this very january
|
| - E2Es caught multiple bugs in their own PRs in the weeks after
|
| And yet, instead of appreciating the cost/benefit ratio
| (something that should characterise a good engineer, after all,
| that's our job) of the solution, I was scolded because they (or
| I) would've done a more careful job missing that they never
| done that in the first place.
|
| I have many such examples, such as automatically providing all
| the translation keys and translations for a new locale, just to
| find cherry picked criticism that this or that could've been
| spelled differently. Of course it can, what's your job if not
| being responsible for the localisation? That shouldn't diminish
| that 95% of the content was correct and provided in few seconds
| rather than days.
|
| Why they do that? I genuinely feel some feel threatened, most
| of those reek insecurity.
|
| I can understand some criticism towards those who build and
| sell hype with cherry picked results, but I cannot but find
| some of the worst critics suffering of Luddism.
| krainboltgreene wrote:
| Given how much damage it's done to our industry without any
| appreciable impact on the actual system's efficacy it makes
| sense to me that _experts in a mechanism_ are critical of
| people telling them how effective this "tool" is for the
| mechanism.
|
| I suppose it's simply easier to think of them as scared and
| afraid of losing their lobs to robots, but the reality is
| most programmers already know someone who lost their job to a
| robot that doesn't even exist yet.
| chchchangelog wrote:
| Oh no being a working serf is going away
|
| In the US anyway, we can all take advantage of the 2A, sit
| around strapped keeping the politicians in-line
|
| https://aeon.co/essays/game-theory-s-cure-for-corruption-
| mak...
|
| No need to live by a politically convenient to the elite,
| economic ledger.
|
| Key axioms, techniques, and technology were invented before
| the contemporary leadership used media to take credit.
| Little reason to actually venerate normal people who
| happened to come along at an opportune time.
|
| Lucky to exist, survivorship bias is a terrible reason to
| deify anyone from prior generations. Maybe all the actual
| smart people from their day died? The survivors being risk
| averse weenies.
|
| 1900s American culture needs to be tossed in the bin along
| with the old "winners".
| jazzyjackson wrote:
| You sound like a markov chain.
| cutnpaste wrote:
| Is that a pejorative?
|
| Let me try...
|
| Your comment reads like:
|
| > cat textbook.txt | echo
|
| Zing!
| __loam wrote:
| Least insane hackernews commenter
| __loam wrote:
| To me it sounds like you're rushing the work and making
| mistakes then dismissing people who point it out.
| imoreno wrote:
| To me, articles like this are not so interesting for the
| results. I'm not reading them to find out just exactly what the
| performance of AIs is, exactly. Obviously it's not useful for
| that, it's not systematic, anecdotal, unscientific...
|
| I think LLMs today, for all their goods and bads, can do some
| useful work. The problem is that there is still mystery on how
| to use them effectively. I'm not talking about some pie in the
| sky singularity stuff, but just coming up with prompts to do
| basic, simple tasks effectively.
|
| Articles like that are great for learning new prompting tricks
| and I'm glad the authors are choosing to share their knowledge.
| Yes, OP isn't saying the last word on prompting, and there's a
| million ways it could be better. But the article is still
| useful to an average person trying to learn how to use LLMs
| more productively.
|
| >the "Senior vs Junior Developer" narrative
|
| It sounds to me like just another case of "telling the AI to
| explicitly reason through its answer improves the quality of
| results". The "senior developer" here is better able to triage
| aspects of the codebase to identify the important ones (and to
| the "junior" everything seems equally important) and I would
| say has better reasoning ability.
|
| Maybe it works because when you ask the LLM to code something,
| it's not really trying to "do a good job", besides whatever
| nebulous bias is instilled from alignment. It's just trying to
| act the part of a human who is solving the problem. If you tell
| it to act a more competent part, it does better - but it has to
| have some knowledge (aka training data) of what the more
| competent part looks like.
| dang wrote:
| No doubt the overstated title is part of the reason, so we've
| adopted the subtitle above, which is presumably more accurate.
| ianbutler wrote:
| Code context and understanding is very important for improving
| the quality of LLM generated code, it's why the core of our
| coding agent product Bismuth (which I won't link but if you're so
| inclined check my profile) is built around a custom code search
| engine that we've also built.
|
| We segment the project into logical areas based on what the user
| is asking, then find interesting symbol information and use it to
| search call chain information which we've constructed at project
| import.
|
| This gives the LLM way better starting context and we then
| provide it tools to move around the codebase through normal
| methods you or I would use like go_to_def.
|
| We've analyzed a lot of competitor products and very few have
| done anything other than a rudimentary project skeleton like
| Aider or just directly feeding opened code as context which
| breaks down very quickly on large code projects.
|
| We're very happy with the level of quality we see from our
| implementation and it's something that really feels overlooked
| sometimes by various products in this space.
|
| Realistically, the only other product I know of approaching this
| correctly with any degree of search sophistication is Cody from
| SourceGraph which yeah, makes sense.
| kmoser wrote:
| I wonder how the results would compare to simply prompting it to
| "analyze this as if you were a senior engineer"?
| jappgar wrote:
| I do this all the time. Actually, I tell it that _I_ am a
| senior engineer.
|
| A lot of people tinkering with AI think it's more complex than
| it is. If you ask it ELI5 it will do that.
|
| Often I will say "I already know all that, I'm an experienced
| engineer and need you to think outside the box and troubleshoot
| with me. "
|
| It works great.
| Arch-TK wrote:
| I am struggling to teach AI to stop dreaming up APIs which don't
| exist and failing to solve relatively simple but not often
| written about problems.
|
| It's good when it works, it's crap when it doesn't, for me it
| mostly doesn't work. I think AI working is a good indicator of
| when you're writing code which has been written by lots of other
| people before.
| Terr_ wrote:
| > I think AI working is a good indicator of when you're writing
| code which has been written by lots of other people before.
|
| This is arguably good news for the programming profession,
| because that has a big overlap with cases that could be
| improved by a library or framework, the traditional way we've
| been trying to automate ourselves out of a job for several
| decades now.
| charles_f wrote:
| I wondered if there was a reason behind the ligature between c
| and t across the article (e.g. is it easier to read for people
| with dyslexia).
|
| If like me you didn't know, apparently this is mostly stylistic,
| and comes from a historical practice that predates printing.
| There are other common ligatures such as CT, st, sp and th.
| https://rwt.io/typography-tips/opentype-part-2-leg-ligatures
| patrickhogan1 wrote:
| This is great. More context is better. Only question is after you
| have the AI your code why would you have to tell it basic things
| like this is session middleware.
| deadbabe wrote:
| This strikes me as basically doing the understanding _for_ the
| LLM and then having it summarize it.
| redleggedfrog wrote:
| That's funny those are considered Senior Dev attributes. I would
| think you'd better be doing that basic kind of stuff from the
| minute your writing code for production and future maintenance.
| Otherwise your making a mess someone else is going to have to
| clean up.
| guerrilla wrote:
| Today I learned I have "senior dev level awareness". This seems
| pretty basic to me, but impressive that the LLM was able to do
| it. On the other hand, this borderline reads like those people
| with their "AI" girlfriends.
| riazrizvi wrote:
| Nice article. The comments are weird as fuck.
| ptx wrote:
| Well, did you check if the AI's claims were correct?
|
| Does PR 1234 actually exist? Did it actually modify the retry
| logic? Does the token refresh logic actually share patterns with
| the notification service? Was the notification service added last
| month? Does it use websockets?
| stevenhuang wrote:
| Related article on how LLMs are force fed information line by
| line
|
| https://amistrongeryet.substack.com/p/unhobbling-llms-with-k...
|
| > Our entire world - the way we present information in scientific
| papers, the way we organize the workplace, website layouts,
| software design - is optimized to support human cognition. There
| will be some movement in the direction of making the world more
| accessible to AI. But the big leverage will be in making AI more
| able to interact with the world as it exists.
|
| > We need to interpret LLM accomplishments to date in light of
| the fact that they have been laboring under a handicap. This
| helps explain the famously "jagged" nature of AI capabilities:
| it's not surprising that LLMs struggle with tasks, such as ARC
| puzzles, that don't fit well with a linear thought process. In
| any case, we will probably find ways of removing this handicap
| disambiguation wrote:
| OP you only took this half way. We already know LLMs can say
| smart sounding things while also being wrong and irrelevant. You
| need to manually validate how many N / 100 LLM outputs are both
| correct and significant - and how much did it miss! Otherwise you
| might fall into a trap of dealing with too much noise for only a
| little bit of signal. The next step from there is comparing it
| with human level signal to noise ratio.
| Jimmc414 wrote:
| @namanyayg Thanks for posting this, OP. I created a prompt series
| based on this and so far I like the results. Here is the repo if
| you are interested.
|
| https://github.com/jimmc414/better_code_analysis_prompts_for...
|
| I used this tool to flatten the example repo and PRs into text:
|
| https://github.com/jimmc414/1filellm
___________________________________________________________________
(page generated 2025-01-05 23:01 UTC)