[HN Gopher] The coming knowledge-work supply-chain crisis
       ___________________________________________________________________
        
       The coming knowledge-work supply-chain crisis
        
       Author : Stwerner
       Score  : 81 points
       Date   : 2025-04-27 15:10 UTC (7 hours ago)
        
 (HTM) web link (worksonmymachine.substack.com)
 (TXT) w3m dump (worksonmymachine.substack.com)
        
       | roughly wrote:
       | TFA is right to point out the bottleneck problem for reviewing
       | content - there's a couple things that compound to make this
       | worse than it should be -
       | 
       | The first is that the LLM outputs are not consistently good or
       | bad - the LLM can put out 9 good MRs before the 10th one has some
       | critical bug or architecture mistake. This means you need to be
       | hypervigilant of everything the LLM produces, and you need to
       | review everything with the kind of care with which you review
       | intern contributions.
       | 
       | The second is that the LLMs don't learn once they're done
       | training, which means I could spend the rest of my life tutoring
       | Claude and it'll still make the exact same mistakes, which means
       | I'll never get a return for that time and hypervigilance like I
       | would with an actual junior engineer.
       | 
       | That problem leads to the final problem, which is that you need a
       | senior engineer to vet the LLM's code, but you don't get to be a
       | senior engineer without being the kind of junior engineer that
       | the LLMs are replacing - there's no way up that ladder except to
       | climb it yourself.
       | 
       | All of this may change in the next few years or the next
       | iteration, but the systems as they are today are a tantalizing
       | glimpse at an interesting future, not the actual present you can
       | build on.
        
         | ryandrake wrote:
         | > The first is that the LLM outputs are not consistently good
         | or bad - the LLM can put out 9 good MRs before the 10th one has
         | some critical bug or architecture mistake. This means you need
         | to be hypervigilant of everything the LLM produces
         | 
         | This, to me, is the critical and fatal flaw that prevents me
         | from using or even being excited about LLMs: That they can be
         | randomly, nondeterministically and confidently wrong, and there
         | is no way to know without manually reviewing every output.
         | 
         | Traditional computer systems whose outputs relied on
         | probability solved this by including a confidence value next to
         | any output. Do any LLMs do this? If not, why can't they? If
         | they could, then the user would just need to pick a threshold
         | that suits their peace of mind and review any outputs that came
         | back below that threshold.
        
           | exe34 wrote:
           | > Do any LLMs do this? If not, why can't they? If they could,
           | then the user would just need to pick a threshold that suits
           | their peace of mind and review any outputs that came back
           | below that threshold.
           | 
           | That's not how they work - they don't have internal models
           | where they are sort of confident that this is a good answer.
           | They have internal models where they are sort of confident
           | that these tokens look like they were human generated in that
           | order. So they can be very confident and still wrong. Knowing
           | that confidence level (log p) would not help you assess.
           | 
           | There are probabilistic models where they try to model a
           | posterior distribution for the output - but that has to be
           | trained in, with labelled samples. It's not clear how to do
           | that for LLMs at the kind of scale that they require and
           | affordably.
           | 
           | You could consider letting it run code or try out things in
           | simulations and use those as samples for further tuning, but
           | at the moment, this might still lead them to forget something
           | else or just make some other arbitrary and dumb mistake that
           | they didn't make before the fine tuning.
        
           | bee_rider wrote:
           | What would those probabilities mean in the context of these
           | modern LLMs? They are basically "try to continue the phrase
           | like a human would" bots. I imagine the question of "how good
           | of an approximation is this to something a human might write"
           | could possibly be answerable. But humans often write things
           | which are false.
           | 
           | The entire universe of information consists of human writing,
           | as far as the training process is concerned. Fictional
           | stories and historical documents are equally "true" in that
           | sense, right?
           | 
           | Hmm, maybe somehow one could score outputs based on whether
           | another contradictory output could be written? But it will
           | have to be a little clever. Maybe somehow rank them by how
           | specific they are? Like, a pair of reasonable contradictory
           | sentences that can be written about the history-book setting
           | indicate some controversy. A pair of contradictory sentences,
           | one about history-book, one about Narnia, each equally real
           | to the training set, but the fact that they contradict one
           | another is not so interesting.
        
             | sepositus wrote:
             | > But humans often write things which are false.
             | 
             | Not to mention, humans say things that make sense for
             | humans to say and not a machine. For example, one recent
             | case I saw was where the LLM hallucinated having a Macbook
             | available that it was using to answer a question. In the
             | context of a human, it was a totally viable response, but
             | was total nonsense coming from an LLM.
        
           | ToucanLoucan wrote:
           | > Do any LLMs do this? If not, why can't they?
           | 
           | Because they aren't knowledgeable. The marketing and at-
           | first-blush impressions that LLMs leave as some kind of
           | actual being, no matter how limited, mask this fact and it's
           | the most frustrating thing about trying to evaluate this tech
           | as useful or not.
           | 
           | To make an incredibly complex topic somewhat simple, LLMs
           | train on a series of materials, in this case we'll talk
           | words. It learns that "it turns out," "in the case of",
           | "however, there is" are all words that naturally follow one
           | another in writing, but it has no clue _why_ one would choose
           | one over the other beyond the other words which form the
           | contexts in which those word series ' appear. This process is
           | repeated billions of times as it analyzes the structure of
           | billions of written words until it arrives at a massive in
           | scale statistical model of how likely it is that every word
           | will be followed by every other word or punctuation mark.
           | 
           | Having all that data available does mean an LLM can
           | generate... words. Words that are pretty consistently spelled
           | and arranged correctly in a way that reflects the language
           | they belong to. And, thanks to the documents it trained on,
           | it gains what you could, if you're feeling generous, call a
           | "base of knowledge" on a variety of subjects, in that by the
           | same statistical model, it has "learned" that "measure twice,
           | cut once" is said often enough that it's likely good advice,
           | but again, it doesn't know _why that is,_ which would be: it
           | optimizes your cuts and avoids wasting materials when
           | building something to measure it, mark it, then measure it a
           | second or even third time to make sure it was done correctly
           | before you do the cut, which an operation that cannot be
           | reversed.
           | 
           | However that knowledge has a HARD limit in terms of what was
           | understood within it's training data. For example, way back,
           | a GPT model recommended using elmer's glue to keep pizza
           | toppings attached when making a pizza. No sane person would
           | suggest this, because glue... isn't food. But the LLM doesn't
           | understand that, it takes the question: how do I keep
           | toppings on pizza, and it says, well a ton of things I read
           | said you should use glue to stick things together, and ships
           | that answer out.
           | 
           | This is why I firmly believe LLMs and true AI are just... not
           | the same thing, at all, and I'm annoyed that we now call LLMs
           | AI and AI AGI, because in my mind, LLMs do not demonstrate
           | any intelligence at all.
        
             | smokel wrote:
             | This explanation is only superficially correct, and there
             | is more to it than simply predicting the next word.
             | 
             | It is the _way_ in which the prediction works, that leads
             | to some form of intelligence.
        
             | ryoshu wrote:
             | The glue on pizza thing was a bit more pernicious because
             | of how the model came to that conclusion: SERPs. Google's
             | LLM pulled the top result for that query from Reddit and
             | didn't understand that the Reddit post was a joke. It took
             | it as the most relevant thing and hilarity ensued.
             | 
             | In that case the error was obvious, but these things become
             | "dangerous" for that sort of use case when end users trust
             | the "AI result" as the "truth".
        
               | ryandrake wrote:
               | Treating "highest ranked," "most upvoted," "most
               | popular," and "frequently cited" as a signal of quality
               | or authoritativeness has proven to be a persistent
               | problem for decades.
        
           | giantrobot wrote:
           | > That they can be randomly, nondeterministically and
           | confidently wrong, and there is no way to know without
           | manually reviewing every output.
           | 
           | This is my exact same issue with LLMs and it's routinely
           | ignored by LLM evangelists/hypesters. It's not necessarily
           | about being _wrong_ it 's the non-deterministic nature of the
           | errors. They're not only non-deterministic but unevenly
           | distributed. So you can't predict errors and need expertise
           | to review all the generated content looking for errors.
           | 
           | There's also not necessarily an obvious mapping between input
           | tokens and an output since the output depends on the whole
           | context window. An LLM might never tell _you_ to put glue on
           | pizza because your context window has some set of tokens that
           | will exclude that output while it will tell me to do so
           | because my context window doesn 't. So there's not even
           | necessarily determinism or consistency between
           | sessions/users.
           | 
           | I understand the existence of Gell-Mann amnesia so when I see
           | an LLM give confident but subtly wrong answers about a Python
           | library I don't then assume I won't also get confident yet
           | subtly wrong answers about the Parisian Metro or elephants.
        
             | gopher_space wrote:
             | The prompts we're using seem like they'd generate the same
             | forced confidence from a junior. If everything's a top-down
             | order, and your personal identity is on the line if I'm not
             | "happy" with the results, then you're going to tell me what
             | I want to hear.
        
               | giantrobot wrote:
               | There's some differences between junior developers and
               | LLMs that are important. For one a human developer can
               | likely learn from a mistake and internalize a correction.
               | They might make the mistake once or twice but the
               | occurrences will decrease as they get experience and
               | feedback.
               | 
               | LLMs as currently deployed don't do the same. They'll
               | happily make the same mistake consistently if a mistake
               | is popular in the training corpus. You need to waste
               | context space telling them to avoid the error
               | until/unless the model is updated.
               | 
               | It's entirely possible for good mentors to make junior
               | developers (or any junior position) feel comfortable
               | being realistic in their confidence levels for an answer.
               | It's ok for a junior person to admit they don't know an
               | answer. A mentor requiring a mentee to know everything
               | and never admit fault or ignorance is a bad mentor.
               | That's encouraging thought terminating behavior and helps
               | neither person.
               | 
               | It's much more difficult to alter system prompts or get
               | LLMs to even admit when they're stumped. They don't have
               | meaningful ways to even gauge their own confidence in
               | their output. Their weights are based on occurrences in
               | training data rather than correctness of the training
               | data. Even with RL the weight adjustments are only as
               | good as the determinism of the output for the input which
               | is not great for several reasons.
        
             | furyofantares wrote:
             | This is a nitpick because I think your complaints are all
             | totally valid, except that I think blaming non-determinism
             | isn't quite right. The models are in fact deterministic.
             | But that's just technical, from a practical sense they are
             | non-deterministic in that a human can't determine what
             | it'll produce without running it, and even then it can be
             | sensitive to changes in context window like you said, so
             | even after running it once you don't know you'll get a
             | similar output from similar inputs.
             | 
             | I only post this because I find it kind of interesting; I
             | balked at blaming non-determinism because it technically
             | isn't, but came to conclude that practically speaking
             | that's the right thing to blame, although maybe there's a
             | better word that I don't know.
        
               | ryandrake wrote:
               | > from a practical sense they are non-deterministic in
               | that a human can't determine what it'll produce without
               | running it
               | 
               | But this is also true for programs that are deliberately
               | random. If you program a computer to output a list of
               | random (not pseudo-random) numbers between 0 and 100,
               | then you cannot determine ahead of time what the output
               | will be.
               | 
               | The difference is, you at least know the range of values
               | that it will give you and the distribution, and if
               | programmed correctly, the random number generator will
               | consistency give you numbers in that range with the
               | expected probability distribution.
               | 
               | In contrast, an LLM's answer to "List random numbers
               | between 0 and 100" usually will result in what you
               | expect, or (with a nonzero probability) it might just up
               | and decide to include numbers outside of that range, or
               | (with a nonzero probability) it might decide to list
               | animals instead of numbers. There's no way to know for
               | sure, and you can't prove from the code that it _won 't_
               | happen.
        
               | giantrobot wrote:
               | At the base levels LLMs aren't actually deterministic
               | because the model weights are typically floats of limited
               | precision. At a large enough scale (enough parameters,
               | model size, etc) you _will_ run into rounding issues that
               | effectively behave randomly and alter output.
               | 
               | Even with temperature of zero floating point rounding,
               | probability ties, MoE routing, and other factors make
               | outputs not fully deterministic even between multiple
               | runs with identical contexts/prompts.
               | 
               | In theory you could construct a fully deterministic LLM
               | but I don't think any are deployed in practice. Because
               | there's so many places where behavior is _effectively_
               | non-deterministic the system itself can 't be thought of
               | as deterministic.
               | 
               | Errors might be completely innocuous like one token
               | substituted for another with the same semantic meaning.
               | An error might also completely change the semantic
               | meaning of the output with only a single token change
               | like an "un-" prefix added to a word.
               | 
               | The non-determinism is both technically and practically
               | true in practice.
        
           | ikiris wrote:
           | I don't understand these arguments at all. Do you currently
           | not do code reviews at all, and just commit everything
           | directly to repo? do your coworkers?
           | 
           | If this is the case, I can't take your company at all
           | seriously. And if it isn't, then why is reviewing the output
           | of LLM somehow more burdensome than having to write things
           | yourself?
        
           | Aurornis wrote:
           | > This, to me, is the critical and fatal flaw that prevents
           | me from using or even being excited about LLMs: That they can
           | be randomly, nondeterministically and confidently wrong, and
           | there is no way to know without manually reviewing every
           | output.
           | 
           | Sounds a lot like most engineers I've ever worked with.
           | 
           | There are a lot of people utilizing LLMs wisely because they
           | know and embrace this. Reviewing and understanding their
           | output has always been the game. The whole "vibe coding"
           | trend where you send the LLM off to do something and hope for
           | the best will teach anyone this lesson very quickly if they
           | try it.
        
             | agentultra wrote:
             | Most engineers you worked with probably cared about getting
             | it right and improving their skills.
        
           | nkrisc wrote:
           | How would a meaningful confidence value be calculated with
           | respect to the output of an LLM? What is "correct" LLM
           | output?
        
             | Kinrany wrote:
             | It can be the probability of the response being accepted by
             | the prompter
        
               | nkrisc wrote:
               | So unique to each prompter, refined over time?
        
           | rustcleaner wrote:
           | >That they can be randomly, nondeterministically and
           | confidently wrong, and there is no way to know without
           | manually reviewing every output.
           | 
           | I think I can confidently assert that this applies to you and
           | I as well.
        
             | ryandrake wrote:
             | I choose a computer to do a task because I expect it to be
             | much more accurate, precise, and deterministic than a
             | human.
        
         | FeepingCreature wrote:
         | > The second is that the LLMs don't learn once they're done
         | training, which means I could spend the rest of my life
         | tutoring Claude and it'll still make the exact same mistakes,
         | which means I'll never get a return for that time and
         | hypervigilance like I would with an actual junior engineer.
         | 
         | However, this creates a significant return on investment for
         | opensourcing your LLM projects. In fact, you should commit your
         | LLM dialogs along with your code. The LLM won't learn
         | _immediately_ , but it will learn in a few months when the next
         | refresh comes out.
        
           | samjewell wrote:
           | > In fact, you should commit your LLM dialogs along with your
           | code.
           | 
           | Wholeheartedly agree with this.
           | 
           | I think code review will evolve from "Review this code" to
           | "Review this prompt that was used to generate some code"
        
         | devnull3 wrote:
         | > hypervigilant
         | 
         | If a tech works 80% of the time, then I know that I need to be
         | vigilant and I will review the output. The entire team
         | structure is aware of this. There will be processes to offset
         | this 20%.
         | 
         | The problem is that when the AI becomes > 95% accurate (if at
         | all) then humans will become complacent and the checks and
         | balances will be ineffective.
        
       | kaycebasques wrote:
       | This section heading from the post captures the key insight, is
       | more focused, and is less hyperbolic:
       | 
       | > Redesigning for Decision Velocity
        
       | nthingtohide wrote:
       | > He argues this type of value judgement is something AI
       | fundamentally cannot do, as it can only pattern match against
       | existing decisions, not create new frameworks for assigning
       | worth.
       | 
       | Counterpoint : That decision has to be made only once (probably
       | by some expert). AI can incorportate that training data into its
       | reasoning and voila, it becomes available to everyone. A software
       | framework is already a collection of good decisions, practices
       | and tastes made by experts.
       | 
       | > An MIT study found materials scientists experienced a 44% drop
       | in job satisfaction when AI automated 57% of their "idea-
       | generation" tasks
       | 
       | Counterpoint : Now consider making material science decisions
       | which requires materials to have not just 3 properties but 10 or
       | 15.
       | 
       | > Redesigning for Decision Velocity
       | 
       | Suggestion : I think this section implies we must ask our experts
       | to externalize all their tastes, preferences, top-down thinking
       | so that other juniors can internalize those. So experts will be
       | teaching details (based on their internal model) to LLMs while
       | teaching the model itself to humans.
        
       | Animats wrote:
       | > This pile of tasks is how I understand what Vaughn Tan refers
       | to as Meaningmaking: the uniquely human ability to make
       | subjective decisions about the relative value of things.
       | 
       | Why is that a "uniquely human ability"? Machine learning systems
       | are good at scoring things against some criterion. That's mostly
       | how they work.
        
         | atomicnumber3 wrote:
         | How are the criterion chosen though?
         | 
         | Something I learned from working alongside data scientists and
         | financial analysts doing algo trading is that you can almost
         | always find great fits for your criteria, nobody ever worries
         | about that. Its coming up with the criteria that's what
         | everyone frets over, and even more than that, you need to _beat
         | other people_ at doing so - just being good or event great isn
         | 't enough. Your profit is the delta between where you are
         | compared to all the other sharks in your pool. So LLMs are
         | useless there, getting token predicted answers is just going to
         | get you the same as everyone else, which means zero alpha.
         | 
         | So - I dunno about uniquely human? But there's definitely
         | something here where, short of AGI, there's always going to
         | need to be someone sitting down and actually beating the market
         | (whatever that metaphor means for your industry or use case).
        
           | fwip wrote:
           | Finance is sort of a unique beast in that the field is
           | inherently negative-sum. The profits you take home are always
           | going to be profits somebody else isn't getting.
           | 
           | If you're doing like, real work, solving problems in your
           | domain actually adds value, and so the profits you get are
           | from the value you provide.
        
             | kaashif wrote:
             | If you're algo trading then yes, which is what the person
             | you're replying to is talking about.
             | 
             | But "finance" is very broad and covers very real and
             | valuable work like making loans and insurance - be careful
             | not to be too broad in your condemnation.
        
         | rukuu001 wrote:
         | I think this is challenging because there's a lot of tacit
         | knowledge involved, and feedback loops are long and measurement
         | of success ambiguous.
         | 
         | It's a very rubbery, human oriented activity.
         | 
         | I'm sure this will be solved, but it won't be solved by
         | noodling with prompts and automation tools - the humans will
         | have to organise themselves to externalise expert knowledge and
         | develop an objective framework for making 'subjective decisions
         | about the relative value of things'.
        
       | jasonthorsness wrote:
       | The method of producing the work can be more important (and
       | easier to review) than the work output itself. Like at the
       | simplest level of a global search-replace of a function name that
       | alters 5000 lines. At a complex level, you can trust a team of
       | humans to do something without micro-managing every aspect of
       | their work. My hope is the current crises of reviewing too much
       | AI-generated output will subside into the way you can trust the
       | team because the LLM has reached a high level of "judgement" and
       | competence. But we're definitely not there yet.
       | 
       | And contrary to the article, idea-generation with LLM support can
       | be fun! They must have tested full replacement or something.
        
         | wffurr wrote:
         | >> At a complex level, you can trust a team of humans to do
         | something without micro-managing every aspect of their work
         | 
         | I see you have never managed an outsourced project run by a
         | body shop consultancy. They check the boxes you give them with
         | zero thought or regard to the overall project and require
         | significant micro managing to produce usable code.
        
           | jdlshore wrote:
           | I find this sort of whataboutism in LLM discussions tiring.
           | Yes, _of course,_ there are teams of humans that perform
           | worse than an LLM. But it obvious to all but the most hype-
           | blinded booster that it is possible for teams of humans to
           | work autonomously to produce good results, because that is
           | how all software has been produced to the present day, and
           | some of it is good.
        
       | timewizard wrote:
       | > Remember the first time an autocomplete suggestion nailed
       | exactly what you meant to type?
       | 
       | No.
       | 
       | > Multiply that by a thousand and aim it at every task you once
       | called "work."
       | 
       | If you mean "menial labor" then sure. The "work" I do is not at
       | all aided by LLMs.
       | 
       | > but our decision-making tools and rituals remain stuck in the
       | past.
       | 
       | That's because LLMs haven't eliminated or even significantly
       | reduced risk. In fact they've created an entirely new category of
       | risk in "hallucinations."
       | 
       | > we need to rethink the entire production-to-judgment pipeline.
       | 
       | Attempting to do this without accounting for risk or how capital
       | is allocated into processes will lead you into folly.
       | 
       | > We must reimagine knowledge work as a high-velocity decision-
       | making operation rather than a creative production process.
       | 
       | Then you will invent nothing new or novel and will be relegated
       | to scraping by on the overpriced annotated databases of your
       | direct competitors. The walled garden just raised the stakes. I
       | can't believe people see a future in it.
        
       | shawn-butler wrote:
       | This really isn't true in principle. The current LLM ecosystems
       | can't do "meaning tasks" but there are all kinds of "legacy" AI
       | expert systems that do exactly what is required.
       | 
       | My experience is that middle manager gatekeepers are the most
       | reluctant to participate in building knowledge systems that
       | obsolete them though.
        
       | bendigedig wrote:
       | Validating the outputs of a stochastic parrot sounds like a very
       | alienating job.
        
         | FeepingCreature wrote:
         | It's actually very fun, ime.
        
           | bendigedig wrote:
           | I have plenty of experience doing code reviews and to do a
           | good job is pretty hard and thankless work. If I had to do
           | that all day every day I'd be very unhappy.
        
         | darth_avocado wrote:
         | As a staff engineer, it upsets me if my Review to Code ratio
         | goes above 1. Days when I am not able to focus and code,
         | because I was reviewing other people's work all day, I usually
         | am pretty drained but also unsatisfied. If the only job
         | available to engineers becomes "review 50 PRs a day, everyday"
         | I'll probably quit software engineering altogether.
        
           | kmijyiyxfbklao wrote:
           | > As a staff engineer, it upsets me if my Review to Code
           | ratio goes above 1.
           | 
           | How does this work? Do you allow merging without reviews? Or
           | are other engineers reviewing code way more than you?
        
         | PaulRobinson wrote:
         | Most knowledge work - perhaps all of it - is already validating
         | the output of stochastic parrots, we just call those stochastic
         | parrots "management'.
        
       | xg15 wrote:
       | The intro sentence to this is quite funny.
       | 
       | > _Remember the first time an autocomplete suggestion nailed
       | exactly what you meant to type?_
       | 
       | I actually don't, because so far this only happened with trivial
       | phrases or text I had already typed in the past. I do remember
       | however dozens of times where autocorrect wrongly "corrected" the
       | last word I typed, changing an easy to spot typo into a much more
       | subtle semantic error.
        
         | thechao wrote:
         | I see these sorts of statements from coders who, you know,
         | aren't good programmers in the first place. Here's the secret
         | that I that I think LLM's are uncovering: I think there's a
         | _lot_ of really shoddy coders out there; coders who could could
         | /would never become good programmers and _they_ are absolutely
         | going to be replaced with LLMs.
         | 
         | I don't know how I feel about that. I suspect it's not going to
         | be great for society. Replacing blue collar workers for robots
         | hasn't been super duper great.
        
       | delusional wrote:
       | If the AIs learned from us they'll only be able to produce Coca
       | Cola and ads, so the interety of the actually valuable economy is
       | safe.
        
       | eezurr wrote:
       | And once the Orient and Decide part is augmented, then we'll be
       | limited by social networks (IRL ones). Every solo founder/small
       | biz will have to compete more and more for marketing eyeballs,
       | and the ones who have access to bigger engines (companies),
       | they'll get the juice they need, and we come back to humans being
       | the bottlenecks again.
       | 
       | That is, until we mutually decide on removing our agency from the
       | loop entirely . And then what?
        
       | zkmon wrote:
       | > Ultimately, I don't see AI completely replacing knowledge
       | workers any time soon.
       | 
       | How was that conclusion reached? And what is meant by knowledge
       | workers? Any work with knowledge is exactly the domain of LLMs.
       | So, LLMs are indeed knowledge workers.
        
       ___________________________________________________________________
       (page generated 2025-04-27 23:00 UTC)