[HN Gopher] The coming knowledge-work supply-chain crisis
___________________________________________________________________
The coming knowledge-work supply-chain crisis
Author : Stwerner
Score : 81 points
Date : 2025-04-27 15:10 UTC (7 hours ago)
(HTM) web link (worksonmymachine.substack.com)
(TXT) w3m dump (worksonmymachine.substack.com)
| roughly wrote:
| TFA is right to point out the bottleneck problem for reviewing
| content - there's a couple things that compound to make this
| worse than it should be -
|
| The first is that the LLM outputs are not consistently good or
| bad - the LLM can put out 9 good MRs before the 10th one has some
| critical bug or architecture mistake. This means you need to be
| hypervigilant of everything the LLM produces, and you need to
| review everything with the kind of care with which you review
| intern contributions.
|
| The second is that the LLMs don't learn once they're done
| training, which means I could spend the rest of my life tutoring
| Claude and it'll still make the exact same mistakes, which means
| I'll never get a return for that time and hypervigilance like I
| would with an actual junior engineer.
|
| That problem leads to the final problem, which is that you need a
| senior engineer to vet the LLM's code, but you don't get to be a
| senior engineer without being the kind of junior engineer that
| the LLMs are replacing - there's no way up that ladder except to
| climb it yourself.
|
| All of this may change in the next few years or the next
| iteration, but the systems as they are today are a tantalizing
| glimpse at an interesting future, not the actual present you can
| build on.
| ryandrake wrote:
| > The first is that the LLM outputs are not consistently good
| or bad - the LLM can put out 9 good MRs before the 10th one has
| some critical bug or architecture mistake. This means you need
| to be hypervigilant of everything the LLM produces
|
| This, to me, is the critical and fatal flaw that prevents me
| from using or even being excited about LLMs: That they can be
| randomly, nondeterministically and confidently wrong, and there
| is no way to know without manually reviewing every output.
|
| Traditional computer systems whose outputs relied on
| probability solved this by including a confidence value next to
| any output. Do any LLMs do this? If not, why can't they? If
| they could, then the user would just need to pick a threshold
| that suits their peace of mind and review any outputs that came
| back below that threshold.
| exe34 wrote:
| > Do any LLMs do this? If not, why can't they? If they could,
| then the user would just need to pick a threshold that suits
| their peace of mind and review any outputs that came back
| below that threshold.
|
| That's not how they work - they don't have internal models
| where they are sort of confident that this is a good answer.
| They have internal models where they are sort of confident
| that these tokens look like they were human generated in that
| order. So they can be very confident and still wrong. Knowing
| that confidence level (log p) would not help you assess.
|
| There are probabilistic models where they try to model a
| posterior distribution for the output - but that has to be
| trained in, with labelled samples. It's not clear how to do
| that for LLMs at the kind of scale that they require and
| affordably.
|
| You could consider letting it run code or try out things in
| simulations and use those as samples for further tuning, but
| at the moment, this might still lead them to forget something
| else or just make some other arbitrary and dumb mistake that
| they didn't make before the fine tuning.
| bee_rider wrote:
| What would those probabilities mean in the context of these
| modern LLMs? They are basically "try to continue the phrase
| like a human would" bots. I imagine the question of "how good
| of an approximation is this to something a human might write"
| could possibly be answerable. But humans often write things
| which are false.
|
| The entire universe of information consists of human writing,
| as far as the training process is concerned. Fictional
| stories and historical documents are equally "true" in that
| sense, right?
|
| Hmm, maybe somehow one could score outputs based on whether
| another contradictory output could be written? But it will
| have to be a little clever. Maybe somehow rank them by how
| specific they are? Like, a pair of reasonable contradictory
| sentences that can be written about the history-book setting
| indicate some controversy. A pair of contradictory sentences,
| one about history-book, one about Narnia, each equally real
| to the training set, but the fact that they contradict one
| another is not so interesting.
| sepositus wrote:
| > But humans often write things which are false.
|
| Not to mention, humans say things that make sense for
| humans to say and not a machine. For example, one recent
| case I saw was where the LLM hallucinated having a Macbook
| available that it was using to answer a question. In the
| context of a human, it was a totally viable response, but
| was total nonsense coming from an LLM.
| ToucanLoucan wrote:
| > Do any LLMs do this? If not, why can't they?
|
| Because they aren't knowledgeable. The marketing and at-
| first-blush impressions that LLMs leave as some kind of
| actual being, no matter how limited, mask this fact and it's
| the most frustrating thing about trying to evaluate this tech
| as useful or not.
|
| To make an incredibly complex topic somewhat simple, LLMs
| train on a series of materials, in this case we'll talk
| words. It learns that "it turns out," "in the case of",
| "however, there is" are all words that naturally follow one
| another in writing, but it has no clue _why_ one would choose
| one over the other beyond the other words which form the
| contexts in which those word series ' appear. This process is
| repeated billions of times as it analyzes the structure of
| billions of written words until it arrives at a massive in
| scale statistical model of how likely it is that every word
| will be followed by every other word or punctuation mark.
|
| Having all that data available does mean an LLM can
| generate... words. Words that are pretty consistently spelled
| and arranged correctly in a way that reflects the language
| they belong to. And, thanks to the documents it trained on,
| it gains what you could, if you're feeling generous, call a
| "base of knowledge" on a variety of subjects, in that by the
| same statistical model, it has "learned" that "measure twice,
| cut once" is said often enough that it's likely good advice,
| but again, it doesn't know _why that is,_ which would be: it
| optimizes your cuts and avoids wasting materials when
| building something to measure it, mark it, then measure it a
| second or even third time to make sure it was done correctly
| before you do the cut, which an operation that cannot be
| reversed.
|
| However that knowledge has a HARD limit in terms of what was
| understood within it's training data. For example, way back,
| a GPT model recommended using elmer's glue to keep pizza
| toppings attached when making a pizza. No sane person would
| suggest this, because glue... isn't food. But the LLM doesn't
| understand that, it takes the question: how do I keep
| toppings on pizza, and it says, well a ton of things I read
| said you should use glue to stick things together, and ships
| that answer out.
|
| This is why I firmly believe LLMs and true AI are just... not
| the same thing, at all, and I'm annoyed that we now call LLMs
| AI and AI AGI, because in my mind, LLMs do not demonstrate
| any intelligence at all.
| smokel wrote:
| This explanation is only superficially correct, and there
| is more to it than simply predicting the next word.
|
| It is the _way_ in which the prediction works, that leads
| to some form of intelligence.
| ryoshu wrote:
| The glue on pizza thing was a bit more pernicious because
| of how the model came to that conclusion: SERPs. Google's
| LLM pulled the top result for that query from Reddit and
| didn't understand that the Reddit post was a joke. It took
| it as the most relevant thing and hilarity ensued.
|
| In that case the error was obvious, but these things become
| "dangerous" for that sort of use case when end users trust
| the "AI result" as the "truth".
| ryandrake wrote:
| Treating "highest ranked," "most upvoted," "most
| popular," and "frequently cited" as a signal of quality
| or authoritativeness has proven to be a persistent
| problem for decades.
| giantrobot wrote:
| > That they can be randomly, nondeterministically and
| confidently wrong, and there is no way to know without
| manually reviewing every output.
|
| This is my exact same issue with LLMs and it's routinely
| ignored by LLM evangelists/hypesters. It's not necessarily
| about being _wrong_ it 's the non-deterministic nature of the
| errors. They're not only non-deterministic but unevenly
| distributed. So you can't predict errors and need expertise
| to review all the generated content looking for errors.
|
| There's also not necessarily an obvious mapping between input
| tokens and an output since the output depends on the whole
| context window. An LLM might never tell _you_ to put glue on
| pizza because your context window has some set of tokens that
| will exclude that output while it will tell me to do so
| because my context window doesn 't. So there's not even
| necessarily determinism or consistency between
| sessions/users.
|
| I understand the existence of Gell-Mann amnesia so when I see
| an LLM give confident but subtly wrong answers about a Python
| library I don't then assume I won't also get confident yet
| subtly wrong answers about the Parisian Metro or elephants.
| gopher_space wrote:
| The prompts we're using seem like they'd generate the same
| forced confidence from a junior. If everything's a top-down
| order, and your personal identity is on the line if I'm not
| "happy" with the results, then you're going to tell me what
| I want to hear.
| giantrobot wrote:
| There's some differences between junior developers and
| LLMs that are important. For one a human developer can
| likely learn from a mistake and internalize a correction.
| They might make the mistake once or twice but the
| occurrences will decrease as they get experience and
| feedback.
|
| LLMs as currently deployed don't do the same. They'll
| happily make the same mistake consistently if a mistake
| is popular in the training corpus. You need to waste
| context space telling them to avoid the error
| until/unless the model is updated.
|
| It's entirely possible for good mentors to make junior
| developers (or any junior position) feel comfortable
| being realistic in their confidence levels for an answer.
| It's ok for a junior person to admit they don't know an
| answer. A mentor requiring a mentee to know everything
| and never admit fault or ignorance is a bad mentor.
| That's encouraging thought terminating behavior and helps
| neither person.
|
| It's much more difficult to alter system prompts or get
| LLMs to even admit when they're stumped. They don't have
| meaningful ways to even gauge their own confidence in
| their output. Their weights are based on occurrences in
| training data rather than correctness of the training
| data. Even with RL the weight adjustments are only as
| good as the determinism of the output for the input which
| is not great for several reasons.
| furyofantares wrote:
| This is a nitpick because I think your complaints are all
| totally valid, except that I think blaming non-determinism
| isn't quite right. The models are in fact deterministic.
| But that's just technical, from a practical sense they are
| non-deterministic in that a human can't determine what
| it'll produce without running it, and even then it can be
| sensitive to changes in context window like you said, so
| even after running it once you don't know you'll get a
| similar output from similar inputs.
|
| I only post this because I find it kind of interesting; I
| balked at blaming non-determinism because it technically
| isn't, but came to conclude that practically speaking
| that's the right thing to blame, although maybe there's a
| better word that I don't know.
| ryandrake wrote:
| > from a practical sense they are non-deterministic in
| that a human can't determine what it'll produce without
| running it
|
| But this is also true for programs that are deliberately
| random. If you program a computer to output a list of
| random (not pseudo-random) numbers between 0 and 100,
| then you cannot determine ahead of time what the output
| will be.
|
| The difference is, you at least know the range of values
| that it will give you and the distribution, and if
| programmed correctly, the random number generator will
| consistency give you numbers in that range with the
| expected probability distribution.
|
| In contrast, an LLM's answer to "List random numbers
| between 0 and 100" usually will result in what you
| expect, or (with a nonzero probability) it might just up
| and decide to include numbers outside of that range, or
| (with a nonzero probability) it might decide to list
| animals instead of numbers. There's no way to know for
| sure, and you can't prove from the code that it _won 't_
| happen.
| giantrobot wrote:
| At the base levels LLMs aren't actually deterministic
| because the model weights are typically floats of limited
| precision. At a large enough scale (enough parameters,
| model size, etc) you _will_ run into rounding issues that
| effectively behave randomly and alter output.
|
| Even with temperature of zero floating point rounding,
| probability ties, MoE routing, and other factors make
| outputs not fully deterministic even between multiple
| runs with identical contexts/prompts.
|
| In theory you could construct a fully deterministic LLM
| but I don't think any are deployed in practice. Because
| there's so many places where behavior is _effectively_
| non-deterministic the system itself can 't be thought of
| as deterministic.
|
| Errors might be completely innocuous like one token
| substituted for another with the same semantic meaning.
| An error might also completely change the semantic
| meaning of the output with only a single token change
| like an "un-" prefix added to a word.
|
| The non-determinism is both technically and practically
| true in practice.
| ikiris wrote:
| I don't understand these arguments at all. Do you currently
| not do code reviews at all, and just commit everything
| directly to repo? do your coworkers?
|
| If this is the case, I can't take your company at all
| seriously. And if it isn't, then why is reviewing the output
| of LLM somehow more burdensome than having to write things
| yourself?
| Aurornis wrote:
| > This, to me, is the critical and fatal flaw that prevents
| me from using or even being excited about LLMs: That they can
| be randomly, nondeterministically and confidently wrong, and
| there is no way to know without manually reviewing every
| output.
|
| Sounds a lot like most engineers I've ever worked with.
|
| There are a lot of people utilizing LLMs wisely because they
| know and embrace this. Reviewing and understanding their
| output has always been the game. The whole "vibe coding"
| trend where you send the LLM off to do something and hope for
| the best will teach anyone this lesson very quickly if they
| try it.
| agentultra wrote:
| Most engineers you worked with probably cared about getting
| it right and improving their skills.
| nkrisc wrote:
| How would a meaningful confidence value be calculated with
| respect to the output of an LLM? What is "correct" LLM
| output?
| Kinrany wrote:
| It can be the probability of the response being accepted by
| the prompter
| nkrisc wrote:
| So unique to each prompter, refined over time?
| rustcleaner wrote:
| >That they can be randomly, nondeterministically and
| confidently wrong, and there is no way to know without
| manually reviewing every output.
|
| I think I can confidently assert that this applies to you and
| I as well.
| ryandrake wrote:
| I choose a computer to do a task because I expect it to be
| much more accurate, precise, and deterministic than a
| human.
| FeepingCreature wrote:
| > The second is that the LLMs don't learn once they're done
| training, which means I could spend the rest of my life
| tutoring Claude and it'll still make the exact same mistakes,
| which means I'll never get a return for that time and
| hypervigilance like I would with an actual junior engineer.
|
| However, this creates a significant return on investment for
| opensourcing your LLM projects. In fact, you should commit your
| LLM dialogs along with your code. The LLM won't learn
| _immediately_ , but it will learn in a few months when the next
| refresh comes out.
| samjewell wrote:
| > In fact, you should commit your LLM dialogs along with your
| code.
|
| Wholeheartedly agree with this.
|
| I think code review will evolve from "Review this code" to
| "Review this prompt that was used to generate some code"
| devnull3 wrote:
| > hypervigilant
|
| If a tech works 80% of the time, then I know that I need to be
| vigilant and I will review the output. The entire team
| structure is aware of this. There will be processes to offset
| this 20%.
|
| The problem is that when the AI becomes > 95% accurate (if at
| all) then humans will become complacent and the checks and
| balances will be ineffective.
| kaycebasques wrote:
| This section heading from the post captures the key insight, is
| more focused, and is less hyperbolic:
|
| > Redesigning for Decision Velocity
| nthingtohide wrote:
| > He argues this type of value judgement is something AI
| fundamentally cannot do, as it can only pattern match against
| existing decisions, not create new frameworks for assigning
| worth.
|
| Counterpoint : That decision has to be made only once (probably
| by some expert). AI can incorportate that training data into its
| reasoning and voila, it becomes available to everyone. A software
| framework is already a collection of good decisions, practices
| and tastes made by experts.
|
| > An MIT study found materials scientists experienced a 44% drop
| in job satisfaction when AI automated 57% of their "idea-
| generation" tasks
|
| Counterpoint : Now consider making material science decisions
| which requires materials to have not just 3 properties but 10 or
| 15.
|
| > Redesigning for Decision Velocity
|
| Suggestion : I think this section implies we must ask our experts
| to externalize all their tastes, preferences, top-down thinking
| so that other juniors can internalize those. So experts will be
| teaching details (based on their internal model) to LLMs while
| teaching the model itself to humans.
| Animats wrote:
| > This pile of tasks is how I understand what Vaughn Tan refers
| to as Meaningmaking: the uniquely human ability to make
| subjective decisions about the relative value of things.
|
| Why is that a "uniquely human ability"? Machine learning systems
| are good at scoring things against some criterion. That's mostly
| how they work.
| atomicnumber3 wrote:
| How are the criterion chosen though?
|
| Something I learned from working alongside data scientists and
| financial analysts doing algo trading is that you can almost
| always find great fits for your criteria, nobody ever worries
| about that. Its coming up with the criteria that's what
| everyone frets over, and even more than that, you need to _beat
| other people_ at doing so - just being good or event great isn
| 't enough. Your profit is the delta between where you are
| compared to all the other sharks in your pool. So LLMs are
| useless there, getting token predicted answers is just going to
| get you the same as everyone else, which means zero alpha.
|
| So - I dunno about uniquely human? But there's definitely
| something here where, short of AGI, there's always going to
| need to be someone sitting down and actually beating the market
| (whatever that metaphor means for your industry or use case).
| fwip wrote:
| Finance is sort of a unique beast in that the field is
| inherently negative-sum. The profits you take home are always
| going to be profits somebody else isn't getting.
|
| If you're doing like, real work, solving problems in your
| domain actually adds value, and so the profits you get are
| from the value you provide.
| kaashif wrote:
| If you're algo trading then yes, which is what the person
| you're replying to is talking about.
|
| But "finance" is very broad and covers very real and
| valuable work like making loans and insurance - be careful
| not to be too broad in your condemnation.
| rukuu001 wrote:
| I think this is challenging because there's a lot of tacit
| knowledge involved, and feedback loops are long and measurement
| of success ambiguous.
|
| It's a very rubbery, human oriented activity.
|
| I'm sure this will be solved, but it won't be solved by
| noodling with prompts and automation tools - the humans will
| have to organise themselves to externalise expert knowledge and
| develop an objective framework for making 'subjective decisions
| about the relative value of things'.
| jasonthorsness wrote:
| The method of producing the work can be more important (and
| easier to review) than the work output itself. Like at the
| simplest level of a global search-replace of a function name that
| alters 5000 lines. At a complex level, you can trust a team of
| humans to do something without micro-managing every aspect of
| their work. My hope is the current crises of reviewing too much
| AI-generated output will subside into the way you can trust the
| team because the LLM has reached a high level of "judgement" and
| competence. But we're definitely not there yet.
|
| And contrary to the article, idea-generation with LLM support can
| be fun! They must have tested full replacement or something.
| wffurr wrote:
| >> At a complex level, you can trust a team of humans to do
| something without micro-managing every aspect of their work
|
| I see you have never managed an outsourced project run by a
| body shop consultancy. They check the boxes you give them with
| zero thought or regard to the overall project and require
| significant micro managing to produce usable code.
| jdlshore wrote:
| I find this sort of whataboutism in LLM discussions tiring.
| Yes, _of course,_ there are teams of humans that perform
| worse than an LLM. But it obvious to all but the most hype-
| blinded booster that it is possible for teams of humans to
| work autonomously to produce good results, because that is
| how all software has been produced to the present day, and
| some of it is good.
| timewizard wrote:
| > Remember the first time an autocomplete suggestion nailed
| exactly what you meant to type?
|
| No.
|
| > Multiply that by a thousand and aim it at every task you once
| called "work."
|
| If you mean "menial labor" then sure. The "work" I do is not at
| all aided by LLMs.
|
| > but our decision-making tools and rituals remain stuck in the
| past.
|
| That's because LLMs haven't eliminated or even significantly
| reduced risk. In fact they've created an entirely new category of
| risk in "hallucinations."
|
| > we need to rethink the entire production-to-judgment pipeline.
|
| Attempting to do this without accounting for risk or how capital
| is allocated into processes will lead you into folly.
|
| > We must reimagine knowledge work as a high-velocity decision-
| making operation rather than a creative production process.
|
| Then you will invent nothing new or novel and will be relegated
| to scraping by on the overpriced annotated databases of your
| direct competitors. The walled garden just raised the stakes. I
| can't believe people see a future in it.
| shawn-butler wrote:
| This really isn't true in principle. The current LLM ecosystems
| can't do "meaning tasks" but there are all kinds of "legacy" AI
| expert systems that do exactly what is required.
|
| My experience is that middle manager gatekeepers are the most
| reluctant to participate in building knowledge systems that
| obsolete them though.
| bendigedig wrote:
| Validating the outputs of a stochastic parrot sounds like a very
| alienating job.
| FeepingCreature wrote:
| It's actually very fun, ime.
| bendigedig wrote:
| I have plenty of experience doing code reviews and to do a
| good job is pretty hard and thankless work. If I had to do
| that all day every day I'd be very unhappy.
| darth_avocado wrote:
| As a staff engineer, it upsets me if my Review to Code ratio
| goes above 1. Days when I am not able to focus and code,
| because I was reviewing other people's work all day, I usually
| am pretty drained but also unsatisfied. If the only job
| available to engineers becomes "review 50 PRs a day, everyday"
| I'll probably quit software engineering altogether.
| kmijyiyxfbklao wrote:
| > As a staff engineer, it upsets me if my Review to Code
| ratio goes above 1.
|
| How does this work? Do you allow merging without reviews? Or
| are other engineers reviewing code way more than you?
| PaulRobinson wrote:
| Most knowledge work - perhaps all of it - is already validating
| the output of stochastic parrots, we just call those stochastic
| parrots "management'.
| xg15 wrote:
| The intro sentence to this is quite funny.
|
| > _Remember the first time an autocomplete suggestion nailed
| exactly what you meant to type?_
|
| I actually don't, because so far this only happened with trivial
| phrases or text I had already typed in the past. I do remember
| however dozens of times where autocorrect wrongly "corrected" the
| last word I typed, changing an easy to spot typo into a much more
| subtle semantic error.
| thechao wrote:
| I see these sorts of statements from coders who, you know,
| aren't good programmers in the first place. Here's the secret
| that I that I think LLM's are uncovering: I think there's a
| _lot_ of really shoddy coders out there; coders who could could
| /would never become good programmers and _they_ are absolutely
| going to be replaced with LLMs.
|
| I don't know how I feel about that. I suspect it's not going to
| be great for society. Replacing blue collar workers for robots
| hasn't been super duper great.
| delusional wrote:
| If the AIs learned from us they'll only be able to produce Coca
| Cola and ads, so the interety of the actually valuable economy is
| safe.
| eezurr wrote:
| And once the Orient and Decide part is augmented, then we'll be
| limited by social networks (IRL ones). Every solo founder/small
| biz will have to compete more and more for marketing eyeballs,
| and the ones who have access to bigger engines (companies),
| they'll get the juice they need, and we come back to humans being
| the bottlenecks again.
|
| That is, until we mutually decide on removing our agency from the
| loop entirely . And then what?
| zkmon wrote:
| > Ultimately, I don't see AI completely replacing knowledge
| workers any time soon.
|
| How was that conclusion reached? And what is meant by knowledge
| workers? Any work with knowledge is exactly the domain of LLMs.
| So, LLMs are indeed knowledge workers.
___________________________________________________________________
(page generated 2025-04-27 23:00 UTC)