[HN Gopher] Against LLM Maximalism
___________________________________________________________________
Against LLM Maximalism
Author : pmoriarty
Score : 105 points
Date : 2023-09-13 12:32 UTC (10 hours ago)
(HTM) web link (explosion.ai)
(TXT) w3m dump (explosion.ai)
| Grimburger wrote:
| [flagged]
| EGreg wrote:
| I predicted that AI will be the next Web3 -- hugely promising
| but increasingly ignored by HN.
|
| There will be waves of innovation in the coming years. Web3
| solutions will mostly enrich people or at worst be zero-sum.
| While AI solutions will redistribute wealth from the working
| class to the top 1% and corporations, as well as giving people
| ways to take advantage of vulnerable people and systems at a
| scale never seen before.
| xnx wrote:
| Web3 never got past the idea stage and was never useful.
| Generative AI is already useful and actively used by millions
| of people in their daily work.
| EGreg wrote:
| Same predictable comment every time, and by the same exact
| people too. The first part is not even close to being true.
| And you never mention that the downsides of AI are
| astronomically larger than Web3. The downside of Web3 is
| bugs in immutable smart contracts where people lose only
| what they voluntarily put in. The downside of AI is human
| extinction and people losing in all kinds of ways
| regardless of whether they want to or not.
| phillipcarter wrote:
| How can you hold the opinion that AI is both not useful
| and that it's to bring human extinction?
|
| Anyways, incoherence of your argument aside, I'll gladly
| raise my hand as having a use case that LLMs immediately
| solved for and it's now a core product feature that
| performs well.
| EGreg wrote:
| Many things are not useful and can bring about human
| extinction. Viruses, volcanoes, or asteroid inpacts for
| instance. So it's not incoherent on its face.
|
| But I am not even saying that AI is not useful. I am
| saying that every single time someone pops up on HN to
| defend AI vs Web3, they _only_ focus on possible upside
| right now for some people. Even if AI brings 10000x
| downside at the same time, they would _never ever_
| consider mentioning that. But when society adopts a new
| technology, downsides matter even more than upsides. Loss
| of income stability and livelihood for many professions,
| attacks at scale (eg on reputation or truth) by
| botswarms, etc etc. And that is just what's possible with
| current technology.
|
| But most of all, for all its upsides, Web3's harm is
| limited to those who voluntarily commit some of their
| money to a smart contract. AI's harm on the other hand is
| far greater and is spread primarily out to those who
| DIDNT VOLUNTARILY CHOOSE IT or even oppose it. That is
| not very moral as a society. It may enrich the tech bros
| further, but just like other tech projects, it will
| probably come at the expense of many others, especially
| the working class of society. They will have a rude
| awakening and will riot. But they aren't rioting about
| Web3, because losing money you put at risk in a
| controlled environment is just not in the same
| stratosphere.
|
| Expect the government to use AI to control the population
| more and more as this civil unrest happens. Look to China
| to see what it would look like. Or Palantir for precrime
| etc.
| phillipcarter wrote:
| I guess I'll just say that...I don't believe much of what
| you're saying is going to happen? I don't think I'll
| convince you and I don't think you'll convince me either.
| j16sdiz wrote:
| Three post from you in this thread. I downvote two and
| upvote one.
|
| Sometimes, unpopular opinion need more explanation. The
| other two comments are not helpful, this comment is
| helpful.
| EGreg wrote:
| Thanks. Well -- prepare to be downvoted by the anti Web3
| brigade heh
| IshKebab wrote:
| Since when was web3 hugely promising Web3 is rightfully being
| ignored because it is useless.
|
| AI is _already_ extremely useful. There 's zero chance that
| it's a fad that will fizzle out. I'm not sure how anyone
| could come to that conclusion.
| EGreg wrote:
| Web3 being hugely promising doesn't mean AI will fizzile
| out. That's a strawman. Try to reply to what's been said.
| AI has far bigger downsides than Web3, Web3 at worst is
| zero-sum and people voluntarily choose to engage with it.
| AI can harm many vulnerable people and systems, that never
| chose to engage with any of it. That's what you call
| _useful_?
|
| Also, this idea that just because you _say_ Web3 has no use
| cases, makes it true, regardless of evidence, is silly.
| IshKebab wrote:
| > Try to reply to what's been said.
|
| Try to _read_ what 's been said. When did I imply that
| the two are linked?
|
| > AI has far bigger downsides than Web3, Web3 at worst is
| zero-sum and people voluntarily choose to engage with it.
|
| Sure. Web3 is a nothing. At worst it will change nothing.
| But it _is_ at worst. It changes nothing.
|
| > That's what you call useful?
|
| AI can be abused, but that obviously doesn't mean that it
| isn't useful. I did not call the abuse of AI useful. Who
| is arguing against straw men now?
|
| > Also, this idea that just because you say Web3 has no
| use cases, makes it true, regardless of evidence, is
| silly.
|
| Please tell me one practical use of Web3. I did actually
| google it and it returned this list:
|
| https://www.techtarget.com/searchcio/tip/8-top-
| Web-30-use-ca...
|
| 1. Flexible and secure smart contracts - nobody really
| wants this; they don't want to lose all their money due
| to a bug with no recourse.
|
| 2. Trusted data privacy - this isn't anything concrete.
|
| 3. Highly targeted [advertising] content - erm I thought
| you said web3 has no downsides?
|
| 4. Automated content creation and curation - another hand
| wave.
|
| 5. Unique loyalty programs - ha, come on, really?
|
| 6. Increased community building - ... this list is
| exactly what I expected ...
|
| 7. Better omnichannel experiences - ??!?
|
| 8. Wider use of augmented reality - what has this even
| got to do with web3?
|
| Please point me to a realistic use case for web3.
| EGreg wrote:
| See the list here
|
| https://intercoin.org/applications
|
| Would love to see the same type of reaction by numbered
| point by point as you did above
| IshKebab wrote:
| Web 5? Lol. As far as I can see all of those things are
| already totally possible with web 2.0. Except maybe NFTs?
| Hard to argue that they are useful though except for
| money laundering.
|
| Could you perhaps pick one or two from that list that you
| think are the best and explain why they can only be
| implemented with smart contracts?
|
| I mean, take voting for example. You can do voting with a
| web 1.0 website. The challenge is always going to be
| preventing vote stuffing, and the only real way to
| prevent that is to associate votes with real world IDs.
| How would web3 help with that? The proper solution is
| _government issued_ key pairs, but that doesn 't sound
| very web3 to me.
| EGreg wrote:
| You were fine making a list of 8 and here you punked out?
| Please give your reaction to each one, why they aren't
| necessary or aren't real applications and why Web3 is
| useless for them. Each one goes into depth for why Web3
| matters if you click it.
|
| Voting can be done with Web 1.0 and in fact is done with
| StackExchange sites. But how do you know someone didnt go
| into the database and change the votes and results? What
| good are elections if you can't trust them?
| atomicnumber3 wrote:
| How is Web3 doing these days, I must ask?
|
| The only thing I've heard of it recently is that 4chan is
| still doing good business selling ads for NFT and coin scams.
| EGreg wrote:
| Growing at a CAGR of 44%
|
| https://www.globenewswire.com/en/news-
| release/2023/03/22/263...
|
| Expected to hit $50 billion by 2030
|
| https://www.emergenresearch.com/amp/industry-
| report/web-3-ma...
|
| And for example $1.1 Billion in India
|
| https://m.economictimes.com/tech/technology/indian-
| web3-indu...
| hk__2 wrote:
| > Expected to hit $50 billion by 2030
|
| The definition of "web3" is too vague to have a correct
| estimation: it will be $50B according to your second
| link; $44B by 2031 according to your first link; $33B
| according to [1]; $45 according to [2]; $16B according to
| [3].
|
| [1]: https://www.grandviewresearch.com/press-
| release/global-web-3...
|
| [2]: https://www.vantagemarketresearch.com/industry-
| report/web-30...
|
| [3]: https://www.skyquestt.com/report/web-3-0-blockchain-
| market
| naillo wrote:
| The conditional probability that the article is AI written is
| also so much larger when you encounter .ai tld's.
| davepeck wrote:
| Explosion is an old school machine learning company by the
| people who built the spaCy natural language library. They're
| serious practitioners whose work predates the "hype-train"
| you're concerned about.
|
| The blog post might be worth a gander.
| sudb wrote:
| I've had a fair amount of success at work recently with treating
| LLMs - specifically OpenAI's GPT-4 with function calling - as
| modules in a larger system, helped along powerfully by the
| ability to output structured data.
|
| > Most systems need to be much faster than LLMs are today, and on
| current trends of efficiency and hardware improvements, will be
| for the next several years.
|
| I think here I disagree with the author here though, and am happy
| to be a technological optimist - if LLMs are used modularly,
| what's to stop us in a few years (presumably still hardware
| requirement costs, on reflection) eventually having small, fast
| specialised LLMs for the things that we find them truly
| useful/irreplaceable?
| syllogism wrote:
| Nothing's to stop us, and in fact we can do that now! This is
| basically what the post advocates for: replacing the LLM calls
| for task-specific things with smaller models. They just don't
| need to be LLMs.
| og_kalu wrote:
| I'll just say there's no guarantee training or fine-tuning a
| smaller bespoke model will be more accurate (Certainly though, it
| may be accurate enough). Minerva and Med-Palm are worse than
| GPT-4 for instance.
| syllogism wrote:
| This is where the terminology being used to discuss LLMs today
| is a touch awkward and imprecise.
|
| There's a key distinction between smaller models trained with
| transfer-learning, and just fine-tuning a smaller LLM and still
| using in-context learning.
|
| Transfer learning means you're training an output network
| specifically for the task you're doing. So like, if you're
| doing classification, you output a vector with one element per
| class, apply a softmax transformation, and train on a negative
| log likelihood objective. This is direct and effective.
|
| Fine-tuning a smaller LLM so that it's still learning to do
| text generation, but it's better at the kinds of tasks you want
| to do, is a much more mixed experience. The text generation is
| still really difficult, and it's really difficult to learn to
| follow instructions. So all of this still really favours size.
| og_kalu wrote:
| Right that is a good distinction. Fair enough. Still stand
| that you could train a worse model depending on the task.
| Translation, Nuanced Classification are all instances where
| i've not seen bespoke models outright better than GPT-4.
| although, like i said it could still be good enough for
| speed, compute requirements.
| skybrian wrote:
| I don't understand this heuristic and I think it might be a bit
| garbled. Any idea what the author meant? How do you get 1000?
|
| > A good rule of thumb is that you'll want ten data points per
| significant digit of your evaluation metric. So if you want to
| distinguish 91% accuracy from 90% accuracy, you'll want to have
| at least 1000 data points annotated. You don't want to be running
| experiments where your accuracy figure says a 1% improvement, but
| actually you went from 94/103 to 96/103.
| akprasad wrote:
| My guess is that this should be something like "If you have n
| significant digits in your evaluation metric, you should have
| at least 10^(n+1) data points."
| wrs wrote:
| Avoiding the term "significant digits" completely:
| Distinguishing 91 vs 90 is a difference of 1 on a 0-100
| scale. 100x10=1000. If you wanted to distinguish 91.0 vs
| 90.9, that's 1 on a 0-1000 scale, so you'd want 10,000
| points.
| forward-slashed wrote:
| All of this is quite difficult without the DSL to explore and
| construct pipelines for LLMs. Current approaches are very slow in
| terms of iteration.
| alexvitkov wrote:
| Sorry if this is a bit ignorant, I don't work in the space, but
| if a single LLM invocation is considered too slow, how could
| splitting it up into a pipeline of LLM invocations which need to
| happen in sequence help?
|
| Same with reliability - you don't trust the results of one
| prompt, but you trust multiple piped one into another? Even if
| you test the individual components, which is what this approach
| enables and this article heavily advocates for, I still can't
| imagine that 10 unreliable systems, which have to interact with
| rach other, are more reliable than one.
|
| 80% accuracy of one system is 80% accuracy.
|
| 95% accuracy on 10 systems is 59% accuracy in total if you need
| all of them to work and they fail independently.
| peter_l_downs wrote:
| I think the idea behind breaking down the task into a
| composable pipeline is that you then replace the LLM steps in a
| pipeline with supervised models that are much faster. So you
| end up with a pipeline of non-LLM models, which are faster and
| more explainable.
| syllogism wrote:
| (Author here)
|
| About the speed, the idea is that if you break down the task,
| you can very often use much smaller models for the component
| tasks. LLMs are approaching prediction tasks under an extremely
| difficult constraint: they don't get to see many labelled
| examples. If you relax that constraint and just use transfer-
| learning, you can get better accuracy with much smaller models.
| The transfer-learning pipeline can also be arranged so that you
| encode the text into vectors once, and you apply multiple
| little task networks over the shared representation. spaCy
| supports this for instance, and it's easy to do when working
| directly with the networks in PyTorch etc.
| cmcaleer wrote:
| > you don't trust the results of one promt, but you trust
| multiple piped one into another?
|
| This is really not at all unusual. Take aircraft for instance.
| One system is not reliable, for a multitude of reasons. A
| faulty sensor could be misleading, a few bits could get flipped
| by cosmic rays causing ECC to fail, the system itself could be
| poorly calibrated, there are far too many unacceptable risks.
| But add TMR[0][1] and suddenly you are able to trust things a
| lot more. This isn't to say that TMR is bullet proof e.g.
| incidents like [2], but redundancy does make it possible to
| increase trust in a system, and assign blame to what part of a
| system is faulty (e.g. if 3 systems exist, and 1 appears to be
| disagreeing wildly with 2 and 3, you know to start
| investigating system 1 first).
|
| Would it work here? I don't know! But it doesn't seem like an
| inherently terrible or flawed idea if we look at past
| applications. Ensembling different models is a pretty common
| technique to get better results in ML, and maybe this approach
| would make it easier to find weak links and assign blame.
|
| [0]: https://en.wikipedia.org/wiki/Triple_modular_redundancy
|
| [1]:
| https://en.wikipedia.org/wiki/Air_data_inertial_reference_un...
|
| [2]: https://www.atsb.gov.au/media/news-items/2022/pitot-probe-
| co... causing total confusion among the TMR
| chongli wrote:
| _This isn 't to say that TMR is bullet proof e.g. incidents
| like [2], but redundancy does make it possible to increase
| trust in a system, and assign blame to what part of a system
| is faulty (e.g. if 3 systems exist, and 1 appears to be
| disagreeing wildly with 2 and 3, you know to start
| investigating system 1 first)._
|
| You can only gain trust in this system if you understand the
| error sources for all three systems. If there's any common
| mode errors then you can see errors showing up in multiple
| systems simultaneously. For example, if your aircraft is
| using pitot tubes [1] to measure airspeed then you need to
| worry about multiple tubes icing up at the same time (which
| is likely since they're in the same environment).
|
| So it would not add very much trust to implement TMR with
| three different pitot tubes. It would be better to combine
| the pitot tubes with completely different systems, such as
| radar and GPS, to handle the (likely) scenario of two or more
| pitot tubes icing up and failing completely.
|
| [1] https://en.wikipedia.org/wiki/Pitot_tube?wprov=sfti1
| vjerancrnjak wrote:
| It's not ignorant. It is a known problem. Before LLMs,
| approaches to machine translation or any high level language
| tasks did start with a pipeline (part of speech tagging,
| dependency tree parsing, named entity recognition etc.) but
| quickly these attempts were discarded.
|
| All of the models in the pipeline are not optimized with the
| joint loss (the final machine translation model that maps lang
| A to lang B does not propagate its error to the low level
| models in the pipeline).
|
| A pipeline of LLMs will accumulate the error in the same way,
| eventually the same underlying problem of pipeline not being
| trained with the joint loss will result in low accuracy.
|
| LLMs or DNNs in general do more compute, so they start being
| extremely powerful even when sequenced. Making a sequence of
| decisions with a regular ML model has a similar problem to
| pipelining, if you train it on single decision loss and not the
| sequence of decisions loss, then there's a question of can it
| recover and make a right next step if it made the wrong step
| (your training data never included this recovery example), but
| convolutional NNs were so powerful for language tasks that this
| recovery from error was successful (even though you never
| trained CNNs over the joint loss of sequence of decision).
| visarga wrote:
| It's not a given that the performance would suffer. For
| instance, you could use self-checking methods like cycle
| consistency or back translation in a sequence of prompts.
| Another option is to generate multiple answers and then use a
| voting system to pick the best one. This could actually boost
| the LLM's accuracy, although it would require more
| computation. In various tasks, there might be simpler methods
| for verifying the answer than initially generating it.
|
| Then you have techniques like the Tree of Thoughts, which are
| particularly useful for tasks that require strategic planning
| and exploration. You just can't solve these in one single
| round of LLM interaction.
|
| In real-world applications, developers often choose a series
| of prompts that enable either self-checking or error
| minimization. Alternatively, they can involve a human in the
| loop to guide the system's actions. The point is to design
| with the system's limitations in mind.
|
| On a side note, if you're using vLLM, you can send up to 20
| requests in parallel without incurring additional costs. The
| server batches these requests and uses key-value caching, so
| you get high token/s throughput. This allows you to resend
| previous outputs for free or run multiple queries on a large
| text segment. So, running many tasks doesn't necessarily slow
| things down if you manage it correctly.
| vjerancrnjak wrote:
| It is a simple problem and in literature it was named
| "label bias".
|
| Let's say you maximize performance of a single piece of
| pipeline (training on a dataset or something else), and you
| do it the same way for all pieces. The labels that were
| correct as inputs in training are your limitation. Why?
| Because when a mistake happens, you've never learned to
| recover from it, because you always gave the correct labels
| in your training.
|
| What LLM pipelines do is probably something like this:
|
| * a complex task is solved by a pipeline of prompts
|
| * we tweak a single prompt
|
| * we observe the output at the end of the whole pipeline
| and determine if the tweak was right
|
| In this way, the joint loss of the pipeline is observed and
| that is ok.
|
| But, the moment your pipeline is: POS Tagger -> Dependency
| Tree Parser -> Named Entity Recognition -> ... -> Machine
| Translation
|
| and you have separate training sets that maximize
| performance of each particular piece, you are introducing
| label bias and are relying on some luck to recover from
| errors early in the pipeline because during training, the
| later parts never got errors as input and recovered to the
| correct output.
| phillipcarter wrote:
| So I think this is an excellent post. Indeed, LLM maximalism is
| pretty dumb. They're awesome at specific things and mediocre at
| others. In particular, I get the most frustrated when I see
| people try to use them for tasks that need deterministic outputs
| _and the thing you need to create is already known statically_.
| My hope is that it 's just people being super excited by the
| tech.
|
| I wanted to call this out, though, as it makes the case that to
| improve any component (and really make it production-worthy), you
| need an evaluation system:
|
| > Intrinsic evaluation is like a unit test, while extrinsic
| evaluation is like an integration test. You do need both. It's
| very common to start building an evaluation set, and find that
| your ideas about how you expect the component to behave are much
| vaguer than you realized. You need a clear specification of the
| component to improve it, and to improve the system as a whole.
| Otherwise, you'll end up in a local maximum: changes to one
| component will seem to make sense in themselves, but you'll see
| worse results overall, because the previous behavior was
| compensating for problems elsewhere. Systems like that are very
| difficult to improve.
|
| I think this makes sense from the perspective of a team with
| deeper ML expertise.
|
| What it doesn't mention is that this is an enormous effort, made
| even larger when you don't have existing ML expertise. I've been
| finding this one out the hard way.
|
| I've found that if you have "hard criteria" to evaluate (i.e.,
| getting the LLM to produce a given structure rather than an open-
| ended output for a chat app) you can quantify improvements using
| Observability tools (SLOs!) and iterating in production. Ship
| changes daily, track versions of what you're doing, and keep on
| top of behavior over a period of time. It's arguably a lot less
| "clean" but it's way faster, and because it's working on the
| real-world usage data, it's really effective. An ML engineer
| might call that some form of "online test" but I don't think it
| really applies.
|
| At any rate, there are other use cases where you really do need
| evaluations, though. The more important correct output is, the
| more it's worth investing in evals. I would argue that if bad
| outputs have high consequences, then maybe LLMs also aren't the
| right tech for the job, but that'll probably change in a few
| years. And hopefully making evaluations will be easier too.
| syllogism wrote:
| (Author here)
|
| It's true that getting something going end-to-end is more
| important than being perfectionist about individual steps --
| that's a good practical perspective. We hope good evaluation
| won't be such an enormous effort. Most of what we're trying to
| do at Explosion can be summarised as trying to make the right
| thing easy. Our annotation tool Prodigy is designed to scale
| down to smaller use-cases for instance ( https://prodigy.ai ).
| I admit it's still effort though, and depending on the task,
| may indeed still take expertise.
| axiom92 wrote:
| > tasks that need deterministic outputs and the thing you need
| to create is already known statically
|
| Wow, interesting. Do you have any example for this?
|
| I've realized that LLMs are fairly good at string processing
| tasks that a really complex regex might also do, so I can see
| the point in those.
| intended wrote:
| Classification tasks come to mind
| og_kalu wrote:
| LLMs are better at that though. Sure you may not require
| them but it certainly wouldn't be for a lack of accuracy.
|
| https://www.artisana.ai/articles/gpt-4-outperforms-elite-
| cro...
|
| https://arxiv.org/abs/2303.15056
| phillipcarter wrote:
| Yeah, there's a little bit of flex there for sure. An example
| that recently came up for me at work was being able to take
| request:response pairs from networking events and turn them
| into a distributed trace. You can absolutely get an LLM to do
| that, but it's very slow and can mess up sometimes. But you
| can also do this 100% programmatically! The LLM route feels a
| little easier at first but it's arguably a bad application of
| the tech to the problem. I tried it out just for fun, but
| it's not something I'd ever want to do for real.
|
| (separately, synthesizing a trace from this kind of data is
| impossible to get 100% correct for other reasons, but hey,
| it's a fun thing to try)
| mark_l_watson wrote:
| I agree with much of the article. You do need to take great care
| to make code with embedded LLM use modular and easily
| maintainable, and otherwise keep code bases tidy.
|
| I am a fan of tools like LangChain that bring some software order
| to using LLMs.
|
| BTW, this article is a blog hosted by the company who writes and
| maintains the excellent spaCy library.
| passion__desire wrote:
| Is anyone working on a OS LLM layer? e.g. consider a program
| like gimp. It would feed in its documentation and workflow
| details in LLM and get embeddings which would be installed with
| the program just like man-pages. Users could just express what
| they want to do in natural languages and Gimp would just query
| llm and create a workflow that might achieve the task.
| mark_l_watson wrote:
| Apple's CoreML is a large collection of regular models, deep
| learning models, etc. that are easy to use in
| macOS/iOS/iPadOS apps.
| peter_l_downs wrote:
| Spacy [0] is a state-of-art / easy-to-use NLP library from the
| pre-LLM era. This post is the Spacy founder's thoughts on how to
| integrate LLMs with the kind of problems that "traditional" NLP
| is used for right now. It's an advertisement for Prodigy [1],
| their paid tool for using LLMs to assist data labeling. That
| said, I think I largely agree with the premise, and it's worth
| reading the entire post.
|
| The steps described in "LLM pragmatism" are basically what I see
| my data science friends doing -- it's hard to justify the cost
| (money and latency) in using LLMs directly for all tasks, and
| even if you want to you'll need a baseline model to compare
| against, so why not use LLMs for dataset creation or augmentation
| in order to train a classic supervised model?
|
| [0] https://spacy.io/
|
| [1] https://prodi.gy/
| og_kalu wrote:
| >what I see my data science friends doing -- it's hard to
| justify the cost (money and latency) in using LLMs directly for
| all tasks, and even if you want to you'll need a baseline model
| to compare against, so why not use LLMs for dataset creation or
| augmentation in order to train a classic supervised model?
|
| The NLP infrastructure and pipelines we have today aren't there
| because they are necessarily the best way to handle the tasks
| you want. They're in place because computers simply could not
| understand text the way we would like and shortcuts,
| approximations were necessary.
|
| Borrowing from the blog, Since you could not simply ask the
| computer, "How many paragraphs in this review say something bad
| about the acting? Which actors do they frequently mention?",
| separate processes of something like tagging names, linking
| them to a knowledge base, and paragraph-level actor sentiment
| etc were needed.
|
| The approximations are cool and they do work rather well for
| some use cases but they fall apart in many others.
|
| This is why automated resume filtering, moderation etc is still
| awful with the old techniques. You simply can't do what is
| suggested above and get the same utility.
___________________________________________________________________
(page generated 2023-09-13 23:00 UTC)