[HN Gopher] Overcoming the limits of current LLMs
___________________________________________________________________
Overcoming the limits of current LLMs
Author : sean_pedersen
Score : 77 points
Date : 2024-07-18 00:35 UTC (22 hours ago)
(HTM) web link (seanpedersen.github.io)
(TXT) w3m dump (seanpedersen.github.io)
| Carrok wrote:
| I wish he went into how to improve confidence scores, though I
| guess training on better data to begin with should improve
| results and thus confidence.
| RodgerTheGreat wrote:
| One of the main factors that makes LLMs popular today is that
| scaling up the models is a simple and (relatively) inexpensive
| matter of buying compute capacity and scraping together more raw
| text to train them. Without large and highly diverse training
| datasets to construct base models, LLMs cannot produce even the
| superficial appearance of good results.
|
| Manually curating "tidy", properly-licensed and verified datasets
| is immensely more difficult, expensive, and time-consuming than
| stealing whatever you can find on the open internet. Wolfram
| Alpha is one of the more successful attempts in that curation-
| based direction (using good-old-fashioned heuristic techniques
| instead of opaque ML models), and while it is very useful and
| contains a great deal of factual information, it does not conjure
| appealing fantasies of magical capabilities springing up from
| thin air and hands-off exponential improvement.
| totetsu wrote:
| It's not unethical if people in positions of privilege and
| power do it to maintain their rightful position of privilege
| and power.
| dang wrote:
| Please don't post in the flamewar style to HN. It degrades
| discussion and we're trying to go in the opposite direction
| here, to the extent that is possible on the internet.
|
| https://news.ycombinator.com/newsguidelines.html
| threeseed wrote:
| > properly-licensed and verified datasets is immensely more
| difficult, expensive
|
| Arguably the bigger problem is that many of those datasets e.g.
| WSJ articles are proprietary and can be exclusively licensed
| like we've seen recently with OpenAI.
|
| So we end up with in a situation where competition is simply
| not possible.
| piva00 wrote:
| > Arguably the bigger problem is that many of those datasets
| e.g. WSJ articles are proprietary and can be exclusively
| licensed like we've seen recently with OpenAI.
|
| > So we end up with in a situation where competition is
| simply not possible.
|
| Exactly, and Technofeudalism advances a little more into a
| new feud.
|
| OpenAI is trying to create its moat by shoring up training
| data, probably attempting to not allow competitors to train
| on the same datasets they've been licencing, at least for a
| while. Training data is the only possible moat for LLMs,
| models seem to be advancing quite well between different
| companies but as mentioned here a tidy training dataset is
| the actual gold.
| gessha wrote:
| It's landmines no matter how you approach the problem.
|
| If you treat the web as a free-for-all and you scrape
| freely, you get sued by the content platforms for copyright
| or term of service violation.
|
| If you license the content, you let the highest bidder get
| the content.
|
| No matter what happens, capital wins.
| darby_nine wrote:
| Man it seems like the ship has sailed on "hallucination" but it's
| such a terrible name for the phenomenon we see. It is a major
| mistake to imply the issue is with perception rather than
| structural incompetence. Why not just say "incoherent output"?
| It's actually descriptive and doesn't require bastardizing a word
| we already find meaningful to mean something completely
| different.
| jwuphysics wrote:
| > Why not just say "incoherent output"? Because the biggest
| problem with hallucinations is that the output is usually
| coherent but factually incorrect. I agree that "hallucination"
| isn't the best word for it... perhaps something like
| "confabulation" is better.
| linguistbreaker wrote:
| I appreciated a post on here recently that likened AI
| hallucination to 'bullshitting'. It's coherent, even
| plausible output without any regard for the truth.
| wkat4242 wrote:
| While I have absolutely no issues with the word "shit" in
| popular terms, I'd normally like to reserve it for
| situations where there's actually intended malice like in
| "enshittification".
|
| Rather than just an imperfect technology as we have here.
|
| Many people object to the term enshittification for foul-
| mouthing reasons but I think it covers it very well because
| the principle it covers is itself so very nasty. But that's
| not at all the case here.
| pessimizer wrote:
| "Bullshitting" isn't a new piece of jargon, it's a common
| English word of many decades vintage, and is being used
| in its dictionary sense here.
| MattPalmer1086 wrote:
| More true to say that _all_ output is bullshitting, not
| just the ones we call hallucinations. Some of it is true,
| some isn 't. The model doesn't know or care.
| nl wrote:
| And we use "hallucination" because in the ancient times when
| generative AI meant image generation models would
| "hallucinate" extra fingers etc.
|
| The behavior of text models is similar enough that the
| wording stuck, and it's not all that bad.
| exmadscientist wrote:
| "Hallucinations" implies that someone isn't of sound mental
| state. We can argue forever about what that means for a LLM and
| whether that's appropriate, but I think it's absolutely the
| right attitude and approach to be taking toward these things.
|
| They simply do not behave like humans of sound minds, and
| "hallucinations" conveys that in a way that "confabulations" or
| even "bullshit" does not. (Though "bullshit" isn't bad either.)
| kimixa wrote:
| I don't really immediately link "Hallucinations" with
| "Unsound mind" - most people I know have experienced auditory
| hallucinations - often things like not sure if the doorbell
| went off, or if someone said their name.
|
| And I couldn't find a single one of my friends who hadn't
| experienced "phantom vibration syndrome".
|
| Both I'd say are "Hallucinations", without any real negative
| connotation.
| devjab wrote:
| I disagree with this take because LLMs are, always,
| hallucinating. When they get things right it's because they
| are lucky. Yes, yes, it's more complicated than that, but the
| essence of LLMs is that they are very good at being lucky. So
| good that they will often give you better results than random
| search engine clicks, but not good enough to be useful for
| anything important.
|
| I think what calling the times they get things wrong
| hallucinations is largely an advertising trick. So that they
| can sort of fit the LLMs into how all IT is sometimes "wonky"
| and sell their fundamentally flawed technology more easily. I
| also think it works extremely well.
| Terr_ wrote:
| To offer a satirical analogy: "Lastly, I want to reassure
| investors and members of the press that we take these
| concerns very seriously: Hindenburg 2 will only contain
| only _normal and unreactive_ hydrogen gas, and not the
| _rare and unusual_ explosive kind, which is merely a
| temporary hurdle in this highly dynamic and growing field.
| "
|
| Edit: It retrospect, perhaps a better analogy would involve
| gasoline, as its explosive nature is what's being actively
| being exploited in normal use.
| marcosdumay wrote:
| Yes (to the edit), an analogy with making planes safer by
| only using non-flammable fuels is perfect.
| Terr_ wrote:
| I expect most people have already filled in the blanks,
| but for completeness:
|
| "Lastly, I want to reassure investors and members of the
| press that we take these concerns very seriously: The
| Ford Pinto-II will only contain only normal and stable
| gasoline, and not the rare and unusual burning kind,
| which is merely a temporary hurdle in this highly dynamic
| and explos--er-- _fast growing_ field. "
| mewpmewp2 wrote:
| But the point is, isn't hallucinating about having
| malformed, altered or out of touch input rather than
| producing inaccurate output yourself?
|
| It is the memory pathways leading them astray. It could be
| thought of a memory system that at certain point any longer
| can't be fully sure if whatever connections they have are
| from actually being trained or it or created accidentally.
| Terr_ wrote:
| > isn't hallucinating about having malformed, altered or
| out of touch input rather than producing inaccurate
| output yourself?
|
| I suppose so, in the sense that someone could simply be
| _lying_ about pink elephants instead of seeing them.
| However it 's hard to argue that the machine _knows_ the
| "right" answer and is (intelligently?) deceiving us.
|
| > It is the memory pathways leading them astray.
|
| I don't think it's a "memory" issue as much as a "they
| don't operate the way we like to think they do" issue.
|
| Suppose a human is asked to describe different paintings
| on the wall of an art gallery. Sometimes their statements
| appear valid and you nod along, and sometimes the
| statements are so wrong that it alarms you, because "this
| person is hallucinating."
|
| Now consider how the entire situation is flipped by
| finding out one additional fact... _They 're actually
| totally blind._
|
| Is it a lie? Is it a hallucination? Does it matter?
| Either way you must dramatically re-evaluate what their
| "good" outputs really mean and whether they can be used.
| mewpmewp2 wrote:
| To me it's more like, imagine that you have read a lot of
| books throughout your life, but then someone comes in and
| asks a question from you and you try to answer from
| memory, but you get beaten when you say something like "I
| don't know", and you get rewarded if you answer
| accurately. You do get beaten if you answer inaccurately,
| but eventually you learn that if you just say something,
| you might just be accurate and you will not get beaten.
| So you just always learn to answer to the best of your
| knowledge, while never saying that you specifically don't
| know, because it decreases chances of getting beat up.
| You are not intentionally lying, you are just hoping that
| whatever you say is accurate to the best you can do
| according to the neural connections you've built up in
| your brain.
|
| Like you ask me for a birthdate of some obscure political
| figure from history? I'm going to try to feel out what
| period in history the name might feel like to me and just
| make my best guess based on that, then say some random
| year and a birthdate. It just has the lowest odds of
| being beaten. Was I hallucinating? No, I was just trying
| to not get beaten.
| linguistbreaker wrote:
| How about "dream-reality confusion (DRC)" ?
| TeaBrain wrote:
| There is no dream-reality separation in an LLM, or really
| any conception of dreams or reality, so I don't think the
| term makes sense. Hallucination works fine to describe the
| phenomenon. LLMs work by coalescing textual information.
| LLM hallucinations occur due to faulty or inappropriate
| coalescence of information, which is similar to what occurs
| with actual hallucinations.
| marcosdumay wrote:
| Bullshit is the most descriptive one.
|
| LLMs don't do it because they are out of their right mind.
| They do it because every single answer they say is invented
| caring only about form, and not correctness.
|
| But yeah, that ship has already sailed.
| mistermann wrote:
| "Sound" minds for humans is graded on a curve, and this trick
| is not acknowledged, _or popular_.
| wkat4242 wrote:
| Hallucination is one single word. Even if it's not perfect it's
| great as a term. It's easy to remember and people new to the
| term already have an idea of what it entails. And the term will
| bend to cover what we take it to mean anyway. Language is
| flexible. Hallucination in an LLM context doesn't have to be
| the exact same as in a human context. All it matters is that
| we're aligned on what we're talking about. It's already
| achieved this purpose.
| freilanzer wrote:
| Hallucination perfectly describes the phenomenon.
| dgs_sgd wrote:
| I think calling it hallucination is because of our tendency to
| anthropomorphize things.
|
| Humans hallucinate. Programs have bugs.
| threeseed wrote:
| The point is that this isn't a bug.
|
| It's inherent to how LLMs work and is expected although
| undesired behaviour.
| TeaBrain wrote:
| The problem with "incoherent output" is that it isn't
| describing the phenomenon at all. There have been cases where
| LLM output has been incoherent, but modern LLM hallucinations
| are usually coherent and well-contructed, just completely
| fabricated.
| breatheoften wrote:
| I think it's a pretty good name for the phenomenon -- maybe the
| only problem with the term is that what models are doing is
| 100% hallucination all the time -- it's just that when the
| hallucinations are useful we don't call them hallucinations --
| so maybe that is a problem with the term (not sure if that's
| what you are getting at).
|
| But there's nothing at all different about what the model is
| doing between these cases -- the models are hallucinating all
| the time and have no ability to assess when they are
| hallucinating "right" or "wrong" or useful/non-useful output in
| any meaningful way.
| Slow_Hand wrote:
| I prefer "confabulate" to describe this phenomena.
|
| : to fill in gaps in memory by fabrication
|
| > In psychology, confabulation is a memory error consisting of
| the production of fabricated, distorted, or misinterpreted
| memories about oneself or the world.
|
| It's more about coming up with a plausible explanation in the
| absence of a readily-available one.
| adrianmonk wrote:
| On a literal level, hallucinations are perceptual.
|
| But "hallucination" was already (before LLMs) being used in a
| figurative sense, i.e. for abstract ideas that are made up out
| of nothing. The same is also true of other words that were
| originally visual, like "illusion" and "mirage".
| ainoobler wrote:
| The article suggests a useful line of research. Train an LLM to
| detect logical fallacies and then see if that can be bootstrapped
| into something useful because it's pretty clear that all the
| issues with LLMs is the lack of logical capabilities. If an LLM
| was capable of logical reasoning then it would be obvious when it
| was generating made-up nonsense instead of referencing existing
| sources of consistent information.
| dtx1 wrote:
| I think we should start smaller and make them able to count
| first.
| Terr_ wrote:
| Yeah, you can train an LLM to recognize the vocabulary and
| grammatical features of logical fallacies... Except the
| nature of fallacies is that they _look real_ on that same
| linguistic level, so those features aren 't distinctive for
| that purpose.
|
| Heck, I think detecting sarcasm would be an easier goal, and
| still tricky.
| idle_zealot wrote:
| > Except the nature of fallacies is that they look real on
| that same linguistic level, so those features aren't
| distinctive for that purpose
|
| Well that's actually good news. With a large enough
| labelled dataset of actually-sound and fallacious text with
| similar grammatical features you should be able to train a
| discriminator to distinguish between them using some other
| metric. Good luck with getting that data set though.
| Terr_ wrote:
| > you should be able to train a discriminator to
| distinguish between them using some other metric
|
| Not when the better metrics are likely alien/incompatible
| to the discriminator's core algorithm!
|
| Then it's rather inconvenient news, because it means you
| have to develop something separate and novel.
|
| As the other poster already mentioned, if we can't even
| get them to reliably count how many objects are being
| referred to, how do you expect them to also handle
| logical syllogisms?
| nyrikki wrote:
| The Entscheidungsproblem tends to rear its ugly problem
|
| Remember NP is equivalent to second order logic with
| existential quantified. E.g. for any X there exists a Y
|
| And that only gets you to truthy Trues, co-NP is another
| problem.
|
| ATP is hard, and while we get lucky with some constrained
| problems like type inference, which is pathological in
| its runtime, but decidable, Pressburger arithmetic is the
| highest form we know is decidable.
|
| It is a large reason CS uses science and falsification vs
| proofs.
|
| Godel and the difference between Symantec and syntactic
| completeness is another rat hole.
| 343rwerfd wrote:
| > If an LLM was capable of logical reasoning
|
| the prompt interfaces + smartphone apps were (from the
| beginning), and are ongoing training for the next iteration,
| they provide massive RLHF for further improvements in already
| quite RLHFed advanced models.
|
| Whatever tokens they're extracting from all the interactions,
| the most valuable are those from metadata, like "correct answer
| in one shot", or "correct answer in three shots".
|
| The inputs and potentially the outputs can be gibberish, but
| the metadata can be mostly accurate given some
| implicit/explicit (the tumbs up, the "thanks" answers from
| users, maybe), human feedback.
|
| The RLHF refinement extracted from getting the models face the
| entire human population for to be continuously, 24x7x365,
| prompted in all languages, about all the topics interesting for
| the human society, must be incredible. If you just can extract
| a single percentage of definitely "correct answers" from the
| total prompts answered, it should be massive compared to just a
| few thousands of QA dedicated RLHF people working on the models
| in the initial iterations of training.
|
| That was GPT2,3,4, initial iterations of the training. Having
| the models been evolved to more powerful (mathematical)
| entities, you can use them to train the next models. Like is
| almost certainly happening.
|
| My bet is that one of two
|
| - The scaling thing is working spectacularly, they've seen
| linear improvement in blue/green deployments across the world +
| realtime RLHF, and maybe it is going a bit slow, but the
| improvements justify just a bit more waiting to get trained a
| more powerful,refined model, incredible more better answers
| from even the previous datasets used (now more deeply inquired
| by the new models and the new massive RLHF data), if in a year
| they have a 20x GPT4, Claude, Gemini, whatever, they could be
| "jumping" to the next 40x GPT4, Claude, Gemini, a lot faster,
| if they have the most popular, prompted model in the market (in
| the world).
|
| - The scaling stuff already sunk, they have seen the numbers
| and it doesn't add by now, or they've seen disminished returns
| coming. This is being firmly denied by anyone on the record or
| off the record.
| fsndz wrote:
| The thing is we probably can't build AGI:
| https://www.lycee.ai/blog/why-no-agi-openai
| knowaveragejoe wrote:
| This is almost a year old, thoughts on it today?
| fsndz wrote:
| LLMs still do not reason or plan. And nothing in their
| architecture, training, post-training points toward real
| reasoning as scaling continues. Thinking does not happen one
| token at a time.
| isaacfung wrote:
| I don't get why some people seem to think the only way to
| use a LLM is for next token prediction or AGI has to be
| bult using LLM alone.
|
| You want planning, you can do monte carlo tree search and
| use LLM to evaluate which node to explore next. You want
| verifiable reasoning, you can ask it to generate code(an
| approach used by recent AI olympiad winner and many
| previous papers).
|
| What is even "planning", finding desirable/optimal
| solutions to some constrained satisfaction problems? Is the
| llm based minecraft bot voyager not doing some kind of
| planning?
|
| LLMs have their limitations. Then augment them with
| external data sources, code interpreters, give it ways to
| interact with real world/simulation environment.
| threeseed wrote:
| The problem is that every time you ask the LLM to
| evaluate what to do next it will return a wrong answer X%
| of the time. Multiple that X across the number of steps
| and you have a system that is effectively useless. X
| today is ~5%.
|
| I do think LLMs could be used to assist in building a
| world model that could be a foundation for an AGI/agent
| system. But it won't be the major part.
| jhanschoo wrote:
| As the other reply has said, the article points to
| limitations of LLMs, but that doesn't preclude synthesizing
| a system of multiple components that uses LLMs. To the
| extent that I'm bearish on AI capabilities, I'll note that
| program synthesis / compression / general inductive
| reasoning which we expect intelligent agents to do is a
| computationally very hard problem.
| mitthrowaway2 wrote:
| Despite its title, this article merely seems to argue that LLMs
| will not themselves scale into AGI.
| FrameworkFred wrote:
| I'm playing around with LangChain and LangGraph
| (https://www.langchain.com/) and it seems like these enable just
| the sort of mechanisms mentioned.
| trte9343r4 wrote:
| > One could spin this idea even further and train several models
| with radically different world views by curating different
| training corpi that represent different sets of beliefs / world
| views.
|
| You can get good results by combining different models in chat,
| or even the same model with different parameters. Model usually
| gives up on hallucinations when challenged. Sometime it pushes
| back and provides explanation with sources.
|
| I have a script that puts models into dialog, moderates
| discussion and takes notes. I run this stuff overnight, so
| getting multiple choices speeds up iteration.
| wokwokwok wrote:
| Does anyone really believe that having a good corpus will remove
| hallucinations?
|
| Is this article even written by a person? Hard to know; they have
| a real blog with real article, but stuff like this reads
| strangely. Maybe it's just not a native english speaker?
|
| > Hallucinations are certainly the toughest nut to crack and
| their negative impact is basically only slightly lessened by good
| confidence estimates and reliable citations (sources).
|
| > The impact of contradictions in the training data.
|
| (was this a prompt header you forget to remove?)
|
| > LLM are incapable of "self-inspection" on their training data
| to find logical inconsistencies in it but in the input context
| window they should be able to find logical inconsistencies.
|
| Annnnyway...
|
| Hallucinations cannot be fixed by a good corpus in a non-
| deterministic (ie. temp > 0) LLM system where you've introduced a
| random factor.
|
| Period. QED. If you think it can, do more reading.
|
| The idea that a good corpus can _significantly improve_ the error
| rate is an open question, but the research I 've seen _tends_ to
| fall on the side of "to some degree, but curating a 'perfect'
| dataset like that, of a sufficiently large size, is basically
| impossible'".
|
| So, it's a pipe dream.
|
| Yes, if you could have a perfect corpus, absolutely, you would
| get a better model.
|
| ...but _how_ do you plan to _get_ that perfect corpus of training
| data?
|
| If it was that easy, the people spending _millions and millions
| of dollars_ making LLMs would have, I guess, probably come up
| with a solution for it. They 're not stupid. If you could easily
| do it, it would already have been done.
|
| my $0.02:
|
| This is a dead end of research, because it's impossible.
|
| Using LLMs which are finetuned to evaluate the output of _other_
| LLMs and using multi-sample / voting to reduce the incidence of
| halluciations that make it past the API barrier is both actively
| used and far, far more effective.
|
| (ie. it doesn't matter if your LLM hallucinates 1 time in 10; if
| you can reliably _detect_ that 1 instance, sample again, and
| return a non hallucination).
|
| Other solutions... I'm skeptical; most of the ones I've seen
| haven't worked when you actually try to use them.
| thntk wrote:
| I've seen such articles more and more recently. In the past,
| when people had a vague idea, they had to do research before
| writing. During this process, they often realized some flaws
| and thoroughly revised the idea or gave up writing. Nowadays,
| research can be bypassed with the help of eloquent LLMs,
| allowing any vague idea to turn into a write-up.
| comcuoglu wrote:
| Thank you. It seems largely ignored that LLMs still sample from
| a set of tokens based on estimated probability and the given
| temperature - but not on factuality or the described
| "confidence estimate" in the article. RAG etc. only move the
| estimated probabilities into a more factually based direction,
| but do not change the sampling itself
| JohnVideogames wrote:
| It's obvious that you can't solve hallucinations by curating
| the dataset when you think about arithmetic.
|
| It's trivial to create a corpus of True Maths Facts and verify
| that they're correct. But an LLM (as they're currently
| structured) will never generalise to new mathematical problems
| with 100% success rate because they do not fundamentally work
| like that.
| tidenly wrote:
| I wonder to what extent is hallucination a result of a "must
| answer" bias?
|
| When sampling data all over the internet, your data set only
| represents people who _did_ write, _did_ respond to questions -
| with no representation of what they didn 't. Add into that
| confidently wrong people - people who respond to questions on,
| say, StackOverflow, even if they're wrong, and suddenly you
| have a data set that prefers replying bullshit, because there's
| no data for the people who _didnt_ know the answer and wrote
| nothing.
|
| Inherently there's no representation in the datasets of "I
| don't know" null values.
|
| LLMs are _forced_ to reply, in contrast, so they "bullshit" a
| response that sounds right even though not answering or saying
| you don't know would be more appropriate - because no-one does
| that on the internet.
|
| I always assumed this was a big factor, but am I completely off
| the mark?
| sean_pedersen wrote:
| I wrote up this blog post in 30 mins, that's why it reads a
| little rough. I could not find explicit research on the impact
| of contradicting training data, only on the general need for
| high-quality training data.
|
| May be it is a pipe dream to drastically improve on
| hallucinations by curating a self-consistent data set but I am
| still interested in how much it actually impacts the quality of
| the final model.
|
| I described one possible way to create such a self-consistent
| data set in this very blog post.
| fatbird wrote:
| In my mind LLMs are already fatally compromised. Proximity
| matching via vector embeddings that offer no guarantees of
| completeness or correctness have already surrendered the
| essential advantage of technological advances.
|
| Imagine a dictionary where the words are only mostly in
| alphabetical order. If you look up a word and don't find it, you
| can't be certain it's not in there. It's as useful as asking
| someone else, or several other people, but it's value _as a
| reference_ is zero, and there 's no shortage of other people on
| the planet.
| TeMPOraL wrote:
| > _Proximity matching via vector embeddings that offer no
| guarantees of completeness or correctness have already
| surrendered the essential advantage of technological advances._
|
| On the contrary, it's arguably _the_ breakthrough that allowed
| us to model _concepts_ and meaning in computers. A sufficiently
| high-dimensional embedding space can model arbitrary
| relationships between embedded entities, which allows each of
| them to be defined in terms of its associations to all the
| others. This is more-less how we define concepts too, if you
| dig down into it.
|
| > _Imagine a dictionary where the words are only mostly in
| alphabetical order. If you look up a word and don 't find it,
| you can't be certain it's not in there._
|
| It's already the case with dictionaries. Dictionaries have
| mistakes, words out of order; they get outdated, and most
| importantly, they're _descriptive_. If a word isn 't in it, or
| isn't defined in particular way, you cannot be certain it
| doesn't exist or doesn't mean anything other than the
| dictionary says it does.
|
| > _It 's as useful as asking someone else, or several other
| people_
|
| Which is _very_ useful, because it _saves you the hassle of
| dealing with other people_. Especially when it 's as useful as
| asking _an expert_ , which saves you the effort of finding one.
| Now scale that up to being able to ask about whole topics of
| interest, instead of single words.
|
| > _it 's value as a reference is zero_
|
| Obviously. So is the value of asking even an expert for an
| immediate, snap answer, and going with that.
|
| > _and there 's no shortage of other people on the planet_
|
| Again, dealing with people is stupidly expensive in time,
| energy and effort, starting with having to find _the right
| people_. LLM is just a function call away.
| fatbird wrote:
| Technology advances by supplanting human mechanisms, not by
| amplifying or cheapening them. A loom isn't a more nimble
| hand, it's a different mechanical approach to weaving. Wheels
| and roads aren't better legs, they're different conveyances.
| LLMs as a replacement for dealing with people but offering
| only the same certainty aren't an advance.
|
| LLMs do math by trying to match an answer to a prompt.
| Mathematica does better than that.
| TeMPOraL wrote:
| Wheels and roads do the same thing as legs in several major
| use cases, only they do it better. Sane with jet engines
| and flapping wings. Same with loom vs. hand, and same with
| LLMs vs. people.
|
| > _LLMs do math by trying to match an answer to a prompt.
| Mathematica does better than that._
|
| Category error. Pencil and paper or theorem prover are
| better at doing complex math than snap judgment of an
| expert, but an expert using those tools according to their
| judgement is the best. LLMs compete with snap judgement,
| not heavily algorithmic tasks.
|
| Still, it's a somewhat pointless discussion, because the
| premise behind your argument is that LLMs aren't a big
| breakthrough, which is in disagreement with facts obvious
| to anyone who hasn't been living under a rock for the past
| year.
| thntk wrote:
| We knew high quality data can help as evidenced by the \Phi
| models. However, this alone can never eliminate hallucination
| because data can never be both consistent and complete. Moreover,
| hallucination is an inherent flaw of intelligence in general if
| we think of intelligence as (lossy) compression.
| jillesvangurp wrote:
| There has been steady improvement since the release of chat gpt
| into the wild, which is still only less than two years ago (easy
| to forget). I've been getting a lot of value out of chat gpt 4o,
| like lots of other people. I find with each model generation my
| dependence on this stuff for day to day work goes up as the
| soundness of its answers and reasoning improve.
|
| There are still lots of issues and limitations but it's a very
| different experience than with gpt 3 early on. A lot of the
| smaller OSS models are a bit of a mixed bag in terms of
| hallucinations and utility. But they can be useful if you apply
| some skills. Half the success is actually learning to prompt
| these things and learning to spot when it starts to hallucinate.
|
| One thing I find useful is to run ideas by it in kind of a
| socratic mode where I try to get it to flesh out brain farts I
| have for algorithms or other kinds of things. This can be coding
| related topics but also non technical kinds of things. It will
| get some things wrong and when you spot it, you can often get a
| better answer simply by pointing it out and maybe nudging it in a
| different direction. A useful trick with code is to also let it
| generate tests for its own code. When the tests fail to run, you
| can ask it to fix it. Or you can ask it for some alternative
| implementation of the same thing. Often you get something that is
| 95% close to what you asked for and then you can just do the
| remaining few percent yourself.
|
| Doing TDD with an LLM is a power move. Good tests are easy enough
| to understand and once they pass, it's hard to argue with the
| results. And you can just ask it to identify edge cases and add
| more tests for those. LLMs take a lot of the tediousness out of
| writing tests. I'm a big picture kind of guy and my weakness is
| skipping unit tests to fast forward to having working code.
| Spelling out all the stupid little assertions is mindnumbingly
| stupid work that I don't have to bother with anymore. I just let
| AI generate good test cases. LLMs make TDD a lot less tedious.
| It's like having a really diligent junior pair programmer doing
| all the easy bits.
|
| And if you apply SOLID principles to your own code (which is a
| good thing in any case), a lot of code is self contained enough
| that you can easily fit it in a small file that is small enough
| to fit into the context window of chat gpt (which is quite large
| these days). So, a thing I often do is just gather relevant code,
| copy past it and then tell it to make some reasonable assumptions
| about missing things and make some modifications to the code. Add
| a function that does X; how would I need to modify this code to
| address Y; etc. I also get it to iterate on its own code. And a
| neat trick is to ask it to compare its solution to other
| solutions out there and then get it to apply some of the same
| principles and optimizations.
|
| One thing with RAG is that we're still under utilizing LLMs for
| this. It's a lot easier to get an LLM to ask good questions than
| it is to get them to provide the right answers. With RAG, you can
| use good old information retrieval to answer the questions. IMHO
| limiting RAG to just vector search is a big mistake. It actually
| doesn't work that well for structured data and you could just ask
| it to query some API based on a specification of use some sql,
| xpath, or whatever query language. And why just ask 1 question?
| Maybe engage in a dialog where it zooms in on the solution via
| querying and iteratively coming up with better questions until
| the context has all the data needed to come up with the answer.
|
| If you think about it, this is how most knowledge workers address
| problems themselves. They are not oracles of wisdom that know
| everything but merely aggregators and filters of external
| knowledge. A good knowledge worker / researcher / engineer is one
| that knows how to ask the right questions in order to come up
| with an iterative process that converges on a solution.
|
| Once you stop using LLMs as one shot oracles that give you an
| answer given a question, they become a lot more useful.
|
| As for AGI, a human AI enhanced by AGI is a powerful combination.
| I kind of like the vision behind neuralink where the core idea is
| basically improving the bandwidth between our brains and external
| tools and intelligence. Using a chat bot is a low bandwidth kind
| of thing. I actually find it tedious.
| stephc_int13 wrote:
| This is very close to my use case with Claude 3.5, and I used
| to only write tests when I was forced to, now it is part of the
| routine to double check everything while improving the
| codebase. I also really enjoy the socratic discussions when
| thinking about new ideas. What it says is mostly generic
| Wikipedia quality but this is useful when I am exploring
| domains where I have knowledge gaps.
| luke-stanley wrote:
| As I understand it: the Phi models, are trained with a much more
| selective training data, the Tiny Stories research was one of the
| starts of that, they used GPT-4 to make stories and encyclopedia
| like training data for Phi to learn from and code, which probably
| helps with logical structuring too. I think they did add in real
| web data too though but I think it was fairly selective.
|
| Maybe something between Cyc and Google's math and geometry LLM's
| could help.
| lsy wrote:
| We can't develop a universally coherent data set because what we
| understand as "truth" is so intensely contextual that we can't
| hope to cover the amount of context needed to make the things
| work how we want, not to mention the numerous social situations
| where writing factual statements would be awkward or disastrous.
|
| Here are a few examples of statements that are not "factual" in
| the sense of being derivable from a universally coherent data
| set, and that nevertheless we would expect a useful intelligence
| to be able to generate:
|
| "There is a region called Hobbiton where someone named Frodo
| Baggins lives."
|
| "We'd like to announce that Mr. Ousted is transitioning from his
| role as CEO to an advisory position while he looks for a new
| challenge. We are grateful to Mr. Ousted for his contributions
| and will be sad to see him go."
|
| "The earth is round."
|
| "Nebraska is flat."
| smokel wrote:
| _> We can 't develop a universally coherent data set because_
|
| Yet every child seems to manage, when raised by a small
| village, over a period of about 18 years. I guess we just need
| to give these LLMs a little more love and attention.
| bugglebeetle wrote:
| Or maybe hundreds of millions of years of evolutionary
| pressure to build unbelievably efficient function
| approximation.
| antisthenes wrote:
| And then you go out into the real world, talk to real adults,
| and discover that the majority of people don't have a
| coherent mental model of the world, and have completely
| ridiculous ideas that aren't anywhere near an approximation
| of the real physical world.
| mistermann wrote:
| > and discover that the majority of people don't have a
| coherent mental model of the world
|
| "Coherent" is doing a lot of lifting here. All humans have
| highly flawed models, and we've been culturally conditioned
| to grade on a curve to hide the problem from ourselves.
| js8 wrote:
| You're right. We don't really know how to handle uncertainty
| and fuzziness in logic properly (to avoid logical
| contradictions). There has been many mathematical attempts to
| model uncertainty (just to name a few - probability, Dempster-
| Shafer theory, fuzzy logic, non-monotone logics, etc.), but
| they all suffer from some kind of paradox.
|
| At the end of the day, none of these theoretical techniques
| prevailed in the field of AI, and we ended up with, empirically
| successful, neural networks (and LLMs specifically). We know
| they model uncertainty but we have no clue how they do it
| conceptually, or whether they even have a coherent conception
| of uncertainty.
|
| So I would pose that the problem isn't that we don't have the
| technology, but it's rather we don't understand what we want
| from it. I am yet to see a coherent theory of how humans
| manipulate the human language to express uncertainty that would
| encompass broad (if not all) range of how people use language.
| Without having that, you can't define what is a hallucination
| of an LLM. Maybe it's making a joke (some believe that point of
| the joke is to highlight a subtle logical error of some sort),
| because, you know, it read a lot of them and it concluded
| that's what humans do.
|
| So AI eventually prevailed (over humans) in fields where we
| were able to precisely define the goal. But what is our goal
| vis-a-vis human language? What do we want from AI to answer to
| our prompts? I think we are stuck at the lack of definition of
| that.
| RamblingCTO wrote:
| My biggest problem with them is that I can't quite get it to
| behave like I want it to. I built myself a "therapy/coaching"
| telegram bot (I'm healthy, but like to reflect a lot, no
| worries). I even built a self-reflecting memory component that
| generates insights (sometimes spot on, sometimes random af). But
| the more I use it, the more I notice that neither the memory nor
| the prompt matters much. I just can't get it to behave like a
| therapist would. So in other words: I can't find the inputs to
| achieve a desirable prediction from the SOTA LLMs. And I think
| that's a bigger problem for them not to be a shallow hype.
| coldtea wrote:
| > _I just can 't get it to behave like a therapist would_
| import time import random SESSION_DURATION =
| 50 * 60 start_time = time.time() while
| True: current_time = time.time() elapsed_time =
| current_time - start_time if elapsed_time >=
| SESSION_DURATION: print("Our time is up. That will
| be $150. See you next week!") break
| _ = input("") print(random.choice(["Mmm hmm", "Tell me
| more", "How does that make you feel?"]))
| time.sleep(1)
|
| Thank me later!
| RamblingCTO wrote:
| haha, good one! although I'm German and it was free for me
| when I did it. I just had the best therapist. $150 a session
| is insane!
| DolphinAsa wrote:
| I'm surprised he didn't mention the way, that we are solving the
| issue at amazon. It's not an secret at this point, giving the
| LLM's hands or agentic systems to run code or do things that get
| feedback in a loop DRAMATICALLY REDUCE Hallucinations.
| nyrikki wrote:
| > ...manually curate a high-quality (consistent) text corpus
| based on undisputed, well curated wikipedia articles and battle
| tested scientific literature.
|
| This assumption is based on the mistaken assumption that science
| is about objective truth.
|
| It is confusing the map for the territory. Scientific models are
| intended to be useful, not perfect.
|
| Statistical learning, vs symbolic learning is about existential
| quantification vs universal quantification respectively.
|
| All models are wrong some are useful, this applies to even the
| most unreasonably accurate versions like QFT and GR.
|
| Spherical cows, no matter how useful are hotly debated outside of
| the didactic half truths of low level courses.
|
| The corpus that the above seeks doesn't exist in academic
| circles, only in popular science where people don't see that
| practical, useful models are far more important that 'correct'
| ones.
| mitthrowaway2 wrote:
| LLMs don't only hallucinate because of mistaken statements in
| their training data. It just comes hand-in-hand with the model's
| ability to remix, interpolate, and extrapolate answers to other
| questions that aren't directly answered in the dataset. For
| example if I ask ChatGPT a legal question, it might cite as
| precedent a case that doesn't exist at all (but which seems
| plausible, being interpolated from cases that do exist). It's not
| necessarily because it drew that case from a TV episode. It works
| the same way that GPT-3 wrote news releases that sounded
| convincing, matching the structure and flow of real articles.
|
| Training only on factual data won't solve this.
|
| Anyway, I can't help but feel saddened sometimes to see our
| talented people and investment resources being drawn in to
| developing these AI chatbots. These problems are solvable, but
| are we really making a better world by solving them?
| CamperBob2 wrote:
| _These problems are solvable, but are we really making a better
| world by solving them?_
|
| When you ask yourself that question -- and you do ask yourself
| that, right? -- what's _your_ answer?
| mitthrowaway2 wrote:
| I do, all the time! My answer is "most likely not". (I
| assumed that answer was implied by my expressing sadness
| about all the work being invested in them.) This is why,
| although I try to keep up-to-date with and understand these
| technologies, I am not being paid to develop them.
| dweinus wrote:
| 100% I think the author is really misunderstanding the issue
| here. "Hallucination" is a fundamental aspect of the design of
| Large Language Models. Narrowing the distribution of the
| training data will reduce the LLM's ability to generalize, but
| it won't stop hallucinations.
| sean_pedersen wrote:
| I agree in that a perfectly consistent dataset won't
| completely stop _statistical_ language models from
| hallucinating but it will reduce it. I think it is
| established that data quality is more important than
| quantity. Bullshit in - > bullshit out, so a focus on data
| quality is good and needed IMO.
|
| I am also saying LMs output should cite sources and give
| confidence scores (which reflects how much the output is in
| or out of the training distrtibution).
| rtkwe wrote:
| I think the problem is you need an extremely large quantity
| of data just to get the machine to work in the first place.
| So much so that there may not be enough to get it working
| on just "quality" data.
| Animats wrote:
| Plausible idea which needs a big training budget. Was it funded?
___________________________________________________________________
(page generated 2024-07-18 23:09 UTC)