[HN Gopher] Overcoming the limits of current LLMs
       ___________________________________________________________________
        
       Overcoming the limits of current LLMs
        
       Author : sean_pedersen
       Score  : 77 points
       Date   : 2024-07-18 00:35 UTC (22 hours ago)
        
 (HTM) web link (seanpedersen.github.io)
 (TXT) w3m dump (seanpedersen.github.io)
        
       | Carrok wrote:
       | I wish he went into how to improve confidence scores, though I
       | guess training on better data to begin with should improve
       | results and thus confidence.
        
       | RodgerTheGreat wrote:
       | One of the main factors that makes LLMs popular today is that
       | scaling up the models is a simple and (relatively) inexpensive
       | matter of buying compute capacity and scraping together more raw
       | text to train them. Without large and highly diverse training
       | datasets to construct base models, LLMs cannot produce even the
       | superficial appearance of good results.
       | 
       | Manually curating "tidy", properly-licensed and verified datasets
       | is immensely more difficult, expensive, and time-consuming than
       | stealing whatever you can find on the open internet. Wolfram
       | Alpha is one of the more successful attempts in that curation-
       | based direction (using good-old-fashioned heuristic techniques
       | instead of opaque ML models), and while it is very useful and
       | contains a great deal of factual information, it does not conjure
       | appealing fantasies of magical capabilities springing up from
       | thin air and hands-off exponential improvement.
        
         | totetsu wrote:
         | It's not unethical if people in positions of privilege and
         | power do it to maintain their rightful position of privilege
         | and power.
        
           | dang wrote:
           | Please don't post in the flamewar style to HN. It degrades
           | discussion and we're trying to go in the opposite direction
           | here, to the extent that is possible on the internet.
           | 
           | https://news.ycombinator.com/newsguidelines.html
        
         | threeseed wrote:
         | > properly-licensed and verified datasets is immensely more
         | difficult, expensive
         | 
         | Arguably the bigger problem is that many of those datasets e.g.
         | WSJ articles are proprietary and can be exclusively licensed
         | like we've seen recently with OpenAI.
         | 
         | So we end up with in a situation where competition is simply
         | not possible.
        
           | piva00 wrote:
           | > Arguably the bigger problem is that many of those datasets
           | e.g. WSJ articles are proprietary and can be exclusively
           | licensed like we've seen recently with OpenAI.
           | 
           | > So we end up with in a situation where competition is
           | simply not possible.
           | 
           | Exactly, and Technofeudalism advances a little more into a
           | new feud.
           | 
           | OpenAI is trying to create its moat by shoring up training
           | data, probably attempting to not allow competitors to train
           | on the same datasets they've been licencing, at least for a
           | while. Training data is the only possible moat for LLMs,
           | models seem to be advancing quite well between different
           | companies but as mentioned here a tidy training dataset is
           | the actual gold.
        
             | gessha wrote:
             | It's landmines no matter how you approach the problem.
             | 
             | If you treat the web as a free-for-all and you scrape
             | freely, you get sued by the content platforms for copyright
             | or term of service violation.
             | 
             | If you license the content, you let the highest bidder get
             | the content.
             | 
             | No matter what happens, capital wins.
        
       | darby_nine wrote:
       | Man it seems like the ship has sailed on "hallucination" but it's
       | such a terrible name for the phenomenon we see. It is a major
       | mistake to imply the issue is with perception rather than
       | structural incompetence. Why not just say "incoherent output"?
       | It's actually descriptive and doesn't require bastardizing a word
       | we already find meaningful to mean something completely
       | different.
        
         | jwuphysics wrote:
         | > Why not just say "incoherent output"? Because the biggest
         | problem with hallucinations is that the output is usually
         | coherent but factually incorrect. I agree that "hallucination"
         | isn't the best word for it... perhaps something like
         | "confabulation" is better.
        
           | linguistbreaker wrote:
           | I appreciated a post on here recently that likened AI
           | hallucination to 'bullshitting'. It's coherent, even
           | plausible output without any regard for the truth.
        
             | wkat4242 wrote:
             | While I have absolutely no issues with the word "shit" in
             | popular terms, I'd normally like to reserve it for
             | situations where there's actually intended malice like in
             | "enshittification".
             | 
             | Rather than just an imperfect technology as we have here.
             | 
             | Many people object to the term enshittification for foul-
             | mouthing reasons but I think it covers it very well because
             | the principle it covers is itself so very nasty. But that's
             | not at all the case here.
        
               | pessimizer wrote:
               | "Bullshitting" isn't a new piece of jargon, it's a common
               | English word of many decades vintage, and is being used
               | in its dictionary sense here.
        
             | MattPalmer1086 wrote:
             | More true to say that _all_ output is bullshitting, not
             | just the ones we call hallucinations. Some of it is true,
             | some isn 't. The model doesn't know or care.
        
           | nl wrote:
           | And we use "hallucination" because in the ancient times when
           | generative AI meant image generation models would
           | "hallucinate" extra fingers etc.
           | 
           | The behavior of text models is similar enough that the
           | wording stuck, and it's not all that bad.
        
         | exmadscientist wrote:
         | "Hallucinations" implies that someone isn't of sound mental
         | state. We can argue forever about what that means for a LLM and
         | whether that's appropriate, but I think it's absolutely the
         | right attitude and approach to be taking toward these things.
         | 
         | They simply do not behave like humans of sound minds, and
         | "hallucinations" conveys that in a way that "confabulations" or
         | even "bullshit" does not. (Though "bullshit" isn't bad either.)
        
           | kimixa wrote:
           | I don't really immediately link "Hallucinations" with
           | "Unsound mind" - most people I know have experienced auditory
           | hallucinations - often things like not sure if the doorbell
           | went off, or if someone said their name.
           | 
           | And I couldn't find a single one of my friends who hadn't
           | experienced "phantom vibration syndrome".
           | 
           | Both I'd say are "Hallucinations", without any real negative
           | connotation.
        
           | devjab wrote:
           | I disagree with this take because LLMs are, always,
           | hallucinating. When they get things right it's because they
           | are lucky. Yes, yes, it's more complicated than that, but the
           | essence of LLMs is that they are very good at being lucky. So
           | good that they will often give you better results than random
           | search engine clicks, but not good enough to be useful for
           | anything important.
           | 
           | I think what calling the times they get things wrong
           | hallucinations is largely an advertising trick. So that they
           | can sort of fit the LLMs into how all IT is sometimes "wonky"
           | and sell their fundamentally flawed technology more easily. I
           | also think it works extremely well.
        
             | Terr_ wrote:
             | To offer a satirical analogy: "Lastly, I want to reassure
             | investors and members of the press that we take these
             | concerns very seriously: Hindenburg 2 will only contain
             | only _normal and unreactive_ hydrogen gas, and not the
             | _rare and unusual_ explosive kind, which is merely a
             | temporary hurdle in this highly dynamic and growing field.
             | "
             | 
             | Edit: It retrospect, perhaps a better analogy would involve
             | gasoline, as its explosive nature is what's being actively
             | being exploited in normal use.
        
               | marcosdumay wrote:
               | Yes (to the edit), an analogy with making planes safer by
               | only using non-flammable fuels is perfect.
        
               | Terr_ wrote:
               | I expect most people have already filled in the blanks,
               | but for completeness:
               | 
               | "Lastly, I want to reassure investors and members of the
               | press that we take these concerns very seriously: The
               | Ford Pinto-II will only contain only normal and stable
               | gasoline, and not the rare and unusual burning kind,
               | which is merely a temporary hurdle in this highly dynamic
               | and explos--er-- _fast growing_ field. "
        
             | mewpmewp2 wrote:
             | But the point is, isn't hallucinating about having
             | malformed, altered or out of touch input rather than
             | producing inaccurate output yourself?
             | 
             | It is the memory pathways leading them astray. It could be
             | thought of a memory system that at certain point any longer
             | can't be fully sure if whatever connections they have are
             | from actually being trained or it or created accidentally.
        
               | Terr_ wrote:
               | > isn't hallucinating about having malformed, altered or
               | out of touch input rather than producing inaccurate
               | output yourself?
               | 
               | I suppose so, in the sense that someone could simply be
               | _lying_ about pink elephants instead of seeing them.
               | However it 's hard to argue that the machine _knows_ the
               | "right" answer and is (intelligently?) deceiving us.
               | 
               | > It is the memory pathways leading them astray.
               | 
               | I don't think it's a "memory" issue as much as a "they
               | don't operate the way we like to think they do" issue.
               | 
               | Suppose a human is asked to describe different paintings
               | on the wall of an art gallery. Sometimes their statements
               | appear valid and you nod along, and sometimes the
               | statements are so wrong that it alarms you, because "this
               | person is hallucinating."
               | 
               | Now consider how the entire situation is flipped by
               | finding out one additional fact... _They 're actually
               | totally blind._
               | 
               | Is it a lie? Is it a hallucination? Does it matter?
               | Either way you must dramatically re-evaluate what their
               | "good" outputs really mean and whether they can be used.
        
               | mewpmewp2 wrote:
               | To me it's more like, imagine that you have read a lot of
               | books throughout your life, but then someone comes in and
               | asks a question from you and you try to answer from
               | memory, but you get beaten when you say something like "I
               | don't know", and you get rewarded if you answer
               | accurately. You do get beaten if you answer inaccurately,
               | but eventually you learn that if you just say something,
               | you might just be accurate and you will not get beaten.
               | So you just always learn to answer to the best of your
               | knowledge, while never saying that you specifically don't
               | know, because it decreases chances of getting beat up.
               | You are not intentionally lying, you are just hoping that
               | whatever you say is accurate to the best you can do
               | according to the neural connections you've built up in
               | your brain.
               | 
               | Like you ask me for a birthdate of some obscure political
               | figure from history? I'm going to try to feel out what
               | period in history the name might feel like to me and just
               | make my best guess based on that, then say some random
               | year and a birthdate. It just has the lowest odds of
               | being beaten. Was I hallucinating? No, I was just trying
               | to not get beaten.
        
           | linguistbreaker wrote:
           | How about "dream-reality confusion (DRC)" ?
        
             | TeaBrain wrote:
             | There is no dream-reality separation in an LLM, or really
             | any conception of dreams or reality, so I don't think the
             | term makes sense. Hallucination works fine to describe the
             | phenomenon. LLMs work by coalescing textual information.
             | LLM hallucinations occur due to faulty or inappropriate
             | coalescence of information, which is similar to what occurs
             | with actual hallucinations.
        
           | marcosdumay wrote:
           | Bullshit is the most descriptive one.
           | 
           | LLMs don't do it because they are out of their right mind.
           | They do it because every single answer they say is invented
           | caring only about form, and not correctness.
           | 
           | But yeah, that ship has already sailed.
        
           | mistermann wrote:
           | "Sound" minds for humans is graded on a curve, and this trick
           | is not acknowledged, _or popular_.
        
         | wkat4242 wrote:
         | Hallucination is one single word. Even if it's not perfect it's
         | great as a term. It's easy to remember and people new to the
         | term already have an idea of what it entails. And the term will
         | bend to cover what we take it to mean anyway. Language is
         | flexible. Hallucination in an LLM context doesn't have to be
         | the exact same as in a human context. All it matters is that
         | we're aligned on what we're talking about. It's already
         | achieved this purpose.
        
         | freilanzer wrote:
         | Hallucination perfectly describes the phenomenon.
        
         | dgs_sgd wrote:
         | I think calling it hallucination is because of our tendency to
         | anthropomorphize things.
         | 
         | Humans hallucinate. Programs have bugs.
        
           | threeseed wrote:
           | The point is that this isn't a bug.
           | 
           | It's inherent to how LLMs work and is expected although
           | undesired behaviour.
        
         | TeaBrain wrote:
         | The problem with "incoherent output" is that it isn't
         | describing the phenomenon at all. There have been cases where
         | LLM output has been incoherent, but modern LLM hallucinations
         | are usually coherent and well-contructed, just completely
         | fabricated.
        
         | breatheoften wrote:
         | I think it's a pretty good name for the phenomenon -- maybe the
         | only problem with the term is that what models are doing is
         | 100% hallucination all the time -- it's just that when the
         | hallucinations are useful we don't call them hallucinations --
         | so maybe that is a problem with the term (not sure if that's
         | what you are getting at).
         | 
         | But there's nothing at all different about what the model is
         | doing between these cases -- the models are hallucinating all
         | the time and have no ability to assess when they are
         | hallucinating "right" or "wrong" or useful/non-useful output in
         | any meaningful way.
        
         | Slow_Hand wrote:
         | I prefer "confabulate" to describe this phenomena.
         | 
         | : to fill in gaps in memory by fabrication
         | 
         | > In psychology, confabulation is a memory error consisting of
         | the production of fabricated, distorted, or misinterpreted
         | memories about oneself or the world.
         | 
         | It's more about coming up with a plausible explanation in the
         | absence of a readily-available one.
        
         | adrianmonk wrote:
         | On a literal level, hallucinations are perceptual.
         | 
         | But "hallucination" was already (before LLMs) being used in a
         | figurative sense, i.e. for abstract ideas that are made up out
         | of nothing. The same is also true of other words that were
         | originally visual, like "illusion" and "mirage".
        
       | ainoobler wrote:
       | The article suggests a useful line of research. Train an LLM to
       | detect logical fallacies and then see if that can be bootstrapped
       | into something useful because it's pretty clear that all the
       | issues with LLMs is the lack of logical capabilities. If an LLM
       | was capable of logical reasoning then it would be obvious when it
       | was generating made-up nonsense instead of referencing existing
       | sources of consistent information.
        
         | dtx1 wrote:
         | I think we should start smaller and make them able to count
         | first.
        
           | Terr_ wrote:
           | Yeah, you can train an LLM to recognize the vocabulary and
           | grammatical features of logical fallacies... Except the
           | nature of fallacies is that they _look real_ on that same
           | linguistic level, so those features aren 't distinctive for
           | that purpose.
           | 
           | Heck, I think detecting sarcasm would be an easier goal, and
           | still tricky.
        
             | idle_zealot wrote:
             | > Except the nature of fallacies is that they look real on
             | that same linguistic level, so those features aren't
             | distinctive for that purpose
             | 
             | Well that's actually good news. With a large enough
             | labelled dataset of actually-sound and fallacious text with
             | similar grammatical features you should be able to train a
             | discriminator to distinguish between them using some other
             | metric. Good luck with getting that data set though.
        
               | Terr_ wrote:
               | > you should be able to train a discriminator to
               | distinguish between them using some other metric
               | 
               | Not when the better metrics are likely alien/incompatible
               | to the discriminator's core algorithm!
               | 
               | Then it's rather inconvenient news, because it means you
               | have to develop something separate and novel.
               | 
               | As the other poster already mentioned, if we can't even
               | get them to reliably count how many objects are being
               | referred to, how do you expect them to also handle
               | logical syllogisms?
        
               | nyrikki wrote:
               | The Entscheidungsproblem tends to rear its ugly problem
               | 
               | Remember NP is equivalent to second order logic with
               | existential quantified. E.g. for any X there exists a Y
               | 
               | And that only gets you to truthy Trues, co-NP is another
               | problem.
               | 
               | ATP is hard, and while we get lucky with some constrained
               | problems like type inference, which is pathological in
               | its runtime, but decidable, Pressburger arithmetic is the
               | highest form we know is decidable.
               | 
               | It is a large reason CS uses science and falsification vs
               | proofs.
               | 
               | Godel and the difference between Symantec and syntactic
               | completeness is another rat hole.
        
         | 343rwerfd wrote:
         | > If an LLM was capable of logical reasoning
         | 
         | the prompt interfaces + smartphone apps were (from the
         | beginning), and are ongoing training for the next iteration,
         | they provide massive RLHF for further improvements in already
         | quite RLHFed advanced models.
         | 
         | Whatever tokens they're extracting from all the interactions,
         | the most valuable are those from metadata, like "correct answer
         | in one shot", or "correct answer in three shots".
         | 
         | The inputs and potentially the outputs can be gibberish, but
         | the metadata can be mostly accurate given some
         | implicit/explicit (the tumbs up, the "thanks" answers from
         | users, maybe), human feedback.
         | 
         | The RLHF refinement extracted from getting the models face the
         | entire human population for to be continuously, 24x7x365,
         | prompted in all languages, about all the topics interesting for
         | the human society, must be incredible. If you just can extract
         | a single percentage of definitely "correct answers" from the
         | total prompts answered, it should be massive compared to just a
         | few thousands of QA dedicated RLHF people working on the models
         | in the initial iterations of training.
         | 
         | That was GPT2,3,4, initial iterations of the training. Having
         | the models been evolved to more powerful (mathematical)
         | entities, you can use them to train the next models. Like is
         | almost certainly happening.
         | 
         | My bet is that one of two
         | 
         | - The scaling thing is working spectacularly, they've seen
         | linear improvement in blue/green deployments across the world +
         | realtime RLHF, and maybe it is going a bit slow, but the
         | improvements justify just a bit more waiting to get trained a
         | more powerful,refined model, incredible more better answers
         | from even the previous datasets used (now more deeply inquired
         | by the new models and the new massive RLHF data), if in a year
         | they have a 20x GPT4, Claude, Gemini, whatever, they could be
         | "jumping" to the next 40x GPT4, Claude, Gemini, a lot faster,
         | if they have the most popular, prompted model in the market (in
         | the world).
         | 
         | - The scaling stuff already sunk, they have seen the numbers
         | and it doesn't add by now, or they've seen disminished returns
         | coming. This is being firmly denied by anyone on the record or
         | off the record.
        
       | fsndz wrote:
       | The thing is we probably can't build AGI:
       | https://www.lycee.ai/blog/why-no-agi-openai
        
         | knowaveragejoe wrote:
         | This is almost a year old, thoughts on it today?
        
           | fsndz wrote:
           | LLMs still do not reason or plan. And nothing in their
           | architecture, training, post-training points toward real
           | reasoning as scaling continues. Thinking does not happen one
           | token at a time.
        
             | isaacfung wrote:
             | I don't get why some people seem to think the only way to
             | use a LLM is for next token prediction or AGI has to be
             | bult using LLM alone.
             | 
             | You want planning, you can do monte carlo tree search and
             | use LLM to evaluate which node to explore next. You want
             | verifiable reasoning, you can ask it to generate code(an
             | approach used by recent AI olympiad winner and many
             | previous papers).
             | 
             | What is even "planning", finding desirable/optimal
             | solutions to some constrained satisfaction problems? Is the
             | llm based minecraft bot voyager not doing some kind of
             | planning?
             | 
             | LLMs have their limitations. Then augment them with
             | external data sources, code interpreters, give it ways to
             | interact with real world/simulation environment.
        
               | threeseed wrote:
               | The problem is that every time you ask the LLM to
               | evaluate what to do next it will return a wrong answer X%
               | of the time. Multiple that X across the number of steps
               | and you have a system that is effectively useless. X
               | today is ~5%.
               | 
               | I do think LLMs could be used to assist in building a
               | world model that could be a foundation for an AGI/agent
               | system. But it won't be the major part.
        
             | jhanschoo wrote:
             | As the other reply has said, the article points to
             | limitations of LLMs, but that doesn't preclude synthesizing
             | a system of multiple components that uses LLMs. To the
             | extent that I'm bearish on AI capabilities, I'll note that
             | program synthesis / compression / general inductive
             | reasoning which we expect intelligent agents to do is a
             | computationally very hard problem.
        
         | mitthrowaway2 wrote:
         | Despite its title, this article merely seems to argue that LLMs
         | will not themselves scale into AGI.
        
       | FrameworkFred wrote:
       | I'm playing around with LangChain and LangGraph
       | (https://www.langchain.com/) and it seems like these enable just
       | the sort of mechanisms mentioned.
        
       | trte9343r4 wrote:
       | > One could spin this idea even further and train several models
       | with radically different world views by curating different
       | training corpi that represent different sets of beliefs / world
       | views.
       | 
       | You can get good results by combining different models in chat,
       | or even the same model with different parameters. Model usually
       | gives up on hallucinations when challenged. Sometime it pushes
       | back and provides explanation with sources.
       | 
       | I have a script that puts models into dialog, moderates
       | discussion and takes notes. I run this stuff overnight, so
       | getting multiple choices speeds up iteration.
        
       | wokwokwok wrote:
       | Does anyone really believe that having a good corpus will remove
       | hallucinations?
       | 
       | Is this article even written by a person? Hard to know; they have
       | a real blog with real article, but stuff like this reads
       | strangely. Maybe it's just not a native english speaker?
       | 
       | > Hallucinations are certainly the toughest nut to crack and
       | their negative impact is basically only slightly lessened by good
       | confidence estimates and reliable citations (sources).
       | 
       | > The impact of contradictions in the training data.
       | 
       | (was this a prompt header you forget to remove?)
       | 
       | > LLM are incapable of "self-inspection" on their training data
       | to find logical inconsistencies in it but in the input context
       | window they should be able to find logical inconsistencies.
       | 
       | Annnnyway...
       | 
       | Hallucinations cannot be fixed by a good corpus in a non-
       | deterministic (ie. temp > 0) LLM system where you've introduced a
       | random factor.
       | 
       | Period. QED. If you think it can, do more reading.
       | 
       | The idea that a good corpus can _significantly improve_ the error
       | rate is an open question, but the research I 've seen _tends_ to
       | fall on the side of  "to some degree, but curating a 'perfect'
       | dataset like that, of a sufficiently large size, is basically
       | impossible'".
       | 
       | So, it's a pipe dream.
       | 
       | Yes, if you could have a perfect corpus, absolutely, you would
       | get a better model.
       | 
       | ...but _how_ do you plan to _get_ that perfect corpus of training
       | data?
       | 
       | If it was that easy, the people spending _millions and millions
       | of dollars_ making LLMs would have, I guess, probably come up
       | with a solution for it. They 're not stupid. If you could easily
       | do it, it would already have been done.
       | 
       | my $0.02:
       | 
       | This is a dead end of research, because it's impossible.
       | 
       | Using LLMs which are finetuned to evaluate the output of _other_
       | LLMs and using multi-sample  / voting to reduce the incidence of
       | halluciations that make it past the API barrier is both actively
       | used and far, far more effective.
       | 
       | (ie. it doesn't matter if your LLM hallucinates 1 time in 10; if
       | you can reliably _detect_ that 1 instance, sample again, and
       | return a non hallucination).
       | 
       | Other solutions... I'm skeptical; most of the ones I've seen
       | haven't worked when you actually try to use them.
        
         | thntk wrote:
         | I've seen such articles more and more recently. In the past,
         | when people had a vague idea, they had to do research before
         | writing. During this process, they often realized some flaws
         | and thoroughly revised the idea or gave up writing. Nowadays,
         | research can be bypassed with the help of eloquent LLMs,
         | allowing any vague idea to turn into a write-up.
        
         | comcuoglu wrote:
         | Thank you. It seems largely ignored that LLMs still sample from
         | a set of tokens based on estimated probability and the given
         | temperature - but not on factuality or the described
         | "confidence estimate" in the article. RAG etc. only move the
         | estimated probabilities into a more factually based direction,
         | but do not change the sampling itself
        
         | JohnVideogames wrote:
         | It's obvious that you can't solve hallucinations by curating
         | the dataset when you think about arithmetic.
         | 
         | It's trivial to create a corpus of True Maths Facts and verify
         | that they're correct. But an LLM (as they're currently
         | structured) will never generalise to new mathematical problems
         | with 100% success rate because they do not fundamentally work
         | like that.
        
         | tidenly wrote:
         | I wonder to what extent is hallucination a result of a "must
         | answer" bias?
         | 
         | When sampling data all over the internet, your data set only
         | represents people who _did_ write, _did_ respond to questions -
         | with no representation of what they didn 't. Add into that
         | confidently wrong people - people who respond to questions on,
         | say, StackOverflow, even if they're wrong, and suddenly you
         | have a data set that prefers replying bullshit, because there's
         | no data for the people who _didnt_ know the answer and wrote
         | nothing.
         | 
         | Inherently there's no representation in the datasets of "I
         | don't know" null values.
         | 
         | LLMs are _forced_ to reply, in contrast, so they  "bullshit" a
         | response that sounds right even though not answering or saying
         | you don't know would be more appropriate - because no-one does
         | that on the internet.
         | 
         | I always assumed this was a big factor, but am I completely off
         | the mark?
        
         | sean_pedersen wrote:
         | I wrote up this blog post in 30 mins, that's why it reads a
         | little rough. I could not find explicit research on the impact
         | of contradicting training data, only on the general need for
         | high-quality training data.
         | 
         | May be it is a pipe dream to drastically improve on
         | hallucinations by curating a self-consistent data set but I am
         | still interested in how much it actually impacts the quality of
         | the final model.
         | 
         | I described one possible way to create such a self-consistent
         | data set in this very blog post.
        
       | fatbird wrote:
       | In my mind LLMs are already fatally compromised. Proximity
       | matching via vector embeddings that offer no guarantees of
       | completeness or correctness have already surrendered the
       | essential advantage of technological advances.
       | 
       | Imagine a dictionary where the words are only mostly in
       | alphabetical order. If you look up a word and don't find it, you
       | can't be certain it's not in there. It's as useful as asking
       | someone else, or several other people, but it's value _as a
       | reference_ is zero, and there 's no shortage of other people on
       | the planet.
        
         | TeMPOraL wrote:
         | > _Proximity matching via vector embeddings that offer no
         | guarantees of completeness or correctness have already
         | surrendered the essential advantage of technological advances._
         | 
         | On the contrary, it's arguably _the_ breakthrough that allowed
         | us to model _concepts_ and meaning in computers. A sufficiently
         | high-dimensional embedding space can model arbitrary
         | relationships between embedded entities, which allows each of
         | them to be defined in terms of its associations to all the
         | others. This is more-less how we define concepts too, if you
         | dig down into it.
         | 
         | > _Imagine a dictionary where the words are only mostly in
         | alphabetical order. If you look up a word and don 't find it,
         | you can't be certain it's not in there._
         | 
         | It's already the case with dictionaries. Dictionaries have
         | mistakes, words out of order; they get outdated, and most
         | importantly, they're _descriptive_. If a word isn 't in it, or
         | isn't defined in particular way, you cannot be certain it
         | doesn't exist or doesn't mean anything other than the
         | dictionary says it does.
         | 
         | > _It 's as useful as asking someone else, or several other
         | people_
         | 
         | Which is _very_ useful, because it _saves you the hassle of
         | dealing with other people_. Especially when it 's as useful as
         | asking _an expert_ , which saves you the effort of finding one.
         | Now scale that up to being able to ask about whole topics of
         | interest, instead of single words.
         | 
         | > _it 's value as a reference is zero_
         | 
         | Obviously. So is the value of asking even an expert for an
         | immediate, snap answer, and going with that.
         | 
         | > _and there 's no shortage of other people on the planet_
         | 
         | Again, dealing with people is stupidly expensive in time,
         | energy and effort, starting with having to find _the right
         | people_. LLM is just a function call away.
        
           | fatbird wrote:
           | Technology advances by supplanting human mechanisms, not by
           | amplifying or cheapening them. A loom isn't a more nimble
           | hand, it's a different mechanical approach to weaving. Wheels
           | and roads aren't better legs, they're different conveyances.
           | LLMs as a replacement for dealing with people but offering
           | only the same certainty aren't an advance.
           | 
           | LLMs do math by trying to match an answer to a prompt.
           | Mathematica does better than that.
        
             | TeMPOraL wrote:
             | Wheels and roads do the same thing as legs in several major
             | use cases, only they do it better. Sane with jet engines
             | and flapping wings. Same with loom vs. hand, and same with
             | LLMs vs. people.
             | 
             | > _LLMs do math by trying to match an answer to a prompt.
             | Mathematica does better than that._
             | 
             | Category error. Pencil and paper or theorem prover are
             | better at doing complex math than snap judgment of an
             | expert, but an expert using those tools according to their
             | judgement is the best. LLMs compete with snap judgement,
             | not heavily algorithmic tasks.
             | 
             | Still, it's a somewhat pointless discussion, because the
             | premise behind your argument is that LLMs aren't a big
             | breakthrough, which is in disagreement with facts obvious
             | to anyone who hasn't been living under a rock for the past
             | year.
        
       | thntk wrote:
       | We knew high quality data can help as evidenced by the \Phi
       | models. However, this alone can never eliminate hallucination
       | because data can never be both consistent and complete. Moreover,
       | hallucination is an inherent flaw of intelligence in general if
       | we think of intelligence as (lossy) compression.
        
       | jillesvangurp wrote:
       | There has been steady improvement since the release of chat gpt
       | into the wild, which is still only less than two years ago (easy
       | to forget). I've been getting a lot of value out of chat gpt 4o,
       | like lots of other people. I find with each model generation my
       | dependence on this stuff for day to day work goes up as the
       | soundness of its answers and reasoning improve.
       | 
       | There are still lots of issues and limitations but it's a very
       | different experience than with gpt 3 early on. A lot of the
       | smaller OSS models are a bit of a mixed bag in terms of
       | hallucinations and utility. But they can be useful if you apply
       | some skills. Half the success is actually learning to prompt
       | these things and learning to spot when it starts to hallucinate.
       | 
       | One thing I find useful is to run ideas by it in kind of a
       | socratic mode where I try to get it to flesh out brain farts I
       | have for algorithms or other kinds of things. This can be coding
       | related topics but also non technical kinds of things. It will
       | get some things wrong and when you spot it, you can often get a
       | better answer simply by pointing it out and maybe nudging it in a
       | different direction. A useful trick with code is to also let it
       | generate tests for its own code. When the tests fail to run, you
       | can ask it to fix it. Or you can ask it for some alternative
       | implementation of the same thing. Often you get something that is
       | 95% close to what you asked for and then you can just do the
       | remaining few percent yourself.
       | 
       | Doing TDD with an LLM is a power move. Good tests are easy enough
       | to understand and once they pass, it's hard to argue with the
       | results. And you can just ask it to identify edge cases and add
       | more tests for those. LLMs take a lot of the tediousness out of
       | writing tests. I'm a big picture kind of guy and my weakness is
       | skipping unit tests to fast forward to having working code.
       | Spelling out all the stupid little assertions is mindnumbingly
       | stupid work that I don't have to bother with anymore. I just let
       | AI generate good test cases. LLMs make TDD a lot less tedious.
       | It's like having a really diligent junior pair programmer doing
       | all the easy bits.
       | 
       | And if you apply SOLID principles to your own code (which is a
       | good thing in any case), a lot of code is self contained enough
       | that you can easily fit it in a small file that is small enough
       | to fit into the context window of chat gpt (which is quite large
       | these days). So, a thing I often do is just gather relevant code,
       | copy past it and then tell it to make some reasonable assumptions
       | about missing things and make some modifications to the code. Add
       | a function that does X; how would I need to modify this code to
       | address Y; etc. I also get it to iterate on its own code. And a
       | neat trick is to ask it to compare its solution to other
       | solutions out there and then get it to apply some of the same
       | principles and optimizations.
       | 
       | One thing with RAG is that we're still under utilizing LLMs for
       | this. It's a lot easier to get an LLM to ask good questions than
       | it is to get them to provide the right answers. With RAG, you can
       | use good old information retrieval to answer the questions. IMHO
       | limiting RAG to just vector search is a big mistake. It actually
       | doesn't work that well for structured data and you could just ask
       | it to query some API based on a specification of use some sql,
       | xpath, or whatever query language. And why just ask 1 question?
       | Maybe engage in a dialog where it zooms in on the solution via
       | querying and iteratively coming up with better questions until
       | the context has all the data needed to come up with the answer.
       | 
       | If you think about it, this is how most knowledge workers address
       | problems themselves. They are not oracles of wisdom that know
       | everything but merely aggregators and filters of external
       | knowledge. A good knowledge worker / researcher / engineer is one
       | that knows how to ask the right questions in order to come up
       | with an iterative process that converges on a solution.
       | 
       | Once you stop using LLMs as one shot oracles that give you an
       | answer given a question, they become a lot more useful.
       | 
       | As for AGI, a human AI enhanced by AGI is a powerful combination.
       | I kind of like the vision behind neuralink where the core idea is
       | basically improving the bandwidth between our brains and external
       | tools and intelligence. Using a chat bot is a low bandwidth kind
       | of thing. I actually find it tedious.
        
         | stephc_int13 wrote:
         | This is very close to my use case with Claude 3.5, and I used
         | to only write tests when I was forced to, now it is part of the
         | routine to double check everything while improving the
         | codebase. I also really enjoy the socratic discussions when
         | thinking about new ideas. What it says is mostly generic
         | Wikipedia quality but this is useful when I am exploring
         | domains where I have knowledge gaps.
        
       | luke-stanley wrote:
       | As I understand it: the Phi models, are trained with a much more
       | selective training data, the Tiny Stories research was one of the
       | starts of that, they used GPT-4 to make stories and encyclopedia
       | like training data for Phi to learn from and code, which probably
       | helps with logical structuring too. I think they did add in real
       | web data too though but I think it was fairly selective.
       | 
       | Maybe something between Cyc and Google's math and geometry LLM's
       | could help.
        
       | lsy wrote:
       | We can't develop a universally coherent data set because what we
       | understand as "truth" is so intensely contextual that we can't
       | hope to cover the amount of context needed to make the things
       | work how we want, not to mention the numerous social situations
       | where writing factual statements would be awkward or disastrous.
       | 
       | Here are a few examples of statements that are not "factual" in
       | the sense of being derivable from a universally coherent data
       | set, and that nevertheless we would expect a useful intelligence
       | to be able to generate:
       | 
       | "There is a region called Hobbiton where someone named Frodo
       | Baggins lives."
       | 
       | "We'd like to announce that Mr. Ousted is transitioning from his
       | role as CEO to an advisory position while he looks for a new
       | challenge. We are grateful to Mr. Ousted for his contributions
       | and will be sad to see him go."
       | 
       | "The earth is round."
       | 
       | "Nebraska is flat."
        
         | smokel wrote:
         | _> We can 't develop a universally coherent data set because_
         | 
         | Yet every child seems to manage, when raised by a small
         | village, over a period of about 18 years. I guess we just need
         | to give these LLMs a little more love and attention.
        
           | bugglebeetle wrote:
           | Or maybe hundreds of millions of years of evolutionary
           | pressure to build unbelievably efficient function
           | approximation.
        
           | antisthenes wrote:
           | And then you go out into the real world, talk to real adults,
           | and discover that the majority of people don't have a
           | coherent mental model of the world, and have completely
           | ridiculous ideas that aren't anywhere near an approximation
           | of the real physical world.
        
             | mistermann wrote:
             | > and discover that the majority of people don't have a
             | coherent mental model of the world
             | 
             | "Coherent" is doing a lot of lifting here. All humans have
             | highly flawed models, and we've been culturally conditioned
             | to grade on a curve to hide the problem from ourselves.
        
         | js8 wrote:
         | You're right. We don't really know how to handle uncertainty
         | and fuzziness in logic properly (to avoid logical
         | contradictions). There has been many mathematical attempts to
         | model uncertainty (just to name a few - probability, Dempster-
         | Shafer theory, fuzzy logic, non-monotone logics, etc.), but
         | they all suffer from some kind of paradox.
         | 
         | At the end of the day, none of these theoretical techniques
         | prevailed in the field of AI, and we ended up with, empirically
         | successful, neural networks (and LLMs specifically). We know
         | they model uncertainty but we have no clue how they do it
         | conceptually, or whether they even have a coherent conception
         | of uncertainty.
         | 
         | So I would pose that the problem isn't that we don't have the
         | technology, but it's rather we don't understand what we want
         | from it. I am yet to see a coherent theory of how humans
         | manipulate the human language to express uncertainty that would
         | encompass broad (if not all) range of how people use language.
         | Without having that, you can't define what is a hallucination
         | of an LLM. Maybe it's making a joke (some believe that point of
         | the joke is to highlight a subtle logical error of some sort),
         | because, you know, it read a lot of them and it concluded
         | that's what humans do.
         | 
         | So AI eventually prevailed (over humans) in fields where we
         | were able to precisely define the goal. But what is our goal
         | vis-a-vis human language? What do we want from AI to answer to
         | our prompts? I think we are stuck at the lack of definition of
         | that.
        
       | RamblingCTO wrote:
       | My biggest problem with them is that I can't quite get it to
       | behave like I want it to. I built myself a "therapy/coaching"
       | telegram bot (I'm healthy, but like to reflect a lot, no
       | worries). I even built a self-reflecting memory component that
       | generates insights (sometimes spot on, sometimes random af). But
       | the more I use it, the more I notice that neither the memory nor
       | the prompt matters much. I just can't get it to behave like a
       | therapist would. So in other words: I can't find the inputs to
       | achieve a desirable prediction from the SOTA LLMs. And I think
       | that's a bigger problem for them not to be a shallow hype.
        
         | coldtea wrote:
         | > _I just can 't get it to behave like a therapist would_
         | import time       import random            SESSION_DURATION =
         | 50 * 60       start_time = time.time()                while
         | True:         current_time = time.time()         elapsed_time =
         | current_time - start_time                  if elapsed_time >=
         | SESSION_DURATION:             print("Our time is up. That will
         | be $150. See you next week!")             break
         | _ = input("")         print(random.choice(["Mmm hmm", "Tell me
         | more", "How does that make you feel?"]))
         | time.sleep(1)
         | 
         | Thank me later!
        
           | RamblingCTO wrote:
           | haha, good one! although I'm German and it was free for me
           | when I did it. I just had the best therapist. $150 a session
           | is insane!
        
       | DolphinAsa wrote:
       | I'm surprised he didn't mention the way, that we are solving the
       | issue at amazon. It's not an secret at this point, giving the
       | LLM's hands or agentic systems to run code or do things that get
       | feedback in a loop DRAMATICALLY REDUCE Hallucinations.
        
       | nyrikki wrote:
       | > ...manually curate a high-quality (consistent) text corpus
       | based on undisputed, well curated wikipedia articles and battle
       | tested scientific literature.
       | 
       | This assumption is based on the mistaken assumption that science
       | is about objective truth.
       | 
       | It is confusing the map for the territory. Scientific models are
       | intended to be useful, not perfect.
       | 
       | Statistical learning, vs symbolic learning is about existential
       | quantification vs universal quantification respectively.
       | 
       | All models are wrong some are useful, this applies to even the
       | most unreasonably accurate versions like QFT and GR.
       | 
       | Spherical cows, no matter how useful are hotly debated outside of
       | the didactic half truths of low level courses.
       | 
       | The corpus that the above seeks doesn't exist in academic
       | circles, only in popular science where people don't see that
       | practical, useful models are far more important that 'correct'
       | ones.
        
       | mitthrowaway2 wrote:
       | LLMs don't only hallucinate because of mistaken statements in
       | their training data. It just comes hand-in-hand with the model's
       | ability to remix, interpolate, and extrapolate answers to other
       | questions that aren't directly answered in the dataset. For
       | example if I ask ChatGPT a legal question, it might cite as
       | precedent a case that doesn't exist at all (but which seems
       | plausible, being interpolated from cases that do exist). It's not
       | necessarily because it drew that case from a TV episode. It works
       | the same way that GPT-3 wrote news releases that sounded
       | convincing, matching the structure and flow of real articles.
       | 
       | Training only on factual data won't solve this.
       | 
       | Anyway, I can't help but feel saddened sometimes to see our
       | talented people and investment resources being drawn in to
       | developing these AI chatbots. These problems are solvable, but
       | are we really making a better world by solving them?
        
         | CamperBob2 wrote:
         | _These problems are solvable, but are we really making a better
         | world by solving them?_
         | 
         | When you ask yourself that question -- and you do ask yourself
         | that, right? -- what's _your_ answer?
        
           | mitthrowaway2 wrote:
           | I do, all the time! My answer is "most likely not". (I
           | assumed that answer was implied by my expressing sadness
           | about all the work being invested in them.) This is why,
           | although I try to keep up-to-date with and understand these
           | technologies, I am not being paid to develop them.
        
         | dweinus wrote:
         | 100% I think the author is really misunderstanding the issue
         | here. "Hallucination" is a fundamental aspect of the design of
         | Large Language Models. Narrowing the distribution of the
         | training data will reduce the LLM's ability to generalize, but
         | it won't stop hallucinations.
        
           | sean_pedersen wrote:
           | I agree in that a perfectly consistent dataset won't
           | completely stop _statistical_ language models from
           | hallucinating but it will reduce it. I think it is
           | established that data quality is more important than
           | quantity. Bullshit in - > bullshit out, so a focus on data
           | quality is good and needed IMO.
           | 
           | I am also saying LMs output should cite sources and give
           | confidence scores (which reflects how much the output is in
           | or out of the training distrtibution).
        
             | rtkwe wrote:
             | I think the problem is you need an extremely large quantity
             | of data just to get the machine to work in the first place.
             | So much so that there may not be enough to get it working
             | on just "quality" data.
        
       | Animats wrote:
       | Plausible idea which needs a big training budget. Was it funded?
        
       ___________________________________________________________________
       (page generated 2024-07-18 23:09 UTC)