[HN Gopher] LLMs aren't "trained on the internet" anymore
       ___________________________________________________________________
        
       LLMs aren't "trained on the internet" anymore
        
       Author : ingve
       Score  : 90 points
       Date   : 2024-06-01 20:59 UTC (2 hours ago)
        
 (HTM) web link (allenpike.com)
 (TXT) w3m dump (allenpike.com)
        
       | zer00eyz wrote:
       | You sir get an F in history and the industry does too.
       | 
       | Does no one remember why expert systems fell apart? Because you
       | have to keep paying experts to feed the beast. Because they are
       | bound to the whims and limitations of experts. Making up data
       | isnt going to get us there, we already failed with this method
       | ONCE.
       | 
       | Open AI's bet with MS and the resignation of all the safety
       | people says everything you need to know. MS gets everthin up to
       | AGI... IF you thought you were close, if you thought that you
       | were going to get there with a bigger model and more data then
       | you MIGHT want MS's money. And MS had its own ML folks publish
       | papers with "hints of AGI", The google engineer saying "it's AGI"
       | before getting laughed at...
       | 
       | I suspect that everyone at OpenAI was high on their own supply.
       | That they thought AGI would emerge, or sapience, or sentience if
       | they shoved enough data at it. I think the safety minding folks
       | leaving points to the fact that they found the practical
       | limitations.
       | 
       | Show me the paper that has progress on hallucination. Show me the
       | paper that doubles effectiveness and halves the size. These are
       | where we need progress for this to become more than grift, than
       | NFT's.
        
         | sebzim4500 wrote:
         | They aren't going to show you any papers at all, they like
         | money.
        
         | falcor84 wrote:
         | >Show me the paper that doubles effectiveness and halves the
         | size.
         | 
         | LLM's have pretty clearly been the most rapidly advancing
         | technology in the history of humankind. Are you not
         | entertained?!
        
         | solidasparagus wrote:
         | > Does no one remember why expert systems fell apart?
         | 
         | Many of the current generation of AI experts mostly either did
         | not pay great attention to the history of AI or they believe
         | this time is completely different. They would do well to spend
         | more time learning about history.
         | 
         | However, your view doesn't strike me as correct either. Expert
         | system fell apart because the world was more complex than
         | researchers realized and enumeration was essentially discovered
         | to be infeasible (more or less as you say). But the
         | impossibility of enumerating the world isn't news, everyone
         | knows "the bitter lesson". And this isn't the past - now
         | everyone on earth carries around a computer, a video camera and
         | a microphone. They talk to each other through the internet.
         | Remote workers screens' are recorded. Billions of vehicles with
         | absurd numbers of sensors are roaming around the world. More of
         | the arenas that matter to humanity are digital and thus
         | effective domains for automated exploration and data
         | generation.
         | 
         | The information about how the world operates exists or can be
         | generated, the only real question is how to get your hands on
         | it.
        
           | discreteevent wrote:
           | > The information about how the world operates exists or can
           | be generated, the only real question is how to get your hands
           | on it.
           | 
           | I'm sure I could read all the information for an astrophysics
           | course in a relatively short time. Understanding it is a
           | different matter.
        
             | solidasparagus wrote:
             | Understanding is a loaded term. But large transformers seem
             | pretty good at learning from datasets (of more or less any
             | modality) to they extent that they can create useful new
             | datapoints and allow you to work with existing datapoints
             | in useful, structured ways.
        
           | zer00eyz wrote:
           | > The information about how the world operates exists or can
           | be generated
           | 
           | The hubris of mathematics. At what scale does whether
           | prediction become 100 percent accurate? How large of a model
           | do you need, and how big of a computer to run it?
           | 
           | Do we thing that reducing the world to a model and feeding it
           | through (what isnt even close to a model) of "thought" or
           | "interaction" or ... what ever you want to bill and LLM as is
           | going to be any more accurate than weather prediction?
        
       | HL33tibCe7 wrote:
       | > Admittedly, this used to be true! And is still mostly true. But
       | it's increasingly becoming less true.
       | 
       | So the headline is bullshit then.
       | 
       | The thrust of this article is essentially "LLMs are trained on
       | the internet, _but wait_, they are also trained on other stuff in
       | these very rarified and specific cases". So even you concede
       | that, LLMs are still, for the most part, trained on the internet.
        
         | neilv wrote:
         | I used ta do Web corpus. I still do. But I used ta, too.
         | 
         | (Apologies to Mitch Hedberg.
         | https://www.youtube.com/watch?v=VqHA5CIL0fg )
        
         | gmuslera wrote:
         | "trained" and putting everything in the same bag hides the
         | possibility that not all training data have the same weight,
         | confidence, or even deep tags to differentiate between an
         | expert opinion from a 4chan post.
        
         | hn_throwaway_99 wrote:
         | I agree the article title is clickbait. But the article makes
         | the good point that people often say LLMs are "trained on the
         | Internet" to imply all of the statistical problems with that
         | (e.g. the type of content on the Internet, and populations who
         | are more likely to post on the Internet, are not representative
         | samples of knowledge). This article's point is more I feel that
         | so much is being invested in private data that it's no longer
         | really fair to make that implication by default.
        
         | emporas wrote:
         | Somewhat of a clickbait, but "it's increasingly becoming less
         | true" exponentially. The human population produces written
         | data, exponentially, but in a less steep slope than LLMs by
         | themselves. Human text may double every year, LLM generated
         | text may double every day, or every second.
        
       | mrkramer wrote:
       | Data is valuable so for example; I don't understand why Reddit
       | and Stack Overflow gave their valuable data to OpenAI and Google
       | for pennies, when they could've made their own chatbots and beat
       | OpenAI and Google at their own game.
        
         | pizza wrote:
         | Stack Overflow did try this but imo their LLM wasn't that good
        
           | mrkramer wrote:
           | So after few iterations they decided to give up and get the
           | quick buck? So shortsighted from them.
        
             | solardev wrote:
             | Do they have much time left? Seems like once the answers
             | are all scraped and trained, that site wouldn't be able to
             | survive long anyway?
        
               | bandrami wrote:
               | The next version of libfoo released after 2023 will have
               | a new set of options in /etc/foo.conf and at some point a
               | human being who knows that will have to answer a question
               | about it for an LLM to know that.
        
               | reducesuffering wrote:
               | No, future LLMs will ingest the codebase, any docs, and
               | be able to answer the question anyway. That is, if the
               | LLM didn't generate the code base itself...
        
         | Rucadi wrote:
         | I assume they thought that OpenAI and Google would have used
         | that data anyways without a clear way to prove otherwise.
        
         | marcinzm wrote:
         | It's not their core business model and so far LLMs aren't very
         | profitable. Investing a ton of money into a whole new business
         | area with heavy competition that may be profitable eventually
         | while not swimming in cash is often how companies die.
        
         | yreg wrote:
         | StackOverflow is Creative Commons, so until courts/regulators
         | decide otherwise, anyone can probably claim it's fair game to
         | train on it, same as Wikipedia.
         | 
         | Actually,
        
         | saintfire wrote:
         | Apparently (according to AI enthusiasts) all publicly
         | accessible data is free from copyright when used as training
         | data. It doesn't really look like "owning" the data is worth
         | very much money at all.
        
         | asadotzler wrote:
         | OpenAI would have stolen it if they didn't license it so it's
         | "free money" for Reddit or Stack Overflow when OpenAI or any
         | other comes along and offers money up front.
        
       | RecycledEle wrote:
       | Wow. They quoted me in the article.
        
         | RecycledEle wrote:
         | This is one of the longest articles anyone ever wrote to prove
         | me wrong.
         | 
         | I agree that LLMs are not 100% trained on Internet posts, but
         | that they are mostly trained on good Internet posts.
         | 
         | When I ask an LLM a question I expect a simulation if a good
         | answer from an Internet discussion board specializing in that
         | topic.
        
           | mattgreenrocks wrote:
           | In true HN comment style, the article comes out both with
           | guns blazing and leads with a "well actually..." :)
        
       | logrot wrote:
       | The dream is collapsing.
        
         | benreesman wrote:
         | As Something of a Vagueposter Myself, I'll bite.
         | 
         | Which dream is collapsing? I'm not disputing it, legitimately
         | curious which of several collapsing dreams you mean.
        
           | rzzzt wrote:
           | It's the title of a Hans Zimmer song from the Inception
           | soundtrack, pretty intense!
        
       | benreesman wrote:
       | I think this post makes a few good points, certainly the
       | parabolic trajectory Scale seems to be on is at least suggestive
       | if not conclusive that there's a lot more going on now than just
       | big text crawls.
       | 
       | And Phi-3 is _something else_ , even from relatively limited time
       | playing with it, so that's useful signal for anyone who hadn't
       | looked at it yet. Wildly cool stuff.
       | 
       | It seems weird to not mention Anthropic or Mistral or FAIR pretty
       | much at all: they're all pretty clearly on more modern
       | architectures at least as concerns capability per weight and
       | Instruct-style stuff. I'm part of a now nontrivial group who
       | regards Opus as basically shattering GPT-4-{0125, 1106}-Preview
       | (which is basically the same as 4o for pure language modalities)
       | on basically everything I care about, and LLaMA3 is just about
       | there as well, maybe not quite Opus, comparable if you ignore
       | trivially gamed metrics like MMLU.
       | 
       | And I have no idea why we're talking about GPT-5 when there's
       | little if any verifiable evidence it even exists as a training
       | run tracking to completion. Maybe it is, maybe not, but let's get
       | a look at it rather than just assume that it's going to lap the
       | labs that are currently pushing the pace now?
        
         | ein0p wrote:
         | Have you actually tried using Opus side by side with GPT4 on
         | "work" related stuff? GPT4 is way better in my experience, to
         | the point where I cancelled my Opus subscription after just a
         | couple of months.
        
           | benreesman wrote:
           | I make an effort to use both and several high-capability open
           | tunes every day (it's not literally every day but I have
           | keyboard shortcuts for all of them).
           | 
           | Opus historically had issues with minor typographical errors,
           | though recently that seems to not happen often, lots of very
           | sharp people at Anthropic.
           | 
           | So a month ago if I wanted something from Opus I'd run it
           | through a cleanup pass courtesy of one of the other ones, but
           | even my old standby dolphin-8x7 can clean up typos. 1106 can
           | as well, but all else equal I don't want to be sending my
           | stuff to any black box data warehouse and I'm always
           | surprised so many other sophisticated people don't share the
           | preference.
           | 
           | My personal eyeball capability check is to posit a gauge
           | symmetry and ask what it thinks the implied conserved
           | quantity is, and I've yet to see Opus not crush that relative
           | to anything else, including real footnotes.
           | 
           | On coding I usually hand it a Prisma schema and ask for a
           | proto3/gRPC definition that is a good way to interact with
           | it, Opus in my personal experience also dominates there.
           | 
           | If you have an example of a task that represents a counter
           | example I'd be grateful for another little integration test
           | for my personal ad-hoc model card. I want to know the best
           | tool for every job.
        
       | delusional wrote:
       | This sounds terrible. We are paying PhD level experts to produce
       | novel work exclusively available through an overly optimistic,
       | lying, robot.
       | 
       | What if those expert just published the stuff freely online
       | instead. Surely that would be more productive and trustworthy.
       | Reality is truly stupid.
        
         | jltsiren wrote:
         | Is that different from a company doing R&D without releasing
         | the results to the public? Experts working for private gain is
         | the norm, not the exception.
         | 
         | That kind of work also sounds incredibly boring. They probably
         | have to pay their experts a lot more than it would normally
         | cost to hire the same caliber of experts. Which would mean they
         | are not generating very much private data.
        
         | civilized wrote:
         | Hopefully it'll all be available in a fire sale when these
         | companies finally have to be stripped for parts.
        
         | bossyTeacher wrote:
         | > We are paying PhD level experts to produce novel work
         | exclusively available through an overly optimistic, lying,
         | robot.
         | 
         | We are not. Open AI is
        
       | jacobsenscott wrote:
       | OpenAI etc will be paying irresistible sums of money to companies
       | that promised to keep data private. Think slack (and their recent
       | "opt out" fiasco), Atlassian, Dropbox...
        
         | amelius wrote:
         | "It's easier to ask for forgiveness" is the main modus operandi
         | nowadays ...
        
           | goatlover wrote:
           | As long as you can pay the lawyers.
        
           | moogly wrote:
           | They don't even need to do that...
        
       | sdfgtr wrote:
       | > While some of this is for annotation and ratings on data that
       | came from the web or LLMs, they also create new training data
       | whole-hog:
       | 
       | The article states that this human data is PhDs, poets, and other
       | experts but my recollection from some info about programming LLM
       | training is that there was a small army of low paid Indian
       | programmers feeding it with data.
       | 
       | Even if it's actually experts now I have to wonder when that will
       | switch to 3rd worlders making $1/hour.
        
       | astrea wrote:
       | It's fascinating to watch the whole "Data is the new oil" thing
       | grow and morph into something truly horrible.
        
         | freeone3000 wrote:
         | What have been your experiences with oil?
        
       | hole_in_foot wrote:
       | For an company with open in its name, we sure don't know what
       | data openai trains its models on. Why?
        
       | surfingdino wrote:
       | > For example, if your model is hallucinating because you don't
       | have enough training examples of people expressing uncertainty,
       | or biased because it has unrepresentative data, then generate
       | some better examples!
       | 
       | Or, as the case may be... humans are biased? Also "generate some
       | better examples" sounds like fudging data to fit the expected
       | outcome. It smells of clutching at straws hoping to come up with
       | something before the world looses interests and investor money
       | runs out.
       | 
       | If you want to see how LLMs fail at coming up with original
       | responses ask your favourite hallucinating bot to come up with
       | fifty different ways of encouraging people to "Click the
       | Subscribe button" in a YT video. Not only it will not come up
       | with anything original, but it will simply start repeating itself
       | (well, not itself, it will start repeating phrases found in YT
       | video transcripts).
        
       | stephc_int13 wrote:
       | The current state of LLMs would be several orders of magnitude
       | more impressive if they were only trained from data scrapped on
       | the web.
       | 
       | But this is not the reality of modern LLMs by a long shot, they
       | are trained in increasingly large parts from custom built
       | datasets that are created by countless paid individuals, hidden
       | behind stringent NDAs.
       | 
       | The author here seems to see that as a strength, an opportunity
       | for unbounded growth and potential, I think this is the opposite,
       | this approach is close to a gigantic whack a mole game,
       | effectively unbounded, but in the wrong way.
        
         | jackblemming wrote:
         | Reminds me of the same issues with self driving. Seems like we
         | need a completely different approach to solve these class of
         | problems.
        
       ___________________________________________________________________
       (page generated 2024-06-01 23:00 UTC)