[HN Gopher] LLMs aren't "trained on the internet" anymore
___________________________________________________________________
LLMs aren't "trained on the internet" anymore
Author : ingve
Score : 90 points
Date : 2024-06-01 20:59 UTC (2 hours ago)
(HTM) web link (allenpike.com)
(TXT) w3m dump (allenpike.com)
| zer00eyz wrote:
| You sir get an F in history and the industry does too.
|
| Does no one remember why expert systems fell apart? Because you
| have to keep paying experts to feed the beast. Because they are
| bound to the whims and limitations of experts. Making up data
| isnt going to get us there, we already failed with this method
| ONCE.
|
| Open AI's bet with MS and the resignation of all the safety
| people says everything you need to know. MS gets everthin up to
| AGI... IF you thought you were close, if you thought that you
| were going to get there with a bigger model and more data then
| you MIGHT want MS's money. And MS had its own ML folks publish
| papers with "hints of AGI", The google engineer saying "it's AGI"
| before getting laughed at...
|
| I suspect that everyone at OpenAI was high on their own supply.
| That they thought AGI would emerge, or sapience, or sentience if
| they shoved enough data at it. I think the safety minding folks
| leaving points to the fact that they found the practical
| limitations.
|
| Show me the paper that has progress on hallucination. Show me the
| paper that doubles effectiveness and halves the size. These are
| where we need progress for this to become more than grift, than
| NFT's.
| sebzim4500 wrote:
| They aren't going to show you any papers at all, they like
| money.
| falcor84 wrote:
| >Show me the paper that doubles effectiveness and halves the
| size.
|
| LLM's have pretty clearly been the most rapidly advancing
| technology in the history of humankind. Are you not
| entertained?!
| solidasparagus wrote:
| > Does no one remember why expert systems fell apart?
|
| Many of the current generation of AI experts mostly either did
| not pay great attention to the history of AI or they believe
| this time is completely different. They would do well to spend
| more time learning about history.
|
| However, your view doesn't strike me as correct either. Expert
| system fell apart because the world was more complex than
| researchers realized and enumeration was essentially discovered
| to be infeasible (more or less as you say). But the
| impossibility of enumerating the world isn't news, everyone
| knows "the bitter lesson". And this isn't the past - now
| everyone on earth carries around a computer, a video camera and
| a microphone. They talk to each other through the internet.
| Remote workers screens' are recorded. Billions of vehicles with
| absurd numbers of sensors are roaming around the world. More of
| the arenas that matter to humanity are digital and thus
| effective domains for automated exploration and data
| generation.
|
| The information about how the world operates exists or can be
| generated, the only real question is how to get your hands on
| it.
| discreteevent wrote:
| > The information about how the world operates exists or can
| be generated, the only real question is how to get your hands
| on it.
|
| I'm sure I could read all the information for an astrophysics
| course in a relatively short time. Understanding it is a
| different matter.
| solidasparagus wrote:
| Understanding is a loaded term. But large transformers seem
| pretty good at learning from datasets (of more or less any
| modality) to they extent that they can create useful new
| datapoints and allow you to work with existing datapoints
| in useful, structured ways.
| zer00eyz wrote:
| > The information about how the world operates exists or can
| be generated
|
| The hubris of mathematics. At what scale does whether
| prediction become 100 percent accurate? How large of a model
| do you need, and how big of a computer to run it?
|
| Do we thing that reducing the world to a model and feeding it
| through (what isnt even close to a model) of "thought" or
| "interaction" or ... what ever you want to bill and LLM as is
| going to be any more accurate than weather prediction?
| HL33tibCe7 wrote:
| > Admittedly, this used to be true! And is still mostly true. But
| it's increasingly becoming less true.
|
| So the headline is bullshit then.
|
| The thrust of this article is essentially "LLMs are trained on
| the internet, _but wait_, they are also trained on other stuff in
| these very rarified and specific cases". So even you concede
| that, LLMs are still, for the most part, trained on the internet.
| neilv wrote:
| I used ta do Web corpus. I still do. But I used ta, too.
|
| (Apologies to Mitch Hedberg.
| https://www.youtube.com/watch?v=VqHA5CIL0fg )
| gmuslera wrote:
| "trained" and putting everything in the same bag hides the
| possibility that not all training data have the same weight,
| confidence, or even deep tags to differentiate between an
| expert opinion from a 4chan post.
| hn_throwaway_99 wrote:
| I agree the article title is clickbait. But the article makes
| the good point that people often say LLMs are "trained on the
| Internet" to imply all of the statistical problems with that
| (e.g. the type of content on the Internet, and populations who
| are more likely to post on the Internet, are not representative
| samples of knowledge). This article's point is more I feel that
| so much is being invested in private data that it's no longer
| really fair to make that implication by default.
| emporas wrote:
| Somewhat of a clickbait, but "it's increasingly becoming less
| true" exponentially. The human population produces written
| data, exponentially, but in a less steep slope than LLMs by
| themselves. Human text may double every year, LLM generated
| text may double every day, or every second.
| mrkramer wrote:
| Data is valuable so for example; I don't understand why Reddit
| and Stack Overflow gave their valuable data to OpenAI and Google
| for pennies, when they could've made their own chatbots and beat
| OpenAI and Google at their own game.
| pizza wrote:
| Stack Overflow did try this but imo their LLM wasn't that good
| mrkramer wrote:
| So after few iterations they decided to give up and get the
| quick buck? So shortsighted from them.
| solardev wrote:
| Do they have much time left? Seems like once the answers
| are all scraped and trained, that site wouldn't be able to
| survive long anyway?
| bandrami wrote:
| The next version of libfoo released after 2023 will have
| a new set of options in /etc/foo.conf and at some point a
| human being who knows that will have to answer a question
| about it for an LLM to know that.
| reducesuffering wrote:
| No, future LLMs will ingest the codebase, any docs, and
| be able to answer the question anyway. That is, if the
| LLM didn't generate the code base itself...
| Rucadi wrote:
| I assume they thought that OpenAI and Google would have used
| that data anyways without a clear way to prove otherwise.
| marcinzm wrote:
| It's not their core business model and so far LLMs aren't very
| profitable. Investing a ton of money into a whole new business
| area with heavy competition that may be profitable eventually
| while not swimming in cash is often how companies die.
| yreg wrote:
| StackOverflow is Creative Commons, so until courts/regulators
| decide otherwise, anyone can probably claim it's fair game to
| train on it, same as Wikipedia.
|
| Actually,
| saintfire wrote:
| Apparently (according to AI enthusiasts) all publicly
| accessible data is free from copyright when used as training
| data. It doesn't really look like "owning" the data is worth
| very much money at all.
| asadotzler wrote:
| OpenAI would have stolen it if they didn't license it so it's
| "free money" for Reddit or Stack Overflow when OpenAI or any
| other comes along and offers money up front.
| RecycledEle wrote:
| Wow. They quoted me in the article.
| RecycledEle wrote:
| This is one of the longest articles anyone ever wrote to prove
| me wrong.
|
| I agree that LLMs are not 100% trained on Internet posts, but
| that they are mostly trained on good Internet posts.
|
| When I ask an LLM a question I expect a simulation if a good
| answer from an Internet discussion board specializing in that
| topic.
| mattgreenrocks wrote:
| In true HN comment style, the article comes out both with
| guns blazing and leads with a "well actually..." :)
| logrot wrote:
| The dream is collapsing.
| benreesman wrote:
| As Something of a Vagueposter Myself, I'll bite.
|
| Which dream is collapsing? I'm not disputing it, legitimately
| curious which of several collapsing dreams you mean.
| rzzzt wrote:
| It's the title of a Hans Zimmer song from the Inception
| soundtrack, pretty intense!
| benreesman wrote:
| I think this post makes a few good points, certainly the
| parabolic trajectory Scale seems to be on is at least suggestive
| if not conclusive that there's a lot more going on now than just
| big text crawls.
|
| And Phi-3 is _something else_ , even from relatively limited time
| playing with it, so that's useful signal for anyone who hadn't
| looked at it yet. Wildly cool stuff.
|
| It seems weird to not mention Anthropic or Mistral or FAIR pretty
| much at all: they're all pretty clearly on more modern
| architectures at least as concerns capability per weight and
| Instruct-style stuff. I'm part of a now nontrivial group who
| regards Opus as basically shattering GPT-4-{0125, 1106}-Preview
| (which is basically the same as 4o for pure language modalities)
| on basically everything I care about, and LLaMA3 is just about
| there as well, maybe not quite Opus, comparable if you ignore
| trivially gamed metrics like MMLU.
|
| And I have no idea why we're talking about GPT-5 when there's
| little if any verifiable evidence it even exists as a training
| run tracking to completion. Maybe it is, maybe not, but let's get
| a look at it rather than just assume that it's going to lap the
| labs that are currently pushing the pace now?
| ein0p wrote:
| Have you actually tried using Opus side by side with GPT4 on
| "work" related stuff? GPT4 is way better in my experience, to
| the point where I cancelled my Opus subscription after just a
| couple of months.
| benreesman wrote:
| I make an effort to use both and several high-capability open
| tunes every day (it's not literally every day but I have
| keyboard shortcuts for all of them).
|
| Opus historically had issues with minor typographical errors,
| though recently that seems to not happen often, lots of very
| sharp people at Anthropic.
|
| So a month ago if I wanted something from Opus I'd run it
| through a cleanup pass courtesy of one of the other ones, but
| even my old standby dolphin-8x7 can clean up typos. 1106 can
| as well, but all else equal I don't want to be sending my
| stuff to any black box data warehouse and I'm always
| surprised so many other sophisticated people don't share the
| preference.
|
| My personal eyeball capability check is to posit a gauge
| symmetry and ask what it thinks the implied conserved
| quantity is, and I've yet to see Opus not crush that relative
| to anything else, including real footnotes.
|
| On coding I usually hand it a Prisma schema and ask for a
| proto3/gRPC definition that is a good way to interact with
| it, Opus in my personal experience also dominates there.
|
| If you have an example of a task that represents a counter
| example I'd be grateful for another little integration test
| for my personal ad-hoc model card. I want to know the best
| tool for every job.
| delusional wrote:
| This sounds terrible. We are paying PhD level experts to produce
| novel work exclusively available through an overly optimistic,
| lying, robot.
|
| What if those expert just published the stuff freely online
| instead. Surely that would be more productive and trustworthy.
| Reality is truly stupid.
| jltsiren wrote:
| Is that different from a company doing R&D without releasing
| the results to the public? Experts working for private gain is
| the norm, not the exception.
|
| That kind of work also sounds incredibly boring. They probably
| have to pay their experts a lot more than it would normally
| cost to hire the same caliber of experts. Which would mean they
| are not generating very much private data.
| civilized wrote:
| Hopefully it'll all be available in a fire sale when these
| companies finally have to be stripped for parts.
| bossyTeacher wrote:
| > We are paying PhD level experts to produce novel work
| exclusively available through an overly optimistic, lying,
| robot.
|
| We are not. Open AI is
| jacobsenscott wrote:
| OpenAI etc will be paying irresistible sums of money to companies
| that promised to keep data private. Think slack (and their recent
| "opt out" fiasco), Atlassian, Dropbox...
| amelius wrote:
| "It's easier to ask for forgiveness" is the main modus operandi
| nowadays ...
| goatlover wrote:
| As long as you can pay the lawyers.
| moogly wrote:
| They don't even need to do that...
| sdfgtr wrote:
| > While some of this is for annotation and ratings on data that
| came from the web or LLMs, they also create new training data
| whole-hog:
|
| The article states that this human data is PhDs, poets, and other
| experts but my recollection from some info about programming LLM
| training is that there was a small army of low paid Indian
| programmers feeding it with data.
|
| Even if it's actually experts now I have to wonder when that will
| switch to 3rd worlders making $1/hour.
| astrea wrote:
| It's fascinating to watch the whole "Data is the new oil" thing
| grow and morph into something truly horrible.
| freeone3000 wrote:
| What have been your experiences with oil?
| hole_in_foot wrote:
| For an company with open in its name, we sure don't know what
| data openai trains its models on. Why?
| surfingdino wrote:
| > For example, if your model is hallucinating because you don't
| have enough training examples of people expressing uncertainty,
| or biased because it has unrepresentative data, then generate
| some better examples!
|
| Or, as the case may be... humans are biased? Also "generate some
| better examples" sounds like fudging data to fit the expected
| outcome. It smells of clutching at straws hoping to come up with
| something before the world looses interests and investor money
| runs out.
|
| If you want to see how LLMs fail at coming up with original
| responses ask your favourite hallucinating bot to come up with
| fifty different ways of encouraging people to "Click the
| Subscribe button" in a YT video. Not only it will not come up
| with anything original, but it will simply start repeating itself
| (well, not itself, it will start repeating phrases found in YT
| video transcripts).
| stephc_int13 wrote:
| The current state of LLMs would be several orders of magnitude
| more impressive if they were only trained from data scrapped on
| the web.
|
| But this is not the reality of modern LLMs by a long shot, they
| are trained in increasingly large parts from custom built
| datasets that are created by countless paid individuals, hidden
| behind stringent NDAs.
|
| The author here seems to see that as a strength, an opportunity
| for unbounded growth and potential, I think this is the opposite,
| this approach is close to a gigantic whack a mole game,
| effectively unbounded, but in the wrong way.
| jackblemming wrote:
| Reminds me of the same issues with self driving. Seems like we
| need a completely different approach to solve these class of
| problems.
___________________________________________________________________
(page generated 2024-06-01 23:00 UTC)