[HN Gopher] AI hype is built on flawed test scores
___________________________________________________________________
AI hype is built on flawed test scores
Author : antondd
Score : 135 points
Date : 2023-10-10 09:20 UTC (13 hours ago)
(HTM) web link (www.technologyreview.com)
(TXT) w3m dump (www.technologyreview.com)
| javier_e06 wrote:
| As a developer when I work with ChatGPT I can see ChatGPT
| eventually taking over my JIRA stories. Then ChatGPT will take
| over management creating product roadmaps, prioritizing and
| assigning tasks to itself. All dictated by customer feedback. The
| clock is ticking. But reasoning like a human? No.
| janalsncm wrote:
| The only test I need is the amount of time it takes me to do
| common tasks with and without ChatGPT. I'm aware it's not perfect
| but perfect was never necessary.
| aldousd666 wrote:
| Only idiots are basing their excitement about what's possible on
| those test scores. They're just an attempt to measure one bot
| against another. There is a strong possibility that they are only
| measuring how well the bot takes the test, and nothing at all
| about what the tests themselves purport to measure. I mean, those
| tests are probably similar to stuff that's in the training data.
| ehutch79 wrote:
| Yeah... there's a lot of idiots out there.
| epups wrote:
| I think ironically there has been an "AI-anti-hype hype", with
| people like Gary Marcus trying to blow up every single possible
| issue into a deal breaker. Most of the claims in this article are
| based on tests performed only on GPT-3, and researchers often
| seem to make tests in a way that proves their point - see an
| earlier comment from me here with an example:
| https://news.ycombinator.com/item?id=37503944
|
| I agree there has been many attention-grabbing headlines that are
| due to simple issues like contamination. However, I think AI has
| already proved its business value far beyond those issues, as
| anyone using ChatGPT with a code base not present in their
| dataset can attest.
| smcl wrote:
| I think some amount of that is necessary, though no? We have
| people claiming that this generation of AI will replace jobs -
| and plenty of companies have taken the bait and tried to get
| started with LLM-based bots. We even had a pretty high-profile
| case of a Google AI engineer going public with claims that
| their LaMDA AI was sentient. Regardless of what you think of
| that individual or Google's AI efforts, this resonates with the
| public. Additionally a pretty common sentiment I've seen has
| been non-tech people suggesting AI should handle content
| moderation - the idea being that since they're not human and
| don't have "feelings" they won't have biases and won't attempt
| to "silence" any one political group (without realising that
| bias can be built in via the training data).
|
| It seems pretty important to counter that and to debunk any
| wild claims such as these. To provide context and to educate
| the world on their shortcomings.
| epups wrote:
| I think skepticism is always welcome and we should continue
| to explore what LLM's can and cannot do. However, what I'm
| referring to is trying to get a quick win by defeating some
| inferior version of GPT or trying to apply a test which you
| don't even expect most humans to pass.
|
| The article is actually fine and pretty balanced, but it is a
| bit unfortunate that 80% of their examples are not
| illustrative of current capabilities. At least for me, most
| of my optimism about the utility of LLM's comes from GPT-4
| specifically.
| danielvaughn wrote:
| I remember watching a documentary about an old blues guitar
| player from the 1920's. They were trying to learn more about him
| and track down his whereabouts during certain periods of his
| life.
|
| At one point, they showed some old footage which featured a
| montage of daily life in a small Mississippi town. You'd see
| people shopping for groceries, going on walks, etc. Some would
| stop and wave at the camera.
|
| In the documentary, they noted that this footage exists because
| at the time, they'd show it on screen during intermission at
| movie theaters. Film was still in its infancy in that time, and
| was so novel that people loved seeing themselves and other people
| on the big screen. It was an interesting use of a new technology,
| and today it's easy to understand why it died out. Of course, it
| likely wasn't obvious at the time.
|
| I say all that because I don't think we can _know_ at this point
| what AI is capable of, and how we want to use it, but we should
| expect to see lots of failure while we figure it out. Over the
| next decade there 's undoubtedly going to be countless ventures
| similar to the "show the townspeople on the movie screen" idea,
| blinded by the novelty of technological change. But failed
| ventures have no relevance to the overall impact or worth of the
| technology itself.
| kenjackson wrote:
| What died out? Film?
| actionfromafar wrote:
| Showing locals little movie clips of themselves in
| intermissions at the local theater.
| savanaly wrote:
| >What died out?
|
| The custom of showing film consisting of footage of the
| general public in movie theaters.
| danielvaughn wrote:
| The practice of filming a montage around your local
| neighborhood or town to play during intermission. Though you
| could say intermission as well, since that was a legacy
| concept that was inherited from plays and eventually died out
| as well.
| iFire wrote:
| Selfies and 15 second videos still exist as shorts and tiktoks.
| mattkrause wrote:
| > it's easy to understand why it died out
|
| I think it's probably more sociological than technical. People
| _love_ to see themselves and their friends /family. My work has
| screens that show photos of events and it always causes a bit
| of a stir ("Did you see X's photo from the summer picnic?")
| Yearbooks are perennially popular and there's a whole slew of
| social media.
|
| However, for this to be "fun", there must be a decent chance
| that most people in the audience know a few people in a few of
| the pictures. I can't imagine this working well in a big city,
| for example, or a rural theatre that draws from a huge area.
| refulgentis wrote:
| This is my favorite new AI argument, took me a few months to see
| it. Enjoyed it at first.
|
| You start with everyone knows there's AI hype from tech bros.
| Then you introduce a PhD or two at institutions with good names.
| Then they start grumbling about anthropomorphizing and who knows
| what AI is anyway.
|
| Somehow, if it's long enough, you forget that this kind of has
| nothing to do with anything. There is no argument. Just imagining
| other people must believe crazy things and working backwards from
| there to find something to critique.
|
| Took me a bit to realize it's not even an argument, just
| parroting "it's a stochastic parrot!" Assumes other people are
| dunces and genuinely believe it's a minihuman. I can't believe
| MIT Tech Review is going for this, the only argument here is the
| tests are flawed if you think they're supposed to show the AI
| model is literally human.
| PeterisP wrote:
| It's not built on high test scores - while academics do benchmark
| models on various tests, all the many people who built up the
| hype mostly did it based on their personal experience with a
| chatbot, not by running some long (and expensive) tests on those
| datasets.
|
| The tests are used (and, despite their flaws, useful) to compare
| various facets of model A to model B - however, the validation
| whether a model is _good_ now comes from users, and that
| validation really can 't be flawed much - if it's helpful (or
| not) to someone, then it is what it is, the proof of the pudding
| is in the eating.
| rvz wrote:
| Most of the hype comes from the AI grifters who need to find the
| next sucker to dump their VC shares onto to the next greater fool
| to purchase their ChatGPT-wrapper snake oil project to at an
| overvalued asking price.
|
| The ones who have to dismantle the hype are the proper
| technologies such as Yann LeCun and Grady Booch who know exactly
| what they are talking about.
| rahimnathwani wrote:
| "People have been giving human intelligence tests--IQ tests and
| so on--to machines since the very beginning of AI," says Melanie
| Mitchell, an artificial-intelligence researcher at the Santa Fe
| Institute in New Mexico. "The issue throughout has been what it
| means when you test a machine like this. It doesn't mean the same
| thing that it means for a human."
|
| The last sentence above is an important point that most people
| don't consider.
| api wrote:
| It seems a bit like having a human face off in a race against a
| car and then concluding that cars have exceeded human physical
| dexterity.
|
| It's not an apples/apples comparison. The nature of the
| capability profile of a human vs. any known machine is
| radically different. Machines are intentionally designed to
| have extreme peaks of performance in narrow areas. Present-
| generation AI might be wider in its capabilities than what
| we've previously built, but it's still rather narrow as you
| quickly discover if you start trying to use it on real tasks.
| waynenilsen wrote:
| This article is absurd.
|
| > But when a large language model scores well on such tests, it
| is not clear at all what has been measured. Is it evidence of
| actual understanding? A mindless statistical trick? Rote
| repetition?
|
| It is measuring how well it does _at REPLACING HUMANS_. It is
| hard to believe how the author clearly does not understand this.
| I don't care how it obtains its results.
|
| GPT-4 is like a hyperspeed entry to mid level dev that has almost
| no ability to contextualize. Tools built on top of 32k will allow
| repo ingestion.
|
| This is the worst it will ever be.
| dartos wrote:
| Which tests test specifically for "replacing humans?" That
| seems like a wild metric to try and capture in a test.
|
| Also an aside:
|
| > This is the worse it will ever be.
|
| I hear this a lot and it really bothers me. Just because
| something is the worst it'll ever be doesn't mean it'll get
| much better. There could always be a plateau on the horizon.
|
| It's akin to "just have faith." A real weird sentiment that I
| didn't notice in tech before 2021.
| RandomLensman wrote:
| It is measuring how well it does replacing humans - in those
| tests.
| GuB-42 wrote:
| I don't think test scores have anything to do with the hype. Most
| people don't even realize test scores exist.
|
| One is just to wow factor. It will be short lived. A bit like VR,
| which is awesome when you first try it, but it wears out quickly.
| Here, you can have a bot write convincing stories and generate
| nice looking images, which is awesome until you notice that the
| story doesn't make sense and that the images has many details
| wrong. This is not just a score, it is something you can see and
| experience.
|
| And there is also the real thing. People start using GPT for real
| work. I have used it to document my code for instance, and it
| works really well, with it I can do a better job than without,
| and I can do it faster. Many students use it to do their
| homework, which may not be something you want, but it no less of
| a real use. Many artists are strongly protesting against
| generative AI, this in itself is telling, it means it is taken
| seriously, and at the same time, other artists are making use of
| it.
|
| It is even use for great effect where you don't notice. Phone
| cameras are a good example, by enhancing details using AI, they
| give you much better pictures than what the optics are capable
| of. Some people don't like that because the picture are "not
| real", but most enjoy the better perceived quality. Then, there
| are image classifiers, speech-to-text and OCR, fuzzy searching,
| content ranking algorithms we love to hate, etc... that all make
| use of AI.
|
| Note: here AI = machine learning with neural networks, which is
| what the hype is about. AI is a vague term that can mean just
| about anything.
| Jensson wrote:
| > I don't think test scores have anything to do with the hype.
| Most people don't even realize test scores exist.
|
| They put the test scores front and center in the initial
| announcement with a huge image showing improvements on AP
| exams, it was the main thing people talked about during the
| announcement and the first thing anyone who read anything about
| gpt-4 sees.
|
| I don't think many who are hyped about these things missed
| that.
|
| https://openai.com/research/gpt-4
| bondarchuk wrote:
| > _But there's a problem: there is little agreement on what those
| results really mean. Some people are dazzled by what they see as
| glimmers of human-like intelligence; others aren't convinced one
| bit._
|
| I find the whole hype & anti-hype dynamic so tiresome. Some are
| over-hyping, others are responding with over-anti-hyping.
| Somewhere in-between are many reasonable, moderate and caveated
| opinions, but neither the hypesters or anti-hypesters will listen
| to these (considering all of them to come from people at the
| opposite extreme), nor will outside commentators (somehow being
| unable to categorize things as anything more complicated than
| this binary).
| Closi wrote:
| Depends if the hype is invalid - Let's remember that "There
| will be a computer in every home!" was once considered hype.
|
| There is a possible world where AI will be a truly
| transformative technology in ways we can't possibly understand.
|
| There is a possible world where this tech fizzles out.
|
| So one of the reasons that there is a broad 'hype' dynamic here
| is because the range of possibilities is broad.
|
| I sit firmly in the first camp though - I believe it's truly a
| transformative technology, and struggle to see the perspective
| of the 'anti-hype' crowd.
| TerrifiedMouse wrote:
| I'm in the second camp. To every hyped up tech, all I can say
| is "prove it". Give me actual real world results.
|
| There are millions of hustlers out there pushing snake oil.
| The probability that something is the real deal and not snake
| oil is small. Better to assuming the glass is half empty.
| Closi wrote:
| There will be millions of hustlers regardless of if the
| technology is transformative or not.
|
| The invention of the PC market was filled with hustlers but
| that doesn't mean that the PC didn't match the hype.
|
| The .com boom was filled with hustlers, but that doesn't
| mean that the Internet wasn't transformative.
|
| Actual real world results... well the technology is already
| responsible for c40% of code on Github. Image recognition
| technologies are soaring and self driving feels within
| reach. Few people doubt that a real-world Jarvis will be in
| your home within 12 months. The turing test is smashed, and
| LLM's are already replacing live chat operatives. And this
| is just the start of the technology...
| TerrifiedMouse wrote:
| > The .com boom was filled with hustlers, but that
| doesn't mean that the Internet wasn't transformative.
|
| But a lot of .com projects were BS. If you were to pick
| at random, the probability you got a winner is low. Thus
| it's wise to be skeptical of all hyped stuff until they
| have proven themselves.
|
| > Actual real world results... well the technology is
| already responsible for c40% of code on Github.
|
| Quite sure you misread that article. It says 40% of the
| code checked in by people who use Copilot is AI-
| generated. Not 40% of all code.
|
| That's how some programmers are I guess. I have heard of
| people copy pasting code directly from stack overflow
| without a second thought about how it works. That's
| probably Copilot's audience.
| Closi wrote:
| I think your reasoning is flawed - the fact a lot of .com
| projects were BS does not imply that the underlying
| technology (the internet) wasn't transformative.
|
| Are we really saying that people who were saying the
| internet was a transformative technology in the
| mid-1990's were wrong? It _was_ transformative, but it
| was hard to see which parts of the technology would stick
| around. Of course it doesn 't mean that every single
| company and investment was going to be profitable, that's
| not true of anything ever. People investing in Amazon and
| Google were winners though - these are companies that
| have in many ways reinvented the market they operate in.
|
| > Quite sure you misread that article. It says 40% of the
| code checked in by people who use Copilot is AI-
| generated. Not 40% of all code.
|
| Ok, I'll take that it's 40% of Copilot users. That's
| still 40% of some programmers code!
| dleslie wrote:
| Two years ago I didn't use AI at all. Now I wouldn't go without
| it; I have Copilot integrated with Emacs, VSCode, and Rider. I
| consider it a ground-breaking productivity accelerator, a leap
| similar to when I transitioned from Turbo Pascal 2 to Visual C 6.
|
| That's why I'm hyped. If it's that good for me, and it's
| generalizable, then it's going to rock the world.
| thomasfromcdnjs wrote:
| Life longer programmer, and same sentiments, I use it
| everywhere I can.
|
| I am currently transliterating a language PDF into a formatted
| lexicon, I wouldn't even be able to do this without co-pilot,
| it has turned this seemingly impossibly arduous task into a
| pleasurable one.
| airstrike wrote:
| Coding on something without copilot these days feels like
| having my hands tied. I'm looking at you, XCode and Colab...
| derbOac wrote:
| This was interesting to me but mostly because of a question I
| thought it was going to focus on, which is how should we
| interpret these tests when a human takes it?
|
| I wasn't sure that the phenomena they discussed was as relevant
| to the question of whether AI is overhyped as they made it out to
| be, but I did think a lot of questions about the meaning of the
| performances were important.
|
| What's interesting to me is you could flip this all on its head
| and, instead of asking "what can we infer about the machine
| processes these test scores are measuring?", we could ask "what
| does this imply about the human processes these test scores are
| measuring?"
|
| A lot of these test are well-validated but overinterpreted I
| think, and leaned on too heavily to make inferences about people.
| If a machine can pass a test, for instance, what does it say
| about the test as used in people? Should we be putting as much
| weight on them as we do?
|
| I'm not arguing these tests are useless or something, just that
| maybe we read into them too much to begin with.
| kfk wrote:
| AI hype is really problematic in Enterprise. Big companies are
| now spending C executive time figuring out a company "AI
| strategy". This is going to be another cycle of money-wasted/biz-
| upset, very similar to what I have seen with Big data. The thing
| in Enterprise is that everyone serious about biz operations knows
| AI test scores and AI quality is not there, but very few are able
| to communicate these concerns in a constructive way, rather
| everyone is embracing the hype because, maybe they get a
| promotion? Tech, as usual, is very happy to feed the hype and
| never, as usual, telling businesses honestly that, at best, this
| is an incremental productivity improvement, nothing life
| changing. I think the issue is overall lack of honesty,
| professionalism, and accountability across the board, with tech
| leading this terrible way of pushing product and "adding value".
| huijzer wrote:
| > Tech, as usual, is very happy to feed the hype
|
| I agree completely with you on this.
|
| In defence of the executives however is that some businesses
| will be seriously affected. Call centres and plagiarism scanner
| have already been affected, but it's unclear which industries
| will be affected too. Maybe the probability is low, but the
| impact could be very high. In think this reasoning is driving
| the executives.
| kfk wrote:
| Look, I am going to wait and see on this, maybe new facts
| will make me reconsider. In the meanwhile, github Copilot is
| just cost to my company, haven't seen much additional
| productivity. I guess my concern, given how hard is to hire
| developers and technologists, is replacing simpler job roles,
| like a customer service representative, with complicated new
| ones, like "MLOps Engineer".
| janalsncm wrote:
| Identifying an "AI strategy" seems backwards. What they should
| be doing is identifying the current problems and goals of the
| company and reassessing how best to accomplish them given the
| new capabilities which have surfaced. Perhaps "AI" is the best
| way. Or maybe simpler ways are better.
|
| I've said it before, but as someone to whom "AI" means
| something more than making API calls to some SAAS, I look
| forward to the day they hire me at $300/hour to replace their
| "AI strategy" with something that can be run locally off of a
| consumer-grade GPU or cheaper.
| gmerc wrote:
| Publicly listed companies whose traditional business model is
| under pressure are incentivized to hype because if they don't
| inspire there idea of sustained growth to their wary investors,
| they cautionary tale in form of Twitter (valuation low enough
| to lose control) exists.
|
| In Capitalism, you grow or you die and sometimes you need to
| bullshit people about growth potential to buy yourself time
| hashtag-til wrote:
| Yes, sad but true.
| chasd00 wrote:
| in consulting all we hear is sell sell sell AI so i'm sure my
| industry isn't helping at all. I'm not on board yet, I just
| don't see a use case in enterprise beyond learning a knowledge
| base to make a more conversational self-help search and things
| like that. It's great that it can help right a function in
| javascript but that's not a watershed moment... yet. Curious to
| see AI project sales at end of 2024 (everything in my biz is
| measured in units of $$).
| padjo wrote:
| It's rational herd dynamics for the execs. Going against the
| herd and being wrong is a career ender. Going with the herd and
| being wrong will be neutral at worst.
| hdjjhhvvhga wrote:
| > because, maybe they get a promotion?
|
| While I agree with you in general, I don't think this bit is
| particularly fair. I'd say we know the limitations, and we also
| know that using LLMs might bring some advantage, and the
| companies that are able to use it properly will have a better
| position, so it makes sense to at least investigate the
| options.
| hashtag-til wrote:
| Agreed. I think there is a FOMO phenomena among C-level execs,
| that is generating a gigantic waste os money and time, creating
| distractions around "AI strategy".
|
| It started a few years back and it is now really inflamed with
| LLM, because of the consumer level hype and general media
| reporting about it.
|
| You can perceive that by the multiple AI startups capturing
| millions in VC capital for absolutely bogus value proposition.
| Bizarre!
| epups wrote:
| The problem with your premise is that you're already drawing
| conclusions about the potential of AI and deciding it is hype.
| Perhaps decades ago someone could have equally criticised
| "Internet hype" and "mobile hype" and look foolish now.
| soco wrote:
| Also decades ago someone criticised "bigdata hype" and
| "microservices hype" and looks right now. Doing things just
| out of FOMO is rarely a good business decision. It can pay
| out, even a broken clock is right twice a day, but it's
| definitely bad to follow every new thing just because Gartner
| mentioned it. I'm not giving advice of course, but having
| seen enterprises betting good money even on NFT I tend to
| treat every new enterprise powerpoint idea with a certain
| dose of skepticism.
| epups wrote:
| Yes, hype exists and some things we thought were promising
| turned out not to be. However, if anyone is making the case
| that we know enough today to claim that AI is mostly hype,
| I think that's foolish.
| pixl97 wrote:
| Business can work on more than one thing at once.
| Businesses typically take any number of risks they invest
| in. Proper risk management ensures you've not over
| committed assets to the point of an unrecoverable loss.
|
| Some businesses in some industries can follow a strategy of
| "never do anything until it's a well established process",
| others cannot.
| dboreham wrote:
| > AI hype is really problematic in Enterprise.
|
| This only appears so because we here have some insight into the
| domain. But there have always been hype cycles. We just didn't
| notice them so readily.
|
| The speed with which this happens makes me suspect there is a
| hidden "generic hype army" that was already in place,
| presumably hyping the last thing, and ready to jump on this
| thing.
| jacobr1 wrote:
| Blindly following a trend will likely not end well. But even
| with previous hype cycles, those companies that identified good
| use cases, validated those use cases, and had solid execution
| of the projects leaped ahead. Big Data was genuinely of value
| to plenty of organizations, and a waste of time for others. IoT
| was crazy for plenty of orgs ... but also was really valuable
| to certain segments. Gartner's hype cycle ends with the plateau
| of productivity for a reason ... you just have to go through
| the trough of disillusionment first, which is going to come
| from a the great multitudes of failed and ill-conceived
| projects.
| JaDogg wrote:
| This is exactly correct.
| randcraw wrote:
| The debate over whether LLMs are "intelligent" seem a lot like
| the old debate among NLP experts whether English must be modeled
| as a context-free grammar (push down automaton) or finite-state
| machine (regular expression). Yes, any language can be modeled
| using regular expressions; you just need an insane number of FSMs
| (perhaps billions). And that seems to be the model that LLMs are
| using to model cognition today.
|
| LLMs seem to use little or no abstract reasoning (is-a) or
| hierarchical perception (has-a), as humans do -- both of which
| are grounded in semantic abstraction. Instead, LLMs can memorize
| a brute force explosion in finite state machines (interconnected
| with Word2Vec-like associations) and then traverse those machines
| and associations as some kind of mashup, akin to a coherent
| abstract concept. Then as LLMs get bigger and bigger, they just
| memorize more and more mashup clusters of FSMs augmented with
| associations.
|
| Of course, that's not how a human learns, or reasons. It seems
| likely that synthetic cognition of this kind will fail to enable
| various kinds of reasoning that humans perceive as essential and
| normal (like common sense based on abstraction, or physically-
| grounded perception, or goal-based or counterfactual reasoning,
| much less insight into the thought processes / perceptions of
| other sentient beings). Even as ever-larger LLMs "know more" by
| memorizing ever more FSMs, I suspect they'll continue to surprise
| us with persistent cognitive and perceptual deficits that would
| never arise in organic beings that _do_ use abstract reasoning
| and physically grounded perception.
| dmezzetti wrote:
| This video from Yann LeCun gives a great summary on where things
| stand. https://www.youtube.com/watch?v=pd0JmT6rYcI
|
| He is of the opinion the current generation transformers
| architecture is flawed and it will take a new generation of
| models to get close to the hype.
| chewxy wrote:
| I note something very interesting in the AI hype, and I would
| like someone to help explain it.
|
| Whenever there's a news or article noting the limits of current
| LLM tech (especially the GPT class of models from OpenAI),
| there's always a comment that says something along the lines of
| "ah did you test it on GPT-4"?
|
| Or if it's clear that it's the limitation of GPT-4, then you have
| comments along the lines of "what's the prompt?", or "the prompt
| is poor". Usually, it's someone who hasn't in the past indicated
| that they understand that prompt engineering is model specific,
| and the papers' point is to make a more general claim as opposed
| to a claim on one model.
|
| Can anyone explain this? It's like the mere mention of LLMs being
| limited in X, Y, Z fashion offends their lifestyle/core beliefs.
| Or perhaps it's a weird form of astroturfing. To which, I ask, to
| what end?
| jazzyjackson wrote:
| The output of any model is essentially random and whether it is
| useful or impressive is a coin flip. While most people get a
| mix of heads and tails, there are a few people at any time that
| are getting streaks of one head after another or vice versa.
|
| So my perception is this leads to people who have good luck and
| perceive LLMs as near AGI because it arrives at a useful answer
| more often than not, and these people cannot believe there are
| others who have bad luck and get worthless output from their
| LLM, like someone at a roulette table exhorting "have you tried
| betting it all on black? worked for me!"
| stevenhuang wrote:
| Because they're saying it can't do something when they're
| holding it wrong.
|
| It's a weird thing to get hung up on if you ask me.
| TeMPOraL wrote:
| > _there 's always a comment that says something along the
| lines of "ah did you test it on GPT-4"?_
|
| Perhaps because whenever there's "a news or article noting the
| limits of current LLM tech", it's a bit like someone tried to
| play a modern game on a machine they found in their parents'
| basement, and the only appropriate response to this is, "have
| you tried running it on something other than a potato"? This
| has been happening so often over the past few months that it's
| the first red flag you check for.
|
| GPT-4 is still _qualitatively_ ahead of all other LLMs, so
| outside of articles addressing specialized aspects of different
| model families, the claims are invalid unless they were tested
| on GPT-4.
|
| (Half the time the problem is that the author used ChatGPT web
| app and did not even realize there are two models and they've
| been using the toy one.)
| wrsh07 wrote:
| 1. Just like it's frustrating when a paper is published making
| claims that are hard to verify, it's frustrating when somebody
| says "x can't do y" in a way that is hard to verify^^
|
| 2. LLMs, in spite of the complaints about the research leaders,
| are fairly democratic. I have access to several of the best
| LLMs currently in existence and the ones I can't access haven't
| been polished for general usage anyway. If you make a claim
| with a prompt, it's easy for me to verify it
|
| 3. I've been linked legitimate ChatGPT prompts where someone
| gets incorrect data from ChatGPT - my instinct is to help them
| refine their prompt to get correct data
|
| 4. If you make a claim about these cool new tools (not making a
| claim about what they're good for!) all of these kick in. I
| want to verify, refine, etc.
|
| Of course some people are on the bandwagon and it is akin to
| insulting their religion (it is with religious fervor they hold
| their beliefs!) but at least most folks on hn are just excited
| and trying to engage
|
| ^^ I actually think making this claim is in bad form generally.
| It's like looking for the existence of aliens on a planet.
| Absence of evidence is not evidence of absence
| abm53 wrote:
| Perhaps they are trying to help people get the best out of a
| tool which they themselves find very useful?
| epups wrote:
| If someone comes here and says "<insert programming language>
| cannot do X" and that is wrong, or perhaps outdated, don't you
| feel that the reaction would be similar?
|
| If you are trying to make categorical statements about what AI
| is unable to do, at the very least you should use a state-of-
| the-art system, which conveniently is easily available for
| everyone.
| jacobr1 wrote:
| As someone who has this instinct myself, there is a line of
| reactionism to modern AI/ML that says, "this is just a toy,
| look it can't do something simple." But often the case, if
| _can_ do that thing with a either a more advanced model, or a
| more built-out system. So the instinct is to try and explain
| that the pessimism is wrong. That we really can push the
| boundary and do more, even if it isn't going to work out of the
| box yet. I react that way against all forms of poppy snipping.
| Jensson wrote:
| Hyping up tech based on what you think it will be able to do
| in the future is the misplaced overhyping that is the
| problem. The issues people say are easy to fix aren't easy to
| fix.
|
| Expect the model to continue to perform like it does today,
| and then lots of dumb integrations added to it, and you will
| get a very accurate prediction of how most of new tech hype
| turns out. Dumb integrations can't add intelligence, but it
| can add a lot of value, so the rational hype still sees this
| as a very valuable and exciting thing, but it isn't a
| complete revolution in its current form.
| Cloudef wrote:
| AI is honestly wrong word to use. These are ML models and they
| are able to only do the task they have been specifically trained
| for (not saying the results aren't impressive!). There really
| isn't competition either as the only people who can train these
| giant models are those who have the cash.
| pixl97 wrote:
| >AI is honestly wrong word to use
|
| https://en.wikipedia.org/wiki/AI_effect
|
| Just because you don't like how poorly the term AI is defined,
| doesn't mean it is the wrong term.
|
| AI can never be well defined because the word intelligence
| itself is not well defined.
| TeMPOraL wrote:
| > _These are ML models and they are able to only do the task
| they have been specifically trained for_
|
| Yes, but the models we're talking about have been trained
| specifically on the task of "complete arbitrary textual input
| in a way that makes sense to humans", for _arbitrary textual
| input_ , and then further tuned for "complete it as if you were
| a person having conversation with a human", again for arbitrary
| text input - and trained until they could do so convincingly.
|
| (Or, you could say that with instruct fine-tuning, they were
| further trained to _behave as if they were an AI chatbot_ - the
| kind of AI people know from sci-fi. Fake it 'till you make it,
| via backpropagation.)
|
| In short, they've been trained on an open-ended, general task
| of communicating with humans using plain text. That's very
| different to typical ML models which are tasked to predict some
| very specific data in a specialized domain. It's like comparing
| a Python interpreter to Notepad - both are just regular
| software, but there's a meaningful difference in capabilities.
|
| As for seeing glimpses of understanding in SOTA LLMs - this
| makes sense under the compression argument: understanding is
| lossy compression of observations, and this is what the
| training process is trying to force to happen, squeezing more
| and more knowledge into a fixed set of model weights.
| Cloudef wrote:
| Yes, this is why I think the LLM and image generation models
| are still impressive. Knowing they are ML models in the end
| and still produce a results that surprise us, makes you
| wonder what we are in the end. Could we essentially simulate
| something similar to us given enough inputs and parameters in
| the network, with enough memory, computing power and a
| training process that would aim to simulate a human with
| emotions. I would imagine the training process alone would
| need bunch of other models to teach the final model
| "concepts" and from there perhaps "reasoning".
|
| Why I think AI is not the appropriate term is that if it were
| AI, the AI would have already figured everything out for us
| (or for itself). LLM can only chain text, it does not really
| understand the content of the text, and can't come up with
| new novel solutions (or if it accidentally does, it's due to
| hallucination), this can be easily confirmed by giving
| current LLMs some simple puzzles, math problems and so on..
| Image models have similar issues.
| iambateman wrote:
| This really is a good article, and is seriously researched. But
| the conclusion in the headline - "AI hype is built on flawed test
| scores" - feels like a poor summary of the article.
|
| It _is_ correct to say that an LLM is not ready to be a medical
| doctor, even if it can pass the test.
|
| But I think a better conclusion is that test scores don't help us
| understand LLM capabilities like we think they do.
|
| Using a human test for an LLM is like measuring a car's "muscles"
| and calling it horsepower. They're just different.
|
| But the AI hype is justified, even if we struggle to measure it.
| aidenn0 wrote:
| Any task that gets solved with AI retroactively becomes something
| that doesn't require reasoning.
| janalsncm wrote:
| I wouldn't say that. Chess certainly requires reasoning even if
| that reasoning is minimax.
|
| I suppose in the context of this article "AI" means statistical
| language models.
| robertlagrant wrote:
| > AI hype is built on high test scores
|
| No, it's built on people using DALLE and Midjourney and ChatGPT.
| yCombLinks wrote:
| Exactly, chatpgt is double checking my homework problems and
| pointing out my errors, it's teaching me the material better
| than any of my lectures. It's writing tons of code I'm getting
| paid for, with way less overhead than trying to explain the
| problem to a junior, less mistakes and faster iteration. Test
| scores, ridiculous
| Kalanos wrote:
| Didn't it perform well on both the SAT and LSAT though?
| mg wrote:
| I don't think the "hype" is built on test scores.
|
| It is built on the observation how fast AI is getting better. If
| the speed of improvement stays anywhere near the level it was the
| last two years, then over the next two decades, it will lead to
| massive changes in how we work and which skills are valuable.
|
| Just two years ago, I was mesmerized by GPT-3's ability to
| understand concepts:
|
| https://twitter.com/marekgibney/status/1403414210642649092
|
| Nowadays, using it daily in a productive fashion feels completely
| normal.
|
| Yesterday, I was annoyed with how cumbersome it is to play long
| mp3s on my iPad. I asked GPT-4 something like "Write an html page
| which lets me select an mp3, play it via play/pause buttons and
| offers me a field to enter a time to jump to". And the result was
| usable out of the box and is my default mp3 player now.
|
| Two years ago it didn't even dawn on me that this would be my way
| of writing software in the near future. I have been coding for
| over 20 years. But for little tools like this, it is faster to
| ask ChatGPT now.
|
| It's hard to imagine where we will be in 20 years.
| gmerc wrote:
| This.
|
| We are in a Cambrian Explosion on the software side and
| hardware hasn't yet reacted to it. There's a few years of mad
| discovery in front of us.
|
| People have different impressions as to the shape of the curve
| that's going up and right, but only a fool would not stop and
| carefully take what is happening.
| kossTKR wrote:
| Exactly and things are actually getting crazy now. Pardon the
| tangent but for some reason this hasn't reached the frontpage
| on HN yet: https://github.com/OpenBMB/ChatDev
|
| Making your own "internal family system" of AI's is a making
| this exponential (and frightening), like an ensemble on top
| of the ensemble, with specific "mindsets", that with shared
| memory can build and do stuff continuously. Found this from a
| comp sci professor on Tiktok so be warned: https://www.tiktok
| .com/@lizthedeveloper/video/72835773820264...
|
| I remember a couple of comments here on HN when the hype
| began about how some dude thought he had figured out how to
| actually make an AGI - can't find it now, but it was
| something about having multiple ai's with different
| personalities discoursing with a shared memory - and now it
| seems to be happening.
|
| This coupled with access to linux containers that can be
| spawned on demand, we are in for a wild ride!
| ChatGTP wrote:
| [flagged]
| dartos wrote:
| I saw chatdev on hn and have been pretty disappointed with
| it :(
|
| Haven't had it make anything usable that's more complicated
| than a mad lib yet
| [deleted]
| happycube wrote:
| I got curious and did this myself. Needed a bit of nudging to
| get where I wanted, but I even had it make an Electron wrapper:
|
| https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...
| dchuk wrote:
| This is awesome, thanks for sharing.
|
| Do you (or anyone) know of any products that allow for
| iterating on the generated output through further chatting
| with the ai? What I mean, is that each subsequent prompt here
| either generated a new whole output, or new chunks to add to
| the output. Ideally, whether generating code or prose, I'd
| want to keep prompting about the generated output and the AI
| further modifies the existing output until it's refined to
| the degree I want it.
|
| Or is that effectively what Copilot/cursor do and I'm just a
| bad operator?
| happycube wrote:
| No problem, it was a fun morning exercise for me :)
|
| Copilot, at least from what little I did in vscode, isn't
| as powerful as this. I think there's a GPT4 mode for it
| that I haven't played with that'd be a _lot_ closer to
| this.
| robertlagrant wrote:
| > Do you (or anyone) know of any products that allow for
| iterating on the generated output through further chatting
| with the ai? What I mean, is that each subsequent prompt
| here either generated a new whole output, or new chunks to
| add to the output. Ideally, whether generating code or
| prose, I'd want to keep prompting about the generated
| output and the AI further modifies the existing output
| until it's refined to the degree I want it.
|
| ChatGPT does this.
| RC_ITR wrote:
| You can check my post history to see how unpopular this point
| of view is, but the big "reveal" that will come up is as
| follows:
|
| The way that LLMs and humans "think" is inherently different.
| Giving an LLM a test designed for humans is akin to giving a
| camera a 'drawing test.'
|
| A camera can make a better narrow final output than a human,
| but it cannot do the subordinate tasks that a human illustrator
| could, like changing shadings, line width, etc.
|
| An LLM can answer really well on tests, but it often fails at
| subordinate tasks like 'applying symbolic reasoning to
| unfamiliar situations.'
|
| Eventually the thinking styles may converge in a way that makes
| the LLMs practically more capable than humans on those
| subordinate tasks, but we are not there yet.
| james-revisoai wrote:
| A lot of the progress in the last 3-4 years was predictable
| from GPT-2 and especially GPT-3 onwards - combining instruction
| following and reinforcement learning with scaling GPT. With
| research being more closed, this isn't so true anymore. The mp3
| case was predictable in 2020 - some early twitter GIFs showed
| vaguely similar stuff. Can you predict what will happen in
| 2026/7 though, with multimodal tech?
|
| I simply don't see it a being the same today. The obvious
| element of scaling or techniques that imply a useful overlap
| isn't there. Whereas before researchers brought together
| excellent and groundbreaking performance on different
| benchmarks and areas together as they worked on GPT-3, since
| 2020, except instruction following, less has been predictable.
|
| Multi modal could change everything (things like the ScienceQA
| paper suggest so), but also, it might not shift benchmarks.
| It's just not so clear that the future is as predictable or
| will be faster than the last few years. I do have my own
| beliefs similar to Yann Lecun about what architecture or rather
| infrastructure makes most sense intuitively going forward, and
| there's not really the openness we used to have from top labs
| to know if they are going these ways, or not. So you are
| absolutely right that it's hard to imagine where we will be in
| 20 years, but in a strange way, because it is much less clear
| than in 2020 where we will be in 3 years time onwards, I would
| say it is much less guaranteed progress than it is felt by
| many...
| huijzer wrote:
| I was also thinking about how quickly AI may progress and am
| curious for your or other people's thoughts. When estimating
| AI progress, estimating orders of magnitude sounds like the
| most plausible way to do it, just like Moore's law has
| guessed the magnitude correctly for years. For AI, it is
| known that performance increases linearly when the model size
| increases exponentially. Funding currently increases
| exponentially meaning that performance will increase
| linearly. So, AI will increase linearly as long as the
| funding does too. On top of this, algorithms may be made more
| efficient, which may occasionally make an order of magnitude
| improvement. Does this reasoning make sense? I think it does
| but I could be completely wrong.
| denton-scratch wrote:
| > I was mesmerized by GPT-3's ability to understand concepts
|
| This language embodies the anthropomorphic assumptions that the
| author is attacking.
| dboreham wrote:
| Or the corollary: that there's really no such thing as
| anthropomorphic. There's inputs and outputs, and an
| observer's opinion on how well the outputs relate to the
| inputs. Thing producing the outputs, and the observer, can be
| human or not human. Same difference.
| smcin wrote:
| It absolutely is anthropomorphizing to claim "GPT-3's
| ability to _understand concepts_ " rather than simply
| calling it "reproduce, mix and match text from an enormous
| corpus". And we can totally legitimately compare to a jury
| of human observers' opinions on how well(/badly) the output
| generated relates to the inputs.
|
| For the specific example the OP cited _" War: like being
| eaten by a dragon and then having it spit you out"_
|
| then unless its answer to "Where were you in between being
| eaten by a dragon and before it spat you out?" is "in the
| dragon's digestive system" that isn't understanding.
|
| And I'm curious to see it answer "Dragons only exist in
| mythology; does your analogy mean war doesn't exist either?
| Why not compare to an animal that exists?"
| nomel wrote:
| > "War: like being eaten by a dragon and then having it
| spit you out"
|
| This exact text, and the response (several attempts) is
| flagged and censored, with ChatGPT-4 web interface. :-|
| adrians1 wrote:
| > If the speed of improvement stays anywhere near the level it
| was the last two years, then over the next two decades, it will
| lead to massive changes in how we work and which skills are
| valuable.
|
| That's a big assumption to make. You can't assume that the rate
| of improvement will stay the same, especially over a period of
| 2 decades, which is a very long time. Every advance in
| technology hits diminishing returns at some point.
| mg wrote:
| Why do you think so?
|
| Technological progress seems rather accelerated than
| diminishing to me.
|
| Computers are a great example: They have been getting more
| capable exponentially over the last decades.
|
| In terms of performance (memory, speed, bandwidth) and in
| terms of impact. First we had calculators, then we had
| desktop applications, then the Internet and now we have AI.
|
| And AI will help us get to the next stage even faster.
| hashtag-til wrote:
| I'm not putting my coins on this advances.
|
| More likely this will become the new "search" technology
| and get polluted with ads. People will lose trust and it
| will decay.
| Jensson wrote:
| > Two years ago it didn't even dawn on me that this would be my
| way of writing software in the near future
|
| So you were ignorant two years ago, GitHub Copilot was already
| available to users back then. The only new big thing the past
| two years was GPT-4, and nothing suggest anything similar will
| come the next two years. There are no big new things on the
| horizon, we knew for quite a while that GPT-4 was coming, but
| there isn't anything like that this time.
| mg wrote:
| Copilot was not around when I wrote the Tweet.
|
| But when Copilot came out, I was indeed ignorant! I remember
| when a friend showed it to me for the first time. I was like
| "Yeah, it outputs almost correct boilerplate code for you.
| But thankfully my coding is so that I don't have to write
| boilerplate". I didn't expect it to be able to write fully
| functional tools and understand them well enough to actually
| write pretty _nice_ code!
|
| Regarding "there isn't anything like that this time." : Quite
| the opposite! We have not figured out where using larger
| models and throwing more data at them will level off! This
| could go on for quite a while. With FSD 12, Tesla is already
| testing self driving with a single large neural net, without
| any glue code. I am super curious how that will turn out.
|
| The whole thing is just starting.
| Jensson wrote:
| Well, my point is that you perceive progress to be fast
| since you went from not understanding what existed to later
| getting in on it. That doesn't mean progress was that fast,
| it means that you just discovered a new domain.
|
| Trying to extrapolate actual progress is bad in itself, but
| trying to extrapolate your perceived progress is even
| worse.
| james-revisoai wrote:
| Yeah you have hit the nail on the head here. A lot was
| predictable with seeing that GPT-2 could reasonably stay
| within language and generate early coherent structures,
| that coming at the same time as instructions with the T5
| stuff and the widespread use of embeddings from BERT told
| us this direction was likely, it's just for many people
| this came to awareness in 2021/22 rather than the
| 2018-2020 ramp up the field/hobbyists experienced.
| [deleted]
| isaacfung wrote:
| Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3
|
| Other breakthroughs in graph machine learning
| https://towardsdatascience.com/graph-ml-in-2023-the-state-
| of...
| Jensson wrote:
| Those are image/voice generation, the topic is about
| potential replacement of knowledge workers such as coders.
| The discussion about image/voice generation is a very
| different topic since nobody thinks those are moving
| towards AGI and nobody argued they were "conscious" etc.
| isaacfung wrote:
| [flagged]
| rob74 wrote:
| The article doesn't say that LLMs aren't useful - the "hype"
| they mean is overestimating their capabilities. An LLM may be
| able to pass a "theory of mind" test, or it may fail
| spectacularly, depending on how you prompt it. And that's
| because, despite all of its training data, it's not capable of
| actually _reasoning_. That may change in the future, but we 're
| not there yet, and (AFAIK) nobody can tell how long it will
| take to get there.
| YetAnotherNick wrote:
| > it's not capable of actually reasoning
|
| Define reasoning. Because in my definition GPT 4 could reason
| without doubt. It definitely can't reason better than experts
| in the field, but it could reason better than say interns.
| mcguire wrote:
| What is your definition?
| YetAnotherNick wrote:
| If it can solve basic logic problems, then it could
| reason. And if it could write code of a new game with new
| logic, then it could reason for sure.
|
| Example of basic problem: In a shop, there are 4 dolls of
| different heights P,Q,R and S. S is neither as tall as P
| nor as short as R. Q is shorter than S but taller than R.
| If Kittu wants to purchase the tallest doll, which one
| should she purchase? Think step by step.
| kaibee wrote:
| Seems to handle it easily. https://chat.openai.com/share/
| 4d8ab2af-f824-44c8-9311-e3893c...
| NovemberWhiskey wrote:
| I don't have access to GPT 4 but I'd be interested to see
| how it does on a question like this:
|
| _" Say I have a container with 50 red balls and 50 blue
| balls, and every time I draw a blue ball from the
| container, I add two white balls back. After drawing 100
| balls, how many of each different color ball are left in
| the container? Explain why."_
|
| ... because on GPT 3.5 the answer begins like the below and
| then gets worse:
|
| _" Let's break down the process step by step:
|
| Initially, you have 50 red balls and 50 blue balls in the
| container.
|
| 1) When you draw a blue ball from the container, you remove
| one blue ball, and you add two white balls back. So, after
| drawing a blue ball, you have 49 blue balls (due to
| removal) and you add 2 white balls, making it a total of 52
| white balls (due to addition) ..."_
|
| If I was hiring interns this dumb, I'd be in trouble.
|
| EDIT: judging by the GPT-4 responses, I remain of the
| opinion I'd be in trouble if my interns were this dumb.
| magospietato wrote:
| GPT-4:
|
| e composition of what's left in the container.
|
| There's a couple of scenarios, which depend on when you
| run out of blue balls:
|
| 1) If you draw all 50 blue balls first, for every blue
| ball you draw, you are adding 2 white balls. Hence, once
| you've drawn all the blue balls, you have replaced them
| with 100 white balls. This means you've drawn 50 balls
| total and there's now 100 white balls and 50 red balls
| (150 balls) in the container. Then, you continue and draw
| the next 50 balls. These could be any combination of
| white and red balls, we don't know which color is
| preferred when you continue drawing.
|
| 2a) If you draw all red balls in these next 50 draws, you
| would have 50 red balls out and 50 blue balls out. So,
| you would end with 100 white balls in the container.
|
| 2b) If you draw all white balls in these next 50 draws
| you would have 50 red balls, 50 blue balls, and 50 white
| balls out. So, you would end with 50 white balls and 50
| red balls in the container.
|
| 2c) If you draw a mixture of red and white balls in these
| next 50 draws, the number leftover in the container would
| vary according to the specific combination drawn.
|
| Remember, the order in which the balls are drawn matters
| to this question since we have a rule in place that
| changes the overall total (or population) of balls within
| the container. In each of these scenarios, you've drawn
| 50 times initially (all blue) and then 50 times again (in
| some unknown combination of red and white). You've drawn
| 100 times total and changed the number of white balls in
| the container from zero to an amount dependent on how
| many times you drew a white ball on your second round of
| 50 draws.
| NovemberWhiskey wrote:
| Yeah, that's still pretty much nonsense isn't it?
|
| _2b) If you draw all white balls in these next 50 draws
| you would have 50 red balls, 50 blue balls, and 50 white
| balls out. So, you would end with 50 white balls and 50
| red balls in the container._
|
| ... so after removing 100 balls, I 've removed 150 balls?
| And the 150 balls that I've removed are red, white and
| blue despite the fact that I removed 50 blue balls
| initially and then 50 white ones.
| mountainriver wrote:
| Just because it fails one test in a particular way
| doesn't mean it lacks reasoning entirely. It clearly does
| have reasoning based on all the benchmarks it passses
|
| You are really trying to make it not have reasoning for
| your own benefit
| kenjackson wrote:
| I asked GPT4 and it gave a similar response. So then I
| asked my wife and she said, "do you want more white balls
| at the end or not?" And I realized as CS or math question
| we assume that the draw is random. Other people assume
| that you're picking which ball to draw.
|
| So I clarified to ChatGPT that the drawing is random. And
| it replied: "The exact numbers can vary based on the
| randomness and can be precisely modeled with a simulation
| or detailed probabilistic analysis."
|
| I asked for a detailed probabilistic analysis and it
| gives a very simplified analysis. And then basically says
| that a Monte Carlo approach would be easier. That
| actually sounds more like most people I know than most
| people I know. :-)
| YetAnotherNick wrote:
| This is such a flawed puzzle. And GPT 4 answers it
| rightly. It is a long answer but the last sentence is
| "This is one possible scenario. However, there could be
| other scenarios based on the order in which balls are
| drawn. But in any case, the same logic can be applied to
| find the number of each color of ball left in the
| container."
| NovemberWhiskey wrote:
| The ability to identify that there isn't a simple closed
| form result is actually a key component of reasoning. Can
| you stick the answer it gives on a gist or something? The
| GPT 3.5 response is pure, self-contradictory word salad
| and of course delivered in a highly confident tone.
| YetAnotherNick wrote:
| https://pastebin.com/r9bNi8GD
|
| GPT 4 goes into detail about one example scenario, which
| most humans won't do, but it is technically correct
| answer as it said it depends on the order.
| NovemberWhiskey wrote:
| But the reasoning is total garbage, right?
|
| It says the number of blue balls drawn is _x_ and the
| number of red balls drawn is _y_ , and then asserts _x +
| y = 100_ , which is wrong.
|
| Then it proceeds to "solve" an equation which reduces to
| _x = x_ to conclude _x = 0_.
|
| It then uses that to "prove" that _y_ = 100, which is a
| problem as there are only 50 red balls in the container
| and nothing causes any more to be added.
|
| It's like "mistakes bad students make in Algebra 1".
| albedoa wrote:
| > but it is technically correct answer as it said it
| depends on the order.
|
| It should give you pause that you had to pick not only
| the _line_ by which to judge the answer but the _part_ of
| the line. The sentence immediately before that is
| objectively wrong:
|
| > This is one possible scenario.
| Jensson wrote:
| Its answer isn't correct, this isn't a possible ending
| scenario:
|
| - *Ending Scenario:* - Red Balls (RB): 0 (all have been
| drawn) - Blue Balls (BB): 50 (none have been drawn) -
| White Balls (WB): 0 (since no blue balls were drawn, no
| white balls were added) - Total Balls: 50
| sebzim4500 wrote:
| I don't understand the question. Surely the answer
| depends which order you withdraw balls in? Is the idea
| that you blindly withdraw a ball at every step, and you
| are asking for the expected value of each number of ball
| at the end of the process?
|
| Seems like quite a difficult question to compute exactly.
|
| I reworded the question to make it clearer and then it
| was able to simulate a bunch of scenarios as a monte
| carlo simulation. Was your hope to calculate it exactly
| with dynamic programming? GPT-4 was not able to do this,
| but I suspect neither could a lot of your interns.
| NovemberWhiskey wrote:
| > _I don 't understand the question. Surely the answer
| depends which order you withdraw balls in? Is the idea
| that you blindly withdraw a ball at every step, and you
| are asking for the expected value of each number of ball
| at the end of the process?_
|
| These are very good questions that anyone with the
| ability to reason would ask if given this problem.
| rafark wrote:
| GPT 3.5 is VERY dumb when compared to GPT 4. Like, the
| difference is massive.
| Jensson wrote:
| GPT 4 still does a lot of dumb stuff on this question,
| you see several people post outright wrong answer and say
| "Look how gpt-4 solved it!". That happens quite a lot in
| these discussions, so it seems like the magic to get
| gpt-4 to work is that you just don't check its answers
| properly.
| kaibee wrote:
| https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-284
| 3e6...
|
| Took a bit of massaging and I enabled the Data Analysis
| plugin which lets it write python code and run it. It
| looks like the simulation code is correct though.
| Kim_Bruning wrote:
| I came at it from a different angle. The simulation code
| in my case had a bug which I needed to point out. Then it
| got a similar final answer.
| NovemberWhiskey wrote:
| > _Let 's assume you draw x blue balls in 100 draws. Then
| you would have drawn 100-x red balls._
|
| Uhm.
| pertique wrote:
| This is what I got on a basically brand new OpenAI
| account: https://chat.openai.com/share/5199c972-478d-406f
| -9092-061a6b...
|
| All told, I'd say it's a decent answer.
|
| Edit: I took it to completion:https://chat.openai.com/c/6
| cdd92f1-487a-4e1c-ab94-f2bdbf282d...
|
| These were the first responses each time, with no
| massaging/retires/leading answers. I will say it's not
| entirely there. I re-ran the initial question a few times
| afterwards and one was basically giberish.
| [deleted]
| szundi wrote:
| Really?
| _trapexit wrote:
| It's not reasoning. It's word prediction. At least at the
| individual model level. OpenAI is likely using a collection
| of models.
| denton-scratch wrote:
| > And that's because, despite all of its training data, it's
| not capable of actually reasoning. That may change in the
| future [...]
|
| I don't think so. When you say "it's not capable of actually
| reasoning", that's because it's a LLM; and if it "changes in
| the future", that's because the new system must no longer be
| a pure LLM. The appearance of reasoning in LLMs is an
| illusion.
| kromem wrote:
| That's not the case. It's very much in the realm of "we don't
| know what's going on in the network."
|
| Rather than a binary it's much more likely that it's a mix of
| factors going into results that includes basic reasoning
| capabilities developed from the training data (much like
| board representations and state tracking abilities developed
| feeding board game moves into a toy model in Othello-GPT) as
| well as statistic driven autocomplete.
|
| In fact often when I've seen GPT-4 get hung up with logic
| puzzle variations such as transparency, it tends to seem more
| like the latter is overriding the former, and changing up
| tokens to emoji representations or having it always repeat
| adjectives attached to nouns so it preserves variation
| context gets it over the hump to reproducible solutions (as
| would be expected from a network capable of reasoning) but by
| default it falls into the pattern of the normative cases.
|
| For something as complex as SotA neural networks, binary
| sweeping statements seem rather unlikely to actually be
| representative...
| nsagent wrote:
| As an PhD student in NLP who's graduating soon, my
| perspective is that language models do not demonstrate
| "reasoning" in the way most people colloquially use the
| term.
|
| These models have no capacity to plan ahead, which is a
| requirement for many "reasoning" problems. If it's not in
| the context, the model is unlikely to use it for predicting
| the next token. That's why techniques like chain-of-thought
| are popular; they cause the model to parrot a list of facts
| before making a decision. This increases the likelihood
| that the context might contain parts of the answer.
|
| Unfortunately, this means the "reasoning" exhibited by
| language models is limited: if the training data does not
| contain a set of generalizable text applicable to a
| particular domain, a language model is unlikely to make a
| correct inference when confronted with a novel version of a
| similar situation.
|
| That said, I do think adding reasoning capabilities is an
| active area of research, but we don't have a clear time
| horizon on when that might happen. Current prompting
| approaches are stopgaps until research identifies a
| promising approach for developing reasoning, e.g. combining
| latent space representations with planning algorithms over
| knowledge bases, constraining the logits based on an
| external knowledge verifier, etc (these are just random
| ideas, not saying they are what people are working on,
| rather are examples of possible approaches to the problem).
|
| In my opinion, language models have been good enough since
| the GPT-2 era, but have been held back by a lack of
| reasoning and efficient memory. Making the language models
| larger and trained on more data helps make them more useful
| by incorporating more facts with increased computational
| capacity, but the approach is fundamentally a dead end for
| higher level reasoning capability.
| Kim_Bruning wrote:
| > These models have no capacity to plan ahead
|
| How would you describe the behavior of "GPT Advanced Data
| Analysis"?
| Der_Einzige wrote:
| Are you going to school in Langley, Virginia?
| NovemberWhiskey wrote:
| NSA is more commonly associated with Fort Meade, MD, for
| what that's worth.
| visarga wrote:
| > if the training data does not contain a set of
| generalizable text applicable to a particular domain, a
| language model is unlikely to make a correct inference
| when confronted with a novel version of a similar
| situation.
|
| True. But look at the Phi-1.5 model - it punches 5x above
| its weight limit. The trick is in the dataset:
|
| > Our training data for phi-1.5 is a combination of
| phi-1's training data (7B tokens) and newly created
| synthetic, "textbook-like" data (roughly 20B tokens) for
| the purpose of teaching common sense reasoning and
| general knowledge of the world (science, daily
| activities, theory of mind, etc.). We carefully selected
| 20K topics to seed the generation of this new synthetic
| data. In our generation prompts, we use samples from web
| datasets for diversity. We point out that the only non-
| synthetic part in our training data for phi-1.5 consists
| of the 6B tokens of filtered code dataset used in phi-1's
| training (see [GZA+ 23]).
|
| > We remark that the experience gained in the process of
| creating the training data for both phi-1 and phi-1.5
| leads us to the conclusion that the creation of a robust
| and comprehensive dataset demands more than raw
| computational power: It requires intricate iterations,
| strategic topic selection, and a deep understanding of
| knowledge gaps to ensure quality and diversity of the
| data. We speculate that the creation of synthetic
| datasets will become, in the near future, an important
| technical skill and a central topic of research in AI.
|
| https://arxiv.org/pdf/2309.05463.pdf
|
| Synthetic data has its advantages - less bias, more
| diverse, scalable, higher average quality. But more
| importantly, it can cover all the permutations and
| combinations of skills, concepts, situations. That's why
| a small model just 1.5B like Phi was able to work like a
| 7B model. Usually at that scale they are not coherent.
| smaddox wrote:
| > And that's because, despite all of its training data, it's
| not capable of actually reasoning.
|
| Your conclusion doesn't follow from your premise.
|
| None of these models are trained to do their best on any kind
| of test. They're just trained to predict the next word. The
| fact that they do well at all on tests they haven't seen is
| miraculous, and demonstrates something very akin to
| reasoning. Imagine how they might do if you actually trained
| them or something like them to do well on tests, using
| something like RL.
| Jensson wrote:
| > None of these models are trained to do their best on any
| kind of test
|
| How do you know GPT-4 wasn't trained to do well on these
| tests? They didn't disclose what they did for it, so you
| can't say it wasn't trained to do well on these tests. That
| could be the magic sauce for it.
| thesz wrote:
| >The fact that they do well at all on tests they haven't
| seen
|
| Haven't they seen these tests?
|
| We know little to nothing of how these models get trained.
| mcguire wrote:
| " _When Horace He, a machine-learning engineer, tested GPT-4 on
| questions taken from Codeforces, a website that hosts coding
| competitions, he found that it scored 10 /10 on coding tests
| posted before 2021 and 0/10 on tests posted after 2021. Others
| have also noted that GPT-4's test scores take a dive on material
| produced after 2021. Because the model's training data only
| included text collected before 2021, some say this shows that
| large language models display a kind of memorization rather than
| intelligence._"
|
| I'm sure that is just a matter of prompt engineering, though.
___________________________________________________________________
(page generated 2023-10-10 23:00 UTC)