[HN Gopher] AI hype is built on flawed test scores
       ___________________________________________________________________
        
       AI hype is built on flawed test scores
        
       Author : antondd
       Score  : 135 points
       Date   : 2023-10-10 09:20 UTC (13 hours ago)
        
 (HTM) web link (www.technologyreview.com)
 (TXT) w3m dump (www.technologyreview.com)
        
       | javier_e06 wrote:
       | As a developer when I work with ChatGPT I can see ChatGPT
       | eventually taking over my JIRA stories. Then ChatGPT will take
       | over management creating product roadmaps, prioritizing and
       | assigning tasks to itself. All dictated by customer feedback. The
       | clock is ticking. But reasoning like a human? No.
        
       | janalsncm wrote:
       | The only test I need is the amount of time it takes me to do
       | common tasks with and without ChatGPT. I'm aware it's not perfect
       | but perfect was never necessary.
        
       | aldousd666 wrote:
       | Only idiots are basing their excitement about what's possible on
       | those test scores. They're just an attempt to measure one bot
       | against another. There is a strong possibility that they are only
       | measuring how well the bot takes the test, and nothing at all
       | about what the tests themselves purport to measure. I mean, those
       | tests are probably similar to stuff that's in the training data.
        
         | ehutch79 wrote:
         | Yeah... there's a lot of idiots out there.
        
       | epups wrote:
       | I think ironically there has been an "AI-anti-hype hype", with
       | people like Gary Marcus trying to blow up every single possible
       | issue into a deal breaker. Most of the claims in this article are
       | based on tests performed only on GPT-3, and researchers often
       | seem to make tests in a way that proves their point - see an
       | earlier comment from me here with an example:
       | https://news.ycombinator.com/item?id=37503944
       | 
       | I agree there has been many attention-grabbing headlines that are
       | due to simple issues like contamination. However, I think AI has
       | already proved its business value far beyond those issues, as
       | anyone using ChatGPT with a code base not present in their
       | dataset can attest.
        
         | smcl wrote:
         | I think some amount of that is necessary, though no? We have
         | people claiming that this generation of AI will replace jobs -
         | and plenty of companies have taken the bait and tried to get
         | started with LLM-based bots. We even had a pretty high-profile
         | case of a Google AI engineer going public with claims that
         | their LaMDA AI was sentient. Regardless of what you think of
         | that individual or Google's AI efforts, this resonates with the
         | public. Additionally a pretty common sentiment I've seen has
         | been non-tech people suggesting AI should handle content
         | moderation - the idea being that since they're not human and
         | don't have "feelings" they won't have biases and won't attempt
         | to "silence" any one political group (without realising that
         | bias can be built in via the training data).
         | 
         | It seems pretty important to counter that and to debunk any
         | wild claims such as these. To provide context and to educate
         | the world on their shortcomings.
        
           | epups wrote:
           | I think skepticism is always welcome and we should continue
           | to explore what LLM's can and cannot do. However, what I'm
           | referring to is trying to get a quick win by defeating some
           | inferior version of GPT or trying to apply a test which you
           | don't even expect most humans to pass.
           | 
           | The article is actually fine and pretty balanced, but it is a
           | bit unfortunate that 80% of their examples are not
           | illustrative of current capabilities. At least for me, most
           | of my optimism about the utility of LLM's comes from GPT-4
           | specifically.
        
       | danielvaughn wrote:
       | I remember watching a documentary about an old blues guitar
       | player from the 1920's. They were trying to learn more about him
       | and track down his whereabouts during certain periods of his
       | life.
       | 
       | At one point, they showed some old footage which featured a
       | montage of daily life in a small Mississippi town. You'd see
       | people shopping for groceries, going on walks, etc. Some would
       | stop and wave at the camera.
       | 
       | In the documentary, they noted that this footage exists because
       | at the time, they'd show it on screen during intermission at
       | movie theaters. Film was still in its infancy in that time, and
       | was so novel that people loved seeing themselves and other people
       | on the big screen. It was an interesting use of a new technology,
       | and today it's easy to understand why it died out. Of course, it
       | likely wasn't obvious at the time.
       | 
       | I say all that because I don't think we can _know_ at this point
       | what AI is capable of, and how we want to use it, but we should
       | expect to see lots of failure while we figure it out. Over the
       | next decade there 's undoubtedly going to be countless ventures
       | similar to the "show the townspeople on the movie screen" idea,
       | blinded by the novelty of technological change. But failed
       | ventures have no relevance to the overall impact or worth of the
       | technology itself.
        
         | kenjackson wrote:
         | What died out? Film?
        
           | actionfromafar wrote:
           | Showing locals little movie clips of themselves in
           | intermissions at the local theater.
        
           | savanaly wrote:
           | >What died out?
           | 
           | The custom of showing film consisting of footage of the
           | general public in movie theaters.
        
           | danielvaughn wrote:
           | The practice of filming a montage around your local
           | neighborhood or town to play during intermission. Though you
           | could say intermission as well, since that was a legacy
           | concept that was inherited from plays and eventually died out
           | as well.
        
         | iFire wrote:
         | Selfies and 15 second videos still exist as shorts and tiktoks.
        
         | mattkrause wrote:
         | > it's easy to understand why it died out
         | 
         | I think it's probably more sociological than technical. People
         | _love_ to see themselves and their friends /family. My work has
         | screens that show photos of events and it always causes a bit
         | of a stir ("Did you see X's photo from the summer picnic?")
         | Yearbooks are perennially popular and there's a whole slew of
         | social media.
         | 
         | However, for this to be "fun", there must be a decent chance
         | that most people in the audience know a few people in a few of
         | the pictures. I can't imagine this working well in a big city,
         | for example, or a rural theatre that draws from a huge area.
        
       | refulgentis wrote:
       | This is my favorite new AI argument, took me a few months to see
       | it. Enjoyed it at first.
       | 
       | You start with everyone knows there's AI hype from tech bros.
       | Then you introduce a PhD or two at institutions with good names.
       | Then they start grumbling about anthropomorphizing and who knows
       | what AI is anyway.
       | 
       | Somehow, if it's long enough, you forget that this kind of has
       | nothing to do with anything. There is no argument. Just imagining
       | other people must believe crazy things and working backwards from
       | there to find something to critique.
       | 
       | Took me a bit to realize it's not even an argument, just
       | parroting "it's a stochastic parrot!" Assumes other people are
       | dunces and genuinely believe it's a minihuman. I can't believe
       | MIT Tech Review is going for this, the only argument here is the
       | tests are flawed if you think they're supposed to show the AI
       | model is literally human.
        
       | PeterisP wrote:
       | It's not built on high test scores - while academics do benchmark
       | models on various tests, all the many people who built up the
       | hype mostly did it based on their personal experience with a
       | chatbot, not by running some long (and expensive) tests on those
       | datasets.
       | 
       | The tests are used (and, despite their flaws, useful) to compare
       | various facets of model A to model B - however, the validation
       | whether a model is _good_ now comes from users, and that
       | validation really can 't be flawed much - if it's helpful (or
       | not) to someone, then it is what it is, the proof of the pudding
       | is in the eating.
        
       | rvz wrote:
       | Most of the hype comes from the AI grifters who need to find the
       | next sucker to dump their VC shares onto to the next greater fool
       | to purchase their ChatGPT-wrapper snake oil project to at an
       | overvalued asking price.
       | 
       | The ones who have to dismantle the hype are the proper
       | technologies such as Yann LeCun and Grady Booch who know exactly
       | what they are talking about.
        
       | rahimnathwani wrote:
       | "People have been giving human intelligence tests--IQ tests and
       | so on--to machines since the very beginning of AI," says Melanie
       | Mitchell, an artificial-intelligence researcher at the Santa Fe
       | Institute in New Mexico. "The issue throughout has been what it
       | means when you test a machine like this. It doesn't mean the same
       | thing that it means for a human."
       | 
       | The last sentence above is an important point that most people
       | don't consider.
        
         | api wrote:
         | It seems a bit like having a human face off in a race against a
         | car and then concluding that cars have exceeded human physical
         | dexterity.
         | 
         | It's not an apples/apples comparison. The nature of the
         | capability profile of a human vs. any known machine is
         | radically different. Machines are intentionally designed to
         | have extreme peaks of performance in narrow areas. Present-
         | generation AI might be wider in its capabilities than what
         | we've previously built, but it's still rather narrow as you
         | quickly discover if you start trying to use it on real tasks.
        
       | waynenilsen wrote:
       | This article is absurd.
       | 
       | > But when a large language model scores well on such tests, it
       | is not clear at all what has been measured. Is it evidence of
       | actual understanding? A mindless statistical trick? Rote
       | repetition?
       | 
       | It is measuring how well it does _at REPLACING HUMANS_. It is
       | hard to believe how the author clearly does not understand this.
       | I don't care how it obtains its results.
       | 
       | GPT-4 is like a hyperspeed entry to mid level dev that has almost
       | no ability to contextualize. Tools built on top of 32k will allow
       | repo ingestion.
       | 
       | This is the worst it will ever be.
        
         | dartos wrote:
         | Which tests test specifically for "replacing humans?" That
         | seems like a wild metric to try and capture in a test.
         | 
         | Also an aside:
         | 
         | > This is the worse it will ever be.
         | 
         | I hear this a lot and it really bothers me. Just because
         | something is the worst it'll ever be doesn't mean it'll get
         | much better. There could always be a plateau on the horizon.
         | 
         | It's akin to "just have faith." A real weird sentiment that I
         | didn't notice in tech before 2021.
        
         | RandomLensman wrote:
         | It is measuring how well it does replacing humans - in those
         | tests.
        
       | GuB-42 wrote:
       | I don't think test scores have anything to do with the hype. Most
       | people don't even realize test scores exist.
       | 
       | One is just to wow factor. It will be short lived. A bit like VR,
       | which is awesome when you first try it, but it wears out quickly.
       | Here, you can have a bot write convincing stories and generate
       | nice looking images, which is awesome until you notice that the
       | story doesn't make sense and that the images has many details
       | wrong. This is not just a score, it is something you can see and
       | experience.
       | 
       | And there is also the real thing. People start using GPT for real
       | work. I have used it to document my code for instance, and it
       | works really well, with it I can do a better job than without,
       | and I can do it faster. Many students use it to do their
       | homework, which may not be something you want, but it no less of
       | a real use. Many artists are strongly protesting against
       | generative AI, this in itself is telling, it means it is taken
       | seriously, and at the same time, other artists are making use of
       | it.
       | 
       | It is even use for great effect where you don't notice. Phone
       | cameras are a good example, by enhancing details using AI, they
       | give you much better pictures than what the optics are capable
       | of. Some people don't like that because the picture are "not
       | real", but most enjoy the better perceived quality. Then, there
       | are image classifiers, speech-to-text and OCR, fuzzy searching,
       | content ranking algorithms we love to hate, etc... that all make
       | use of AI.
       | 
       | Note: here AI = machine learning with neural networks, which is
       | what the hype is about. AI is a vague term that can mean just
       | about anything.
        
         | Jensson wrote:
         | > I don't think test scores have anything to do with the hype.
         | Most people don't even realize test scores exist.
         | 
         | They put the test scores front and center in the initial
         | announcement with a huge image showing improvements on AP
         | exams, it was the main thing people talked about during the
         | announcement and the first thing anyone who read anything about
         | gpt-4 sees.
         | 
         | I don't think many who are hyped about these things missed
         | that.
         | 
         | https://openai.com/research/gpt-4
        
       | bondarchuk wrote:
       | > _But there's a problem: there is little agreement on what those
       | results really mean. Some people are dazzled by what they see as
       | glimmers of human-like intelligence; others aren't convinced one
       | bit._
       | 
       | I find the whole hype & anti-hype dynamic so tiresome. Some are
       | over-hyping, others are responding with over-anti-hyping.
       | Somewhere in-between are many reasonable, moderate and caveated
       | opinions, but neither the hypesters or anti-hypesters will listen
       | to these (considering all of them to come from people at the
       | opposite extreme), nor will outside commentators (somehow being
       | unable to categorize things as anything more complicated than
       | this binary).
        
         | Closi wrote:
         | Depends if the hype is invalid - Let's remember that "There
         | will be a computer in every home!" was once considered hype.
         | 
         | There is a possible world where AI will be a truly
         | transformative technology in ways we can't possibly understand.
         | 
         | There is a possible world where this tech fizzles out.
         | 
         | So one of the reasons that there is a broad 'hype' dynamic here
         | is because the range of possibilities is broad.
         | 
         | I sit firmly in the first camp though - I believe it's truly a
         | transformative technology, and struggle to see the perspective
         | of the 'anti-hype' crowd.
        
           | TerrifiedMouse wrote:
           | I'm in the second camp. To every hyped up tech, all I can say
           | is "prove it". Give me actual real world results.
           | 
           | There are millions of hustlers out there pushing snake oil.
           | The probability that something is the real deal and not snake
           | oil is small. Better to assuming the glass is half empty.
        
             | Closi wrote:
             | There will be millions of hustlers regardless of if the
             | technology is transformative or not.
             | 
             | The invention of the PC market was filled with hustlers but
             | that doesn't mean that the PC didn't match the hype.
             | 
             | The .com boom was filled with hustlers, but that doesn't
             | mean that the Internet wasn't transformative.
             | 
             | Actual real world results... well the technology is already
             | responsible for c40% of code on Github. Image recognition
             | technologies are soaring and self driving feels within
             | reach. Few people doubt that a real-world Jarvis will be in
             | your home within 12 months. The turing test is smashed, and
             | LLM's are already replacing live chat operatives. And this
             | is just the start of the technology...
        
               | TerrifiedMouse wrote:
               | > The .com boom was filled with hustlers, but that
               | doesn't mean that the Internet wasn't transformative.
               | 
               | But a lot of .com projects were BS. If you were to pick
               | at random, the probability you got a winner is low. Thus
               | it's wise to be skeptical of all hyped stuff until they
               | have proven themselves.
               | 
               | > Actual real world results... well the technology is
               | already responsible for c40% of code on Github.
               | 
               | Quite sure you misread that article. It says 40% of the
               | code checked in by people who use Copilot is AI-
               | generated. Not 40% of all code.
               | 
               | That's how some programmers are I guess. I have heard of
               | people copy pasting code directly from stack overflow
               | without a second thought about how it works. That's
               | probably Copilot's audience.
        
               | Closi wrote:
               | I think your reasoning is flawed - the fact a lot of .com
               | projects were BS does not imply that the underlying
               | technology (the internet) wasn't transformative.
               | 
               | Are we really saying that people who were saying the
               | internet was a transformative technology in the
               | mid-1990's were wrong? It _was_ transformative, but it
               | was hard to see which parts of the technology would stick
               | around. Of course it doesn 't mean that every single
               | company and investment was going to be profitable, that's
               | not true of anything ever. People investing in Amazon and
               | Google were winners though - these are companies that
               | have in many ways reinvented the market they operate in.
               | 
               | > Quite sure you misread that article. It says 40% of the
               | code checked in by people who use Copilot is AI-
               | generated. Not 40% of all code.
               | 
               | Ok, I'll take that it's 40% of Copilot users. That's
               | still 40% of some programmers code!
        
       | dleslie wrote:
       | Two years ago I didn't use AI at all. Now I wouldn't go without
       | it; I have Copilot integrated with Emacs, VSCode, and Rider. I
       | consider it a ground-breaking productivity accelerator, a leap
       | similar to when I transitioned from Turbo Pascal 2 to Visual C 6.
       | 
       | That's why I'm hyped. If it's that good for me, and it's
       | generalizable, then it's going to rock the world.
        
         | thomasfromcdnjs wrote:
         | Life longer programmer, and same sentiments, I use it
         | everywhere I can.
         | 
         | I am currently transliterating a language PDF into a formatted
         | lexicon, I wouldn't even be able to do this without co-pilot,
         | it has turned this seemingly impossibly arduous task into a
         | pleasurable one.
        
         | airstrike wrote:
         | Coding on something without copilot these days feels like
         | having my hands tied. I'm looking at you, XCode and Colab...
        
       | derbOac wrote:
       | This was interesting to me but mostly because of a question I
       | thought it was going to focus on, which is how should we
       | interpret these tests when a human takes it?
       | 
       | I wasn't sure that the phenomena they discussed was as relevant
       | to the question of whether AI is overhyped as they made it out to
       | be, but I did think a lot of questions about the meaning of the
       | performances were important.
       | 
       | What's interesting to me is you could flip this all on its head
       | and, instead of asking "what can we infer about the machine
       | processes these test scores are measuring?", we could ask "what
       | does this imply about the human processes these test scores are
       | measuring?"
       | 
       | A lot of these test are well-validated but overinterpreted I
       | think, and leaned on too heavily to make inferences about people.
       | If a machine can pass a test, for instance, what does it say
       | about the test as used in people? Should we be putting as much
       | weight on them as we do?
       | 
       | I'm not arguing these tests are useless or something, just that
       | maybe we read into them too much to begin with.
        
       | kfk wrote:
       | AI hype is really problematic in Enterprise. Big companies are
       | now spending C executive time figuring out a company "AI
       | strategy". This is going to be another cycle of money-wasted/biz-
       | upset, very similar to what I have seen with Big data. The thing
       | in Enterprise is that everyone serious about biz operations knows
       | AI test scores and AI quality is not there, but very few are able
       | to communicate these concerns in a constructive way, rather
       | everyone is embracing the hype because, maybe they get a
       | promotion? Tech, as usual, is very happy to feed the hype and
       | never, as usual, telling businesses honestly that, at best, this
       | is an incremental productivity improvement, nothing life
       | changing. I think the issue is overall lack of honesty,
       | professionalism, and accountability across the board, with tech
       | leading this terrible way of pushing product and "adding value".
        
         | huijzer wrote:
         | > Tech, as usual, is very happy to feed the hype
         | 
         | I agree completely with you on this.
         | 
         | In defence of the executives however is that some businesses
         | will be seriously affected. Call centres and plagiarism scanner
         | have already been affected, but it's unclear which industries
         | will be affected too. Maybe the probability is low, but the
         | impact could be very high. In think this reasoning is driving
         | the executives.
        
           | kfk wrote:
           | Look, I am going to wait and see on this, maybe new facts
           | will make me reconsider. In the meanwhile, github Copilot is
           | just cost to my company, haven't seen much additional
           | productivity. I guess my concern, given how hard is to hire
           | developers and technologists, is replacing simpler job roles,
           | like a customer service representative, with complicated new
           | ones, like "MLOps Engineer".
        
         | janalsncm wrote:
         | Identifying an "AI strategy" seems backwards. What they should
         | be doing is identifying the current problems and goals of the
         | company and reassessing how best to accomplish them given the
         | new capabilities which have surfaced. Perhaps "AI" is the best
         | way. Or maybe simpler ways are better.
         | 
         | I've said it before, but as someone to whom "AI" means
         | something more than making API calls to some SAAS, I look
         | forward to the day they hire me at $300/hour to replace their
         | "AI strategy" with something that can be run locally off of a
         | consumer-grade GPU or cheaper.
        
         | gmerc wrote:
         | Publicly listed companies whose traditional business model is
         | under pressure are incentivized to hype because if they don't
         | inspire there idea of sustained growth to their wary investors,
         | they cautionary tale in form of Twitter (valuation low enough
         | to lose control) exists.
         | 
         | In Capitalism, you grow or you die and sometimes you need to
         | bullshit people about growth potential to buy yourself time
        
           | hashtag-til wrote:
           | Yes, sad but true.
        
         | chasd00 wrote:
         | in consulting all we hear is sell sell sell AI so i'm sure my
         | industry isn't helping at all. I'm not on board yet, I just
         | don't see a use case in enterprise beyond learning a knowledge
         | base to make a more conversational self-help search and things
         | like that. It's great that it can help right a function in
         | javascript but that's not a watershed moment... yet. Curious to
         | see AI project sales at end of 2024 (everything in my biz is
         | measured in units of $$).
        
         | padjo wrote:
         | It's rational herd dynamics for the execs. Going against the
         | herd and being wrong is a career ender. Going with the herd and
         | being wrong will be neutral at worst.
        
         | hdjjhhvvhga wrote:
         | > because, maybe they get a promotion?
         | 
         | While I agree with you in general, I don't think this bit is
         | particularly fair. I'd say we know the limitations, and we also
         | know that using LLMs might bring some advantage, and the
         | companies that are able to use it properly will have a better
         | position, so it makes sense to at least investigate the
         | options.
        
         | hashtag-til wrote:
         | Agreed. I think there is a FOMO phenomena among C-level execs,
         | that is generating a gigantic waste os money and time, creating
         | distractions around "AI strategy".
         | 
         | It started a few years back and it is now really inflamed with
         | LLM, because of the consumer level hype and general media
         | reporting about it.
         | 
         | You can perceive that by the multiple AI startups capturing
         | millions in VC capital for absolutely bogus value proposition.
         | Bizarre!
        
         | epups wrote:
         | The problem with your premise is that you're already drawing
         | conclusions about the potential of AI and deciding it is hype.
         | Perhaps decades ago someone could have equally criticised
         | "Internet hype" and "mobile hype" and look foolish now.
        
           | soco wrote:
           | Also decades ago someone criticised "bigdata hype" and
           | "microservices hype" and looks right now. Doing things just
           | out of FOMO is rarely a good business decision. It can pay
           | out, even a broken clock is right twice a day, but it's
           | definitely bad to follow every new thing just because Gartner
           | mentioned it. I'm not giving advice of course, but having
           | seen enterprises betting good money even on NFT I tend to
           | treat every new enterprise powerpoint idea with a certain
           | dose of skepticism.
        
             | epups wrote:
             | Yes, hype exists and some things we thought were promising
             | turned out not to be. However, if anyone is making the case
             | that we know enough today to claim that AI is mostly hype,
             | I think that's foolish.
        
             | pixl97 wrote:
             | Business can work on more than one thing at once.
             | Businesses typically take any number of risks they invest
             | in. Proper risk management ensures you've not over
             | committed assets to the point of an unrecoverable loss.
             | 
             | Some businesses in some industries can follow a strategy of
             | "never do anything until it's a well established process",
             | others cannot.
        
         | dboreham wrote:
         | > AI hype is really problematic in Enterprise.
         | 
         | This only appears so because we here have some insight into the
         | domain. But there have always been hype cycles. We just didn't
         | notice them so readily.
         | 
         | The speed with which this happens makes me suspect there is a
         | hidden "generic hype army" that was already in place,
         | presumably hyping the last thing, and ready to jump on this
         | thing.
        
         | jacobr1 wrote:
         | Blindly following a trend will likely not end well. But even
         | with previous hype cycles, those companies that identified good
         | use cases, validated those use cases, and had solid execution
         | of the projects leaped ahead. Big Data was genuinely of value
         | to plenty of organizations, and a waste of time for others. IoT
         | was crazy for plenty of orgs ... but also was really valuable
         | to certain segments. Gartner's hype cycle ends with the plateau
         | of productivity for a reason ... you just have to go through
         | the trough of disillusionment first, which is going to come
         | from a the great multitudes of failed and ill-conceived
         | projects.
        
         | JaDogg wrote:
         | This is exactly correct.
        
       | randcraw wrote:
       | The debate over whether LLMs are "intelligent" seem a lot like
       | the old debate among NLP experts whether English must be modeled
       | as a context-free grammar (push down automaton) or finite-state
       | machine (regular expression). Yes, any language can be modeled
       | using regular expressions; you just need an insane number of FSMs
       | (perhaps billions). And that seems to be the model that LLMs are
       | using to model cognition today.
       | 
       | LLMs seem to use little or no abstract reasoning (is-a) or
       | hierarchical perception (has-a), as humans do -- both of which
       | are grounded in semantic abstraction. Instead, LLMs can memorize
       | a brute force explosion in finite state machines (interconnected
       | with Word2Vec-like associations) and then traverse those machines
       | and associations as some kind of mashup, akin to a coherent
       | abstract concept. Then as LLMs get bigger and bigger, they just
       | memorize more and more mashup clusters of FSMs augmented with
       | associations.
       | 
       | Of course, that's not how a human learns, or reasons. It seems
       | likely that synthetic cognition of this kind will fail to enable
       | various kinds of reasoning that humans perceive as essential and
       | normal (like common sense based on abstraction, or physically-
       | grounded perception, or goal-based or counterfactual reasoning,
       | much less insight into the thought processes / perceptions of
       | other sentient beings). Even as ever-larger LLMs "know more" by
       | memorizing ever more FSMs, I suspect they'll continue to surprise
       | us with persistent cognitive and perceptual deficits that would
       | never arise in organic beings that _do_ use abstract reasoning
       | and physically grounded perception.
        
       | dmezzetti wrote:
       | This video from Yann LeCun gives a great summary on where things
       | stand. https://www.youtube.com/watch?v=pd0JmT6rYcI
       | 
       | He is of the opinion the current generation transformers
       | architecture is flawed and it will take a new generation of
       | models to get close to the hype.
        
       | chewxy wrote:
       | I note something very interesting in the AI hype, and I would
       | like someone to help explain it.
       | 
       | Whenever there's a news or article noting the limits of current
       | LLM tech (especially the GPT class of models from OpenAI),
       | there's always a comment that says something along the lines of
       | "ah did you test it on GPT-4"?
       | 
       | Or if it's clear that it's the limitation of GPT-4, then you have
       | comments along the lines of "what's the prompt?", or "the prompt
       | is poor". Usually, it's someone who hasn't in the past indicated
       | that they understand that prompt engineering is model specific,
       | and the papers' point is to make a more general claim as opposed
       | to a claim on one model.
       | 
       | Can anyone explain this? It's like the mere mention of LLMs being
       | limited in X, Y, Z fashion offends their lifestyle/core beliefs.
       | Or perhaps it's a weird form of astroturfing. To which, I ask, to
       | what end?
        
         | jazzyjackson wrote:
         | The output of any model is essentially random and whether it is
         | useful or impressive is a coin flip. While most people get a
         | mix of heads and tails, there are a few people at any time that
         | are getting streaks of one head after another or vice versa.
         | 
         | So my perception is this leads to people who have good luck and
         | perceive LLMs as near AGI because it arrives at a useful answer
         | more often than not, and these people cannot believe there are
         | others who have bad luck and get worthless output from their
         | LLM, like someone at a roulette table exhorting "have you tried
         | betting it all on black? worked for me!"
        
         | stevenhuang wrote:
         | Because they're saying it can't do something when they're
         | holding it wrong.
         | 
         | It's a weird thing to get hung up on if you ask me.
        
         | TeMPOraL wrote:
         | > _there 's always a comment that says something along the
         | lines of "ah did you test it on GPT-4"?_
         | 
         | Perhaps because whenever there's "a news or article noting the
         | limits of current LLM tech", it's a bit like someone tried to
         | play a modern game on a machine they found in their parents'
         | basement, and the only appropriate response to this is, "have
         | you tried running it on something other than a potato"? This
         | has been happening so often over the past few months that it's
         | the first red flag you check for.
         | 
         | GPT-4 is still _qualitatively_ ahead of all other LLMs, so
         | outside of articles addressing specialized aspects of different
         | model families, the claims are invalid unless they were tested
         | on GPT-4.
         | 
         | (Half the time the problem is that the author used ChatGPT web
         | app and did not even realize there are two models and they've
         | been using the toy one.)
        
         | wrsh07 wrote:
         | 1. Just like it's frustrating when a paper is published making
         | claims that are hard to verify, it's frustrating when somebody
         | says "x can't do y" in a way that is hard to verify^^
         | 
         | 2. LLMs, in spite of the complaints about the research leaders,
         | are fairly democratic. I have access to several of the best
         | LLMs currently in existence and the ones I can't access haven't
         | been polished for general usage anyway. If you make a claim
         | with a prompt, it's easy for me to verify it
         | 
         | 3. I've been linked legitimate ChatGPT prompts where someone
         | gets incorrect data from ChatGPT - my instinct is to help them
         | refine their prompt to get correct data
         | 
         | 4. If you make a claim about these cool new tools (not making a
         | claim about what they're good for!) all of these kick in. I
         | want to verify, refine, etc.
         | 
         | Of course some people are on the bandwagon and it is akin to
         | insulting their religion (it is with religious fervor they hold
         | their beliefs!) but at least most folks on hn are just excited
         | and trying to engage
         | 
         | ^^ I actually think making this claim is in bad form generally.
         | It's like looking for the existence of aliens on a planet.
         | Absence of evidence is not evidence of absence
        
         | abm53 wrote:
         | Perhaps they are trying to help people get the best out of a
         | tool which they themselves find very useful?
        
         | epups wrote:
         | If someone comes here and says "<insert programming language>
         | cannot do X" and that is wrong, or perhaps outdated, don't you
         | feel that the reaction would be similar?
         | 
         | If you are trying to make categorical statements about what AI
         | is unable to do, at the very least you should use a state-of-
         | the-art system, which conveniently is easily available for
         | everyone.
        
         | jacobr1 wrote:
         | As someone who has this instinct myself, there is a line of
         | reactionism to modern AI/ML that says, "this is just a toy,
         | look it can't do something simple." But often the case, if
         | _can_ do that thing with a either a more advanced model, or a
         | more built-out system. So the instinct is to try and explain
         | that the pessimism is wrong. That we really can push the
         | boundary and do more, even if it isn't going to work out of the
         | box yet. I react that way against all forms of poppy snipping.
        
           | Jensson wrote:
           | Hyping up tech based on what you think it will be able to do
           | in the future is the misplaced overhyping that is the
           | problem. The issues people say are easy to fix aren't easy to
           | fix.
           | 
           | Expect the model to continue to perform like it does today,
           | and then lots of dumb integrations added to it, and you will
           | get a very accurate prediction of how most of new tech hype
           | turns out. Dumb integrations can't add intelligence, but it
           | can add a lot of value, so the rational hype still sees this
           | as a very valuable and exciting thing, but it isn't a
           | complete revolution in its current form.
        
       | Cloudef wrote:
       | AI is honestly wrong word to use. These are ML models and they
       | are able to only do the task they have been specifically trained
       | for (not saying the results aren't impressive!). There really
       | isn't competition either as the only people who can train these
       | giant models are those who have the cash.
        
         | pixl97 wrote:
         | >AI is honestly wrong word to use
         | 
         | https://en.wikipedia.org/wiki/AI_effect
         | 
         | Just because you don't like how poorly the term AI is defined,
         | doesn't mean it is the wrong term.
         | 
         | AI can never be well defined because the word intelligence
         | itself is not well defined.
        
         | TeMPOraL wrote:
         | > _These are ML models and they are able to only do the task
         | they have been specifically trained for_
         | 
         | Yes, but the models we're talking about have been trained
         | specifically on the task of "complete arbitrary textual input
         | in a way that makes sense to humans", for _arbitrary textual
         | input_ , and then further tuned for "complete it as if you were
         | a person having conversation with a human", again for arbitrary
         | text input - and trained until they could do so convincingly.
         | 
         | (Or, you could say that with instruct fine-tuning, they were
         | further trained to _behave as if they were an AI chatbot_ - the
         | kind of AI people know from sci-fi. Fake it  'till you make it,
         | via backpropagation.)
         | 
         | In short, they've been trained on an open-ended, general task
         | of communicating with humans using plain text. That's very
         | different to typical ML models which are tasked to predict some
         | very specific data in a specialized domain. It's like comparing
         | a Python interpreter to Notepad - both are just regular
         | software, but there's a meaningful difference in capabilities.
         | 
         | As for seeing glimpses of understanding in SOTA LLMs - this
         | makes sense under the compression argument: understanding is
         | lossy compression of observations, and this is what the
         | training process is trying to force to happen, squeezing more
         | and more knowledge into a fixed set of model weights.
        
           | Cloudef wrote:
           | Yes, this is why I think the LLM and image generation models
           | are still impressive. Knowing they are ML models in the end
           | and still produce a results that surprise us, makes you
           | wonder what we are in the end. Could we essentially simulate
           | something similar to us given enough inputs and parameters in
           | the network, with enough memory, computing power and a
           | training process that would aim to simulate a human with
           | emotions. I would imagine the training process alone would
           | need bunch of other models to teach the final model
           | "concepts" and from there perhaps "reasoning".
           | 
           | Why I think AI is not the appropriate term is that if it were
           | AI, the AI would have already figured everything out for us
           | (or for itself). LLM can only chain text, it does not really
           | understand the content of the text, and can't come up with
           | new novel solutions (or if it accidentally does, it's due to
           | hallucination), this can be easily confirmed by giving
           | current LLMs some simple puzzles, math problems and so on..
           | Image models have similar issues.
        
       | iambateman wrote:
       | This really is a good article, and is seriously researched. But
       | the conclusion in the headline - "AI hype is built on flawed test
       | scores" - feels like a poor summary of the article.
       | 
       | It _is_ correct to say that an LLM is not ready to be a medical
       | doctor, even if it can pass the test.
       | 
       | But I think a better conclusion is that test scores don't help us
       | understand LLM capabilities like we think they do.
       | 
       | Using a human test for an LLM is like measuring a car's "muscles"
       | and calling it horsepower. They're just different.
       | 
       | But the AI hype is justified, even if we struggle to measure it.
        
       | aidenn0 wrote:
       | Any task that gets solved with AI retroactively becomes something
       | that doesn't require reasoning.
        
         | janalsncm wrote:
         | I wouldn't say that. Chess certainly requires reasoning even if
         | that reasoning is minimax.
         | 
         | I suppose in the context of this article "AI" means statistical
         | language models.
        
       | robertlagrant wrote:
       | > AI hype is built on high test scores
       | 
       | No, it's built on people using DALLE and Midjourney and ChatGPT.
        
         | yCombLinks wrote:
         | Exactly, chatpgt is double checking my homework problems and
         | pointing out my errors, it's teaching me the material better
         | than any of my lectures. It's writing tons of code I'm getting
         | paid for, with way less overhead than trying to explain the
         | problem to a junior, less mistakes and faster iteration. Test
         | scores, ridiculous
        
       | Kalanos wrote:
       | Didn't it perform well on both the SAT and LSAT though?
        
       | mg wrote:
       | I don't think the "hype" is built on test scores.
       | 
       | It is built on the observation how fast AI is getting better. If
       | the speed of improvement stays anywhere near the level it was the
       | last two years, then over the next two decades, it will lead to
       | massive changes in how we work and which skills are valuable.
       | 
       | Just two years ago, I was mesmerized by GPT-3's ability to
       | understand concepts:
       | 
       | https://twitter.com/marekgibney/status/1403414210642649092
       | 
       | Nowadays, using it daily in a productive fashion feels completely
       | normal.
       | 
       | Yesterday, I was annoyed with how cumbersome it is to play long
       | mp3s on my iPad. I asked GPT-4 something like "Write an html page
       | which lets me select an mp3, play it via play/pause buttons and
       | offers me a field to enter a time to jump to". And the result was
       | usable out of the box and is my default mp3 player now.
       | 
       | Two years ago it didn't even dawn on me that this would be my way
       | of writing software in the near future. I have been coding for
       | over 20 years. But for little tools like this, it is faster to
       | ask ChatGPT now.
       | 
       | It's hard to imagine where we will be in 20 years.
        
         | gmerc wrote:
         | This.
         | 
         | We are in a Cambrian Explosion on the software side and
         | hardware hasn't yet reacted to it. There's a few years of mad
         | discovery in front of us.
         | 
         | People have different impressions as to the shape of the curve
         | that's going up and right, but only a fool would not stop and
         | carefully take what is happening.
        
           | kossTKR wrote:
           | Exactly and things are actually getting crazy now. Pardon the
           | tangent but for some reason this hasn't reached the frontpage
           | on HN yet: https://github.com/OpenBMB/ChatDev
           | 
           | Making your own "internal family system" of AI's is a making
           | this exponential (and frightening), like an ensemble on top
           | of the ensemble, with specific "mindsets", that with shared
           | memory can build and do stuff continuously. Found this from a
           | comp sci professor on Tiktok so be warned: https://www.tiktok
           | .com/@lizthedeveloper/video/72835773820264...
           | 
           | I remember a couple of comments here on HN when the hype
           | began about how some dude thought he had figured out how to
           | actually make an AGI - can't find it now, but it was
           | something about having multiple ai's with different
           | personalities discoursing with a shared memory - and now it
           | seems to be happening.
           | 
           | This coupled with access to linux containers that can be
           | spawned on demand, we are in for a wild ride!
        
             | ChatGTP wrote:
             | [flagged]
        
             | dartos wrote:
             | I saw chatdev on hn and have been pretty disappointed with
             | it :(
             | 
             | Haven't had it make anything usable that's more complicated
             | than a mad lib yet
        
         | [deleted]
        
         | happycube wrote:
         | I got curious and did this myself. Needed a bit of nudging to
         | get where I wanted, but I even had it make an Electron wrapper:
         | 
         | https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...
        
           | dchuk wrote:
           | This is awesome, thanks for sharing.
           | 
           | Do you (or anyone) know of any products that allow for
           | iterating on the generated output through further chatting
           | with the ai? What I mean, is that each subsequent prompt here
           | either generated a new whole output, or new chunks to add to
           | the output. Ideally, whether generating code or prose, I'd
           | want to keep prompting about the generated output and the AI
           | further modifies the existing output until it's refined to
           | the degree I want it.
           | 
           | Or is that effectively what Copilot/cursor do and I'm just a
           | bad operator?
        
             | happycube wrote:
             | No problem, it was a fun morning exercise for me :)
             | 
             | Copilot, at least from what little I did in vscode, isn't
             | as powerful as this. I think there's a GPT4 mode for it
             | that I haven't played with that'd be a _lot_ closer to
             | this.
        
             | robertlagrant wrote:
             | > Do you (or anyone) know of any products that allow for
             | iterating on the generated output through further chatting
             | with the ai? What I mean, is that each subsequent prompt
             | here either generated a new whole output, or new chunks to
             | add to the output. Ideally, whether generating code or
             | prose, I'd want to keep prompting about the generated
             | output and the AI further modifies the existing output
             | until it's refined to the degree I want it.
             | 
             | ChatGPT does this.
        
         | RC_ITR wrote:
         | You can check my post history to see how unpopular this point
         | of view is, but the big "reveal" that will come up is as
         | follows:
         | 
         | The way that LLMs and humans "think" is inherently different.
         | Giving an LLM a test designed for humans is akin to giving a
         | camera a 'drawing test.'
         | 
         | A camera can make a better narrow final output than a human,
         | but it cannot do the subordinate tasks that a human illustrator
         | could, like changing shadings, line width, etc.
         | 
         | An LLM can answer really well on tests, but it often fails at
         | subordinate tasks like 'applying symbolic reasoning to
         | unfamiliar situations.'
         | 
         | Eventually the thinking styles may converge in a way that makes
         | the LLMs practically more capable than humans on those
         | subordinate tasks, but we are not there yet.
        
         | james-revisoai wrote:
         | A lot of the progress in the last 3-4 years was predictable
         | from GPT-2 and especially GPT-3 onwards - combining instruction
         | following and reinforcement learning with scaling GPT. With
         | research being more closed, this isn't so true anymore. The mp3
         | case was predictable in 2020 - some early twitter GIFs showed
         | vaguely similar stuff. Can you predict what will happen in
         | 2026/7 though, with multimodal tech?
         | 
         | I simply don't see it a being the same today. The obvious
         | element of scaling or techniques that imply a useful overlap
         | isn't there. Whereas before researchers brought together
         | excellent and groundbreaking performance on different
         | benchmarks and areas together as they worked on GPT-3, since
         | 2020, except instruction following, less has been predictable.
         | 
         | Multi modal could change everything (things like the ScienceQA
         | paper suggest so), but also, it might not shift benchmarks.
         | It's just not so clear that the future is as predictable or
         | will be faster than the last few years. I do have my own
         | beliefs similar to Yann Lecun about what architecture or rather
         | infrastructure makes most sense intuitively going forward, and
         | there's not really the openness we used to have from top labs
         | to know if they are going these ways, or not. So you are
         | absolutely right that it's hard to imagine where we will be in
         | 20 years, but in a strange way, because it is much less clear
         | than in 2020 where we will be in 3 years time onwards, I would
         | say it is much less guaranteed progress than it is felt by
         | many...
        
           | huijzer wrote:
           | I was also thinking about how quickly AI may progress and am
           | curious for your or other people's thoughts. When estimating
           | AI progress, estimating orders of magnitude sounds like the
           | most plausible way to do it, just like Moore's law has
           | guessed the magnitude correctly for years. For AI, it is
           | known that performance increases linearly when the model size
           | increases exponentially. Funding currently increases
           | exponentially meaning that performance will increase
           | linearly. So, AI will increase linearly as long as the
           | funding does too. On top of this, algorithms may be made more
           | efficient, which may occasionally make an order of magnitude
           | improvement. Does this reasoning make sense? I think it does
           | but I could be completely wrong.
        
         | denton-scratch wrote:
         | > I was mesmerized by GPT-3's ability to understand concepts
         | 
         | This language embodies the anthropomorphic assumptions that the
         | author is attacking.
        
           | dboreham wrote:
           | Or the corollary: that there's really no such thing as
           | anthropomorphic. There's inputs and outputs, and an
           | observer's opinion on how well the outputs relate to the
           | inputs. Thing producing the outputs, and the observer, can be
           | human or not human. Same difference.
        
             | smcin wrote:
             | It absolutely is anthropomorphizing to claim "GPT-3's
             | ability to _understand concepts_ " rather than simply
             | calling it "reproduce, mix and match text from an enormous
             | corpus". And we can totally legitimately compare to a jury
             | of human observers' opinions on how well(/badly) the output
             | generated relates to the inputs.
             | 
             | For the specific example the OP cited _" War: like being
             | eaten by a dragon and then having it spit you out"_
             | 
             | then unless its answer to "Where were you in between being
             | eaten by a dragon and before it spat you out?" is "in the
             | dragon's digestive system" that isn't understanding.
             | 
             | And I'm curious to see it answer "Dragons only exist in
             | mythology; does your analogy mean war doesn't exist either?
             | Why not compare to an animal that exists?"
        
               | nomel wrote:
               | > "War: like being eaten by a dragon and then having it
               | spit you out"
               | 
               | This exact text, and the response (several attempts) is
               | flagged and censored, with ChatGPT-4 web interface. :-|
        
         | adrians1 wrote:
         | > If the speed of improvement stays anywhere near the level it
         | was the last two years, then over the next two decades, it will
         | lead to massive changes in how we work and which skills are
         | valuable.
         | 
         | That's a big assumption to make. You can't assume that the rate
         | of improvement will stay the same, especially over a period of
         | 2 decades, which is a very long time. Every advance in
         | technology hits diminishing returns at some point.
        
           | mg wrote:
           | Why do you think so?
           | 
           | Technological progress seems rather accelerated than
           | diminishing to me.
           | 
           | Computers are a great example: They have been getting more
           | capable exponentially over the last decades.
           | 
           | In terms of performance (memory, speed, bandwidth) and in
           | terms of impact. First we had calculators, then we had
           | desktop applications, then the Internet and now we have AI.
           | 
           | And AI will help us get to the next stage even faster.
        
             | hashtag-til wrote:
             | I'm not putting my coins on this advances.
             | 
             | More likely this will become the new "search" technology
             | and get polluted with ads. People will lose trust and it
             | will decay.
        
         | Jensson wrote:
         | > Two years ago it didn't even dawn on me that this would be my
         | way of writing software in the near future
         | 
         | So you were ignorant two years ago, GitHub Copilot was already
         | available to users back then. The only new big thing the past
         | two years was GPT-4, and nothing suggest anything similar will
         | come the next two years. There are no big new things on the
         | horizon, we knew for quite a while that GPT-4 was coming, but
         | there isn't anything like that this time.
        
           | mg wrote:
           | Copilot was not around when I wrote the Tweet.
           | 
           | But when Copilot came out, I was indeed ignorant! I remember
           | when a friend showed it to me for the first time. I was like
           | "Yeah, it outputs almost correct boilerplate code for you.
           | But thankfully my coding is so that I don't have to write
           | boilerplate". I didn't expect it to be able to write fully
           | functional tools and understand them well enough to actually
           | write pretty _nice_ code!
           | 
           | Regarding "there isn't anything like that this time." : Quite
           | the opposite! We have not figured out where using larger
           | models and throwing more data at them will level off! This
           | could go on for quite a while. With FSD 12, Tesla is already
           | testing self driving with a single large neural net, without
           | any glue code. I am super curious how that will turn out.
           | 
           | The whole thing is just starting.
        
             | Jensson wrote:
             | Well, my point is that you perceive progress to be fast
             | since you went from not understanding what existed to later
             | getting in on it. That doesn't mean progress was that fast,
             | it means that you just discovered a new domain.
             | 
             | Trying to extrapolate actual progress is bad in itself, but
             | trying to extrapolate your perceived progress is even
             | worse.
        
               | james-revisoai wrote:
               | Yeah you have hit the nail on the head here. A lot was
               | predictable with seeing that GPT-2 could reasonably stay
               | within language and generate early coherent structures,
               | that coming at the same time as instructions with the T5
               | stuff and the widespread use of embeddings from BERT told
               | us this direction was likely, it's just for many people
               | this came to awareness in 2021/22 rather than the
               | 2018-2020 ramp up the field/hobbyists experienced.
        
               | [deleted]
        
           | isaacfung wrote:
           | Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3
           | 
           | Other breakthroughs in graph machine learning
           | https://towardsdatascience.com/graph-ml-in-2023-the-state-
           | of...
        
             | Jensson wrote:
             | Those are image/voice generation, the topic is about
             | potential replacement of knowledge workers such as coders.
             | The discussion about image/voice generation is a very
             | different topic since nobody thinks those are moving
             | towards AGI and nobody argued they were "conscious" etc.
        
               | isaacfung wrote:
               | [flagged]
        
         | rob74 wrote:
         | The article doesn't say that LLMs aren't useful - the "hype"
         | they mean is overestimating their capabilities. An LLM may be
         | able to pass a "theory of mind" test, or it may fail
         | spectacularly, depending on how you prompt it. And that's
         | because, despite all of its training data, it's not capable of
         | actually _reasoning_. That may change in the future, but we 're
         | not there yet, and (AFAIK) nobody can tell how long it will
         | take to get there.
        
           | YetAnotherNick wrote:
           | > it's not capable of actually reasoning
           | 
           | Define reasoning. Because in my definition GPT 4 could reason
           | without doubt. It definitely can't reason better than experts
           | in the field, but it could reason better than say interns.
        
             | mcguire wrote:
             | What is your definition?
        
               | YetAnotherNick wrote:
               | If it can solve basic logic problems, then it could
               | reason. And if it could write code of a new game with new
               | logic, then it could reason for sure.
               | 
               | Example of basic problem: In a shop, there are 4 dolls of
               | different heights P,Q,R and S. S is neither as tall as P
               | nor as short as R. Q is shorter than S but taller than R.
               | If Kittu wants to purchase the tallest doll, which one
               | should she purchase? Think step by step.
        
               | kaibee wrote:
               | Seems to handle it easily. https://chat.openai.com/share/
               | 4d8ab2af-f824-44c8-9311-e3893c...
        
             | NovemberWhiskey wrote:
             | I don't have access to GPT 4 but I'd be interested to see
             | how it does on a question like this:
             | 
             |  _" Say I have a container with 50 red balls and 50 blue
             | balls, and every time I draw a blue ball from the
             | container, I add two white balls back. After drawing 100
             | balls, how many of each different color ball are left in
             | the container? Explain why."_
             | 
             | ... because on GPT 3.5 the answer begins like the below and
             | then gets worse:
             | 
             |  _" Let's break down the process step by step:
             | 
             | Initially, you have 50 red balls and 50 blue balls in the
             | container.
             | 
             | 1) When you draw a blue ball from the container, you remove
             | one blue ball, and you add two white balls back. So, after
             | drawing a blue ball, you have 49 blue balls (due to
             | removal) and you add 2 white balls, making it a total of 52
             | white balls (due to addition) ..."_
             | 
             | If I was hiring interns this dumb, I'd be in trouble.
             | 
             | EDIT: judging by the GPT-4 responses, I remain of the
             | opinion I'd be in trouble if my interns were this dumb.
        
               | magospietato wrote:
               | GPT-4:
               | 
               | e composition of what's left in the container.
               | 
               | There's a couple of scenarios, which depend on when you
               | run out of blue balls:
               | 
               | 1) If you draw all 50 blue balls first, for every blue
               | ball you draw, you are adding 2 white balls. Hence, once
               | you've drawn all the blue balls, you have replaced them
               | with 100 white balls. This means you've drawn 50 balls
               | total and there's now 100 white balls and 50 red balls
               | (150 balls) in the container. Then, you continue and draw
               | the next 50 balls. These could be any combination of
               | white and red balls, we don't know which color is
               | preferred when you continue drawing.
               | 
               | 2a) If you draw all red balls in these next 50 draws, you
               | would have 50 red balls out and 50 blue balls out. So,
               | you would end with 100 white balls in the container.
               | 
               | 2b) If you draw all white balls in these next 50 draws
               | you would have 50 red balls, 50 blue balls, and 50 white
               | balls out. So, you would end with 50 white balls and 50
               | red balls in the container.
               | 
               | 2c) If you draw a mixture of red and white balls in these
               | next 50 draws, the number leftover in the container would
               | vary according to the specific combination drawn.
               | 
               | Remember, the order in which the balls are drawn matters
               | to this question since we have a rule in place that
               | changes the overall total (or population) of balls within
               | the container. In each of these scenarios, you've drawn
               | 50 times initially (all blue) and then 50 times again (in
               | some unknown combination of red and white). You've drawn
               | 100 times total and changed the number of white balls in
               | the container from zero to an amount dependent on how
               | many times you drew a white ball on your second round of
               | 50 draws.
        
               | NovemberWhiskey wrote:
               | Yeah, that's still pretty much nonsense isn't it?
               | 
               |  _2b) If you draw all white balls in these next 50 draws
               | you would have 50 red balls, 50 blue balls, and 50 white
               | balls out. So, you would end with 50 white balls and 50
               | red balls in the container._
               | 
               | ... so after removing 100 balls, I 've removed 150 balls?
               | And the 150 balls that I've removed are red, white and
               | blue despite the fact that I removed 50 blue balls
               | initially and then 50 white ones.
        
               | mountainriver wrote:
               | Just because it fails one test in a particular way
               | doesn't mean it lacks reasoning entirely. It clearly does
               | have reasoning based on all the benchmarks it passses
               | 
               | You are really trying to make it not have reasoning for
               | your own benefit
        
               | kenjackson wrote:
               | I asked GPT4 and it gave a similar response. So then I
               | asked my wife and she said, "do you want more white balls
               | at the end or not?" And I realized as CS or math question
               | we assume that the draw is random. Other people assume
               | that you're picking which ball to draw.
               | 
               | So I clarified to ChatGPT that the drawing is random. And
               | it replied: "The exact numbers can vary based on the
               | randomness and can be precisely modeled with a simulation
               | or detailed probabilistic analysis."
               | 
               | I asked for a detailed probabilistic analysis and it
               | gives a very simplified analysis. And then basically says
               | that a Monte Carlo approach would be easier. That
               | actually sounds more like most people I know than most
               | people I know. :-)
        
               | YetAnotherNick wrote:
               | This is such a flawed puzzle. And GPT 4 answers it
               | rightly. It is a long answer but the last sentence is
               | "This is one possible scenario. However, there could be
               | other scenarios based on the order in which balls are
               | drawn. But in any case, the same logic can be applied to
               | find the number of each color of ball left in the
               | container."
        
               | NovemberWhiskey wrote:
               | The ability to identify that there isn't a simple closed
               | form result is actually a key component of reasoning. Can
               | you stick the answer it gives on a gist or something? The
               | GPT 3.5 response is pure, self-contradictory word salad
               | and of course delivered in a highly confident tone.
        
               | YetAnotherNick wrote:
               | https://pastebin.com/r9bNi8GD
               | 
               | GPT 4 goes into detail about one example scenario, which
               | most humans won't do, but it is technically correct
               | answer as it said it depends on the order.
        
               | NovemberWhiskey wrote:
               | But the reasoning is total garbage, right?
               | 
               | It says the number of blue balls drawn is _x_ and the
               | number of red balls drawn is _y_ , and then asserts _x +
               | y = 100_ , which is wrong.
               | 
               | Then it proceeds to "solve" an equation which reduces to
               | _x = x_ to conclude _x = 0_.
               | 
               | It then uses that to "prove" that _y_ = 100, which is a
               | problem as there are only 50 red balls in the container
               | and nothing causes any more to be added.
               | 
               | It's like "mistakes bad students make in Algebra 1".
        
               | albedoa wrote:
               | > but it is technically correct answer as it said it
               | depends on the order.
               | 
               | It should give you pause that you had to pick not only
               | the _line_ by which to judge the answer but the _part_ of
               | the line. The sentence immediately before that is
               | objectively wrong:
               | 
               | > This is one possible scenario.
        
               | Jensson wrote:
               | Its answer isn't correct, this isn't a possible ending
               | scenario:
               | 
               | - *Ending Scenario:* - Red Balls (RB): 0 (all have been
               | drawn) - Blue Balls (BB): 50 (none have been drawn) -
               | White Balls (WB): 0 (since no blue balls were drawn, no
               | white balls were added) - Total Balls: 50
        
               | sebzim4500 wrote:
               | I don't understand the question. Surely the answer
               | depends which order you withdraw balls in? Is the idea
               | that you blindly withdraw a ball at every step, and you
               | are asking for the expected value of each number of ball
               | at the end of the process?
               | 
               | Seems like quite a difficult question to compute exactly.
               | 
               | I reworded the question to make it clearer and then it
               | was able to simulate a bunch of scenarios as a monte
               | carlo simulation. Was your hope to calculate it exactly
               | with dynamic programming? GPT-4 was not able to do this,
               | but I suspect neither could a lot of your interns.
        
               | NovemberWhiskey wrote:
               | > _I don 't understand the question. Surely the answer
               | depends which order you withdraw balls in? Is the idea
               | that you blindly withdraw a ball at every step, and you
               | are asking for the expected value of each number of ball
               | at the end of the process?_
               | 
               | These are very good questions that anyone with the
               | ability to reason would ask if given this problem.
        
               | rafark wrote:
               | GPT 3.5 is VERY dumb when compared to GPT 4. Like, the
               | difference is massive.
        
               | Jensson wrote:
               | GPT 4 still does a lot of dumb stuff on this question,
               | you see several people post outright wrong answer and say
               | "Look how gpt-4 solved it!". That happens quite a lot in
               | these discussions, so it seems like the magic to get
               | gpt-4 to work is that you just don't check its answers
               | properly.
        
               | kaibee wrote:
               | https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-284
               | 3e6...
               | 
               | Took a bit of massaging and I enabled the Data Analysis
               | plugin which lets it write python code and run it. It
               | looks like the simulation code is correct though.
        
               | Kim_Bruning wrote:
               | I came at it from a different angle. The simulation code
               | in my case had a bug which I needed to point out. Then it
               | got a similar final answer.
        
               | NovemberWhiskey wrote:
               | > _Let 's assume you draw x blue balls in 100 draws. Then
               | you would have drawn 100-x red balls._
               | 
               | Uhm.
        
               | pertique wrote:
               | This is what I got on a basically brand new OpenAI
               | account: https://chat.openai.com/share/5199c972-478d-406f
               | -9092-061a6b...
               | 
               | All told, I'd say it's a decent answer.
               | 
               | Edit: I took it to completion:https://chat.openai.com/c/6
               | cdd92f1-487a-4e1c-ab94-f2bdbf282d...
               | 
               | These were the first responses each time, with no
               | massaging/retires/leading answers. I will say it's not
               | entirely there. I re-ran the initial question a few times
               | afterwards and one was basically giberish.
        
             | [deleted]
        
             | szundi wrote:
             | Really?
        
             | _trapexit wrote:
             | It's not reasoning. It's word prediction. At least at the
             | individual model level. OpenAI is likely using a collection
             | of models.
        
           | denton-scratch wrote:
           | > And that's because, despite all of its training data, it's
           | not capable of actually reasoning. That may change in the
           | future [...]
           | 
           | I don't think so. When you say "it's not capable of actually
           | reasoning", that's because it's a LLM; and if it "changes in
           | the future", that's because the new system must no longer be
           | a pure LLM. The appearance of reasoning in LLMs is an
           | illusion.
        
           | kromem wrote:
           | That's not the case. It's very much in the realm of "we don't
           | know what's going on in the network."
           | 
           | Rather than a binary it's much more likely that it's a mix of
           | factors going into results that includes basic reasoning
           | capabilities developed from the training data (much like
           | board representations and state tracking abilities developed
           | feeding board game moves into a toy model in Othello-GPT) as
           | well as statistic driven autocomplete.
           | 
           | In fact often when I've seen GPT-4 get hung up with logic
           | puzzle variations such as transparency, it tends to seem more
           | like the latter is overriding the former, and changing up
           | tokens to emoji representations or having it always repeat
           | adjectives attached to nouns so it preserves variation
           | context gets it over the hump to reproducible solutions (as
           | would be expected from a network capable of reasoning) but by
           | default it falls into the pattern of the normative cases.
           | 
           | For something as complex as SotA neural networks, binary
           | sweeping statements seem rather unlikely to actually be
           | representative...
        
             | nsagent wrote:
             | As an PhD student in NLP who's graduating soon, my
             | perspective is that language models do not demonstrate
             | "reasoning" in the way most people colloquially use the
             | term.
             | 
             | These models have no capacity to plan ahead, which is a
             | requirement for many "reasoning" problems. If it's not in
             | the context, the model is unlikely to use it for predicting
             | the next token. That's why techniques like chain-of-thought
             | are popular; they cause the model to parrot a list of facts
             | before making a decision. This increases the likelihood
             | that the context might contain parts of the answer.
             | 
             | Unfortunately, this means the "reasoning" exhibited by
             | language models is limited: if the training data does not
             | contain a set of generalizable text applicable to a
             | particular domain, a language model is unlikely to make a
             | correct inference when confronted with a novel version of a
             | similar situation.
             | 
             | That said, I do think adding reasoning capabilities is an
             | active area of research, but we don't have a clear time
             | horizon on when that might happen. Current prompting
             | approaches are stopgaps until research identifies a
             | promising approach for developing reasoning, e.g. combining
             | latent space representations with planning algorithms over
             | knowledge bases, constraining the logits based on an
             | external knowledge verifier, etc (these are just random
             | ideas, not saying they are what people are working on,
             | rather are examples of possible approaches to the problem).
             | 
             | In my opinion, language models have been good enough since
             | the GPT-2 era, but have been held back by a lack of
             | reasoning and efficient memory. Making the language models
             | larger and trained on more data helps make them more useful
             | by incorporating more facts with increased computational
             | capacity, but the approach is fundamentally a dead end for
             | higher level reasoning capability.
        
               | Kim_Bruning wrote:
               | > These models have no capacity to plan ahead
               | 
               | How would you describe the behavior of "GPT Advanced Data
               | Analysis"?
        
               | Der_Einzige wrote:
               | Are you going to school in Langley, Virginia?
        
               | NovemberWhiskey wrote:
               | NSA is more commonly associated with Fort Meade, MD, for
               | what that's worth.
        
               | visarga wrote:
               | > if the training data does not contain a set of
               | generalizable text applicable to a particular domain, a
               | language model is unlikely to make a correct inference
               | when confronted with a novel version of a similar
               | situation.
               | 
               | True. But look at the Phi-1.5 model - it punches 5x above
               | its weight limit. The trick is in the dataset:
               | 
               | > Our training data for phi-1.5 is a combination of
               | phi-1's training data (7B tokens) and newly created
               | synthetic, "textbook-like" data (roughly 20B tokens) for
               | the purpose of teaching common sense reasoning and
               | general knowledge of the world (science, daily
               | activities, theory of mind, etc.). We carefully selected
               | 20K topics to seed the generation of this new synthetic
               | data. In our generation prompts, we use samples from web
               | datasets for diversity. We point out that the only non-
               | synthetic part in our training data for phi-1.5 consists
               | of the 6B tokens of filtered code dataset used in phi-1's
               | training (see [GZA+ 23]).
               | 
               | > We remark that the experience gained in the process of
               | creating the training data for both phi-1 and phi-1.5
               | leads us to the conclusion that the creation of a robust
               | and comprehensive dataset demands more than raw
               | computational power: It requires intricate iterations,
               | strategic topic selection, and a deep understanding of
               | knowledge gaps to ensure quality and diversity of the
               | data. We speculate that the creation of synthetic
               | datasets will become, in the near future, an important
               | technical skill and a central topic of research in AI.
               | 
               | https://arxiv.org/pdf/2309.05463.pdf
               | 
               | Synthetic data has its advantages - less bias, more
               | diverse, scalable, higher average quality. But more
               | importantly, it can cover all the permutations and
               | combinations of skills, concepts, situations. That's why
               | a small model just 1.5B like Phi was able to work like a
               | 7B model. Usually at that scale they are not coherent.
        
           | smaddox wrote:
           | > And that's because, despite all of its training data, it's
           | not capable of actually reasoning.
           | 
           | Your conclusion doesn't follow from your premise.
           | 
           | None of these models are trained to do their best on any kind
           | of test. They're just trained to predict the next word. The
           | fact that they do well at all on tests they haven't seen is
           | miraculous, and demonstrates something very akin to
           | reasoning. Imagine how they might do if you actually trained
           | them or something like them to do well on tests, using
           | something like RL.
        
             | Jensson wrote:
             | > None of these models are trained to do their best on any
             | kind of test
             | 
             | How do you know GPT-4 wasn't trained to do well on these
             | tests? They didn't disclose what they did for it, so you
             | can't say it wasn't trained to do well on these tests. That
             | could be the magic sauce for it.
        
             | thesz wrote:
             | >The fact that they do well at all on tests they haven't
             | seen
             | 
             | Haven't they seen these tests?
             | 
             | We know little to nothing of how these models get trained.
        
       | mcguire wrote:
       | " _When Horace He, a machine-learning engineer, tested GPT-4 on
       | questions taken from Codeforces, a website that hosts coding
       | competitions, he found that it scored 10 /10 on coding tests
       | posted before 2021 and 0/10 on tests posted after 2021. Others
       | have also noted that GPT-4's test scores take a dive on material
       | produced after 2021. Because the model's training data only
       | included text collected before 2021, some say this shows that
       | large language models display a kind of memorization rather than
       | intelligence._"
       | 
       | I'm sure that is just a matter of prompt engineering, though.
        
       ___________________________________________________________________
       (page generated 2023-10-10 23:00 UTC)