[HN Gopher] Comparing humans, GPT-4, and GPT-4V on abstraction a...
       ___________________________________________________________________
        
       Comparing humans, GPT-4, and GPT-4V on abstraction and reasoning
       tasks
        
       Author : mpweiher
       Score  : 201 points
       Date   : 2023-11-19 11:36 UTC (11 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | tovej wrote:
       | Conclusion is obvious, but the paper is still probably necessary.
       | 
       | Of course LLM's can't reason. They pattern match answers to
       | previously asked questions, and humans will read the text as a
       | reasonable answer because we assign meaning to it, but there is
       | simply no way an LLM could use a "mental model" to "reason" about
       | a problem other than constructing sentences out of probable
       | matches it's been trained on.
        
         | sfn42 wrote:
         | Love how the ai hypebros always downvote answers like this.
         | Everyone's just insisting that LLMs are AGI just because they
         | kinda seem that way.
        
           | viraptor wrote:
           | Ftr, I mostly agree with the comment but downvoted it for the
           | low quality "of course that's the answer, obviously" padded
           | with some unhelpful tropes about trained pattern matching.
           | (Let us know when you can prove that human communication is
           | not just trained pattern matching)
        
             | tovej wrote:
             | Ok, I'll bite. I personally don't think human communication
             | is simply pattern matching (e.g. we have drives, motives,
             | relationships that we want to tell each other, these are
             | internal systems, outside the visible band of
             | communication). That's a little off topic though, my post
             | was not about communication, but reasoning.
             | 
             | Pattern matching in LLMs is specifically at the language
             | level, as in "predict the next token in a stream", I should
             | have probably specified this bit.
             | 
             | You seem to refer to a more general kind of pattern
             | matching when you say, where we're looking at a complicated
             | structural model within which humans reason about the
             | behavior of different system components and their
             | interactions.
             | 
             | There is no way to form an internal representation of such
             | systems within an LLM, which is why I said "of course" it
             | can't reason. An LLM is not designed with this capability
             | in mind.
             | 
             | If all you have is language, and predictive text, there is
             | no way to use this to simulate and predict the rules in a
             | higher-dimensional space than "stream of tokens", or let's
             | put it another way: there would need to be an extremely
             | high correlation between the pattern in the problem space
             | and in language space for the likelihood of the model
             | guessing right to be anywhere near human levels.
        
               | stevenhuang wrote:
               | > There is no way to form an internal representation of
               | such systems within an LLM, which is why I said "of
               | course" it can't reason. An LLM is not designed with this
               | capability in mind.
               | 
               | You claim this as if it is a foregone conclusion yet it
               | is anything but. This is an active area of study, even
               | within human cognition
               | https://en.m.wikipedia.org/wiki/Predictive_coding
               | 
               | A complex system clearly does not need to be designed to
               | evoke some behavior. Just look to the concept of emergent
               | phenomenon. See evolution through natural selection as an
               | obvious counter example.
               | 
               | Evolution selects for fitness and not reasoning ability.
               | Yet here we are.
               | 
               | It is not known if the same isn't happening now for the
               | transformer model.
               | 
               | LLMs are universal function approximators. If reasoning
               | ability is at all representable through such a process,
               | it is possible for LLMs to find it.
               | 
               | https://en.m.wikipedia.org/wiki/Universal_approximation_t
               | heo...
        
               | tovej wrote:
               | Neural networks are universal function approximators,
               | transformers are universal sequence to sequence
               | approximators. Crucially, the sequences here are the
               | stream tokens in an LLM. This means an LLM can do
               | arbitrary language manipulation.
               | 
               | If the reasoning task requires something that can't be
               | done using language manipulation in embeddings-space, it
               | won't be able to do it.
               | 
               | Idk, maybe my mathematical intuition is wrong, but it
               | seems obvious to me that if all you have is a function on
               | a vector (representing a token stream), then the only
               | transformations you can create are ones that can be
               | expressed in that vector space. In other words, if you
               | express a problem in a more complicated space (e.g. a
               | visual problem, or an abstract algebra problem), you will
               | not be able to solve it in the smaller token space,
               | there's not enough information. You COULD first transform
               | the vector into a different representation, perform the
               | transformation there, and then encode back.
               | 
               | This step would have to be different for every reasoning
               | problem, so the underlying transformation to solution
               | space would have to be generic and move into one of N
               | reasoning subsystems.
               | 
               | Also, if we hypothetically assume this limitation doesn't
               | exist, we would still have the issue that a trained LLM
               | represents a single such transformation function.
               | 
               | If we want generic reasoning skills, the approximated
               | function would have to be different for every reasoning
               | problem. Even if we use a subset of the vectors degrees
               | of freedoms to express each type of reasoning, it would
               | make any truly reasoning LLM computationally intractable,
               | because the embeddings space would blow up.
        
               | int_19h wrote:
               | What couldn't you possibly represent as a bunch of
               | numbers to the level of precision necessary to process
               | it? We might not have sufficient compute power for LMs
               | large enough for some tasks yet, but I don't see how it's
               | different in principle.
               | 
               | And yes, of course such representation would be different
               | for different problems. We also have organs in our body
               | that do that for various types of sensory inputs that
               | ultimately all feed into the brain to be processed as
               | quantized impulses.
               | 
               | And GPT-4 is already supposed to be an "assembly of
               | experts", i.e. different models trained for different
               | tasks, with an orchestrating model weaving them together.
               | This is also not a surprise, given that our brains aren't
               | monolithic, either. You _should_ be able to do it with a
               | single model that 's large enough, but it is likely more
               | energy- and time-efficient to get there by putting
               | smaller pieces together instead.
        
               | stevenhuang wrote:
               | > In other words, if you express a problem in a more
               | complicated space (e.g. a visual problem, or an abstract
               | algebra problem), you will not be able to solve it in the
               | smaller token space, there's not enough information
               | 
               | You're aware multimodel transformers do exactly this?
               | 
               | https://github.com/bytedance/SALMONN
        
             | sfn42 wrote:
             | I don't need to prove anything. We know for a fact that
             | LLMs are stochastic parrots. That's why they confabulate,
             | if they were "intelligent" they wouldn't just pass off
             | nonsense as fact.
             | 
             | They have been fed incredible amounts of data which allows
             | them to stochastically parrot really complex things, this
             | makes people think they're intelligent. Designing a test
             | that actually tests intelligence rather than parroting
             | ability is tricky, but whenever people do they fail them.
             | 
             | I just logged into ChatGPT and asked it repeatedly to
             | generate a novel thought experiment intended to prove that
             | LLMs are incapable of reasoning. I kept trying to refine it
             | by explaining what was wrong with each suggestion.
             | 
             | It completely failed. It just spat out variations of well
             | known dilemmas or paradoxes, even when I specifically asked
             | it to create something original it just made nonsense.
        
               | mcswell wrote:
               | "whenever people do they fail them": I assume the
               | antecedent of "they" is LLMs, not people.
        
           | danielbln wrote:
           | You don't have to be an AI "hype bro" to take issue with the
           | reductive and trite notion that LLM's are stochastic parrots
           | only. There is a continuum between that and AGI.
        
             | PaulDavisThe1st wrote:
             | A continuum like the ape climbing to the top of a tree
             | being part of a continuum that includes landing on the
             | moon?
        
               | danielbln wrote:
               | Pretty much, I'd say we're way past the fleecing and poop
               | flinging at this point and well into persistence hunting,
               | maybe even control of fire?
        
               | int_19h wrote:
               | I'd say it's more like the ape making and using its first
               | crude tool by itself.
        
             | sfn42 wrote:
             | Well I take issue with the reductive and trite notion that
             | just because an LLM can generate plausible text it's
             | suddenly maybe conscious and intelligent, maybe about to
             | end humanity etc.
             | 
             | It's exactly like the crypto hype wave. Everyone dreaming
             | up semi-plausible futures, based on a whole chain of
             | unfounded assumptions.
        
               | danielbln wrote:
               | It's plausible text, and it's useful text. LLMs aren't
               | just speculative vehicles in search for a problem, as
               | most of crypto is, they are useful, today, right now.
               | They don't require any assumptions to be so, nor do they
               | have to be skynet world ending AGI to be that. You can
               | point out problematic extrapolation of doomers without
               | being reductive towards the very real and very useful
               | capabilities of LLMs.
        
               | pixl97 wrote:
               | The only thing the LLM is missing is an self actuation
               | loop.
               | 
               | We put a camera on a multimodal LLM, it interprets the
               | visual world and sees before it a bunch of blocks. It
               | looks at the task list it has that says "pick up red
               | blocks, put them in blue bin". The visual components
               | identifies the red blocks and the textual component
               | issues commands to its drive unit which calculates the
               | best path and how to use it's manipulators to pick up the
               | blocks.
               | 
               | This is a very basic chain of existence. We have world
               | identification, we have actuation on motivation, we have
               | interaction with the environment. We can do this now.
               | These goals have already been achieved. Companies are
               | already testing more complex models with much more
               | general instructions such as "Pick up everything you
               | think is trash" or "Organize this room" to see the
               | emergent behaviors that come out of the models.
               | 
               | You are seemingly a few years behind what has already
               | been done, and why people are starting to get concerned.
        
           | dodslaser wrote:
           | I think it's pretty funny that the moment we taught a
           | computer to talk like a human people immediately assumed it
           | was sentient, let alone intelligent.
        
             | ben_w wrote:
             | And this assumption has only gotten more common in the ~56
             | years since (so the anecdote goes) ELIZA fooled
             | Weizenbaum's secretary.
             | 
             | "Consciousness" has ~22 different definitions; some are
             | easily met, some are not even met by the humans giving
             | those definitions (for example, I've seen people here on HN
             | who write as if consciousness can somehow magically solve
             | the halting problem).
             | 
             | There's only one test for it in AI, that I've heard of, and
             | that I think is worth exploring:
             | https://www.youtube.com/watch?v=LWf3szP6L80
        
             | jofla_net wrote:
             | I suspect this has to do with how shallowly people even
             | regard their fellow man. To the point that if the outputs
             | of an entity parrot humans well enough then, well, there
             | couldn't be any differences whatsoever, and it must bear
             | all the titles which we gift to humans.
        
             | int_19h wrote:
             | It has to do with the fact that all our definitions of
             | "sentient" and "intelligent" are extremely vague, often
             | vacuous, and generally boil down to "I know it when I see
             | it" in practice.
             | 
             | So there should be absolutely no surprise that, when
             | something comes along that behaves in certain ways very
             | much like a human (and unlike anything else), it is assumed
             | to share the basic qualities of such. It is how we judge
             | other people, after all - it's not like you do a brain scan
             | to determine whether someone you are talking to is
             | "sentient", whatever that means to you. You judge that by
             | their behavior, modelling what a "sentient" would do and
             | comparing the result to what they actually do.
        
           | jstummbillig wrote:
           | I think both of your comments are just boring and
           | condescending.
        
         | koe123 wrote:
         | On the other hand, can we conclusively say that humans aren't
         | really advanced biological stochastic parrots?
        
         | naasking wrote:
         | > Of course LLM's can't reason
         | 
         | That they are not effective at some forms of reasoning does not
         | entail they can't reason.
        
         | mcswell wrote:
         | The conclusion may be obvious to you and me (although it's hard
         | to know for certain, since these available LLMs are black
         | boxes). But it's definitely not obvious to everyone. There are
         | plenty of people saying this is the dawn of AGI, or that we're
         | a few short steps from AGI. Whereas people like Gary Marcus
         | (who knows tons more than I do) says LLMs are going off in the
         | wrong direction.
        
           | visarga wrote:
           | Yes, LLMs can't reason 100% correctly, but neither do humans.
           | We can often reason correctly, but it's not always the case.
           | 
           | Even reasoning, fundamental as it is, comes from feedback.
           | Feedback from our actions teaches us how to reason. Learning
           | from feedback is more general than reasoning - AI agents can
           | definitely learn this way too, if they have enough freedom to
           | explore. But you can't do it with supervised training sets
           | alone.
           | 
           | You need to put language models into agents in environments,
           | give them goals and rewards. Then they can make their own
           | training data and mistakes, build up their own experience.
           | Can't teach an AI based on how people make mistakes, it needs
           | to fix its own mistakes, but that means to deploy it in the
           | wild, where there are consequences for errors.
           | 
           | If you remember, DeepMind first tried to train a Go model on
           | human game play, but it was limited. Then they started from
           | scratch, and learning from feedback alone they surpassed
           | human level, even though the feedback was one single bit of
           | information at the end of a whole self-play game. And it had
           | no pre-training prior to learning Go, unlike human players.
        
             | somewhereoutth wrote:
             | That presupposes that language encodes the world we live in
             | completely, whereas in fact language is meaningless without
             | the shared context of reality. 'up' and 'down' are only
             | meaningful to intelligence that can experience space.
             | 
             | Essentially LLMs just are oracles for the shadows on
             | Plato's cave.
        
               | visarga wrote:
               | Do a reverse Chinese room experiment - remove from a
               | human all the abilities multi-modal LLMs gain after
               | training on human media. What's left? Homo ferus.
               | 
               | Most of our intelligence is encoded in the environment
               | and language, it's a collective process, not an
               | individual one. We're collectively, not individually,
               | very smart.
        
               | stefl14 wrote:
               | This isn't true. I wrote a blog post about it a while
               | back but never finished. It's complete enough to
               | demonstrate the point though so I'll plug anyway
               | https://stefanlavelle.substack.com/p/no-language-isnt-
               | enough
               | 
               | TLDR: Internal LLM representations correspond to an
               | understanding of the visual world. We've all seen the
               | Othello example, which is too constrained a world to mean
               | much, but even more interesting is that LLMs can caption
               | tokenized images with no pretraining on visual tasks
               | whatsoever. Specifically, pass an image to an encoder-
               | decoder visual model trained in a completely unsupervised
               | manner on images -> take the encoded representation ->
               | pass the encoded representation to an LLM as tokens ->
               | get accurate captions. The tests were done on gpt-j,
               | which is not multimodal and only has about 7bn params.
               | The only caveat is that a linear mapping model needs to
               | be trained to map the vector space from the encoder-
               | decoder model to the embedding space of the language
               | model, but this isn't doing any conceptual labour, it's
               | only needed to align the completely arbitrary coordinate
               | axes of the vision and language models, which were
               | trained separately (akin to an American and a European to
               | agreeing to use metric or imperial -- neither's
               | conception of the world changes).
               | 
               | It's not intuitive, but it's hard to argue with these
               | results. Even small LLMs can caption images. Sure, they
               | don't get the low-level details like the texture of
               | grass, but they get the gist.
               | 
               | I keep reading your sort of analysis, but honestly, those
               | priors need updating. I had to update when learning this.
               | If 7bn params can do it, 175bn params with multimodality
               | can certainly do it.
               | 
               | It's true that humans need symbol grounding, but we don't
               | see hundreds of billions of sequences. There are
               | theoretical reasons (cf category theory) why this could
               | work, albeit probably limited to gist rather than detail.
        
               | int_19h wrote:
               | The LLMs do indeed deal with Plato's shadows, but so do
               | we - what we "see", after all, is not the actual up or
               | down, but a series of neural activations from our retinas
               | (which aren't even 3D, so concepts like "behind" are only
               | captured by proxy). Such activations can all be readily
               | encoded into tokens, which is exactly what models
               | specifically trained to describe images do.
        
               | pixl97 wrote:
               | Of course you're saying that LLMs can only train on
               | textual data, whereas we are developing multimodal AI at
               | this time that takes things like visual, audible, and
               | whatever other kind of sensor data and turn it into
               | actionable information.
        
           | int_19h wrote:
           | The real question isn't whether the LLM can reason.
           | 
           | The question is whether an assembly of components, one of
           | which is an LLM (others would include memory and whatever
           | else is needed to make it a self-contained loop with a notion
           | self-identity) can reason.
        
       | baxtr wrote:
       | Could a human and a LLM submit a summary of the paper so we can
       | compare?
        
         | j2kun wrote:
         | That's what an abstract is for:
         | 
         | > Our experimental results support the conclusion that neither
         | version of GPT-4 has developed robust abstraction abilities at
         | humanlike levels.
        
         | z7 wrote:
         | >The paper investigates the abstract reasoning abilities of
         | text-only and multimodal versions of GPT-4 using the ConceptARC
         | benchmark, concluding that neither version of GPT-4 has
         | developed robust abstraction abilities at humanlike levels.
        
       | cs702 wrote:
       | Interesting. Immediate thoughts and questions:
       | 
       | * How would human beings perform on the text-only version of the
       | tasks given to GPT-4?
       | 
       | * How would human beings perform if each grid is shown on its
       | own, making it impossible to perform side-by-side visual
       | comparisons?
       | 
       | * How would human beings perform if each grid is shown on its own
       | _only once_ , making it impossible to perform _any_ back-and-
       | forth comparisons?
       | 
       | * How could we give LLMs the ability to "pay attention" to
       | different parts of images, as needed, so they can make back-and-
       | forth comparisons between parts of different images to solve
       | these kinds of visual reasoning tasks?
        
         | mistermann wrote:
         | * How would human beings perform if they didn't know they were
         | being tested (ie: if in the same mode they are in when writing
         | comments on the internet)?
         | 
         | * How would human beings perform if the questions are based on
         | culture war topics, which tend to invoke System 1
         | intuitive/emotional thinking?
        
         | YetAnotherNick wrote:
         | Also I want to know how would know how much gain could be made
         | by optimizing the prompt for GPT and by including things like
         | CoT. Current version of the prompt is pretty bad both for
         | humans and AI.
        
         | QuadmasterXLII wrote:
         | > How could we give LLMs the ability to "pay attention" to
         | different parts of images, as needed, so they can make back-
         | and-forth comparisons between parts of different images to
         | solve these kinds of visual reasoning tasks?
         | 
         | I've got good news
        
           | oefnak wrote:
           | What is it?
        
           | phh wrote:
           | It's even all we need
        
       | air7 wrote:
       | Thank god...
        
       | firtoz wrote:
       | The best thing from research like this is that they allow new
       | models to be built, or improvements on the existing ones that can
       | lead them to pass these evaluations.
        
         | lewhoo wrote:
         | Yeah but the question remains are they passing because of a new
         | obtained/emerged quality or are they passing because they were
         | taught specifically to how to pass or is there even a
         | difference at this point.
        
           | firtoz wrote:
           | Well, while training AI we make sure that we include enough
           | training for them to pass some evaluation, but then test them
           | on things that were not in the training data to make sure
           | that they are not "over fitted".
        
             | viraptor wrote:
             | I don't think parent meant training specifically on that
             | test data, but rather in that kind of task. Think more "LLM
             | trained only on images of art" doing badly on this test,
             | but then "LLM trained on images of art and IQ puzzles"
             | doing better on this test.
             | 
             | It's kind of like asking what even is IQ if you can learn
             | how to solve Mensa puzzles and improve your score. Does it
             | mean you're more intelligent?
        
               | firtoz wrote:
               | Oh, I see.
               | 
               | One guess for GPT4 is that it's an amalgamation of
               | models. So you would have various models trained more
               | specifically like the ones you mentioned, and you ask
               | them all to start answering the query, then you choose
               | which one produces better results, and then you present
               | that to the user.
               | 
               | Alternatively you can have a decider model that knows
               | what kind of queries goes to what specialised one, then
               | have that as a "hidden layer", whether it be in the
               | application level or "neuron layers".
        
               | broast wrote:
               | If you've checked out the docs for the assistant api, you
               | can intuit that there is a higher level system which
               | decides which subsystems to use to respond. The assistant
               | determines whether to use a tool (vision, code
               | interpreter, search, retrieval), as well as which code or
               | text language model to use to generate the response.
        
               | pixl97 wrote:
               | Your last statement reminds me of the folktale of John
               | Henry. That of generalized capability versus that of
               | specialization.
               | 
               | Under a narrow range of tests specialization is pretty
               | much guaranteed to win. Specialization in all cases I
               | know of comes at a cost to general capability. It's like
               | one of those qips "Pick any two: fast, cheap, good", the
               | more you pull at one category, the costs rase in the
               | other categories.
        
             | EdiX wrote:
             | That's the idea, yes. However none but OpenAI knows exactly
             | what ChatGPT was trained on. In fact, the dataset that it
             | was trained on is so vast that they probably don't know
             | either if it contains any given question. IIRC last week I
             | saw a study where GPT4 can solve some leetcode problems
             | simply by giving it the number, no description. A clear
             | example of overfitting.
        
               | viraptor wrote:
               | There was more information provided. And it's possibly
               | not even overfitting. See
               | https://news.ycombinator.com/item?id=38205153
        
               | falcor84 wrote:
               | These tests should be conducted on new questions. And if
               | we as humans no longer have the ability to create
               | original questions, then maybe we should just retire.
        
       | krona wrote:
       | This paper evaluates performance compared to a 'human' which
       | presumably is an average adult human without cognitive
       | impairment. I had to dig in to the references:
       | 
       |  _In the first batch of participants collected via Amazon
       | Mechanical Turk, each received 11 problems (this batch also only
       | had two "minimal Problems," as opposed to three such problems for
       | everyone else). However, preliminary data examination showed that
       | some participants did not fully follow the study instructions and
       | had to be excluded (see Section 5.2). In response, we made the
       | screening criteria more strict (requiring a Master Worker
       | qualification, 99% of HITs approved with at least 2000 HIT
       | history, as opposed to 95% approval requirement in the first
       | batch). Participants in all but the first batch were paid $10
       | upon completing the experiment. Participants in the first batch
       | were paid $5. In all batches, the median pay-per-hour exceeded
       | the U.S. minimal wage._
       | 
       | (Arseny Moskvichev et al)
       | 
       | So in conclusion, this isn't a random sample of (adult) humans,
       | and the paper doesn't give standard deviations.
       | 
       | It would've been more interesting of they had sampled an age
       | range of humans which we would place GPT-4 on rather than just
       | 'it's not as good' which is all this paper can say, really.
        
         | cs702 wrote:
         | Also, it's possible there are LLMs pretending to be human
         | beings on Mechanical Turk!
        
           | jgalt212 wrote:
           | What per cent of Mechanical Turk hits are being performed by
           | an LLM these days?
        
             | dmd wrote:
             | Several researchers where I work used to use mturk
             | regularly and now have been forced to stop using it
             | entirely because of this. They started getting "As a large
             | language model" answers ... in even things where the only
             | possible answer was 'true' or 'false' or '1-7'.
        
               | ttul wrote:
               | "I hope this email finds you well..."
        
               | EGreg wrote:
               | I love how your CompanyName does XYZ in field ABC. Buy
               | our product. Let's talk!
        
               | kordlessagain wrote:
               | Click here for virtual coffee with me!
        
               | jgalt212 wrote:
               | According to Eric Schmidt, OpenAI is worth $90B. LLMs
               | being unable to master mturk and $90B are incongruent.
               | 
               | https://twitter.com/ericschmidt/status/172562514451990964
               | 8
               | 
               | Who's right? Who's wrong?
        
               | dmd wrote:
               | Good question. Let's go ask our expert, an internet
               | commenter who calls himself "John Galt", to step in with
               | an opinion based solely on market value!
        
           | svat wrote:
           | See the "Artificial Artificial Artificial Intelligence"
           | paper: https://arxiv.org/abs/2306.07899
        
         | colincooke wrote:
         | My wife studys people for living (experimental cognitive
         | psychologist), the quality of MTurk is laughable, if that's our
         | standard for higher level cognition then the bar is low. You'll
         | see the most basic "attention check" questions ("answer option
         | C if you read the question") be failed routinely, honestly at
         | this point I think GPT4 would to a better job than most
         | MTurkers at these tasks...
         | 
         | She has found that prolific is substantially better (you have
         | to pay more for it as well), however that may only be because
         | it's a higher cost/newer platform.
        
           | PaulHoule wrote:
           | My take is the tasks on Turk are awful and will drive away
           | anybody decent.
           | 
           | I had a time when I was running enough HITs to get a customer
           | rep and felt I was getting OK results. I wanted to get better
           | at running HITs so I thought I would "go native" as a Turk
           | and try to make $50 or so but I could not find tasks to do
           | that were at all reasonable. Instead they'd want me to "OCR"
           | a receipt that was crumpled up and torn and unreadable in
           | spots and said they'd punish me for any mistakes.
        
           | amelius wrote:
           | > honestly at this point I think GPT4 would to a better job
           | than most MTurkers at these tasks...
           | 
           | From the article:
           | 
           | > Our experimental results support the conclusion that
           | neither version of GPT-4 has developed robust abstraction
           | abilities at humanlike levels.
           | 
           | This makes the conclusion only worse for GPT-4 ...
        
             | cosmojg wrote:
             | > In the first batch of participants collected via Amazon
             | Mechanical Turk, each received 11 problems (this batch also
             | only had two "minimal Problems," as opposed to three such
             | problems for everyone else). However, preliminary data
             | examination showed that some participants did not fully
             | follow the study instructions and had to be excluded (see
             | Section 5.2).
             | 
             | If they stuck to the average Mechanical Turk worker instead
             | of filtering for "Master Workers," the parent's conclusions
             | likely would've aligned with those of the study.
             | Unfortunately, it seems the authors threw out the only data
             | that didn't support their hypothesis as GPT-4 did, in fact,
             | outperform the median Mechanical Turk worker, particularly
             | in terms of instruction following.
        
               | nrfulton wrote:
               | _> Unfortunately, it seems the authors threw out the only
               | data that didn 't support their hypothesis as GPT-4 did,
               | in fact, outperform the median Mechanical Turk worker,
               | particularly in terms of instruction following._
               | 
               | MTurk, to first approximate, is a marketplace that pays
               | people pennies to fill out web forms. The obvious thing
               | happens. The median Mechanical Turk worker probably
               | either isn't a human, isn't _just_ a (single) human, and
               | /or is a (single) human but is barely paying attention +
               | possibly using macros. Or even just button mashing.
               | 
               | That was true even before GPT-2. Tricks like attention
               | checks and task-specific subtle captcha checks have been
               | around for almost as long as the platform itself. Vaguely
               | psychometric tasks such as ARC are particularly difficult
               | -- designing hardened MTurk protocols in that regime is a
               | fucking nightmare.
               | 
               | The type of study that the authors ran is useful if your
               | goal is to determine whether you should use outputs from
               | a model or deal with MTurk. But results from study
               | designs like the one in the paper rarely generalize
               | beyond the _exact_ type of HIT you 're studying and the
               | _exact_ workers you finally identify. And even then you
               | need constant vigilance.
               | 
               | I genuinely have no idea why academics use MTurk for
               | these types of small experiments. For a study of this
               | size, getting human participants that fit some criteria
               | to show up at a physical lab space or login to a zoom
               | call is easier and more robust than getting a
               | sufficiently non-noisy sample from MTurk. The first
               | derivative on your dataset size has to be like an order
               | of magnitude higher than the overall size of the task
               | they're doing for the time investment of hardening an
               | MTurk HIT to even begin make sense.
        
               | warkdarrior wrote:
               | This is just coming up with excuses for the MTurk
               | workers. "they were barely paying attention", "they were
               | button mashing", "they weren't a single human", etc.
               | 
               | It turns out that GPT-4 does not have those problems. The
               | comparison in the paper is not really fair, since it does
               | not compare average humans vs GPT-4, it compares "humans
               | that did well at our task" vs GPT-4.
        
               | nrfulton wrote:
               | _> This is just coming up with excuses for the MTurk
               | workers_
               | 
               | No. The authors are not trying to study MTurk market
               | dynamics. They are trying to compare humans and LLMs.
               | 
               | Both questions are interesting and useful. This study is
               | only asking about the second question. That's okay.
               | Isolating specific questions and studying them without a
               | bunch of confounds is one of the basic principles of
               | experiment design. The experiment isn't intended to
               | answer every question all at once. It's intended to
               | answer one very specific question accurately.
               | 
               | LLMs can both be worse at Mensa tasks and also better
               | than humans at a variety of reasoning tasks that have
               | economic value. Or, LLMs can be worse at those reasoning
               | tasks but still reasonably good enough and therefore
               | better on a cost-adjusted basis. There's no contradiction
               | there, and I don't think the authors have this confusion.
               | 
               |  _> The comparison in the paper is not really fair_
               | 
               | The study is not trying to fairly compare these two
               | methods of getting work done in general. It's trying to
               | study whether LLMs have "abstraction abilities at
               | humanlike levels", using Mensa puzzles as a proxy.
               | 
               | You can take issues with the goal of the study (like I
               | do). But given that goal, the authors' protocols are
               | completely reasonable as a minimal quality control.
               | 
               | Or, to put this another way: why would _NOT_ filtering
               | out clickbots and humans speedrunning surveys for $0.25
               | /piece result in a more insightful study _given the
               | author 's stated research question_?
               | 
               |  _> It turns out that GPT-4 does not have those
               | problems._
               | 
               | I think the authors would agree but also point out that
               | these problems aren't the ones they are studying in this
               | particular paper. They would probably suggest that this
               | is interesting future work for themselves, or for labor
               | economists, and that their results in this paper could be
               | incorporated into that larger study (which would
               | hopefully generalize beyond MTurk in particular, since
               | MTUrk inter alia are such uniquely chaotic subsets of the
               | labor market).
               | 
               | For me, the problems with the study are:
               | 
               | 1. The question isn't particularly interesting because no
               | one cares about Mensa tests. These problem sets make an
               | implicit assumption that psychometric tools which have
               | some amount of predictive power for humans will have
               | similar predictive power for LLMs. I think that's a naive
               | assumption, and that even if correlations exist the
               | underlying causes are so divergent that the results are
               | difficult to operationalize. So I'm not really sure what
               | to do with studies like this until I find an ethical
               | business model that allows me to make money by automating
               | Mensa style test-taking en masse. Which I kind of hope
               | will ever exist, to be honest.
               | 
               | 2. MTurk is a hit mess (typo, but sic). If you want to do
               | this type of study just recruit human participants in the
               | old fashioned ways.
               | 
               | But given the goal of the authors, I don't think applying
               | MTurk filters is "unfair". In fact, if anything, they're
               | probably not doing enough.
        
         | petermcneeley wrote:
         | This critique in no way invalidates the conclusions of the
         | paper.
        
           | krona wrote:
           | As I point out, the critique is simply that the paper isn't
           | very good. In fact I'd go as far as to say its egregiously
           | bad, even if the conclusion is 'correct' in as far as
           | something was measured and they reported it.
        
             | arp242 wrote:
             | The conclusion of "GPT-4 can't even compete against low-
             | quality human effort like that found on MTurk" seems
             | alright. The article would be better if it didn't claim to
             | compare "humans", because it doesn't - or at least, not in
             | the common sense definition of "comparing humans vs GPT-4".
             | But to be honest I think that's a bit pedantic.
             | 
             | It would be different if it had found that GPT-4
             | outperformed MTurk, but if GPT-4 can't compete against
             | MTurk on these tasks, then I don't think it will fare
             | better against you or me with our full attention.
        
           | avs733 wrote:
           | If it's making a claim about humans vs llms but shittified
           | capitalism means that the humans are just using llms to make
           | money churning on mechanical turn then you can't make the
           | underlying claim.
           | 
           | That ignores random sample problems and the simplistic
           | statistical analysis...so yeah it unfortunately does.
           | 
           | It doesn't prove they are wrong, but it's not credible to say
           | they are right
        
             | Gare wrote:
             | It's only important that humans (or "humans") were about
             | 90% correct. Whereas GPT-4 was only 33% correct. If those
             | turks were using AI, that would mean there is AI out there
             | much more capable than GPT-4.
        
               | avs733 wrote:
               | It doesn't...
               | 
               | The problem is that unless the data are credible and
               | reliable, any interpretation of that data is meaningless.
               | For or against one's hypotheses doesn't matter if the
               | data is junk. If the data is junk you can't move past
               | that, it's the end of the conversation because the data
               | has no meaning.
        
             | LogicalRisk wrote:
             | From reading the paper and the original paper that the data
             | for the MTurk/Prolific samples are drawn from, this is a
             | convenience sample of 415 humans on two platforms. Each
             | worker received a random sample of the ConceptARC problems,
             | and the average score correct is assigned the "Human"
             | benchmark.
             | 
             | Perhaps by "random sample problems" you mean that the study
             | is not representative of all of humanity? If so we can
             | still take the paper as evaluating these 415 humans who
             | speak English against the two models. If as you say, the
             | workers are actually just using LLMs then this implies
             | there is some LLM that your average MTurk worker has access
             | to that out-performs GPT 4 and GPT 4V. That seems
             | *extremely* unlikely to say the least.
             | 
             | There is no need for any complex statistical analysis here
             | since the question is simply comparing the scores on a
             | test. It's a simple difference in means. Arguably, the main
             | place that could benefit from additional statistical
             | procedures would be weighting the sample to be
             | representative of a target population, but that in no way
             | affects the results of the study at hand.
        
         | a1j9o94 wrote:
         | Another thing I take issue with is this doesn't seem to be
         | using known ways to improve performance of LLMs such as chain
         | of thought and tree of thought prompting.
        
           | arjvik wrote:
           | There's also the "take a deep breath" and other "unreal
           | engine"-esque prompting tricks
        
             | 1024core wrote:
             | Ideally they would use the same prompt for LLMs as they use
             | for Humans; and judge them in exactly the same manner. Only
             | then is it a fair comparison.
        
               | arjvik wrote:
               | Why can't the LLM be a (model, prompt prefix) pair?
        
         | hackerlight wrote:
         | I also want to see GPT-3 vs GPT-4 comparison on these tasks.
        
         | karmakaze wrote:
         | This is interesting in a 'human interest news' report way but
         | doesn't do anything to judge current systems any more the
         | average people thinking older less capable chatbots were human.
        
         | svnt wrote:
         | What is your concern exactly?
         | 
         | This was a first-pass study in a field addressing some of the
         | criticisms leveraged against an earlier study where the spatial
         | reasoning problems were viewed to be too hard. They seemingly
         | made the spatial reasoning questions as easy as they could.
         | 
         | The qualifications they put on MTurk are pretty standard if you
         | want humans who care about what they are doing responding to
         | your study. It costs more to do this.
         | 
         | It is a limitation of science that is both budgetary and
         | procedural.
         | 
         | By calling into question their results you seem to be
         | suggesting that an average human would only 33% of the time be
         | able to tell e.g. how many points are inside a box, or whether
         | more points are inside or outside of a box. This is extremely
         | basic spatial reasoning we are talking about.
         | 
         | The problem they were addressing with the settings is just
         | noise in the results by cheap bots and clicky humans trying to
         | earn $0.50. It is endemic on MTurk.
        
           | kbenson wrote:
           | > By calling into question their results
           | 
           | Did they? I took it more as a clarification of a point so
           | people could more easily know what it meant, and wishing they
           | did more so we could know more.
           | 
           | Their last sentence in fact notes that what the paper says is
           | correct and they just which we could make a more nuanced
           | statement if it was done with more specific constraints.
        
       | skepticATX wrote:
       | It has been interesting to see evidence accumulating that shows,
       | despite initial excitement bred by papers such as "Sparks", there
       | is something missing from current language models.
       | 
       | Individually none of these results will ever get the attention of
       | a "Sparks" type paper, but collectively a strong case has been
       | built.
        
         | naasking wrote:
         | Sparks of AGI is not AGI. It's also possible that we're not
         | testing LLMs fairly, or that merely slight tweaks to the
         | architecture or methods would address the issues. I think this
         | comment elaborates nicely:
         | 
         | https://news.ycombinator.com/item?id=38332420
         | 
         | I do think there might be something missing, but I also suspect
         | that it's not as far off as most think.
        
           | incongruity wrote:
           | So, in other words, perhaps what we have is a necessary
           | component but not wholly sufficient on its own?
        
             | amelius wrote:
             | Perhaps not even necessary.
        
             | pixl97 wrote:
             | That is my take on it.
             | 
             | I think embodiment and the encoding of the natural laws
             | (gravity, force, etc) that go into that will be another
             | huge step at grounding AI. People tend to gravitate to
             | thinking about humanoid robots when that is mentioned (and
             | thereby terminators), but honestly I would think things
             | closer to sensor networks involving thousands or millions
             | of bodies like a hivemind would be more likely (why stick
             | at the human level of a single body if you didn't have to).
             | Interaction with the world is a means of determining
             | truth... The ability to perform science.
             | 
             | And as hard as embodiment is, it will be the easy part in
             | my opinion. Continuous learning without losing the plot is
             | going to be quite the challenge. If an LLM has something
             | wrong, how does it update and change that bit of
             | information without huge amounts of power use? How do you
             | make the system learn 'important' things without filling up
             | with junk/spam it is subject to? How do you keep the system
             | aligned with a goal that is not destructive to itself or
             | others?
        
               | naasking wrote:
               | But embodiment being a bottleneck could indicate that
               | it's a data/training issue, rather than an architectural
               | issue. Multimodal training data improves GPT-4 already,
               | but that's still very little data compared to growing up
               | to a full human adult. There are still many things to
               | try.
        
             | pohl wrote:
             | That has always been my impression, despite the myriad ways
             | that LLMs impress.
             | 
             | So much potential is lost just in the request/response
             | limitation. While I'm waiting for a response from GPT-4,
             | I'm continuing to think. Imagine if the reverse were true.
             | AGI needs to be able to mull things over for spans of time.
        
               | pixl97 wrote:
               | At least any company trying to sell a product this is
               | going to be an issue with operations costs.
               | 
               | Also this gets into the halting problem. How many
               | resources do you expend on finding an answer? In a human
               | issues will typically come up like we have to go pee, or
               | eat, or something outside our body interrupts us. For an
               | AI, how much time should it spend? Do we want to wake up
               | one day finding our data centers running at full tilt?
               | 
               | This said, there have been some attempts at working on
               | agent based systems that reach out for answers from
               | multiple places and pool the data then run things like
               | chain of thought on that data pool.
        
       | kenjackson wrote:
       | Can someone provide the prompts in text rather than the images
       | from the paper? That would make it easier to try and replicate
       | results.
        
         | devbrac wrote:
         | Raw json files for each task from the ConceptARC paper are
         | here: https://github.com/victorvikram/ConceptARC. It would be
         | trivial to convert these into prompts using the format in the
         | appendix of the paper.
        
       | kaoD wrote:
       | Maybe I'm missing what "abstraction" means here but seems like
       | the tasks were centered around grids and other spatial problems,
       | which are a very limited subset of abstraction/reasoning.
       | 
       | In my experience GPT4/V is pretty bad at those specifically, not
       | necessarily around abstraction in general. Positions, rotations,
       | etc. are a concept that GPT4 finds very hard to apply, which is
       | kinda unsurprising since it has no body, no world, no space; it
       | "lives" in the realm of text. DALLE3 suffers a similar problem
       | where it has trouble with concepts like "upside down" and
       | consistently fails to apply them to generated images.
        
         | pixl97 wrote:
         | What would directions be for an intelligent creature that lives
         | in zero gravity? I just like thinking about this for the same
         | reasons humans like writing speculative science fiction. Trying
         | to guess what alien perspectives look like, might also give us
         | insights when we're the ones making the alien.
        
           | Retric wrote:
           | Basically the same, gravity doesn't define left/right or
           | North, South, East, and West for us just up and down.
        
             | trealira wrote:
             | However, North, South, East, and West are relative to the
             | poles of the Earth. Something living in zero gravity would
             | have to use some object as an anchor to determine the
             | direction.
        
               | Retric wrote:
               | You're also oriented based on objects. We don't have an
               | abstract compass pointing north 24/7 the way we can use
               | our bodies to determine left and right or gravity to
               | point down.
        
               | trealira wrote:
               | Right, that's why we use compasses, which use the poles
               | of the Earth to determine the direction.
               | 
               | Something living in zero gravity doesn't have a planet,
               | so they'd have to find something else to base the
               | directions on.
               | 
               | That's what I was trying to say before.
        
               | withinboredom wrote:
               | No, but they would have front and back, and people from
               | the bridge would share which way was "up" and "down" and
               | "left" and "right" based on the controls.
        
               | anonymouskimmer wrote:
               | The solar system has a north pole and a south pole based
               | on the rotation of the Sun. Basically the only places in
               | which there isn't something to orient against are in the
               | depths of inter-galactic-cluster voids with nothing
               | around. And if a being is stuck in one of those voids,
               | orientation is way down the list of problems they have.
        
               | trealira wrote:
               | That's a good point. The sun of a solar system could
               | possibly be what an alien society living in zero gravity
               | bases their directions on.
        
               | int_19h wrote:
               | FWIW there is some interesting variability among human
               | cultures on that, as well. There are a few that actually
               | use cardinal directions predominantly or exclusively
               | instead of body-relative ones like "left" and "right".
        
         | bloaf wrote:
         | It's also worth remembering that blind humans who can recognize
         | squares by feel do not have the ability to recognize squares by
         | sight upon gaining vision.
         | 
         | I suspect the model is bad at these kinds of "reasoning" tasks
         | in the same way that a newly-sighted person is bad at
         | recognizing squares by sight.
        
           | eulgro wrote:
           | When did blind humans gain vision out of curiosity?
        
             | withinboredom wrote:
             | The first one I heard about was 10-15 years ago by
             | projecting an image on the tongue. Ahh, here it is:
             | https://www.scientificamerican.com/article/device-lets-
             | blind...
        
             | bloaf wrote:
             | https://www.projectprakash.org/_files/ugd/2af8ef_5a0c6250cc
             | 3...
             | 
             | They studied people with treatable congenital blindness
             | (dense congenital bilateral cataracts)
        
         | joe_the_user wrote:
         | _In my experience GPT4 /V is pretty bad at those specifically,
         | not necessarily around abstraction in general._
         | 
         | The problem with a statement like is that it leaves the door
         | open to accepting any kind of canned generality as "abstraction
         | in general". Abstract reasoning is indeed a fuzzy/slippery
         | concept and spatial reason may not capture it well but I'm
         | pretty sure it captures it better a general impression of
         | ChatGPT.
         | 
         |  _...since it has no body, no world, no space; it "lives" in
         | the realm of text._
         | 
         | There's a bizarre anthropomorphism on this thread, both
         | reflexively compare this software system to a blind human and
         | the implicit call to be considerate of this thing's supposed
         | disability.
        
           | og_kalu wrote:
           | >The problem with a statement like is that it leaves the door
           | open to accepting any kind of canned generality as
           | "abstraction in general".
           | 
           | Not really
           | 
           | https://arxiv.org/abs/2212.09196
        
             | joe_the_user wrote:
             | Nah,
             | 
             | By the very fact that there's paper here, whatever it's
             | merit, the authors of the paper have codified their concept
             | of generality and this doesn't validate the point I was
             | replying to, which was essentially "my impression/feeling"
             | is that it is better".
        
               | og_kalu wrote:
               | Point is that it's good at abstract reasoning that isn't
               | spatially grounded like in that paper. So it's not really
               | leaving any door open. It's not a cop out. That's just
               | how it is.
        
           | int_19h wrote:
           | Why is it bizarre to consider the limitations inherent in the
           | input data on which the model is trained? Fundamentally, it
           | still "sees" the world through text, and the extent to which
           | it can "understand" spatial relationships is defined by that.
           | It seems utterly unsurprising that this leads to very poor
           | grasp of the actual concepts behind what things like "above"
           | or "left" are - the text that humans produce when talking
           | about such things kinda relies on the reader having their own
           | experience (if not vision, then at least body awareness) that
           | can be mapped to those concepts. You can explain "left" and
           | "right" to a human by telling them which of their hands is
           | which, and I can't help but wonder what the actual
           | information payload this constitutes when you consider the
           | body spatial awareness that is brought into context by
           | association.
        
             | dragonwriter wrote:
             | > Fundamentally, it still "sees" the world through text
             | 
             | Fundamentally, it "sees the world" [0] through tokens,
             | which are not text.
             | 
             | [0] Also a bad metaphor, but...
        
               | Jensson wrote:
               | A string of tokens is text. Tokens is just another
               | alphabet, like Japanese letters having many
               | representations for the same sounds and a letter can be
               | entire words sometimes.
        
               | int_19h wrote:
               | Technically true, but when those tokens are 1:1 mapped to
               | text, I think we can simplify this down without losing
               | anything important.
               | 
               | Of course, once you start using tokens for other things -
               | as multimodal LMs already do - that changes. But this
               | current crop of model still has visual modality in its
               | infancy IMO, and gauging the overall performance of model
               | as a whole based on that is very questionable.
        
             | joe_the_user wrote:
             | _Why is it bizarre to consider the limitations inherent in
             | the input data on which the model is trained?_
             | 
             | Sure the thing is limited, the study is demonstration of
             | this (and general purpose abilities have been claimed for
             | LLMs at various point).
             | 
             | I was pushing back against the "it's like a blind person"
             | anthropomorphizing argument [edit: especially the
             | assumption these things learn through experience and
             | _reflection_ which the parent also makes]. Maybe if the
             | thing  "had eyes", it could learn spatial information and
             | maybe it couldn't (though it would take a lot of work to
             | make that metaphor meaningful). The thing certainly doesn't
             | learn text in the fashion that human learns speech since
             | humans don't digest the entire Internet before they can
             | speak.
        
               | pixl97 wrote:
               | I'd recommend looking up model grounding by multi-modal
               | training. Seemingly models improve as you add more modes.
        
         | lazy_moderator1 wrote:
         | > which is kinda unsurprising since it has no body, no world,
         | no space; it "lives" in the realm of text
         | 
         | or rather the training set was lacking in this regard
        
         | Sharlin wrote:
         | > DALLE3 suffers a similar problem where it has trouble with
         | concepts like "upside down" and consistently fails to apply
         | them to generated images.
         | 
         | This has nothing to do with having "no body, no world" and
         | everything to do with the fact that training pictures where
         | things are upside down are simply vastly rarer that pictures
         | where they aren't.
        
         | mr_toad wrote:
         | > DALLE3 suffers a similar problem where it has trouble with
         | concepts like "upside down" and consistently fails to apply
         | them to generated images.
         | 
         | There's probably not many (if any) upside down images or
         | objects in the training data.
        
       | xbmcuser wrote:
       | It has been really interesting to read in last few years with
       | machine learning how the model cant do this or that and the next
       | week or month read it can do this or something else. Chat gpt and
       | the models that have come after seem to have accelerated this
       | back and forth a lot. Unless you keep up with it closely and keep
       | updating your information I think what you knew it could do well
       | or could not do well is no longer correct
        
       | RecycledEle wrote:
       | What every paper I have seen so far is missing is that there are
       | many ways to achieve super-human intelligence. (I need to give
       | creidt to Isaac Arthur of SFIA for this.)
       | 
       | Getting results faster is one way. AIs beat me in speed.
       | 
       | Getting results cheaper is another way. AI is cheaper than I am.
       | 
       | Knowledge across many fields is better. AI beats me here too.
       | 
       | Getting better results in one narrow field is another way, but
       | only one of many ways. I love evaluations of human produced work
       | vs. machine produced work. If we had quality evaluations (not the
       | type-oh riddled garbage most people use.) If we compared AIs to
       | people who work in those fields in occupations recognized by the
       | US Dept of Labor. If we asked both sides to justify their
       | answers. If we had statistically significant sample sizes. Then
       | maybe we could get some good results on quality of work. I can
       | imagine the US DOL spending billions if dollars benchmarking AIs
       | against humans i all the occupations the recognize. Alternately,
       | this could be a very profitable company.
        
         | jacquesm wrote:
         | That's why the 'G' in AGI is so important.
        
         | varispeed wrote:
         | I view LLM as a search engine, that can find and present the
         | knowledge in a way that is easier to consume.
         | 
         | For instance: Find me a solution to 2x2. I see this as a search
         | problem, not reasoning problem.
         | 
         | I think many scientists don't look at it from this angle and
         | push the agenda that somehow current AI can reason. It can't.
         | It's just a search engine / pattern matching contraption that
         | works so well it can fool people into thinking they interact
         | with an intelligence.
         | 
         | Frankly, many people "think" in the same way. They store facts
         | in their brain and then retrieve them when asked, but they have
         | limited ability to arrive at new facts from those they already
         | know, without external input.
         | 
         | In that sense, the "AI" can certainly replace humans in tasks
         | that don't require to think or where there is a need to distil
         | information from a wide array of sources in a useful way.
        
           | ben_w wrote:
           | I see it as translation, not search. Search is already done
           | fantastically well at small scale with simple indexing, and
           | (before adversarial reactions from SEO) at internet scale
           | with Page rank.
           | 
           | Asking if LLMs are really reasoning or not feels like an
           | argument about terminology, like asking if A* really is route
           | planning.
        
           | prox wrote:
           | Having tried a week of LLM instead of search, the results are
           | usually very good as a search engine goes (I used you.com
           | that adds search results)
           | 
           | But the amount of hallucinations was staggering. There is
           | nothing behind the curtain so to speak. I constantly had to
           | guide it like a little kid.
        
             | p1esk wrote:
             | Is you.com using gpt4?
        
           | pixl97 wrote:
           | The following is a logical reasoning problem. Determine which
           | answer, A,B,C, or D is correct
           | 
           | FXGB, HXIB, JXKB, _____, NXOB
           | 
           | A. OXPBB. B. KXMBC. C. LXLBD. D. LXMB
           | 
           | ChatGPT The pattern in the sequence "FXGB, HXIB, JXKB, _____,
           | NXOB" suggests an alternating increment in the letters
           | surrounding "X".
           | 
           | Analyzing each component:
           | 
           | The first letter sequence: F, H, J, __, N
           | 
           | The pattern here is increasing by 2: F (+2) - H, H (+2) - J.
           | Following this, J (+2) - L. So, the first letter of the
           | missing term is L. The third letter sequence: G, I, K, __, O
           | 
           | The pattern is the same as the first sequence: G (+2) - I, I
           | (+2) - K. So, K (+2) - M. The third letter of the missing
           | term is M. The last letter sequence: B, B, B, __, B
           | 
           | The pattern is consistent: the letter B repeats. So, the last
           | letter of the missing term is B. Combining these, the missing
           | term is LXM(B).
           | 
           | The answer is C. LXL(B)D.
           | 
           | ----
           | 
           | Unfortunately I did not make up the question myself, so is it
           | already in the training set?
        
             | root_axis wrote:
             | This response is actually quite illustrative of the
             | stochastic nature of LLMs. Under normal circumstances a
             | human would never make this kind of error.
             | 
             | The system remains coherent long enough to produce the
             | right answer, yet it still confidentially selects the wrong
             | answer because it only understands the question as a
             | function of statistical relationships between tokens,
             | unlike humans who would understand the question at a
             | conceptual level.
        
         | usaar333 wrote:
         | Note that the goal of ASI isn't to beat you per se, but the
         | best humans.
         | 
         | GPT-4 is generally not competitive with domain specialists
         | (outside areas where machines have already been competitive for
         | some time). That of course doesn't mean it isn't valuable - but
         | it becomes more for human augmentation than replacement.
        
           | RecycledEle wrote:
           | ASI stands for Artificial Superintelligence.
        
             | godelski wrote:
             | > ASI stands for Artificial Superintelligence.
             | 
             | >> Note that the goal of ASI isn't to beat you per se, but
             | the best humans.
             | 
             | I'm pretty sure they understand that.
        
           | RecycledEle wrote:
           | I don't see it as a dialectic. I see it as a continuum.
           | 
           | There is value to replacing a percentage of human labor with
           | machine labor.
           | 
           | Jordan Peterson said there are no jobs in the USA for people
           | with IQs below 87:
           | https://www.youtube.com/watch?v=RN3w_F9r-aI&t=338s
           | 
           | Mr. Peterson says lower IQ people are do better at more
           | repetitive jobs.
           | 
           | Machines and AI can do more repetitive jobs. Technology took
           | those jobs.
           | 
           | As AI advances, the lowest useful IQ will move up. I wonder
           | where it will be in 10 years.
        
             | whatshisface wrote:
             | Maybe they will invent treatments for IQ below 90 so that
             | our ability can move up along with technology.
        
               | pixl97 wrote:
               | The rate in AI IQ gain is not limited by ethics and
               | clinical trials.
        
           | notaurus wrote:
           | This might be the case now but shouldn't be a particularly
           | useful distinction for very long-- on the grand scale of
           | intelligence, the difference between the smartest human and
           | the average human is tiny.
        
             | usaar333 wrote:
             | I'm not talking about intelligence but expertise.
             | 
             | GPT-4 is a lot better than me at unfamiliar tasks, because
             | it has the advantage of being able to readily transform all
             | data on the internet.
             | 
             | Once I have trained myself, I generally can beat GPT-4.
        
         | greenthrow wrote:
         | These are not "ways" of achieving superhuman intelligence,
         | these are potential forms of superhuman performance.
         | 
         | It's also a very broad definition, because by that definition
         | we have had superhuman performance in computation for decades.
         | And we've been able to write programs for very narrow tasks
         | that outperform humans for that time as well.
         | 
         | Generally when people are comparing artificial intelligence
         | against human intelligence, they want an AGI. Humans aren't
         | good at just one narrow thing, that's why they aren't all
         | replaced by specialized programs already. We are taking about
         | being able to describe a problem and the artificial
         | intelligence can take it from there. That's why the bar is AGI,
         | which is a much higher bar than any of your broad definitions.
        
           | pixl97 wrote:
           | Though I think we should take note that the write programs
           | for very narrow tasks that outperform humans may start seeing
           | widening.
           | 
           | Things like object recognition have moved from "only humans
           | can do this" to machines are keeping up with us. And I think
           | that this is how it will be for at least some time. Instead
           | of one big blast where ASI is better than humans at
           | everything, more and more categories of narrow intelligence
           | will widen and become superhuman.
        
       | wouldbecouldbe wrote:
       | Never heard of Mechanical Turk, hahah for a Dutch person it
       | sounds pretty racist. Turk is what we call Turkish people.
        
         | margalabargala wrote:
         | It's a reference to an old famous chess playing "machine" that
         | had a human inside it'
         | 
         | https://en.wikipedia.org/wiki/Mechanical_Turk
        
           | wouldbecouldbe wrote:
           | Yeah I read, just interesting for non-english cultures sounds
           | very different.
        
             | baxter001 wrote:
             | No, it doesn't, Turk is what we call Turkish people too.
        
             | WrongAssumption wrote:
             | It was invented by a Hungarian to impress an Austrian. The
             | Turk portion of the name is from the original Hungarian. It
             | toured Europe for 80+ years. I have no clue what about this
             | you think has anything to do with English culture.
        
             | renewiltord wrote:
             | That's nothing. In American English, calling someone a
             | turkey would be an insult. Turkey is how we refer to the
             | nation Turkiye Cumhuriyeti. Imagine calling an entire
             | nation an ungainly bird that we ritually consume every
             | November. Almost as bad as giving Michael Jordan his own
             | seat at the UN. But that's what capitalism and the US
             | hegemony have brought us.
        
               | esafak wrote:
               | There's an interesting story behind the name:
               | https://www.etymonline.com/word/Turkey
        
       | dr_dshiv wrote:
       | I'm really looking forward to students majoring in "machine
       | psychology."
        
         | ryzvonusef wrote:
         | https://en.wikipedia.org/wiki/Susan_Calvin                   >
         | Graduating with a bachelor's degree from Columbia University in
         | 2003, she began post-graduate work in cybernetics, learning to
         | construct positronic brains such that responses to given
         | stimuli could be accurately predicted. She joined US Robots in
         | 2008 as their first Robopsychologist, having earned her PhD. By
         | 2029, when she left Earth for the first time to visit Hyper
         | Base, her formal title was Head Psychologist.
         | 
         | https://en.wikipedia.org/wiki/Robopsychology
        
       | theptip wrote:
       | If you look at the appendix, you can see example transcripts. The
       | sample they provide looks like a very bad eval.
       | 
       | It's encoding an originally visual problem into a textual matrix
       | form, and then expecting GPT to recognize visual correlations.
       | You simply can't compare these two tasks! Most humans wouldn't
       | recognize the 5x5 matrix for a 4x4 square.
       | 
       | So the comparison with "human level" is completely invalid. And
       | even the valid comparison is only measuring visio-spatial
       | intelligence, not IQ.
        
         | Sharlin wrote:
         | Humans are biased towards grokking two-dimensional information
         | arranged in grids because that's how vision works. (Non-V) GPT
         | is not. Its natural gestalt has a vast number of dimensions. It
         | is, and it should, be able to find correlations across data no
         | matter how the input is arranged. (Note that for GPT _text is
         | not linear_ even if it appears so to humans.)
         | 
         | To humans it would be qualitatively more difficult to recognize
         | patterns in four or six or ten-dimensional data than in two
         | dimensions. To GPT, it's just another variation of the same
         | problem, one that's at most _quantitatively_ more difficult.
         | 
         | > And even the valid comparison is only measuring visio-spatial
         | intelligence, not IQ.
         | 
         | You're the first one to mention IQ. The paper definitely does
         | not. In any case "visio-spatial intelligence" is a human
         | concept. GPT doesn't distinguish between "visio-spatial" and
         | any other kind of pattern recognition.
        
           | og_kalu wrote:
           | >It is, and it should, be able to find correlations across
           | data no matter how the input is arranged.
           | 
           | That doesn't mean it doesn't favor certain arrangements.
           | 
           | https://arxiv.org/abs/2305.18354
           | 
           | Linear is in fact better for this benchmark
        
       | datadrivenangel wrote:
       | "Our experimental results support the conclusion that neither
       | version of GPT-4 has developed robust abstraction abilities at
       | humanlike levels."
       | 
       | Some combination of LLMs and logical reasoning systems will get
       | us much closer, but that becomes a lot more complicated.
        
       | intended wrote:
       | This new Gen of AI adds an interesting twist to the infinite
       | monkeys and typewriter issue.
       | 
       | How do you actually check an infinite amount of junk to verify
       | that one of them is the collected works of Shakespeare?
       | 
       | The question I ask now is "whats your error rate for domain
       | specific work ?"
       | 
       | It could be faster and smarter, but it doesnt matter if it's
       | wrong.
        
       ___________________________________________________________________
       (page generated 2023-11-19 23:01 UTC)