[HN Gopher] Comparing humans, GPT-4, and GPT-4V on abstraction a...
___________________________________________________________________
Comparing humans, GPT-4, and GPT-4V on abstraction and reasoning
tasks
Author : mpweiher
Score : 201 points
Date : 2023-11-19 11:36 UTC (11 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| tovej wrote:
| Conclusion is obvious, but the paper is still probably necessary.
|
| Of course LLM's can't reason. They pattern match answers to
| previously asked questions, and humans will read the text as a
| reasonable answer because we assign meaning to it, but there is
| simply no way an LLM could use a "mental model" to "reason" about
| a problem other than constructing sentences out of probable
| matches it's been trained on.
| sfn42 wrote:
| Love how the ai hypebros always downvote answers like this.
| Everyone's just insisting that LLMs are AGI just because they
| kinda seem that way.
| viraptor wrote:
| Ftr, I mostly agree with the comment but downvoted it for the
| low quality "of course that's the answer, obviously" padded
| with some unhelpful tropes about trained pattern matching.
| (Let us know when you can prove that human communication is
| not just trained pattern matching)
| tovej wrote:
| Ok, I'll bite. I personally don't think human communication
| is simply pattern matching (e.g. we have drives, motives,
| relationships that we want to tell each other, these are
| internal systems, outside the visible band of
| communication). That's a little off topic though, my post
| was not about communication, but reasoning.
|
| Pattern matching in LLMs is specifically at the language
| level, as in "predict the next token in a stream", I should
| have probably specified this bit.
|
| You seem to refer to a more general kind of pattern
| matching when you say, where we're looking at a complicated
| structural model within which humans reason about the
| behavior of different system components and their
| interactions.
|
| There is no way to form an internal representation of such
| systems within an LLM, which is why I said "of course" it
| can't reason. An LLM is not designed with this capability
| in mind.
|
| If all you have is language, and predictive text, there is
| no way to use this to simulate and predict the rules in a
| higher-dimensional space than "stream of tokens", or let's
| put it another way: there would need to be an extremely
| high correlation between the pattern in the problem space
| and in language space for the likelihood of the model
| guessing right to be anywhere near human levels.
| stevenhuang wrote:
| > There is no way to form an internal representation of
| such systems within an LLM, which is why I said "of
| course" it can't reason. An LLM is not designed with this
| capability in mind.
|
| You claim this as if it is a foregone conclusion yet it
| is anything but. This is an active area of study, even
| within human cognition
| https://en.m.wikipedia.org/wiki/Predictive_coding
|
| A complex system clearly does not need to be designed to
| evoke some behavior. Just look to the concept of emergent
| phenomenon. See evolution through natural selection as an
| obvious counter example.
|
| Evolution selects for fitness and not reasoning ability.
| Yet here we are.
|
| It is not known if the same isn't happening now for the
| transformer model.
|
| LLMs are universal function approximators. If reasoning
| ability is at all representable through such a process,
| it is possible for LLMs to find it.
|
| https://en.m.wikipedia.org/wiki/Universal_approximation_t
| heo...
| tovej wrote:
| Neural networks are universal function approximators,
| transformers are universal sequence to sequence
| approximators. Crucially, the sequences here are the
| stream tokens in an LLM. This means an LLM can do
| arbitrary language manipulation.
|
| If the reasoning task requires something that can't be
| done using language manipulation in embeddings-space, it
| won't be able to do it.
|
| Idk, maybe my mathematical intuition is wrong, but it
| seems obvious to me that if all you have is a function on
| a vector (representing a token stream), then the only
| transformations you can create are ones that can be
| expressed in that vector space. In other words, if you
| express a problem in a more complicated space (e.g. a
| visual problem, or an abstract algebra problem), you will
| not be able to solve it in the smaller token space,
| there's not enough information. You COULD first transform
| the vector into a different representation, perform the
| transformation there, and then encode back.
|
| This step would have to be different for every reasoning
| problem, so the underlying transformation to solution
| space would have to be generic and move into one of N
| reasoning subsystems.
|
| Also, if we hypothetically assume this limitation doesn't
| exist, we would still have the issue that a trained LLM
| represents a single such transformation function.
|
| If we want generic reasoning skills, the approximated
| function would have to be different for every reasoning
| problem. Even if we use a subset of the vectors degrees
| of freedoms to express each type of reasoning, it would
| make any truly reasoning LLM computationally intractable,
| because the embeddings space would blow up.
| int_19h wrote:
| What couldn't you possibly represent as a bunch of
| numbers to the level of precision necessary to process
| it? We might not have sufficient compute power for LMs
| large enough for some tasks yet, but I don't see how it's
| different in principle.
|
| And yes, of course such representation would be different
| for different problems. We also have organs in our body
| that do that for various types of sensory inputs that
| ultimately all feed into the brain to be processed as
| quantized impulses.
|
| And GPT-4 is already supposed to be an "assembly of
| experts", i.e. different models trained for different
| tasks, with an orchestrating model weaving them together.
| This is also not a surprise, given that our brains aren't
| monolithic, either. You _should_ be able to do it with a
| single model that 's large enough, but it is likely more
| energy- and time-efficient to get there by putting
| smaller pieces together instead.
| stevenhuang wrote:
| > In other words, if you express a problem in a more
| complicated space (e.g. a visual problem, or an abstract
| algebra problem), you will not be able to solve it in the
| smaller token space, there's not enough information
|
| You're aware multimodel transformers do exactly this?
|
| https://github.com/bytedance/SALMONN
| sfn42 wrote:
| I don't need to prove anything. We know for a fact that
| LLMs are stochastic parrots. That's why they confabulate,
| if they were "intelligent" they wouldn't just pass off
| nonsense as fact.
|
| They have been fed incredible amounts of data which allows
| them to stochastically parrot really complex things, this
| makes people think they're intelligent. Designing a test
| that actually tests intelligence rather than parroting
| ability is tricky, but whenever people do they fail them.
|
| I just logged into ChatGPT and asked it repeatedly to
| generate a novel thought experiment intended to prove that
| LLMs are incapable of reasoning. I kept trying to refine it
| by explaining what was wrong with each suggestion.
|
| It completely failed. It just spat out variations of well
| known dilemmas or paradoxes, even when I specifically asked
| it to create something original it just made nonsense.
| mcswell wrote:
| "whenever people do they fail them": I assume the
| antecedent of "they" is LLMs, not people.
| danielbln wrote:
| You don't have to be an AI "hype bro" to take issue with the
| reductive and trite notion that LLM's are stochastic parrots
| only. There is a continuum between that and AGI.
| PaulDavisThe1st wrote:
| A continuum like the ape climbing to the top of a tree
| being part of a continuum that includes landing on the
| moon?
| danielbln wrote:
| Pretty much, I'd say we're way past the fleecing and poop
| flinging at this point and well into persistence hunting,
| maybe even control of fire?
| int_19h wrote:
| I'd say it's more like the ape making and using its first
| crude tool by itself.
| sfn42 wrote:
| Well I take issue with the reductive and trite notion that
| just because an LLM can generate plausible text it's
| suddenly maybe conscious and intelligent, maybe about to
| end humanity etc.
|
| It's exactly like the crypto hype wave. Everyone dreaming
| up semi-plausible futures, based on a whole chain of
| unfounded assumptions.
| danielbln wrote:
| It's plausible text, and it's useful text. LLMs aren't
| just speculative vehicles in search for a problem, as
| most of crypto is, they are useful, today, right now.
| They don't require any assumptions to be so, nor do they
| have to be skynet world ending AGI to be that. You can
| point out problematic extrapolation of doomers without
| being reductive towards the very real and very useful
| capabilities of LLMs.
| pixl97 wrote:
| The only thing the LLM is missing is an self actuation
| loop.
|
| We put a camera on a multimodal LLM, it interprets the
| visual world and sees before it a bunch of blocks. It
| looks at the task list it has that says "pick up red
| blocks, put them in blue bin". The visual components
| identifies the red blocks and the textual component
| issues commands to its drive unit which calculates the
| best path and how to use it's manipulators to pick up the
| blocks.
|
| This is a very basic chain of existence. We have world
| identification, we have actuation on motivation, we have
| interaction with the environment. We can do this now.
| These goals have already been achieved. Companies are
| already testing more complex models with much more
| general instructions such as "Pick up everything you
| think is trash" or "Organize this room" to see the
| emergent behaviors that come out of the models.
|
| You are seemingly a few years behind what has already
| been done, and why people are starting to get concerned.
| dodslaser wrote:
| I think it's pretty funny that the moment we taught a
| computer to talk like a human people immediately assumed it
| was sentient, let alone intelligent.
| ben_w wrote:
| And this assumption has only gotten more common in the ~56
| years since (so the anecdote goes) ELIZA fooled
| Weizenbaum's secretary.
|
| "Consciousness" has ~22 different definitions; some are
| easily met, some are not even met by the humans giving
| those definitions (for example, I've seen people here on HN
| who write as if consciousness can somehow magically solve
| the halting problem).
|
| There's only one test for it in AI, that I've heard of, and
| that I think is worth exploring:
| https://www.youtube.com/watch?v=LWf3szP6L80
| jofla_net wrote:
| I suspect this has to do with how shallowly people even
| regard their fellow man. To the point that if the outputs
| of an entity parrot humans well enough then, well, there
| couldn't be any differences whatsoever, and it must bear
| all the titles which we gift to humans.
| int_19h wrote:
| It has to do with the fact that all our definitions of
| "sentient" and "intelligent" are extremely vague, often
| vacuous, and generally boil down to "I know it when I see
| it" in practice.
|
| So there should be absolutely no surprise that, when
| something comes along that behaves in certain ways very
| much like a human (and unlike anything else), it is assumed
| to share the basic qualities of such. It is how we judge
| other people, after all - it's not like you do a brain scan
| to determine whether someone you are talking to is
| "sentient", whatever that means to you. You judge that by
| their behavior, modelling what a "sentient" would do and
| comparing the result to what they actually do.
| jstummbillig wrote:
| I think both of your comments are just boring and
| condescending.
| koe123 wrote:
| On the other hand, can we conclusively say that humans aren't
| really advanced biological stochastic parrots?
| naasking wrote:
| > Of course LLM's can't reason
|
| That they are not effective at some forms of reasoning does not
| entail they can't reason.
| mcswell wrote:
| The conclusion may be obvious to you and me (although it's hard
| to know for certain, since these available LLMs are black
| boxes). But it's definitely not obvious to everyone. There are
| plenty of people saying this is the dawn of AGI, or that we're
| a few short steps from AGI. Whereas people like Gary Marcus
| (who knows tons more than I do) says LLMs are going off in the
| wrong direction.
| visarga wrote:
| Yes, LLMs can't reason 100% correctly, but neither do humans.
| We can often reason correctly, but it's not always the case.
|
| Even reasoning, fundamental as it is, comes from feedback.
| Feedback from our actions teaches us how to reason. Learning
| from feedback is more general than reasoning - AI agents can
| definitely learn this way too, if they have enough freedom to
| explore. But you can't do it with supervised training sets
| alone.
|
| You need to put language models into agents in environments,
| give them goals and rewards. Then they can make their own
| training data and mistakes, build up their own experience.
| Can't teach an AI based on how people make mistakes, it needs
| to fix its own mistakes, but that means to deploy it in the
| wild, where there are consequences for errors.
|
| If you remember, DeepMind first tried to train a Go model on
| human game play, but it was limited. Then they started from
| scratch, and learning from feedback alone they surpassed
| human level, even though the feedback was one single bit of
| information at the end of a whole self-play game. And it had
| no pre-training prior to learning Go, unlike human players.
| somewhereoutth wrote:
| That presupposes that language encodes the world we live in
| completely, whereas in fact language is meaningless without
| the shared context of reality. 'up' and 'down' are only
| meaningful to intelligence that can experience space.
|
| Essentially LLMs just are oracles for the shadows on
| Plato's cave.
| visarga wrote:
| Do a reverse Chinese room experiment - remove from a
| human all the abilities multi-modal LLMs gain after
| training on human media. What's left? Homo ferus.
|
| Most of our intelligence is encoded in the environment
| and language, it's a collective process, not an
| individual one. We're collectively, not individually,
| very smart.
| stefl14 wrote:
| This isn't true. I wrote a blog post about it a while
| back but never finished. It's complete enough to
| demonstrate the point though so I'll plug anyway
| https://stefanlavelle.substack.com/p/no-language-isnt-
| enough
|
| TLDR: Internal LLM representations correspond to an
| understanding of the visual world. We've all seen the
| Othello example, which is too constrained a world to mean
| much, but even more interesting is that LLMs can caption
| tokenized images with no pretraining on visual tasks
| whatsoever. Specifically, pass an image to an encoder-
| decoder visual model trained in a completely unsupervised
| manner on images -> take the encoded representation ->
| pass the encoded representation to an LLM as tokens ->
| get accurate captions. The tests were done on gpt-j,
| which is not multimodal and only has about 7bn params.
| The only caveat is that a linear mapping model needs to
| be trained to map the vector space from the encoder-
| decoder model to the embedding space of the language
| model, but this isn't doing any conceptual labour, it's
| only needed to align the completely arbitrary coordinate
| axes of the vision and language models, which were
| trained separately (akin to an American and a European to
| agreeing to use metric or imperial -- neither's
| conception of the world changes).
|
| It's not intuitive, but it's hard to argue with these
| results. Even small LLMs can caption images. Sure, they
| don't get the low-level details like the texture of
| grass, but they get the gist.
|
| I keep reading your sort of analysis, but honestly, those
| priors need updating. I had to update when learning this.
| If 7bn params can do it, 175bn params with multimodality
| can certainly do it.
|
| It's true that humans need symbol grounding, but we don't
| see hundreds of billions of sequences. There are
| theoretical reasons (cf category theory) why this could
| work, albeit probably limited to gist rather than detail.
| int_19h wrote:
| The LLMs do indeed deal with Plato's shadows, but so do
| we - what we "see", after all, is not the actual up or
| down, but a series of neural activations from our retinas
| (which aren't even 3D, so concepts like "behind" are only
| captured by proxy). Such activations can all be readily
| encoded into tokens, which is exactly what models
| specifically trained to describe images do.
| pixl97 wrote:
| Of course you're saying that LLMs can only train on
| textual data, whereas we are developing multimodal AI at
| this time that takes things like visual, audible, and
| whatever other kind of sensor data and turn it into
| actionable information.
| int_19h wrote:
| The real question isn't whether the LLM can reason.
|
| The question is whether an assembly of components, one of
| which is an LLM (others would include memory and whatever
| else is needed to make it a self-contained loop with a notion
| self-identity) can reason.
| baxtr wrote:
| Could a human and a LLM submit a summary of the paper so we can
| compare?
| j2kun wrote:
| That's what an abstract is for:
|
| > Our experimental results support the conclusion that neither
| version of GPT-4 has developed robust abstraction abilities at
| humanlike levels.
| z7 wrote:
| >The paper investigates the abstract reasoning abilities of
| text-only and multimodal versions of GPT-4 using the ConceptARC
| benchmark, concluding that neither version of GPT-4 has
| developed robust abstraction abilities at humanlike levels.
| cs702 wrote:
| Interesting. Immediate thoughts and questions:
|
| * How would human beings perform on the text-only version of the
| tasks given to GPT-4?
|
| * How would human beings perform if each grid is shown on its
| own, making it impossible to perform side-by-side visual
| comparisons?
|
| * How would human beings perform if each grid is shown on its own
| _only once_ , making it impossible to perform _any_ back-and-
| forth comparisons?
|
| * How could we give LLMs the ability to "pay attention" to
| different parts of images, as needed, so they can make back-and-
| forth comparisons between parts of different images to solve
| these kinds of visual reasoning tasks?
| mistermann wrote:
| * How would human beings perform if they didn't know they were
| being tested (ie: if in the same mode they are in when writing
| comments on the internet)?
|
| * How would human beings perform if the questions are based on
| culture war topics, which tend to invoke System 1
| intuitive/emotional thinking?
| YetAnotherNick wrote:
| Also I want to know how would know how much gain could be made
| by optimizing the prompt for GPT and by including things like
| CoT. Current version of the prompt is pretty bad both for
| humans and AI.
| QuadmasterXLII wrote:
| > How could we give LLMs the ability to "pay attention" to
| different parts of images, as needed, so they can make back-
| and-forth comparisons between parts of different images to
| solve these kinds of visual reasoning tasks?
|
| I've got good news
| oefnak wrote:
| What is it?
| phh wrote:
| It's even all we need
| air7 wrote:
| Thank god...
| firtoz wrote:
| The best thing from research like this is that they allow new
| models to be built, or improvements on the existing ones that can
| lead them to pass these evaluations.
| lewhoo wrote:
| Yeah but the question remains are they passing because of a new
| obtained/emerged quality or are they passing because they were
| taught specifically to how to pass or is there even a
| difference at this point.
| firtoz wrote:
| Well, while training AI we make sure that we include enough
| training for them to pass some evaluation, but then test them
| on things that were not in the training data to make sure
| that they are not "over fitted".
| viraptor wrote:
| I don't think parent meant training specifically on that
| test data, but rather in that kind of task. Think more "LLM
| trained only on images of art" doing badly on this test,
| but then "LLM trained on images of art and IQ puzzles"
| doing better on this test.
|
| It's kind of like asking what even is IQ if you can learn
| how to solve Mensa puzzles and improve your score. Does it
| mean you're more intelligent?
| firtoz wrote:
| Oh, I see.
|
| One guess for GPT4 is that it's an amalgamation of
| models. So you would have various models trained more
| specifically like the ones you mentioned, and you ask
| them all to start answering the query, then you choose
| which one produces better results, and then you present
| that to the user.
|
| Alternatively you can have a decider model that knows
| what kind of queries goes to what specialised one, then
| have that as a "hidden layer", whether it be in the
| application level or "neuron layers".
| broast wrote:
| If you've checked out the docs for the assistant api, you
| can intuit that there is a higher level system which
| decides which subsystems to use to respond. The assistant
| determines whether to use a tool (vision, code
| interpreter, search, retrieval), as well as which code or
| text language model to use to generate the response.
| pixl97 wrote:
| Your last statement reminds me of the folktale of John
| Henry. That of generalized capability versus that of
| specialization.
|
| Under a narrow range of tests specialization is pretty
| much guaranteed to win. Specialization in all cases I
| know of comes at a cost to general capability. It's like
| one of those qips "Pick any two: fast, cheap, good", the
| more you pull at one category, the costs rase in the
| other categories.
| EdiX wrote:
| That's the idea, yes. However none but OpenAI knows exactly
| what ChatGPT was trained on. In fact, the dataset that it
| was trained on is so vast that they probably don't know
| either if it contains any given question. IIRC last week I
| saw a study where GPT4 can solve some leetcode problems
| simply by giving it the number, no description. A clear
| example of overfitting.
| viraptor wrote:
| There was more information provided. And it's possibly
| not even overfitting. See
| https://news.ycombinator.com/item?id=38205153
| falcor84 wrote:
| These tests should be conducted on new questions. And if
| we as humans no longer have the ability to create
| original questions, then maybe we should just retire.
| krona wrote:
| This paper evaluates performance compared to a 'human' which
| presumably is an average adult human without cognitive
| impairment. I had to dig in to the references:
|
| _In the first batch of participants collected via Amazon
| Mechanical Turk, each received 11 problems (this batch also only
| had two "minimal Problems," as opposed to three such problems for
| everyone else). However, preliminary data examination showed that
| some participants did not fully follow the study instructions and
| had to be excluded (see Section 5.2). In response, we made the
| screening criteria more strict (requiring a Master Worker
| qualification, 99% of HITs approved with at least 2000 HIT
| history, as opposed to 95% approval requirement in the first
| batch). Participants in all but the first batch were paid $10
| upon completing the experiment. Participants in the first batch
| were paid $5. In all batches, the median pay-per-hour exceeded
| the U.S. minimal wage._
|
| (Arseny Moskvichev et al)
|
| So in conclusion, this isn't a random sample of (adult) humans,
| and the paper doesn't give standard deviations.
|
| It would've been more interesting of they had sampled an age
| range of humans which we would place GPT-4 on rather than just
| 'it's not as good' which is all this paper can say, really.
| cs702 wrote:
| Also, it's possible there are LLMs pretending to be human
| beings on Mechanical Turk!
| jgalt212 wrote:
| What per cent of Mechanical Turk hits are being performed by
| an LLM these days?
| dmd wrote:
| Several researchers where I work used to use mturk
| regularly and now have been forced to stop using it
| entirely because of this. They started getting "As a large
| language model" answers ... in even things where the only
| possible answer was 'true' or 'false' or '1-7'.
| ttul wrote:
| "I hope this email finds you well..."
| EGreg wrote:
| I love how your CompanyName does XYZ in field ABC. Buy
| our product. Let's talk!
| kordlessagain wrote:
| Click here for virtual coffee with me!
| jgalt212 wrote:
| According to Eric Schmidt, OpenAI is worth $90B. LLMs
| being unable to master mturk and $90B are incongruent.
|
| https://twitter.com/ericschmidt/status/172562514451990964
| 8
|
| Who's right? Who's wrong?
| dmd wrote:
| Good question. Let's go ask our expert, an internet
| commenter who calls himself "John Galt", to step in with
| an opinion based solely on market value!
| svat wrote:
| See the "Artificial Artificial Artificial Intelligence"
| paper: https://arxiv.org/abs/2306.07899
| colincooke wrote:
| My wife studys people for living (experimental cognitive
| psychologist), the quality of MTurk is laughable, if that's our
| standard for higher level cognition then the bar is low. You'll
| see the most basic "attention check" questions ("answer option
| C if you read the question") be failed routinely, honestly at
| this point I think GPT4 would to a better job than most
| MTurkers at these tasks...
|
| She has found that prolific is substantially better (you have
| to pay more for it as well), however that may only be because
| it's a higher cost/newer platform.
| PaulHoule wrote:
| My take is the tasks on Turk are awful and will drive away
| anybody decent.
|
| I had a time when I was running enough HITs to get a customer
| rep and felt I was getting OK results. I wanted to get better
| at running HITs so I thought I would "go native" as a Turk
| and try to make $50 or so but I could not find tasks to do
| that were at all reasonable. Instead they'd want me to "OCR"
| a receipt that was crumpled up and torn and unreadable in
| spots and said they'd punish me for any mistakes.
| amelius wrote:
| > honestly at this point I think GPT4 would to a better job
| than most MTurkers at these tasks...
|
| From the article:
|
| > Our experimental results support the conclusion that
| neither version of GPT-4 has developed robust abstraction
| abilities at humanlike levels.
|
| This makes the conclusion only worse for GPT-4 ...
| cosmojg wrote:
| > In the first batch of participants collected via Amazon
| Mechanical Turk, each received 11 problems (this batch also
| only had two "minimal Problems," as opposed to three such
| problems for everyone else). However, preliminary data
| examination showed that some participants did not fully
| follow the study instructions and had to be excluded (see
| Section 5.2).
|
| If they stuck to the average Mechanical Turk worker instead
| of filtering for "Master Workers," the parent's conclusions
| likely would've aligned with those of the study.
| Unfortunately, it seems the authors threw out the only data
| that didn't support their hypothesis as GPT-4 did, in fact,
| outperform the median Mechanical Turk worker, particularly
| in terms of instruction following.
| nrfulton wrote:
| _> Unfortunately, it seems the authors threw out the only
| data that didn 't support their hypothesis as GPT-4 did,
| in fact, outperform the median Mechanical Turk worker,
| particularly in terms of instruction following._
|
| MTurk, to first approximate, is a marketplace that pays
| people pennies to fill out web forms. The obvious thing
| happens. The median Mechanical Turk worker probably
| either isn't a human, isn't _just_ a (single) human, and
| /or is a (single) human but is barely paying attention +
| possibly using macros. Or even just button mashing.
|
| That was true even before GPT-2. Tricks like attention
| checks and task-specific subtle captcha checks have been
| around for almost as long as the platform itself. Vaguely
| psychometric tasks such as ARC are particularly difficult
| -- designing hardened MTurk protocols in that regime is a
| fucking nightmare.
|
| The type of study that the authors ran is useful if your
| goal is to determine whether you should use outputs from
| a model or deal with MTurk. But results from study
| designs like the one in the paper rarely generalize
| beyond the _exact_ type of HIT you 're studying and the
| _exact_ workers you finally identify. And even then you
| need constant vigilance.
|
| I genuinely have no idea why academics use MTurk for
| these types of small experiments. For a study of this
| size, getting human participants that fit some criteria
| to show up at a physical lab space or login to a zoom
| call is easier and more robust than getting a
| sufficiently non-noisy sample from MTurk. The first
| derivative on your dataset size has to be like an order
| of magnitude higher than the overall size of the task
| they're doing for the time investment of hardening an
| MTurk HIT to even begin make sense.
| warkdarrior wrote:
| This is just coming up with excuses for the MTurk
| workers. "they were barely paying attention", "they were
| button mashing", "they weren't a single human", etc.
|
| It turns out that GPT-4 does not have those problems. The
| comparison in the paper is not really fair, since it does
| not compare average humans vs GPT-4, it compares "humans
| that did well at our task" vs GPT-4.
| nrfulton wrote:
| _> This is just coming up with excuses for the MTurk
| workers_
|
| No. The authors are not trying to study MTurk market
| dynamics. They are trying to compare humans and LLMs.
|
| Both questions are interesting and useful. This study is
| only asking about the second question. That's okay.
| Isolating specific questions and studying them without a
| bunch of confounds is one of the basic principles of
| experiment design. The experiment isn't intended to
| answer every question all at once. It's intended to
| answer one very specific question accurately.
|
| LLMs can both be worse at Mensa tasks and also better
| than humans at a variety of reasoning tasks that have
| economic value. Or, LLMs can be worse at those reasoning
| tasks but still reasonably good enough and therefore
| better on a cost-adjusted basis. There's no contradiction
| there, and I don't think the authors have this confusion.
|
| _> The comparison in the paper is not really fair_
|
| The study is not trying to fairly compare these two
| methods of getting work done in general. It's trying to
| study whether LLMs have "abstraction abilities at
| humanlike levels", using Mensa puzzles as a proxy.
|
| You can take issues with the goal of the study (like I
| do). But given that goal, the authors' protocols are
| completely reasonable as a minimal quality control.
|
| Or, to put this another way: why would _NOT_ filtering
| out clickbots and humans speedrunning surveys for $0.25
| /piece result in a more insightful study _given the
| author 's stated research question_?
|
| _> It turns out that GPT-4 does not have those
| problems._
|
| I think the authors would agree but also point out that
| these problems aren't the ones they are studying in this
| particular paper. They would probably suggest that this
| is interesting future work for themselves, or for labor
| economists, and that their results in this paper could be
| incorporated into that larger study (which would
| hopefully generalize beyond MTurk in particular, since
| MTUrk inter alia are such uniquely chaotic subsets of the
| labor market).
|
| For me, the problems with the study are:
|
| 1. The question isn't particularly interesting because no
| one cares about Mensa tests. These problem sets make an
| implicit assumption that psychometric tools which have
| some amount of predictive power for humans will have
| similar predictive power for LLMs. I think that's a naive
| assumption, and that even if correlations exist the
| underlying causes are so divergent that the results are
| difficult to operationalize. So I'm not really sure what
| to do with studies like this until I find an ethical
| business model that allows me to make money by automating
| Mensa style test-taking en masse. Which I kind of hope
| will ever exist, to be honest.
|
| 2. MTurk is a hit mess (typo, but sic). If you want to do
| this type of study just recruit human participants in the
| old fashioned ways.
|
| But given the goal of the authors, I don't think applying
| MTurk filters is "unfair". In fact, if anything, they're
| probably not doing enough.
| petermcneeley wrote:
| This critique in no way invalidates the conclusions of the
| paper.
| krona wrote:
| As I point out, the critique is simply that the paper isn't
| very good. In fact I'd go as far as to say its egregiously
| bad, even if the conclusion is 'correct' in as far as
| something was measured and they reported it.
| arp242 wrote:
| The conclusion of "GPT-4 can't even compete against low-
| quality human effort like that found on MTurk" seems
| alright. The article would be better if it didn't claim to
| compare "humans", because it doesn't - or at least, not in
| the common sense definition of "comparing humans vs GPT-4".
| But to be honest I think that's a bit pedantic.
|
| It would be different if it had found that GPT-4
| outperformed MTurk, but if GPT-4 can't compete against
| MTurk on these tasks, then I don't think it will fare
| better against you or me with our full attention.
| avs733 wrote:
| If it's making a claim about humans vs llms but shittified
| capitalism means that the humans are just using llms to make
| money churning on mechanical turn then you can't make the
| underlying claim.
|
| That ignores random sample problems and the simplistic
| statistical analysis...so yeah it unfortunately does.
|
| It doesn't prove they are wrong, but it's not credible to say
| they are right
| Gare wrote:
| It's only important that humans (or "humans") were about
| 90% correct. Whereas GPT-4 was only 33% correct. If those
| turks were using AI, that would mean there is AI out there
| much more capable than GPT-4.
| avs733 wrote:
| It doesn't...
|
| The problem is that unless the data are credible and
| reliable, any interpretation of that data is meaningless.
| For or against one's hypotheses doesn't matter if the
| data is junk. If the data is junk you can't move past
| that, it's the end of the conversation because the data
| has no meaning.
| LogicalRisk wrote:
| From reading the paper and the original paper that the data
| for the MTurk/Prolific samples are drawn from, this is a
| convenience sample of 415 humans on two platforms. Each
| worker received a random sample of the ConceptARC problems,
| and the average score correct is assigned the "Human"
| benchmark.
|
| Perhaps by "random sample problems" you mean that the study
| is not representative of all of humanity? If so we can
| still take the paper as evaluating these 415 humans who
| speak English against the two models. If as you say, the
| workers are actually just using LLMs then this implies
| there is some LLM that your average MTurk worker has access
| to that out-performs GPT 4 and GPT 4V. That seems
| *extremely* unlikely to say the least.
|
| There is no need for any complex statistical analysis here
| since the question is simply comparing the scores on a
| test. It's a simple difference in means. Arguably, the main
| place that could benefit from additional statistical
| procedures would be weighting the sample to be
| representative of a target population, but that in no way
| affects the results of the study at hand.
| a1j9o94 wrote:
| Another thing I take issue with is this doesn't seem to be
| using known ways to improve performance of LLMs such as chain
| of thought and tree of thought prompting.
| arjvik wrote:
| There's also the "take a deep breath" and other "unreal
| engine"-esque prompting tricks
| 1024core wrote:
| Ideally they would use the same prompt for LLMs as they use
| for Humans; and judge them in exactly the same manner. Only
| then is it a fair comparison.
| arjvik wrote:
| Why can't the LLM be a (model, prompt prefix) pair?
| hackerlight wrote:
| I also want to see GPT-3 vs GPT-4 comparison on these tasks.
| karmakaze wrote:
| This is interesting in a 'human interest news' report way but
| doesn't do anything to judge current systems any more the
| average people thinking older less capable chatbots were human.
| svnt wrote:
| What is your concern exactly?
|
| This was a first-pass study in a field addressing some of the
| criticisms leveraged against an earlier study where the spatial
| reasoning problems were viewed to be too hard. They seemingly
| made the spatial reasoning questions as easy as they could.
|
| The qualifications they put on MTurk are pretty standard if you
| want humans who care about what they are doing responding to
| your study. It costs more to do this.
|
| It is a limitation of science that is both budgetary and
| procedural.
|
| By calling into question their results you seem to be
| suggesting that an average human would only 33% of the time be
| able to tell e.g. how many points are inside a box, or whether
| more points are inside or outside of a box. This is extremely
| basic spatial reasoning we are talking about.
|
| The problem they were addressing with the settings is just
| noise in the results by cheap bots and clicky humans trying to
| earn $0.50. It is endemic on MTurk.
| kbenson wrote:
| > By calling into question their results
|
| Did they? I took it more as a clarification of a point so
| people could more easily know what it meant, and wishing they
| did more so we could know more.
|
| Their last sentence in fact notes that what the paper says is
| correct and they just which we could make a more nuanced
| statement if it was done with more specific constraints.
| skepticATX wrote:
| It has been interesting to see evidence accumulating that shows,
| despite initial excitement bred by papers such as "Sparks", there
| is something missing from current language models.
|
| Individually none of these results will ever get the attention of
| a "Sparks" type paper, but collectively a strong case has been
| built.
| naasking wrote:
| Sparks of AGI is not AGI. It's also possible that we're not
| testing LLMs fairly, or that merely slight tweaks to the
| architecture or methods would address the issues. I think this
| comment elaborates nicely:
|
| https://news.ycombinator.com/item?id=38332420
|
| I do think there might be something missing, but I also suspect
| that it's not as far off as most think.
| incongruity wrote:
| So, in other words, perhaps what we have is a necessary
| component but not wholly sufficient on its own?
| amelius wrote:
| Perhaps not even necessary.
| pixl97 wrote:
| That is my take on it.
|
| I think embodiment and the encoding of the natural laws
| (gravity, force, etc) that go into that will be another
| huge step at grounding AI. People tend to gravitate to
| thinking about humanoid robots when that is mentioned (and
| thereby terminators), but honestly I would think things
| closer to sensor networks involving thousands or millions
| of bodies like a hivemind would be more likely (why stick
| at the human level of a single body if you didn't have to).
| Interaction with the world is a means of determining
| truth... The ability to perform science.
|
| And as hard as embodiment is, it will be the easy part in
| my opinion. Continuous learning without losing the plot is
| going to be quite the challenge. If an LLM has something
| wrong, how does it update and change that bit of
| information without huge amounts of power use? How do you
| make the system learn 'important' things without filling up
| with junk/spam it is subject to? How do you keep the system
| aligned with a goal that is not destructive to itself or
| others?
| naasking wrote:
| But embodiment being a bottleneck could indicate that
| it's a data/training issue, rather than an architectural
| issue. Multimodal training data improves GPT-4 already,
| but that's still very little data compared to growing up
| to a full human adult. There are still many things to
| try.
| pohl wrote:
| That has always been my impression, despite the myriad ways
| that LLMs impress.
|
| So much potential is lost just in the request/response
| limitation. While I'm waiting for a response from GPT-4,
| I'm continuing to think. Imagine if the reverse were true.
| AGI needs to be able to mull things over for spans of time.
| pixl97 wrote:
| At least any company trying to sell a product this is
| going to be an issue with operations costs.
|
| Also this gets into the halting problem. How many
| resources do you expend on finding an answer? In a human
| issues will typically come up like we have to go pee, or
| eat, or something outside our body interrupts us. For an
| AI, how much time should it spend? Do we want to wake up
| one day finding our data centers running at full tilt?
|
| This said, there have been some attempts at working on
| agent based systems that reach out for answers from
| multiple places and pool the data then run things like
| chain of thought on that data pool.
| kenjackson wrote:
| Can someone provide the prompts in text rather than the images
| from the paper? That would make it easier to try and replicate
| results.
| devbrac wrote:
| Raw json files for each task from the ConceptARC paper are
| here: https://github.com/victorvikram/ConceptARC. It would be
| trivial to convert these into prompts using the format in the
| appendix of the paper.
| kaoD wrote:
| Maybe I'm missing what "abstraction" means here but seems like
| the tasks were centered around grids and other spatial problems,
| which are a very limited subset of abstraction/reasoning.
|
| In my experience GPT4/V is pretty bad at those specifically, not
| necessarily around abstraction in general. Positions, rotations,
| etc. are a concept that GPT4 finds very hard to apply, which is
| kinda unsurprising since it has no body, no world, no space; it
| "lives" in the realm of text. DALLE3 suffers a similar problem
| where it has trouble with concepts like "upside down" and
| consistently fails to apply them to generated images.
| pixl97 wrote:
| What would directions be for an intelligent creature that lives
| in zero gravity? I just like thinking about this for the same
| reasons humans like writing speculative science fiction. Trying
| to guess what alien perspectives look like, might also give us
| insights when we're the ones making the alien.
| Retric wrote:
| Basically the same, gravity doesn't define left/right or
| North, South, East, and West for us just up and down.
| trealira wrote:
| However, North, South, East, and West are relative to the
| poles of the Earth. Something living in zero gravity would
| have to use some object as an anchor to determine the
| direction.
| Retric wrote:
| You're also oriented based on objects. We don't have an
| abstract compass pointing north 24/7 the way we can use
| our bodies to determine left and right or gravity to
| point down.
| trealira wrote:
| Right, that's why we use compasses, which use the poles
| of the Earth to determine the direction.
|
| Something living in zero gravity doesn't have a planet,
| so they'd have to find something else to base the
| directions on.
|
| That's what I was trying to say before.
| withinboredom wrote:
| No, but they would have front and back, and people from
| the bridge would share which way was "up" and "down" and
| "left" and "right" based on the controls.
| anonymouskimmer wrote:
| The solar system has a north pole and a south pole based
| on the rotation of the Sun. Basically the only places in
| which there isn't something to orient against are in the
| depths of inter-galactic-cluster voids with nothing
| around. And if a being is stuck in one of those voids,
| orientation is way down the list of problems they have.
| trealira wrote:
| That's a good point. The sun of a solar system could
| possibly be what an alien society living in zero gravity
| bases their directions on.
| int_19h wrote:
| FWIW there is some interesting variability among human
| cultures on that, as well. There are a few that actually
| use cardinal directions predominantly or exclusively
| instead of body-relative ones like "left" and "right".
| bloaf wrote:
| It's also worth remembering that blind humans who can recognize
| squares by feel do not have the ability to recognize squares by
| sight upon gaining vision.
|
| I suspect the model is bad at these kinds of "reasoning" tasks
| in the same way that a newly-sighted person is bad at
| recognizing squares by sight.
| eulgro wrote:
| When did blind humans gain vision out of curiosity?
| withinboredom wrote:
| The first one I heard about was 10-15 years ago by
| projecting an image on the tongue. Ahh, here it is:
| https://www.scientificamerican.com/article/device-lets-
| blind...
| bloaf wrote:
| https://www.projectprakash.org/_files/ugd/2af8ef_5a0c6250cc
| 3...
|
| They studied people with treatable congenital blindness
| (dense congenital bilateral cataracts)
| joe_the_user wrote:
| _In my experience GPT4 /V is pretty bad at those specifically,
| not necessarily around abstraction in general._
|
| The problem with a statement like is that it leaves the door
| open to accepting any kind of canned generality as "abstraction
| in general". Abstract reasoning is indeed a fuzzy/slippery
| concept and spatial reason may not capture it well but I'm
| pretty sure it captures it better a general impression of
| ChatGPT.
|
| _...since it has no body, no world, no space; it "lives" in
| the realm of text._
|
| There's a bizarre anthropomorphism on this thread, both
| reflexively compare this software system to a blind human and
| the implicit call to be considerate of this thing's supposed
| disability.
| og_kalu wrote:
| >The problem with a statement like is that it leaves the door
| open to accepting any kind of canned generality as
| "abstraction in general".
|
| Not really
|
| https://arxiv.org/abs/2212.09196
| joe_the_user wrote:
| Nah,
|
| By the very fact that there's paper here, whatever it's
| merit, the authors of the paper have codified their concept
| of generality and this doesn't validate the point I was
| replying to, which was essentially "my impression/feeling"
| is that it is better".
| og_kalu wrote:
| Point is that it's good at abstract reasoning that isn't
| spatially grounded like in that paper. So it's not really
| leaving any door open. It's not a cop out. That's just
| how it is.
| int_19h wrote:
| Why is it bizarre to consider the limitations inherent in the
| input data on which the model is trained? Fundamentally, it
| still "sees" the world through text, and the extent to which
| it can "understand" spatial relationships is defined by that.
| It seems utterly unsurprising that this leads to very poor
| grasp of the actual concepts behind what things like "above"
| or "left" are - the text that humans produce when talking
| about such things kinda relies on the reader having their own
| experience (if not vision, then at least body awareness) that
| can be mapped to those concepts. You can explain "left" and
| "right" to a human by telling them which of their hands is
| which, and I can't help but wonder what the actual
| information payload this constitutes when you consider the
| body spatial awareness that is brought into context by
| association.
| dragonwriter wrote:
| > Fundamentally, it still "sees" the world through text
|
| Fundamentally, it "sees the world" [0] through tokens,
| which are not text.
|
| [0] Also a bad metaphor, but...
| Jensson wrote:
| A string of tokens is text. Tokens is just another
| alphabet, like Japanese letters having many
| representations for the same sounds and a letter can be
| entire words sometimes.
| int_19h wrote:
| Technically true, but when those tokens are 1:1 mapped to
| text, I think we can simplify this down without losing
| anything important.
|
| Of course, once you start using tokens for other things -
| as multimodal LMs already do - that changes. But this
| current crop of model still has visual modality in its
| infancy IMO, and gauging the overall performance of model
| as a whole based on that is very questionable.
| joe_the_user wrote:
| _Why is it bizarre to consider the limitations inherent in
| the input data on which the model is trained?_
|
| Sure the thing is limited, the study is demonstration of
| this (and general purpose abilities have been claimed for
| LLMs at various point).
|
| I was pushing back against the "it's like a blind person"
| anthropomorphizing argument [edit: especially the
| assumption these things learn through experience and
| _reflection_ which the parent also makes]. Maybe if the
| thing "had eyes", it could learn spatial information and
| maybe it couldn't (though it would take a lot of work to
| make that metaphor meaningful). The thing certainly doesn't
| learn text in the fashion that human learns speech since
| humans don't digest the entire Internet before they can
| speak.
| pixl97 wrote:
| I'd recommend looking up model grounding by multi-modal
| training. Seemingly models improve as you add more modes.
| lazy_moderator1 wrote:
| > which is kinda unsurprising since it has no body, no world,
| no space; it "lives" in the realm of text
|
| or rather the training set was lacking in this regard
| Sharlin wrote:
| > DALLE3 suffers a similar problem where it has trouble with
| concepts like "upside down" and consistently fails to apply
| them to generated images.
|
| This has nothing to do with having "no body, no world" and
| everything to do with the fact that training pictures where
| things are upside down are simply vastly rarer that pictures
| where they aren't.
| mr_toad wrote:
| > DALLE3 suffers a similar problem where it has trouble with
| concepts like "upside down" and consistently fails to apply
| them to generated images.
|
| There's probably not many (if any) upside down images or
| objects in the training data.
| xbmcuser wrote:
| It has been really interesting to read in last few years with
| machine learning how the model cant do this or that and the next
| week or month read it can do this or something else. Chat gpt and
| the models that have come after seem to have accelerated this
| back and forth a lot. Unless you keep up with it closely and keep
| updating your information I think what you knew it could do well
| or could not do well is no longer correct
| RecycledEle wrote:
| What every paper I have seen so far is missing is that there are
| many ways to achieve super-human intelligence. (I need to give
| creidt to Isaac Arthur of SFIA for this.)
|
| Getting results faster is one way. AIs beat me in speed.
|
| Getting results cheaper is another way. AI is cheaper than I am.
|
| Knowledge across many fields is better. AI beats me here too.
|
| Getting better results in one narrow field is another way, but
| only one of many ways. I love evaluations of human produced work
| vs. machine produced work. If we had quality evaluations (not the
| type-oh riddled garbage most people use.) If we compared AIs to
| people who work in those fields in occupations recognized by the
| US Dept of Labor. If we asked both sides to justify their
| answers. If we had statistically significant sample sizes. Then
| maybe we could get some good results on quality of work. I can
| imagine the US DOL spending billions if dollars benchmarking AIs
| against humans i all the occupations the recognize. Alternately,
| this could be a very profitable company.
| jacquesm wrote:
| That's why the 'G' in AGI is so important.
| varispeed wrote:
| I view LLM as a search engine, that can find and present the
| knowledge in a way that is easier to consume.
|
| For instance: Find me a solution to 2x2. I see this as a search
| problem, not reasoning problem.
|
| I think many scientists don't look at it from this angle and
| push the agenda that somehow current AI can reason. It can't.
| It's just a search engine / pattern matching contraption that
| works so well it can fool people into thinking they interact
| with an intelligence.
|
| Frankly, many people "think" in the same way. They store facts
| in their brain and then retrieve them when asked, but they have
| limited ability to arrive at new facts from those they already
| know, without external input.
|
| In that sense, the "AI" can certainly replace humans in tasks
| that don't require to think or where there is a need to distil
| information from a wide array of sources in a useful way.
| ben_w wrote:
| I see it as translation, not search. Search is already done
| fantastically well at small scale with simple indexing, and
| (before adversarial reactions from SEO) at internet scale
| with Page rank.
|
| Asking if LLMs are really reasoning or not feels like an
| argument about terminology, like asking if A* really is route
| planning.
| prox wrote:
| Having tried a week of LLM instead of search, the results are
| usually very good as a search engine goes (I used you.com
| that adds search results)
|
| But the amount of hallucinations was staggering. There is
| nothing behind the curtain so to speak. I constantly had to
| guide it like a little kid.
| p1esk wrote:
| Is you.com using gpt4?
| pixl97 wrote:
| The following is a logical reasoning problem. Determine which
| answer, A,B,C, or D is correct
|
| FXGB, HXIB, JXKB, _____, NXOB
|
| A. OXPBB. B. KXMBC. C. LXLBD. D. LXMB
|
| ChatGPT The pattern in the sequence "FXGB, HXIB, JXKB, _____,
| NXOB" suggests an alternating increment in the letters
| surrounding "X".
|
| Analyzing each component:
|
| The first letter sequence: F, H, J, __, N
|
| The pattern here is increasing by 2: F (+2) - H, H (+2) - J.
| Following this, J (+2) - L. So, the first letter of the
| missing term is L. The third letter sequence: G, I, K, __, O
|
| The pattern is the same as the first sequence: G (+2) - I, I
| (+2) - K. So, K (+2) - M. The third letter of the missing
| term is M. The last letter sequence: B, B, B, __, B
|
| The pattern is consistent: the letter B repeats. So, the last
| letter of the missing term is B. Combining these, the missing
| term is LXM(B).
|
| The answer is C. LXL(B)D.
|
| ----
|
| Unfortunately I did not make up the question myself, so is it
| already in the training set?
| root_axis wrote:
| This response is actually quite illustrative of the
| stochastic nature of LLMs. Under normal circumstances a
| human would never make this kind of error.
|
| The system remains coherent long enough to produce the
| right answer, yet it still confidentially selects the wrong
| answer because it only understands the question as a
| function of statistical relationships between tokens,
| unlike humans who would understand the question at a
| conceptual level.
| usaar333 wrote:
| Note that the goal of ASI isn't to beat you per se, but the
| best humans.
|
| GPT-4 is generally not competitive with domain specialists
| (outside areas where machines have already been competitive for
| some time). That of course doesn't mean it isn't valuable - but
| it becomes more for human augmentation than replacement.
| RecycledEle wrote:
| ASI stands for Artificial Superintelligence.
| godelski wrote:
| > ASI stands for Artificial Superintelligence.
|
| >> Note that the goal of ASI isn't to beat you per se, but
| the best humans.
|
| I'm pretty sure they understand that.
| RecycledEle wrote:
| I don't see it as a dialectic. I see it as a continuum.
|
| There is value to replacing a percentage of human labor with
| machine labor.
|
| Jordan Peterson said there are no jobs in the USA for people
| with IQs below 87:
| https://www.youtube.com/watch?v=RN3w_F9r-aI&t=338s
|
| Mr. Peterson says lower IQ people are do better at more
| repetitive jobs.
|
| Machines and AI can do more repetitive jobs. Technology took
| those jobs.
|
| As AI advances, the lowest useful IQ will move up. I wonder
| where it will be in 10 years.
| whatshisface wrote:
| Maybe they will invent treatments for IQ below 90 so that
| our ability can move up along with technology.
| pixl97 wrote:
| The rate in AI IQ gain is not limited by ethics and
| clinical trials.
| notaurus wrote:
| This might be the case now but shouldn't be a particularly
| useful distinction for very long-- on the grand scale of
| intelligence, the difference between the smartest human and
| the average human is tiny.
| usaar333 wrote:
| I'm not talking about intelligence but expertise.
|
| GPT-4 is a lot better than me at unfamiliar tasks, because
| it has the advantage of being able to readily transform all
| data on the internet.
|
| Once I have trained myself, I generally can beat GPT-4.
| greenthrow wrote:
| These are not "ways" of achieving superhuman intelligence,
| these are potential forms of superhuman performance.
|
| It's also a very broad definition, because by that definition
| we have had superhuman performance in computation for decades.
| And we've been able to write programs for very narrow tasks
| that outperform humans for that time as well.
|
| Generally when people are comparing artificial intelligence
| against human intelligence, they want an AGI. Humans aren't
| good at just one narrow thing, that's why they aren't all
| replaced by specialized programs already. We are taking about
| being able to describe a problem and the artificial
| intelligence can take it from there. That's why the bar is AGI,
| which is a much higher bar than any of your broad definitions.
| pixl97 wrote:
| Though I think we should take note that the write programs
| for very narrow tasks that outperform humans may start seeing
| widening.
|
| Things like object recognition have moved from "only humans
| can do this" to machines are keeping up with us. And I think
| that this is how it will be for at least some time. Instead
| of one big blast where ASI is better than humans at
| everything, more and more categories of narrow intelligence
| will widen and become superhuman.
| wouldbecouldbe wrote:
| Never heard of Mechanical Turk, hahah for a Dutch person it
| sounds pretty racist. Turk is what we call Turkish people.
| margalabargala wrote:
| It's a reference to an old famous chess playing "machine" that
| had a human inside it'
|
| https://en.wikipedia.org/wiki/Mechanical_Turk
| wouldbecouldbe wrote:
| Yeah I read, just interesting for non-english cultures sounds
| very different.
| baxter001 wrote:
| No, it doesn't, Turk is what we call Turkish people too.
| WrongAssumption wrote:
| It was invented by a Hungarian to impress an Austrian. The
| Turk portion of the name is from the original Hungarian. It
| toured Europe for 80+ years. I have no clue what about this
| you think has anything to do with English culture.
| renewiltord wrote:
| That's nothing. In American English, calling someone a
| turkey would be an insult. Turkey is how we refer to the
| nation Turkiye Cumhuriyeti. Imagine calling an entire
| nation an ungainly bird that we ritually consume every
| November. Almost as bad as giving Michael Jordan his own
| seat at the UN. But that's what capitalism and the US
| hegemony have brought us.
| esafak wrote:
| There's an interesting story behind the name:
| https://www.etymonline.com/word/Turkey
| dr_dshiv wrote:
| I'm really looking forward to students majoring in "machine
| psychology."
| ryzvonusef wrote:
| https://en.wikipedia.org/wiki/Susan_Calvin >
| Graduating with a bachelor's degree from Columbia University in
| 2003, she began post-graduate work in cybernetics, learning to
| construct positronic brains such that responses to given
| stimuli could be accurately predicted. She joined US Robots in
| 2008 as their first Robopsychologist, having earned her PhD. By
| 2029, when she left Earth for the first time to visit Hyper
| Base, her formal title was Head Psychologist.
|
| https://en.wikipedia.org/wiki/Robopsychology
| theptip wrote:
| If you look at the appendix, you can see example transcripts. The
| sample they provide looks like a very bad eval.
|
| It's encoding an originally visual problem into a textual matrix
| form, and then expecting GPT to recognize visual correlations.
| You simply can't compare these two tasks! Most humans wouldn't
| recognize the 5x5 matrix for a 4x4 square.
|
| So the comparison with "human level" is completely invalid. And
| even the valid comparison is only measuring visio-spatial
| intelligence, not IQ.
| Sharlin wrote:
| Humans are biased towards grokking two-dimensional information
| arranged in grids because that's how vision works. (Non-V) GPT
| is not. Its natural gestalt has a vast number of dimensions. It
| is, and it should, be able to find correlations across data no
| matter how the input is arranged. (Note that for GPT _text is
| not linear_ even if it appears so to humans.)
|
| To humans it would be qualitatively more difficult to recognize
| patterns in four or six or ten-dimensional data than in two
| dimensions. To GPT, it's just another variation of the same
| problem, one that's at most _quantitatively_ more difficult.
|
| > And even the valid comparison is only measuring visio-spatial
| intelligence, not IQ.
|
| You're the first one to mention IQ. The paper definitely does
| not. In any case "visio-spatial intelligence" is a human
| concept. GPT doesn't distinguish between "visio-spatial" and
| any other kind of pattern recognition.
| og_kalu wrote:
| >It is, and it should, be able to find correlations across
| data no matter how the input is arranged.
|
| That doesn't mean it doesn't favor certain arrangements.
|
| https://arxiv.org/abs/2305.18354
|
| Linear is in fact better for this benchmark
| datadrivenangel wrote:
| "Our experimental results support the conclusion that neither
| version of GPT-4 has developed robust abstraction abilities at
| humanlike levels."
|
| Some combination of LLMs and logical reasoning systems will get
| us much closer, but that becomes a lot more complicated.
| intended wrote:
| This new Gen of AI adds an interesting twist to the infinite
| monkeys and typewriter issue.
|
| How do you actually check an infinite amount of junk to verify
| that one of them is the collected works of Shakespeare?
|
| The question I ask now is "whats your error rate for domain
| specific work ?"
|
| It could be faster and smarter, but it doesnt matter if it's
| wrong.
___________________________________________________________________
(page generated 2023-11-19 23:01 UTC)