[HN Gopher] Are you smarter than a language model?
___________________________________________________________________
Are you smarter than a language model?
Author : JoelEinbinder
Score : 91 points
Date : 2024-08-17 19:21 UTC (3 hours ago)
(HTM) web link (joel.tools)
(TXT) w3m dump (joel.tools)
| JoelEinbinder wrote:
| I made a little game/quiz where you try to guess the next word in
| a bunch of Hacker News comments and compete against various
| language models. I used llama2 to generate three alternative
| completions for each comment creating a multiple choice question.
| For the local language models that you are competing against, I
| consider them having picked the answer with the lowest total
| perplexity of prompt + answer. I am able to replicate this
| behavior with the OpenAI models by setting a logit_bias that
| limits the llm to pick only one of the allowed answers. I tried
| just giving the full multiple choice question as a prompt and
| having it pick an answer, but that led to really poor results. So
| I'm not able to compare with Claude or any online LLMs that don't
| have logit_bias.
|
| I wouldn't call the quiz fun exactly. After playing with it a lot
| I think I've been able to consistently get above 50% of questions
| right. I have slowed down a lot answering each question, which I
| think LLMs have trouble doing.
| jonahx wrote:
| "This exercise helped me to understand how language models work
| on a much deeper level."
|
| I'd like to hear more on this.
| 0xDEADFED5 wrote:
| It's an interesting test, pretty cool idea. Thanks for sharing
| silisili wrote:
| Was mine broken? One of my prompts was just '>'. So of course I
| guessed a random word. The answer key showed I got it wrong, but
| showed the right answer inserted into a longer prompt. Or is that
| how it's supposed to work?
| JoelEinbinder wrote:
| That isn't how it's supposed to work. I mean sometimes you get
| a supper annoying prompt like ">", but if you guess the right
| answer it should give you the point. I just checked the two
| prompts like that, and they seem to work for me.
| silisili wrote:
| Right, I got the answer incorrect, so that part worked right.
| I just wasn't sure if the question was intentionally clipped
| and missing that context, but it does sound intentional. I
| guess I make a poor LLM!
| mjcurl wrote:
| 5/15, so the same as choosing the most common word.
|
| I think I did worse when the prompt is shorter. It just becomes a
| guessing game then and I find myself thinking more like a
| language model.
| toxik wrote:
| Yeah, it should be sentences that have low next token
| distribution entropy. Where an LLM is sure what the next word
| is. I bet people do real well on those too. By the way, I also
| had 5/15.
| dalton01 wrote:
| It says choosing the most common word was just 1/5 (and their
| best LLM was 4/15)
| jsnell wrote:
| It's a neat idea, though not what I expected from the title
| talking about "smart" :)
|
| You might want to replace the single page format with showing
| just one question at a time, and giving instant feedback on after
| each answer.
|
| First, it'd be more engaging. Even the small version of the quiz
| is a bit long for something where you don't know what the payoff
| will be. Second, you'd get to see the correct answer while still
| having the context on why you replied the way you did.
| JoelEinbinder wrote:
| If you want to practice it one question at at time, you set the
| question count to 1. https://joel.tools/smarter/?questions=1
|
| When I tested it this way it resulted in less of an emotional
| reaction.
| KTibow wrote:
| If you're looking for "knowledge" try
| https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...
| codetrotter wrote:
| > not what I expected from the title talking about "smart"
|
| I think the title is mainly a reference to the TV show "Are you
| smarter than a fifth grader?"
|
| Fittingly then, is the fact that a lot of types of questions
| that they were asking in that TV show was mostly trivia. Which
| I also don't think of as being a particularly important
| characteristic of being "smart".
|
| When I think of "smart" people, I think of people who can take
| limited amount of information and connect dots in ways that
| others can't. Of course it also builds on knowledge. You need
| to have specific knowledge in the first place to make
| connections. But knowing facts like "the battle of so and so
| happened on August 18th 1924, one hundred years ago today"
| alone is not "smart". A smart person is someone who uses
| knowledge in a surprising way. Or in a way that others would
| not have been able to. After the smart person made the
| connection others might also go like "oh that's so obvious why
| didn't I think about that" or even "yeah that's really obvious,
| I could've thought of that too". And yet the first person to
| actually make, and properly communicate that connection was the
| smart one. Smart exactly because they did.
| wesselbindt wrote:
| I like the website, but it could be a bit more explicit about the
| point it's trying to make. Given that a lot of people tend to
| think of LLM as somehow a thinking entity rather than a
| statistical model for guessing the most likely next word, most
| will probably look at these questions and think the website is
| broken.
| layer8 wrote:
| This is also a good test for noticing that you spend too much
| time reading HN comments.
| TacticalCoder wrote:
| My computer can compute 573034897183834790x3019487439184798 in
| less than a millisecond. Doesn't make it smarter than me.
| nyrikki wrote:
| 7/10 This is more about set shattering than 'smarts'
|
| LLMs are effectively DAGs, they literally have to unroll infinite
| possibilities in the absence of larger context into finite
| options.
|
| You can unroll and cyclic graph into a dag, but you constrict the
| solution space.
|
| Take the 'spoken': sentence:
|
| "I never said she stole my money"
|
| And say it multiple times with emphasis on each word and notice
| how the meaning changes.
|
| That is text being a forgetful functor.
|
| As you can describe PAC learning, or as compression, which is
| exactly equivalent to the finite set shattering above, you can
| assign probabilities to next tokans.
|
| But that is existential quantification, limited based on your
| corpus based on pattern matching and finding.
|
| I guess if "Smart" is defined as pattern matching and finding it
| would apply.
|
| But this is exactly why there was a split between symbolic AI,
| which targeted universal quantification and statistical learning,
| which targets existential quantification.
|
| Even if ML had never been invented, I would assume that there
| were mechanical methods to stack rank next tokens from a corpus.
|
| This isn't a case of 'smarter', but just different. If that
| difference is meaningful depends on context.
| akira2501 wrote:
| Yes. I can tell you about things that happened this morning. Your
| language model cannot.
| Garlef wrote:
| I like it. It's a humorous reversal of the usual articles that
| boil down to "Look! I made the AI fail at something!"
| User23 wrote:
| With some brief experimentation ChatGPT also fails this test.
| lostmsu wrote:
| It might make sense: any kind of fine-tuning of LLMs usually
| reduces generalization capabilities, and instruction-tuning is
| a kind of fine-tuning.
| stackghost wrote:
| This is just a test of how likely you are to generate the same
| word _as the LLM_. The LLM does not produce the "correct" next
| word as there are multiple correct words that fit grammatically
| and can be used to continue the sentence while maintaining
| context.
|
| I don't see what this has to do with being "smarter" than
| anything. Example:
|
| 1. I see a business decision here. Arm cores have licensing fees
| attached to them. Arm is becoming ____
|
| a) ether
|
| b) a
|
| c) the
|
| d) more
|
| But who's to say which is "correct"? Arm is becoming a household
| name. Arm is becoming the premier choice for new CPU
| architectures. Arm is becoming more valuable by the day. Any of
| b), c), or d) are equally good choices. What is there to be
| gained in divining which one the LLM would pick?
| JoelEinbinder wrote:
| The LLM didn't generate the next word. Hacker News commenters
| did. You can see the source of the comment on the results
| screen.
| sigbottle wrote:
| Do LLM's generate words on the fly or can they sort of "go
| back" and correct themselves? stackghost brought up a good
| point I didn't think about before
| DiscourseFan wrote:
| At this point, we've all gotten quite used to the "style" of
| LLM outputs, and personally I doubt this is the case,
| _however_ , it is possible that there is some, shall we say,
| _corruption_ of the data here, since it was not possible to
| measure the ability of LLMs to predict the next word _before
| there were LLMs_.
|
| I propose you do the same things, but only include HN content
| from before the existence of LLMs. That should ensure there
| is no bias towards any of the models.
| JoelEinbinder wrote:
| If I used old comments then it's likely that the models
| will have trained on them. I haven't tested if that makes a
| difference though.
| raggi wrote:
| an unbiased llm shouldn't be producing "style", it should
| be generating outputs that closely match the training set,
| as such their introduction should constitute only some
| biasing toward the average, which also happens in language
| usage in humans over time. the outcome is likely
| indistinguishable for large general data sets and large
| models. i am interested to see how chatbot outputs produce
| human output bias in generations growing up with them
| though, that seems likely and will probably be substantial
| zoklet-enjoyer wrote:
| You scored 6/15. The best language model, gpt-4o, scored 6/15.
| The unigram model, which just picks the most common word without
| reading the prompt, scored 2/15.
|
| Keep in mind that you took 204 seconds to answer the questions,
| whereas the slowest language model was llama-3-8b taking only 10
| seconds!
| e12e wrote:
| you: 8/15 gpt-4o: 2/15 gpt-4: 4/15
| gpt-4o-mini: 4/15 llama-2-7b: 5/15 llama-3-8b:
| 5/15 mistral-7b: 6/15 unigram: 5/15
|
| > You scored 8/15. The best language model, mistral-7b, scored
| 6/15. The unigram model, which just picks the most common word
| without reading the prompt, scored 5/15.
|
| (In I think 120 seconds - didn't copy that part).
|
| Interesting that results differ this much between runs (for the
| LLMs).
|
| Surely someone did better than me on their first run?
|
| Ed: I wonder if the human scores correlate with age of hn
| account?
| lostmsu wrote:
| I think this is a good joke on nay-sayers. But if author is here,
| I would like a clarification if user is picking the next token or
| the next word? Cause if it is the latter, I think this test is
| invalid.
| JoelEinbinder wrote:
| The language model generating the candidate answers generates
| tokens until a full word is produced. The language models
| picking their answer choose the completion that results in the
| lowest perplexity independent of the tokenization.
| lostmsu wrote:
| I'd say the test is still not quite valid, and more of in
| between the original "valid" task and "guess what LLM would
| say" as suggested in another comment here. The reason is: it
| might be easier for LLMs to choose the completion out of
| their own generated variants (1) than the real token
| distribution.
|
| 1. perhaps even out of variants generated by other LLMs
| ZoomerCretin wrote:
| > 8. All of local politics in the muni I live in takes place in a
| forum like this, on Facebook[.] The electeds in our muni post on
| it; I've gotten two different local laws done by posting there
| (and I'm working on a bigger third); I met someone whose campaign
| I funded and helped run who is now a local elected. It is crazy
| to think you can HN-effortpost your way to changing the laws of
| the place you live in but I'm telling you right now that you can.
|
| This is a magical experience. I've done something similar in my
| university's CS department when I pointed out how the learning
| experience in the first programming course varies too much
| depending upon who the professor is.
|
| I've never experienced this anywhere else. American politicians
| at all levels don't appear to be the least bit responsive to the
| needs and issues of anyone but the wealthy and powerful.
| xanderlewis wrote:
| I feel like I recognise the comment about tensors from HN a few
| days ago, haha.
| shakna wrote:
| So... If I picked the same results, in the same timeframe... And
| I don't think glue should go on pizza... Does that mean LLMs are
| completely useless to me?
| Kiro wrote:
| Where do the incorrect options come from?
| EugeneOZ wrote:
| Just proves why IQ tests are worthless.
| moritzwarhier wrote:
| This is the best interactive website about LLMs at a meta level
| (so excluding prompt interfaces for actual AIs) that I've seen so
| far.
|
| Quizzes can be magical.
|
| Haven't seen any cooler new language-related interactive fun-
| project on the web since:
|
| https://wikispeedruns.com/
|
| It would be great if the quiz included an intro or note about the
| training data, but as-is it also succeeds because it's obvious
| from the quiz prompts/questions that they're related to HN
| comments.
|
| Sharing this with a general audience could spark funny
| discussions about bubbles and biases :)
| ChrisArchitect wrote:
| Related:
|
| _Who 's Smarter: AI or a 5-Year-Old?_
|
| https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/
|
| (https://news.ycombinator.com/item?id=41263363)
| moralestapia wrote:
| >the quintessential language model task of predicting the next
| word?
|
| Based on what? The whole test is flawed because of this. Even
| different LLMs would choose different answers and there's no
| objective argument to make for which one is the best.
| sorokod wrote:
| The one provided in the original post.
| anikan_vader wrote:
| Got 8/15, best AI model got 7/15, and unigram got 1/15.
|
| Finally a use for all the wasted hours I've spent on HN -- my
| next word prediction is marginally better than that of the AI.
| sethammons wrote:
| I have wasted an inordinate amount of time hn. i scored 2/15
| StefanBatory wrote:
| 7/15, 90 seconds. I'll blame it on fact that I'm not English
| native speaker, right? Right?
|
| On a more serious note it was a cool thing to go through! It
| seemed like something that should have been so easy at first
| glance.
| seabass-labrax wrote:
| I am a native English speaker and only got 5/15 - and it took
| me over 100 seconds. You have permission to bask in the glory
| of your superiority over both GPT4 and your fellow HN readers!
___________________________________________________________________
(page generated 2024-08-17 23:00 UTC)