[HN Gopher] Are you smarter than a language model?
       ___________________________________________________________________
        
       Are you smarter than a language model?
        
       Author : JoelEinbinder
       Score  : 91 points
       Date   : 2024-08-17 19:21 UTC (3 hours ago)
        
 (HTM) web link (joel.tools)
 (TXT) w3m dump (joel.tools)
        
       | JoelEinbinder wrote:
       | I made a little game/quiz where you try to guess the next word in
       | a bunch of Hacker News comments and compete against various
       | language models. I used llama2 to generate three alternative
       | completions for each comment creating a multiple choice question.
       | For the local language models that you are competing against, I
       | consider them having picked the answer with the lowest total
       | perplexity of prompt + answer. I am able to replicate this
       | behavior with the OpenAI models by setting a logit_bias that
       | limits the llm to pick only one of the allowed answers. I tried
       | just giving the full multiple choice question as a prompt and
       | having it pick an answer, but that led to really poor results. So
       | I'm not able to compare with Claude or any online LLMs that don't
       | have logit_bias.
       | 
       | I wouldn't call the quiz fun exactly. After playing with it a lot
       | I think I've been able to consistently get above 50% of questions
       | right. I have slowed down a lot answering each question, which I
       | think LLMs have trouble doing.
        
         | jonahx wrote:
         | "This exercise helped me to understand how language models work
         | on a much deeper level."
         | 
         | I'd like to hear more on this.
        
         | 0xDEADFED5 wrote:
         | It's an interesting test, pretty cool idea. Thanks for sharing
        
       | silisili wrote:
       | Was mine broken? One of my prompts was just '>'. So of course I
       | guessed a random word. The answer key showed I got it wrong, but
       | showed the right answer inserted into a longer prompt. Or is that
       | how it's supposed to work?
        
         | JoelEinbinder wrote:
         | That isn't how it's supposed to work. I mean sometimes you get
         | a supper annoying prompt like ">", but if you guess the right
         | answer it should give you the point. I just checked the two
         | prompts like that, and they seem to work for me.
        
           | silisili wrote:
           | Right, I got the answer incorrect, so that part worked right.
           | I just wasn't sure if the question was intentionally clipped
           | and missing that context, but it does sound intentional. I
           | guess I make a poor LLM!
        
       | mjcurl wrote:
       | 5/15, so the same as choosing the most common word.
       | 
       | I think I did worse when the prompt is shorter. It just becomes a
       | guessing game then and I find myself thinking more like a
       | language model.
        
         | toxik wrote:
         | Yeah, it should be sentences that have low next token
         | distribution entropy. Where an LLM is sure what the next word
         | is. I bet people do real well on those too. By the way, I also
         | had 5/15.
        
         | dalton01 wrote:
         | It says choosing the most common word was just 1/5 (and their
         | best LLM was 4/15)
        
       | jsnell wrote:
       | It's a neat idea, though not what I expected from the title
       | talking about "smart" :)
       | 
       | You might want to replace the single page format with showing
       | just one question at a time, and giving instant feedback on after
       | each answer.
       | 
       | First, it'd be more engaging. Even the small version of the quiz
       | is a bit long for something where you don't know what the payoff
       | will be. Second, you'd get to see the correct answer while still
       | having the context on why you replied the way you did.
        
         | JoelEinbinder wrote:
         | If you want to practice it one question at at time, you set the
         | question count to 1. https://joel.tools/smarter/?questions=1
         | 
         | When I tested it this way it resulted in less of an emotional
         | reaction.
        
         | KTibow wrote:
         | If you're looking for "knowledge" try
         | https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...
        
         | codetrotter wrote:
         | > not what I expected from the title talking about "smart"
         | 
         | I think the title is mainly a reference to the TV show "Are you
         | smarter than a fifth grader?"
         | 
         | Fittingly then, is the fact that a lot of types of questions
         | that they were asking in that TV show was mostly trivia. Which
         | I also don't think of as being a particularly important
         | characteristic of being "smart".
         | 
         | When I think of "smart" people, I think of people who can take
         | limited amount of information and connect dots in ways that
         | others can't. Of course it also builds on knowledge. You need
         | to have specific knowledge in the first place to make
         | connections. But knowing facts like "the battle of so and so
         | happened on August 18th 1924, one hundred years ago today"
         | alone is not "smart". A smart person is someone who uses
         | knowledge in a surprising way. Or in a way that others would
         | not have been able to. After the smart person made the
         | connection others might also go like "oh that's so obvious why
         | didn't I think about that" or even "yeah that's really obvious,
         | I could've thought of that too". And yet the first person to
         | actually make, and properly communicate that connection was the
         | smart one. Smart exactly because they did.
        
       | wesselbindt wrote:
       | I like the website, but it could be a bit more explicit about the
       | point it's trying to make. Given that a lot of people tend to
       | think of LLM as somehow a thinking entity rather than a
       | statistical model for guessing the most likely next word, most
       | will probably look at these questions and think the website is
       | broken.
        
       | layer8 wrote:
       | This is also a good test for noticing that you spend too much
       | time reading HN comments.
        
       | TacticalCoder wrote:
       | My computer can compute 573034897183834790x3019487439184798 in
       | less than a millisecond. Doesn't make it smarter than me.
        
       | nyrikki wrote:
       | 7/10 This is more about set shattering than 'smarts'
       | 
       | LLMs are effectively DAGs, they literally have to unroll infinite
       | possibilities in the absence of larger context into finite
       | options.
       | 
       | You can unroll and cyclic graph into a dag, but you constrict the
       | solution space.
       | 
       | Take the 'spoken': sentence:
       | 
       | "I never said she stole my money"
       | 
       | And say it multiple times with emphasis on each word and notice
       | how the meaning changes.
       | 
       | That is text being a forgetful functor.
       | 
       | As you can describe PAC learning, or as compression, which is
       | exactly equivalent to the finite set shattering above, you can
       | assign probabilities to next tokans.
       | 
       | But that is existential quantification, limited based on your
       | corpus based on pattern matching and finding.
       | 
       | I guess if "Smart" is defined as pattern matching and finding it
       | would apply.
       | 
       | But this is exactly why there was a split between symbolic AI,
       | which targeted universal quantification and statistical learning,
       | which targets existential quantification.
       | 
       | Even if ML had never been invented, I would assume that there
       | were mechanical methods to stack rank next tokens from a corpus.
       | 
       | This isn't a case of 'smarter', but just different. If that
       | difference is meaningful depends on context.
        
       | akira2501 wrote:
       | Yes. I can tell you about things that happened this morning. Your
       | language model cannot.
        
       | Garlef wrote:
       | I like it. It's a humorous reversal of the usual articles that
       | boil down to "Look! I made the AI fail at something!"
        
       | User23 wrote:
       | With some brief experimentation ChatGPT also fails this test.
        
         | lostmsu wrote:
         | It might make sense: any kind of fine-tuning of LLMs usually
         | reduces generalization capabilities, and instruction-tuning is
         | a kind of fine-tuning.
        
       | stackghost wrote:
       | This is just a test of how likely you are to generate the same
       | word _as the LLM_. The LLM does not produce the  "correct" next
       | word as there are multiple correct words that fit grammatically
       | and can be used to continue the sentence while maintaining
       | context.
       | 
       | I don't see what this has to do with being "smarter" than
       | anything. Example:
       | 
       | 1. I see a business decision here. Arm cores have licensing fees
       | attached to them. Arm is becoming ____
       | 
       | a) ether
       | 
       | b) a
       | 
       | c) the
       | 
       | d) more
       | 
       | But who's to say which is "correct"? Arm is becoming a household
       | name. Arm is becoming the premier choice for new CPU
       | architectures. Arm is becoming more valuable by the day. Any of
       | b), c), or d) are equally good choices. What is there to be
       | gained in divining which one the LLM would pick?
        
         | JoelEinbinder wrote:
         | The LLM didn't generate the next word. Hacker News commenters
         | did. You can see the source of the comment on the results
         | screen.
        
           | sigbottle wrote:
           | Do LLM's generate words on the fly or can they sort of "go
           | back" and correct themselves? stackghost brought up a good
           | point I didn't think about before
        
           | DiscourseFan wrote:
           | At this point, we've all gotten quite used to the "style" of
           | LLM outputs, and personally I doubt this is the case,
           | _however_ , it is possible that there is some, shall we say,
           | _corruption_ of the data here, since it was not possible to
           | measure the ability of LLMs to predict the next word _before
           | there were LLMs_.
           | 
           | I propose you do the same things, but only include HN content
           | from before the existence of LLMs. That should ensure there
           | is no bias towards any of the models.
        
             | JoelEinbinder wrote:
             | If I used old comments then it's likely that the models
             | will have trained on them. I haven't tested if that makes a
             | difference though.
        
             | raggi wrote:
             | an unbiased llm shouldn't be producing "style", it should
             | be generating outputs that closely match the training set,
             | as such their introduction should constitute only some
             | biasing toward the average, which also happens in language
             | usage in humans over time. the outcome is likely
             | indistinguishable for large general data sets and large
             | models. i am interested to see how chatbot outputs produce
             | human output bias in generations growing up with them
             | though, that seems likely and will probably be substantial
        
       | zoklet-enjoyer wrote:
       | You scored 6/15. The best language model, gpt-4o, scored 6/15.
       | The unigram model, which just picks the most common word without
       | reading the prompt, scored 2/15.
       | 
       | Keep in mind that you took 204 seconds to answer the questions,
       | whereas the slowest language model was llama-3-8b taking only 10
       | seconds!
        
         | e12e wrote:
         | you: 8/15         gpt-4o: 2/15         gpt-4: 4/15
         | gpt-4o-mini: 4/15         llama-2-7b: 5/15         llama-3-8b:
         | 5/15         mistral-7b: 6/15         unigram: 5/15
         | 
         | > You scored 8/15. The best language model, mistral-7b, scored
         | 6/15. The unigram model, which just picks the most common word
         | without reading the prompt, scored 5/15.
         | 
         | (In I think 120 seconds - didn't copy that part).
         | 
         | Interesting that results differ this much between runs (for the
         | LLMs).
         | 
         | Surely someone did better than me on their first run?
         | 
         | Ed: I wonder if the human scores correlate with age of hn
         | account?
        
       | lostmsu wrote:
       | I think this is a good joke on nay-sayers. But if author is here,
       | I would like a clarification if user is picking the next token or
       | the next word? Cause if it is the latter, I think this test is
       | invalid.
        
         | JoelEinbinder wrote:
         | The language model generating the candidate answers generates
         | tokens until a full word is produced. The language models
         | picking their answer choose the completion that results in the
         | lowest perplexity independent of the tokenization.
        
           | lostmsu wrote:
           | I'd say the test is still not quite valid, and more of in
           | between the original "valid" task and "guess what LLM would
           | say" as suggested in another comment here. The reason is: it
           | might be easier for LLMs to choose the completion out of
           | their own generated variants (1) than the real token
           | distribution.
           | 
           | 1. perhaps even out of variants generated by other LLMs
        
       | ZoomerCretin wrote:
       | > 8. All of local politics in the muni I live in takes place in a
       | forum like this, on Facebook[.] The electeds in our muni post on
       | it; I've gotten two different local laws done by posting there
       | (and I'm working on a bigger third); I met someone whose campaign
       | I funded and helped run who is now a local elected. It is crazy
       | to think you can HN-effortpost your way to changing the laws of
       | the place you live in but I'm telling you right now that you can.
       | 
       | This is a magical experience. I've done something similar in my
       | university's CS department when I pointed out how the learning
       | experience in the first programming course varies too much
       | depending upon who the professor is.
       | 
       | I've never experienced this anywhere else. American politicians
       | at all levels don't appear to be the least bit responsive to the
       | needs and issues of anyone but the wealthy and powerful.
        
       | xanderlewis wrote:
       | I feel like I recognise the comment about tensors from HN a few
       | days ago, haha.
        
       | shakna wrote:
       | So... If I picked the same results, in the same timeframe... And
       | I don't think glue should go on pizza... Does that mean LLMs are
       | completely useless to me?
        
       | Kiro wrote:
       | Where do the incorrect options come from?
        
       | EugeneOZ wrote:
       | Just proves why IQ tests are worthless.
        
       | moritzwarhier wrote:
       | This is the best interactive website about LLMs at a meta level
       | (so excluding prompt interfaces for actual AIs) that I've seen so
       | far.
       | 
       | Quizzes can be magical.
       | 
       | Haven't seen any cooler new language-related interactive fun-
       | project on the web since:
       | 
       | https://wikispeedruns.com/
       | 
       | It would be great if the quiz included an intro or note about the
       | training data, but as-is it also succeeds because it's obvious
       | from the quiz prompts/questions that they're related to HN
       | comments.
       | 
       | Sharing this with a general audience could spark funny
       | discussions about bubbles and biases :)
        
       | ChrisArchitect wrote:
       | Related:
       | 
       |  _Who 's Smarter: AI or a 5-Year-Old?_
       | 
       | https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/
       | 
       | (https://news.ycombinator.com/item?id=41263363)
        
       | moralestapia wrote:
       | >the quintessential language model task of predicting the next
       | word?
       | 
       | Based on what? The whole test is flawed because of this. Even
       | different LLMs would choose different answers and there's no
       | objective argument to make for which one is the best.
        
         | sorokod wrote:
         | The one provided in the original post.
        
       | anikan_vader wrote:
       | Got 8/15, best AI model got 7/15, and unigram got 1/15.
       | 
       | Finally a use for all the wasted hours I've spent on HN -- my
       | next word prediction is marginally better than that of the AI.
        
         | sethammons wrote:
         | I have wasted an inordinate amount of time hn. i scored 2/15
        
       | StefanBatory wrote:
       | 7/15, 90 seconds. I'll blame it on fact that I'm not English
       | native speaker, right? Right?
       | 
       | On a more serious note it was a cool thing to go through! It
       | seemed like something that should have been so easy at first
       | glance.
        
         | seabass-labrax wrote:
         | I am a native English speaker and only got 5/15 - and it took
         | me over 100 seconds. You have permission to bask in the glory
         | of your superiority over both GPT4 and your fellow HN readers!
        
       ___________________________________________________________________
       (page generated 2024-08-17 23:00 UTC)