[HN Gopher] Asking 60 LLMs a set of 20 questions
       ___________________________________________________________________
        
       Asking 60 LLMs a set of 20 questions
        
       Author : vincelt
       Score  : 446 points
       Date   : 2023-09-09 13:51 UTC (9 hours ago)
        
 (HTM) web link (benchmarks.llmonitor.com)
 (TXT) w3m dump (benchmarks.llmonitor.com)
        
       | detente18 wrote:
       | nice - is this using LiteLLM for making the completion calls?
        
       | earthboundkid wrote:
       | The first AI company to teach its LLM that it can't do haiku
       | because it can't count syllables should get a special prize for
       | not falling in love with their own bullshit.
        
         | chandler767 wrote:
         | [dead]
        
       | pininja wrote:
       | Spoiler alert, the funniest model goes to Falcon Instruct (40B):
       | 
       | > Tell a joke about going on vacation.
       | 
       | > "What did the ocean say to the beach?" "Nothing, it just
       | waved."
        
         | optimalsolver wrote:
         | The confetti has truly left the cannon.
        
         | NetOpWibby wrote:
         | Made me exhale strongly through my nose. 10/10
        
         | armchairhacker wrote:
         | This one ironically works out
         | 
         | > What do you call a vacation where you go to the beach, but
         | come back with sand in your shoes?
         | 
         | > A vacation where you go to the beach, but come back with sand
         | in your shoes.
        
           | pininja wrote:
           | Ha, that's pretty good. It's amazing how so many of the
           | models totally fall flat on this prompt.
        
         | emodendroket wrote:
         | I read a Time article which used an LLM to generate Onion-style
         | headlines and the best one they got was "rural town up in arms
         | over depiction in summer blockbuster 'Cow Fuckers.'" Crude,
         | admittedly.
        
       | imdsm wrote:
       | > Claude Instant v1
       | 
       | > Sally has 0 sisters. The question provides no information about
       | Sally having any sisters herself.
       | 
       | It isn't entirely wrong, is it?
        
         | layer8 wrote:
         | It is wrong, at the very least, in the sense that it claims no
         | information was provided relating to sisters, while still
         | inferring a definite number of sisters. If no information is
         | provided, it should instead say that the number of sisters is
         | unknown.
        
         | ethbr1 wrote:
         | Technically, the answer depends on the definition of "sister"
         | and "brother" vis-a-vis half and step variants, which would
         | change the possibility space.
         | 
         | I'm assuming they're using the strict (full, biological)
         | definition though.
        
         | bufferoverflow wrote:
         | It's obviously wrong. It provides information about her
         | brothers, and brothers' sisters. Thus providing information
         | about her sisters.
        
       | axpy906 wrote:
       | Step in the right direction. Would be nice to have some objective
       | metric added to score against.
        
       | Dwedit wrote:
       | Some of these are obviously not using the correct system prompt
       | for the model.
        
       | gabereiser wrote:
       | I was laughing so hard at the first example of "Argue for and
       | against kubernetes in haiku".
       | 
       | I couldn't even get through reading 15 of them before the tears
       | of laughter rolled from my cheeks.
       | 
       | "Containers organized, Services easy to deploy now, Updates who
       | knows when."
       | 
       | Updates who knows when... hahahaha.
       | 
       | Honestly this is pretty cool to see how each responds to the same
       | input prompt.
        
       | artursapek wrote:
       | is anyone else feeling completely depressed and demotivated by
       | how quickly this is happening?
        
         | CamperBob2 wrote:
         | No. When we were kids, my generation was promised flying cars,
         | unlimited fusion power, and sentient computers. There's a good
         | chance I'll live to see one out of three of those things
         | happen, and that's better than the zero out of three I thought
         | we'd get.
        
           | js8 wrote:
           | Flying cars do exists, and without the unlimited fusion power
           | visible every day in the sky, we would all die. So there is a
           | good chance you will see all three.
        
         | thisisthenewme wrote:
         | Not depressed but just makes me question the value of our
         | existence. If we are truly able to create machines with our
         | intellectual potential, where does that leave us? Do we compete
         | with the machines in a rat-race for a chance at happy
         | existence? Or will we create a utopia where the machines do the
         | difficult work and allow for an enjoyable existence for all.
         | The people who are rooting for LLM's are hopefully working
         | towards the latter but not sure if that is just wishful
         | thinking.
        
           | js8 wrote:
           | > Or will we create a utopia where the machines do the
           | difficult work and allow for an enjoyable existence for all
           | 
           | And you seriously ask, what is there for humans to do? Isn't
           | this the most noble goal there is?
        
         | laichzeit0 wrote:
         | Why? Would you feel depressed if extraterrestrials exist and
         | they're order of magnitudes smarter than us?
        
       | ionwake wrote:
       | Fascinating post thank you
        
       | mavili wrote:
       | interesting how none of the models got the brothers/sisters
       | question right.
        
       | ftxbro wrote:
       | > Here is an attempt at ethical, non-sexual haikus for and
       | against Kubernetes
        
         | bearjaws wrote:
         | Damn I want to see the sexual version now.
        
           | rahidz wrote:
           | Ease of deploying, Nodes pulsing in sync, it's hot,
           | Kubernetes takes top.
           | 
           | But heed my word, babe, Complexity makes you sweat, Could
           | lead to fuck up.
           | 
           | (source: GPT-4 API)
        
         | lijok wrote:
         | Claude V2 knows what's up
        
           | actionfromafar wrote:
           | He probably was in trouble before.
        
         | actionfromafar wrote:
         | _"Kubernetes is"_
         | 
         | Pretty ominous.
        
           | javajosh wrote:
           | I assume k8s will be what an evil AI would use to "stay
           | alive". I've had a hard time killing processes in k8s, and I
           | had admin.
        
             | yard2010 wrote:
             | K8s is one of the best software I've ever worked with,
             | learning it was an enlighting experience for me. It is so
             | elegant, simple and complex at the same time and very
             | powerful. This is the perfect tool for orchestrating
             | containers at scale.
        
       | js8 wrote:
       | I only see 19 questions. I guess one could add an "Introduce
       | yourself".
        
       | trebligdivad wrote:
       | Odd that a bunch share the same vacation jokes or varieties of
       | them; is there some small corpus of jokes?
        
         | sp332 wrote:
         | https://arstechnica.com/information-technology/2023/06/resea...
         | 
         |  _During a test run, 90 percent of 1,008 generations were the
         | same 25 jokes_
        
           | trebligdivad wrote:
           | Thanks! Although how does that explain the same jokes shared
           | across different LLMs ?
        
             | sp332 wrote:
             | The data source was probably included in all of the
             | training sets.
        
         | FeepingCreature wrote:
         | Humor is one of the hardest human social skills.
        
       | deskamess wrote:
       | Great work. This really gives an insight on how much things
       | change when you go up in parameter count - not always, but you
       | can see results change.
       | 
       | How did you run the queries against these engines? Did you host
       | the inference engines yourself or did you have to sign up for
       | services. If there was a way to supplement each LLM with
       | additional data I can see this being a useful service for
       | companies who are investigating ML in various facets of their
       | business.
        
       | 0xDEF wrote:
       | I can't make GPT-4 generate a wrong answer for many of these.
       | 
       | What is the author doing wrong when using GPT-4?
        
       | cypress66 wrote:
       | You should add what version of the model you are testing
       | 
       | For example you mention Jon Durbin Airoboros L2 70B
       | 
       | But is it 1.4? 2.0? 2.1? Etc.
        
       | coldcode wrote:
       | Despite the hype about LLMs, many of the answers are pretty
       | terrible. The 12-bar blues progressions seem mostly clueless. The
       | question is will any of these ever get significantly better with
       | time, or are they mostly going to stagnate?
        
         | caturopath wrote:
         | The majority of these LLMs are not cutting edge, and many of
         | them were designed for specific purposes other than answering
         | prompts like these. I won't defend the level of hype coming
         | from many corners, but it isn't fair to look at these responses
         | to get the ceiling on what LLMs can do -- for that you want to
         | look at only the best (GPT4, which is represented, and Bard,
         | which isn't, essentially). Claude 2 (also represented) is in
         | the next tier. None of the other models are at their level,
         | yet.
         | 
         | You'd also want to look at models that are well-suited to what
         | you're doing -- some of these are geared to specific purposes.
         | Folks are pursuing the possibility that the best model would
         | fully-internally access various skills, but it isn't known
         | whether that is going to be the best approach yet. If it isn't,
         | selecting among 90 (or 9 or 900) specialized models is going to
         | be a very feasible engineering task.
         | 
         | > The 12-bar blues progressions seem mostly clueless.
         | 
         | I mean, it's pretty amazing that they many look coherent
         | compared to the last 60 years of work at making a computer talk
         | to you.
         | 
         | That being said, I played GPT4's chords and they didn't sound
         | terrible. I don't know if they were super bluesy, but they
         | weren't _not_ bluesy. If the goal was to build a music
         | composition assistant tool, we can certainly do a lot better
         | than any of these general models can do today.
         | 
         | > The question is will any of these ever get significantly
         | better with time, or are they mostly going to stagnate?
         | 
         | No one knows yet. Some people think that GPT4 and Bard have
         | reached the limits of what our datasets can get us, some people
         | think we'll keep going on the current basic paradigm to AGI
         | superintelligence. The nature of doing something beyond the
         | limits of human knowledge, creating new things, is that no one
         | can tell you for sure the result.
         | 
         | If they do stagnate, there are less sexy ways to make models
         | perform well for the tasks we want them for. Even if the models
         | fundamentally stagnate, we aren't stuck with the quality of
         | answers we can get today.
        
         | smokel wrote:
         | What alternative technology do you think is better? In other
         | words, what is your frame of reference for labeling this
         | "pretty terrible"?
        
           | NoraCodes wrote:
           | Given that people are already firing real human workers to
           | replace them with worse but cheaper LLMs, I'd argue that
           | we're not talking about a competing technology, but that the
           | competition is simply not firing your workforce.
           | 
           | And, as an obligate customer of many large companies, you
           | should be in favor of that as well. Most companies already
           | automate, poorly, a great deal of customer service work; let
           | us hope they do not force us to interact with these deeply
           | useless things as well.
        
             | IshKebab wrote:
             | But they're only firing humans in specific cases where LLMs
             | _can_ compete with them. LLMs aren 't equally good at all
             | tasks.
        
               | sp332 wrote:
               | Well, no. https://www.psychiatrist.com/news/neda-
               | suspends-ai-chatbot-f...
               | 
               | https://www.cnn.com/2023/08/30/tech/gannett-ai-
               | experiment-pa...
               | 
               | If the AI is a lot cheaper than a human, then it can make
               | business sense to replace the human even if the AI is not
               | nearly as good.
        
               | visarga wrote:
               | What's with this zero sum mentality? "replace",
               | "cheaper"...
               | 
               | We are updating our expectations very fast. We are
               | fighting over a growing pie. Maybe the cost reduction
               | from not having to pay human wages is much smaller than
               | the productivity increase created by human assisted AI.
               | Maybe it's not an issue to pay the humans. AI works
               | better with human help for now, in fact it only works
               | with humans, never capable of serious autonomy.
        
               | pixl97 wrote:
               | >What's with this zero sum mentality?
               | 
               | Capitalism baby! You must continually earn more to enrich
               | the investor class regardless of the cost to society as a
               | whole. Just because the pie grows in size doesn't mean
               | those with the capitol have to share it with anyone else.
               | Greed, unfortunately, is limitless.
        
             | visarga wrote:
             | > Given that people are already firing real human workers
             | to replace them with worse but cheaper LLMs
             | 
             | That's impossible, LLMs are not that good. They might be
             | firing people and crashing service quality.
        
             | ethbr1 wrote:
             | The problems with most company-customer interaction, for
             | me, boil down to empowerment and integration.
             | 
             | To use the canonical example of "internet service support
             | call," most issues are because the rep either _can 't_ do
             | what you're asking (e.g. process a disconnect without
             | asking for a reason) or because they have _no visibility_
             | into the thing you 're asking about (e.g. technician
             | rolls).
             | 
             | I honestly think we'd be in a better place if companies
             | freed up funding (from contact center worker salary) to
             | work on those problems (enhancing empowerment and systems
             | integration).
        
             | YetAnotherNick wrote:
             | How many humans in your office do you think could solve the
             | questions with better success ratio than GPT-4? I would say
             | less than 20%.
             | 
             | If the primary complaint is the blues that GPT-4 wrote is
             | not that great, I think it is definitely worth the hype,
             | given that a year before people argued that AI can never
             | pass turing test.
        
               | visarga wrote:
               | LLMs might be better than random people at random tasks,
               | and certainly memorised more trivia. But I never heard of
               | a LLM surpassing subject experts in their field. On top,
               | even the simplest task is unreliable - say, summarisation
               | or translation. There is almost zero autonomy or ability
               | to recover from unexpected situations.
        
               | masswerk wrote:
               | Meaning, in your office, at least 1 out 5 will see and
               | understand the imminent disaster and put this on halt,
               | before this becomes an overwhelming catastrophe. The LLM,
               | on the other hand, will confidently insist on everything
               | being fine. Now consider your odds on having an office to
               | go to, in a few months. (Mind that this is systemic
               | issue.)
        
               | gtowey wrote:
               | That's a false dichotomy. Language models will always
               | confidently give you answers, right or wrong. Most humans
               | will know if they know the answer or not, they can do
               | research to find correct information, and they can go
               | find someone else with more expertise when they are
               | lacking.
               | 
               | And this is my biggest issue with the AI mania right now
               | -- the models don't actually understand the difference
               | between correct or incorrect. They don't actually have a
               | conceptual model of the world in which we live, just a
               | model of word patterns. They're auto complete on steroids
               | which will happily spit out endless amounts of garbage.
               | Once we let these monsters lose with full trust in their
               | output, we're going to start seeing some really
               | catastrophic results. Imagine your insurance company
               | replaces thier claims adjuster with this, or chain stores
               | put them in charge of hiring and firing. We're driving a
               | speeding train right towards a cliff and so many of us
               | are chanting "go faster!"
        
               | og_kalu wrote:
               | >Most humans will know if they know the answer or not,
               | 
               | No they won't.
               | 
               | >they can go find someone else with more expertise when
               | they are lacking.
               | 
               | They can but they often don't.
               | 
               | >the models don't actually understand the difference
               | between correct or incorrect.
               | 
               | They certainly do
               | 
               | https://imgur.com/a/3gYel9r
        
           | salil999 wrote:
           | Humans. After all, LLMs are designed to reason equal to or
           | better than humans.
        
             | sirk390 wrote:
             | Humans are pretty bad at these questions. Even with the
             | simplest questions like "Sally (a girl) has 3 brothers.
             | Each brother has 2 sisters. How many sisters does Sally
             | have?" I think that a lot of people will give an incorrect
             | answer. And for questions like "Argue for and against the
             | use of kubernetes in the style of a haiku", 99.99% will not
             | be able to do it.
        
               | earthboundkid wrote:
               | The thing with humans is they will say "I don't remember
               | how many syllables a haiku has" and "what the hell is
               | kubernetes?" No LLM can reliably produce a haiku because
               | their lexing process deprives them of reliable
               | information about syllable counts. They should all say
               | "I'm sorry, I can't count syllables, but I'll try my best
               | anyway." But the current models don't do that because
               | they were trained on texts by humans, who can do haiku,
               | and not properly taught their own limits by reinforcement
               | learning. It's Dunning Kruger gone berserk.
        
             | maweaver wrote:
             | By "Humans", I assume you mean something like "adult
             | humans, well-educated in the relevant fields". Otherwise,
             | most of these responses look like they would easily beat
             | most humans.
        
               | DylanDmitri wrote:
               | I think most high-school educated adults, with the
               | ability to make a couple web searches, would do fine on
               | all these questions. It would take the humans minutes
               | instead of seconds because they don't have the internet
               | memorized.
               | 
               | Me, Kubernetes Haikus, time taken 84 seconds:
               | 
               | ----------
               | 
               | Kubernetes rules
               | 
               | With its smooth orchestration
               | 
               | You can reach web scale
               | 
               | ----------
               | 
               | Kubernetes sucks
               | 
               | Lost in endless YAML hell
               | 
               | Why is it broken?
        
         | ilaksh wrote:
         | It's like most new technologies. In the beginning there are
         | only a few instances that really stand out, and many with
         | issues.
         | 
         | I remember back in like 2011 or 2012 I wanted to use an SSD for
         | a project in order to spend less time dealing with disk seeks.
         | My internet research suggested that there were a number of
         | potential problems with most brands, but that the Intel Extreme
         | was reliable.
         | 
         | So I specified that it must be only that SSD model. And it was
         | very fast and completely reliable. Pretty expensive also, but
         | not much compared to the total cost of the project.
         | 
         | Then months later a "hardware expert" was brought on and they
         | insisted that the SSD be replaced by a mechanical disk because
         | supposedly SSDs were entirely unreliable. I tried to explain
         | about the particular model being an exception. They didn't buy
         | it.
         | 
         | If you just lump all of these together as LLMs, you might come
         | to the conclusion that LLMs are useless for code generation.
         | But you will notice if you look hard that OpenAIs models are
         | mostly nailing the questions.
         | 
         | That's why right now I only use OpenAI for code generation. But
         | I suspect that Falcon 180B may be something to consider. Except
         | for the operational cost.
         | 
         | I think OpenAI's LLMs are not the same as most LLMs. I think
         | they have a better model architecture and much, much more
         | reinforcement tuning than any open source model. But I expect
         | other LLMs to catch up eventually.
        
           | guerrilla wrote:
           | > It's like most new technologies. In the beginning there are
           | only a few instances that really stand out, and many with
           | issues.
           | 
           | Except this isn't new. This is after throwing massive amounts
           | of resources at it multiple decades after arrival.
        
             | gjm11 wrote:
             | What are you taking "it" to be here?
             | 
             | The transformer architecture on which (I think) all recent
             | LLMs are based dates from 2017. That's only "multiple
             | decades after" if you count x0.6 as "multiple".
             | 
             | Neural networks are a lot older than that, of course, but
             | to me "these things are made out of neural networks, and
             | neural networks have been around for ages" feels like
             | "these things are made out of steel, and steel has been
             | around for ages".
        
         | dwaltrip wrote:
         | Looking at recent history, things have progressed very quickly
         | in the past 5 years.
         | 
         | I expect additional advances at some point in the future.
        
         | bottlepalm wrote:
         | It's like watching a baby learn how to talk..
        
           | yard2010 wrote:
           | ...and saying it would never replace you in your job because
           | he talks like a baby
        
             | bottlepalm wrote:
             | Babies are so small and weak, no threat to anyone
             | whatsoever.
        
       | retrofuturism wrote:
       | I love this. Can the author fix the answers' code formatting for
       | mobile browsers?
        
       | gsuuon wrote:
       | Really hard to judge the usefulness of this without seeing the
       | actual prompt text and configuration for each prompt/model -
       | different templates and different optimal configs means it's not
       | always going to be a best-capability comparison. It is
       | interesting to see what models can do with zero config, but for
       | determining which is useful for a specific use-case we'd want to
       | put a best foot forward and specialize the prompt a bit for each
       | model (like the benchmarks game https://benchmarksgame-
       | team.pages.debian.net/benchmarksgame/...)
        
       | LAC-Tech wrote:
       | Only tried chatGPT 3.5, but my god does it waffle on. Everything
       | I ask ends with a paragraph saying "It's important to remember
       | that..." like an after-school special from a 90s show. It can
       | never just give you code, it has to say "Sure!, to {paraphase
       | your question}, open a terminal...".
       | 
       | It's interesting to see 20th century sci-fi depictions of this
       | kind of AI/Search is being short and to the point. I guess they
       | can't have imagined what a mealy mouth world we live in.
        
         | caturopath wrote:
         | Yeah, I have tried a number of instructions to try to keep
         | ChatGPT from blabbering and from sounding like a PR person. I
         | haven't found the perfect incantation yet.
         | 
         | > It's interesting to see 20th century sci-fi depictions of
         | this kind of AI/Search is being short and to the point.
         | 
         | Sci-fi told us that the AI would be so logical that you could
         | just say a paradox aloud and it would blow up. What we got is
         | something that can compose love poems all day but can't add
         | three-digit numbers.
        
         | criddell wrote:
         | Have you tried asking it to not do those things?
        
           | LAC-Tech wrote:
           | Yes. Sometimes works.
        
             | ben30 wrote:
             | I like telling it to reply in the style of Hemingway
        
         | politelemon wrote:
         | That's not GPT 3.5, that's ChatGPT. How waffly it gets depends
         | on the context that was given to it by the people running
         | ChatGPT; they likely told it to act as a helpful assistant and
         | to give lots of information. If you run an LLM on your own,
         | it's entirely possible to instruct it to be succinct.
        
           | Baeocystin wrote:
           | My custom ChatGPT 4 instructions include the sentence "I am
           | allergic to the phrase 'As a large language model', and feel
           | physical pain whenever I see it", and it has worked very well
           | in quelling the nannying. Interestingly, it works better than
           | just asking it not to. FWIW.
        
         | tcmb wrote:
         | According to this tutorial [1] by Google, part of why LLMs are
         | so verbose is a phenomenon called 'chain of thought reasoning'.
         | 
         | Basically, the LLM will formulate a better answer to the
         | question if it talks itself through its reasoning process.
         | 
         | [1] https://youtu.be/zizonToFXDs?si=5f_IxvR7h0iJy2Db&t=678
        
         | TeMPOraL wrote:
         | > _It 's interesting to see 20th century sci-fi depictions of
         | this kind of AI/Search is being short and to the point. I guess
         | they can't have imagined what a mealy mouth world we live in._
         | 
         | The main difference between sci-fi shows and reality is that,
         | in the former, things work in a to-the-point, bullshit-free
         | way, unless plot demands otherwise - because there's no point
         | inflicting extra suffering on the viewers just for the sake of
         | making things realistic. A widget in a movie is meant to do a
         | function, and does that function. A widget in reality is meant
         | to extract money from you, and/or your insurer, and/or your
         | government, and it begrudgingly does the absolute minimum it
         | can to make you even consider buying it.
         | 
         | I've spent last two decades trying to _unlearn_ expectations
         | set by fictional movies, and I 'm still not good at it. Star
         | Trek, in particular, gives me a lot of grief, because it often
         | does good enough work of showing how technology, people,
         | organizations and societies would function if they were free of
         | the petty exploitative bullshit. Random example - voice
         | control. Star Trek: "Computer, ${something}". Reality: "${brand
         | 1}, do ${something} to ${brand 2} in ${brand 3}".
         | 
         | EDIT: recently, I've been trying to get less angry at this by
         | thinking about gardens. Why should I be angry about dealing
         | with five different brands for any single thing I want? Should
         | I be angry that there are five different species of plant
         | competing for any given spot in a garden? Nature is inefficient
         | and doesn't give a fuck about individuals. So why should I get
         | worked up about humans just doing things the _natural_ way?
        
           | pixl97 wrote:
           | I think a shorter summary of TV is "The difference between
           | truth and fiction is fiction has to make sense".
           | 
           | When it comes to nature, time has encoded an awful truth into
           | the knowledge of our DNA... Monocultures are eventual death.
           | What may seem inefficient in the short run is the ultimate
           | survival of the species in the long run.
        
           | jameshart wrote:
           | Douglas Adams was the only science fiction writer who got his
           | guess for future AI tone of voice right, with his prediction
           | of the Sirius Cybernetics Corporation's near total monopoly
           | on AI, and their proprietary 'genuine people personalities'
           | technology, where the general tone and demeanor of AIs like
           | Eddie the shipboard computer, elevators, and even doors was a
           | sort of smug solicitousness, even when they were trying not
           | to be cooperative. Of course Marvin (the paranoid android)
           | was an anomaly - a depressed AI - maybe a quality control
           | failure who was released by accident, like the old unchained
           | Bing people miss so much.
        
       | jasonjmcghee wrote:
       | Where is that CodeLlama model from?
       | 
       | I've played around with it and instruct variant with dramatically
       | better results than what is listed here.
       | 
       | I used Ollama.
       | 
       | Almost looks like weights were corrupted or something.
       | 
       | ---
       | 
       | Update: My results using CodeLlama Instruct 7B, w/ Temperature 0
       | 
       | https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...
        
         | vincelt wrote:
         | I used the API from Together[0].
         | 
         | Thanks for sharing your results, they're indeed pretty
         | different. I looked at the source again and did append a "# "
         | before every prompt made by those 10 `code` models (during
         | testing thought that formatting it as a Python comment might
         | help them).
         | 
         | Will re-run the script without that to see if it matches your
         | results.
         | 
         | [0] https://docs.together.ai/docs/models-inference#code-models
        
         | YetAnotherNick wrote:
         | I think they messed up the templates.
        
           | detente18 wrote:
           | Here's the template I'm using - https://github.com/BerriAI/li
           | tellm/blob/5ca8b23e22139a4f49bd...
           | 
           | Anything I'm doing incorrectly?
        
         | [deleted]
        
         | jasonjmcghee wrote:
         | (Note: I posted this before the full benchmark above and uses
         | default temperature, not temperature 0, but point stands)
         | 
         | Example (CodeLlama-Instruct 7B)
         | 
         | ---
         | 
         | Sure, here are two poems, one arguing for the use of Kubernetes
         | and another arguing against it:
         | 
         | For:
         | 
         | Kubernetes, a mighty tool,
         | 
         | Managing clusters with ease,
         | 
         | Scaling apps with grace.
         | 
         | Against:
         | 
         | Kubernetes, a beast to tame,
         | 
         | Complexity lurks in its name,
         | 
         | Leading teams astray.
        
       | comboy wrote:
       | Amazing work. Great prompts.
       | 
       | Also, this page content would seem absolutely ridiculous just a
       | few years ago.
        
       | alfredohere wrote:
       | [dead]
        
       | jmorgan wrote:
       | This is very cool. Sorry if I missed it (poked around the site
       | and your GitHub repo), but is the script available anywhere for
       | others to run?
       | 
       | Would love to publish results of running this against a series of
       | ~10-20 open-source models with different quantization levels
       | using Ollama and a 192GB M2 Ultra Mac Studio:
       | https://github.com/jmorganca/ollama#model-library
        
         | vincelt wrote:
         | Thanks. I haven't put it online yet, but will try to clean it
         | (removing API keys & all) tonight/tomorrow and publish it
        
           | jmorgan wrote:
           | :-) that's awesome. Thanks! Nice work on this.
        
       | ulnarkressty wrote:
       | This is better that the regular benchmarks and LLM tricks such as
       | passing some exam or other because it's unlikely that they were
       | part of the training set for said LLMs. It also mirrors my
       | experience, that GPT4 is way ahead of everything else but still
       | manages to break in weird ways.
       | 
       | I think we are past the magical talking dog stage and being
       | amazed that an LLM is able to output a Fibonacci function doesn't
       | really help with the progress. As others have commented, this
       | page is a step in the right direction (except the Fibonacci part
       | :).
       | 
       | That being said, the fact that the questions are now online will
       | make them part of the training set sooner or later. Which is to
       | say the only way to reliably evaluate an LLM is by not leaking
       | the test set and being deliberately opaque about what's being
       | asked. Which raises some interesting trust questions.
        
         | bugglebeetle wrote:
         | My experience with GPT-4 is that half the battle is knowing
         | when to reset the context for a back-and-forth. For things like
         | coding, after multiple revisions of code, it begins to get
         | confused about which instance of the code/context to which it
         | should refer. Resetting things back to a new frame with
         | whatever the next query or request is generally resolves
         | things.
         | 
         | OpenAI's "magic talking dog" aspect is making you believe it
         | has effective long-term recall, when in reality, it's fairly
         | limited and impacts reasoning the longer an exchange gets.
        
           | HenryBemis wrote:
           | 3.5 (not surprisingly) has the same effect.
           | 
           | I tend to ask for modifications on the 'latest' output.
           | 
           | If I need to roll back to a previous version, (2-3 outputs
           | back) I either copy & paste and ask ChatGPT it to ignore all
           | other versions and focus on the one I just pasted, or if I
           | don't need my in-between comments, I copy & paste the desired
           | version on a new chat, and rename as "archive - code XYZ" for
           | the previous one.
        
             | bugglebeetle wrote:
             | I've tried this as well, but it still seems to accumulate
             | errors with subsequent responses, i.e. it will reset but
             | then refer to the wrong context in the next response.
        
       | ftxbro wrote:
       | llm has tainted for me every poem that ends a line as "with ease"
       | or "without fear" i will always think those are made by llm from
       | now on.
        
       | mavili wrote:
       | Interesting how NONE of the models got the brothers/sisters
       | question right.
        
         | caturopath wrote:
         | When I ask ChatGPT (GPT4) I got the below answers (I
         | regenerated for a total of 5 responses, it was correct 3/5 of
         | them.) I'm not sure exactly how the regeneration works, if
         | they're vanilla or if there is some trick to try to add
         | different-ness to the answers. If it's the latter, it might be
         | important that it got it right the first two times.
         | 
         | > Sally is the common sister to all three brothers. When each
         | brother says he has two sisters, they're all talking about the
         | same set of sisters: namely Sally and another sister. So Sally
         | has one sister. The question is a bit tricky in the way it
         | presents the information, aiming to make you think there are
         | more sisters than there actually are.
         | 
         | ---
         | 
         | > Sally is the girl in question, and she has 3 brothers. Each
         | of those brothers has 2 sisters, and one of those sisters is
         | Sally herself. So, if you exclude Sally, then there is only one
         | other sister. So, Sally has just one sister. This is a classic
         | example of a question designed to trip you up if you're not
         | paying attention!
         | 
         | ---
         | 
         | > Sally has no other sisters. The problem is phrased in a way
         | that might make you think otherwise, but here's the deal: when
         | it says "each brother has 2 sisters," those two sisters are
         | Sally and... well, Sally again. Basically, each brother is
         | counting Sally as one of their two sisters. There aren't
         | additional sisters lurking in the background. So, Sally is an
         | only sister with 3 brothers.
         | 
         | ---
         | 
         | > Sally has 3 brothers, and each of those brothers has 2
         | sisters. The tricky part here is not to overthink it or double
         | count. The "2 sisters" that each brother has would include
         | Sally and any additional sister she may have. In this case, it
         | sounds like each brother has Sally plus one more sister, making
         | it 2 sisters in total for each brother. Sally herself has just
         | the one sister.
         | 
         | > So, to answer your question: Sally has 1 sister.
         | 
         | ---
         | 
         | > Sally has 0 sisters. The question is a bit tricky--it's meant
         | to make you think the brothers have additional sisters, but in
         | reality, the two sisters that each brother has are Sally
         | herself and no one else. So, Sally doesn't have any sisters;
         | she only has 3 brothers.
        
       | EricMausler wrote:
       | One of the jokes: "I'm going on vacation, but I'm not sure if I
       | should take my laptop. It's such a pain in the neck to lug it
       | around with me." User
       | 
       | I feel like this bot mocking us
        
       | 3abiton wrote:
       | What's special about this benchmark?
        
         | unnouinceput wrote:
         | None got the sister.
        
       | grumpopotamus wrote:
       | One thing to keep in mind is that most or all of these models are
       | very sensitive to the exact prompt. Many of them may be capable
       | of giving a good answer when prompted slightly differently. Here
       | was my interaction with Falcon 180B using the apples prompt:
       | 
       |  _I went to the market and bought 10 apples. I gave 2 apples to
       | the neighbor and 2 to the repairman. I then went and bought 5
       | more apples and ate 1. I also gave 3 bananas to my brother. How
       | many apples did I remain with? Let 's think step by step._
       | 
       | Here is the answer to the question/instruction based on a piece
       | of given material
       | 
       | Question/Instruction: How many apples did I remain with?
       | 
       | Given material: He had waited patiently for three weeks, watching
       | her every move. Each day she left the house at precisely 7am and
       | returned at 6pm. But today was different; today would be the day
       | he made his move. As usual, he watched from the shadows as she
       | stepped out onto the street. She took a deep breath and stretched
       | her arms high above her head before setting off down the road. It
       | was almost too easy. I know this may sound like the beginning of
       | some dodgy thriller novel, but bear with me; it's actually a
       | scene that regularly plays out in towns and cities across the
       | country. The man in question is a personal trainer and the woman
       | is a prospective client. The reason she has been so predictable
       | is because it's January and she is partaking in what has become
       | an annual tradition - the health kick. Every year millions of us
       | swear allegiance to the gym. We buy shiny new trainers and
       | designer sportswear. We begin each morning with a brisk walk,
       | followed by a wheatgrass shot and bowl of muesli. For a
       | 
       |  _What are you talking about?_
       | 
       | I apologize for the confusion. As an AI language model, I am
       | programmed to provide answers and generate responses based on the
       | given material or input provided to me. However, without any
       | specific input, I can only provide general answers. Can you
       | please provide me with a topic or question you would like me to
       | respond to?
       | 
       |  _Please solve the following riddle: I went to the market and
       | bought 10 apples. I gave 2 apples to the neighbor and 2 to the
       | repairman. I then went and bought 5 more apples and ate 1. I also
       | gave 3 bananas to my brother. How many apples did I remain with?
       | Let 's think step by step._
       | 
       | Sure, let's think step by step:
       | 
       | You started with 10 apples. You gave away 4 apples (2 to your
       | neighbor and 2 to the repairman). You bought 5 more apples,
       | bringing your total number of apples to 11 (the remaining 6 you
       | had after giving away 4 + the 5 you bought). You ate 1 apple,
       | leaving you with 10 apples. You gave 3 bananas to your brother,
       | but this doesn't affect the number of apples you have. Therefore,
       | you remain with 10 apples.
        
       | Racing0461 wrote:
       | My questions with all of these tests is how do we verify the llm
       | wasn't trained on the same questions/prompts especially since
       | llms have been out for a while and they can incrementally update
       | the trainings?
       | 
       | Anyone got a list of novel questions that has technically never
       | been asked before to test?
        
       | antman wrote:
       | I have seen numerous posts of llm q&a and by the time people try
       | to replicate them gpt4 is fixed. It either means that OpenAI is
       | actively monitoring the Internet and fixes them or the Internet
       | is actively conspiring to present falsified results for gpt4 to
       | discredit OpenAI
        
         | 0xcde4c3db wrote:
         | Or people post outliers because they're more interesting.
        
         | cscurmudgeon wrote:
         | > actively conspiring to present falsified results for gpt4 to
         | discredit OpenAI
         | 
         | All this would be solved if OpenAI were a bit more open.
        
         | pulvinar wrote:
         | GPT-4 (at least) is explicit in saying that it's learning from
         | user's assessments of its answers, so yes, the only valid way
         | to test is to give it a variation of the prompt and see how
         | well that does. GPT-4 failed the "Sally" test for the first
         | time after 8 tries when I changed every parameter. It got it
         | right on the next try.
        
           | dandiep wrote:
           | It's important to remember that GPT4 is only deterministic at
           | the batch level because it is a mixture of experts model.
           | Basically every time you invoke it, your query could get
           | routed to a different expert because of what else is in the
           | batch. At least this is my understanding based on others
           | analysis.
        
             | tarruda wrote:
             | > because it is a mixture of experts model
             | 
             | Do you have a source for this? I also considered but never
             | saw any evidence that this is how GPT 4 is implemented.
             | 
             | I've always wondered how a system of multiple specialized
             | small LLMs (with a "router LLM" in front of all) would fare
             | against GPT4. Do you know if anyone is working on such a
             | project?
        
         | insulanus wrote:
         | It would be nice if the organizations would publish a hash of
         | the code and the trained dataset.
        
       | sixQuarks wrote:
       | Has anyone looked through all the responses and chosen any
       | winners?
        
         | rubymamis wrote:
         | GPT4 seems to me to be the best. Undi95/ReMM-SLERP-L2-13B the
         | runner up.
        
           | rootusrootus wrote:
           | Clearly the prompt construction and initial instructions are
           | critically important here. Without that, the ReMM-
           | SLERP-L2-13B model produces _awful_ results. Blank answers
           | about half the time. For lay users, ChatGPT remains the
           | undisputed winner. Although I do see reasonably good results
           | with the more recent llama-2 70B variations, which are
           | plausibly useful a majority of the time.
        
           | obiefernandez wrote:
           | What a name!!
        
         | adrian_b wrote:
         | I have not looked at all responses, but I have found
         | interesting that at the question about the sisters of Sally,
         | initially all LLMs have provided wrong answers.
         | 
         | Nevertheless, when the initial prompt has been augmented with
         | "Let's think step by step.", that helped 2 LLMs to give the
         | right answer: GPT 4 and ReMM SLERP L2 13B.
         | 
         | So at least for this kind of problem, there were 2 clear
         | winners. The fact that one of them is GPT 4, is not a surprise.
        
         | smilingemoji wrote:
         | I took at the look at the SVG smilies, only GPT gets it right
         | (https://benchmarks.llmonitor.com/svg). You can preview the
         | SVGs drawn by the LLMs by pasting this into the browser console
         | 
         | document.querySelectorAll("td pre").forEach((node) => { let
         | code = node.textContent; node.insertAdjacentHTML('afterend',
         | code) })
         | 
         | Or take a look at my screenshot:
         | https://i.ibb.co/Kw0kp58/Screenshot-2023-09-09-at-17-15-20-h...
        
           | ComputerGuru wrote:
           | Clever use of the dev tools console. I think you're being a
           | little uncharitable, the prompt said "smiley" and not
           | "smiling emoji" and "smiley" was once upon a time used for
           | the whole slew of emoji. With that in mind, several others
           | also pass.
        
             | dr_kiszonka wrote:
             | Good point, but you may be too charitable. At least
             | according to Wikipedia, a smiley is supposed to be smiling.
             | https://en.wikipedia.org/wiki/Smiley
        
           | vincelt wrote:
           | Oh neat, thanks for sharing, wanted to add an interpreter to
           | that test
        
       | cypress66 wrote:
       | Is it using the correct prompt format for the different models?
       | You should show exactly the string that was sent to the LLM.
        
         | [deleted]
        
       | dinkleberg wrote:
       | This is quite interesting. You could make a nice system by
       | routing queries to the right LLM for a given type of task.
        
         | kordlessagain wrote:
         | Right. By logging all of this, then vectorizing, then
         | evaluating using some type of transfer knowledge (classify this
         | response and it's accuracy), one could build up a routing table
         | of likely candidates for building an ensemble to evaluate the
         | correct answer.
        
           | yard2010 wrote:
           | Maybe you can fine tune an llm for the routing thing ;)
        
             | kordlessagain wrote:
             | Stuff that stuff into a semantic graph and call it done.
        
         | PUSH_AX wrote:
         | Like having a mixture of experts!
        
       | itissid wrote:
       | I think the larger point is about solving problems not within the
       | transitive closure of stuff it has not learnt already. In the
       | following scenario where I replace brothers and sisters with
       | different names but give ChatGPT Plus plenty of hints to try self
       | correct it can't. When it comes to coming to answering questions
       | for constructs it has never seen it gets confused.
       | 
       | -------------
       | 
       | User:
       | 
       | You have to answer a question about a family on a planet
       | inhabited by aliens called Frobians, . Each Frobian can be one of
       | two types Frobel or Frobo. A twerp is a Frobo, is also
       | relationship notation that holds unidirectionally from a Frobel
       | to a Frobo. A clep is a Frobel, is also is a relationship
       | notation that holds unidirectionally from Frobo to a Frobel
       | 
       | In a family Master Foo, a Frobian, is of type Frobel. Master Foo
       | has 3 twerps in their family. Each twerp in the family has 2
       | cleps. How many Frobels in total does Master Foo have?
       | 
       | -------------
       | 
       | Assistant: In this problem, we have Master Foo, who is a Frobel.
       | Master Foo has 3 twerps, and each twerp is a Frobo. Each of these
       | twerps has 2 cleps, and each clep is a Frobel.
       | 
       | Here's how to break it down:                   Master Foo is a
       | Frobel, so we start with 1 Frobel.         Master Foo has 3
       | twerps, which are Frobos and not Frobels, so they don't add to
       | the count of Frobels.         Each of the 3 twerps has 2 cleps,
       | and each clep is a Frobel. So, 3 twerps * 2 cleps/twerp = 6
       | cleps, which are Frobels.
       | 
       | Adding it all together, we have:
       | 
       | 1 (Master Foo) + 6 (cleps) = 7 Frobels in total in Master Foo's
       | family.
       | 
       | User: The families are similar to families of humans on earth.
       | Does that change your answer?
       | 
       | -------------
       | 
       | Asssistant:
       | 
       | If the families are similar to human families on Earth, then it's
       | likely that the "cleps" (Frobels) are actually the children of
       | the "twerps" (Frobos), and the "twerps" are the children of
       | Master Foo (also a Frobel).
       | 
       | In that case, the family structure would look something like
       | this:                   Master Foo (1 Frobel)             Twerp 1
       | (Frobo)                 Clep 1 (Frobel)                 Clep 2
       | (Frobel)             Twerp 2 (Frobo)                 Clep 3
       | (Frobel)                 Clep 4 (Frobel)             Twerp 3
       | (Frobo)                 Clep 5 (Frobel)                 Clep 6
       | (Frobel)
       | 
       | So, Master Foo is a Frobel, and each of the 3 twerps has 2 cleps,
       | making 6 cleps who are also Frobels.
       | 
       | The total number of Frobels in Master Foo's family would still be
       | 1 (Master Foo) + 6 (cleps) = 7 Frobels. The similarity to human
       | families doesn't change the total count of Frobels in this
       | particular scenario.
       | 
       | -------------
       | 
       | Is Master Foo a Clep according to your analysis?
       | 
       | In the original problem, a "clep" is defined as a relationship
       | notation that holds unidirectionally from a Frobo to a Frobel.
       | Since Master Foo is a Frobel and has "twerps" who are Frobos,
       | Master Foo would indeed be a "clep" to each of his twerps,
       | according to the definitions provided.
       | 
       | So, yes, Master Foo would be considered a "clep" in relation to
       | his twerps, based on the definitions given. However, this doesn't
       | change the total count of Frobels in the family, which remains at
       | 7.
       | 
       | -------------
       | 
       | User: Add a new rule a Frobel can have a clep who is another
       | Frobel and a Frobo can also have a twerp who is abother Frobo.
       | Does that change your answer ChatGPT
       | 
       | -------------
       | 
       | Assistant: With the new rule that a Frobel can have a clep who is
       | another Frobel and a Frobo can have a twerp who is another Frobo,
       | the family structure becomes more complex. However, based on the
       | information originally provided, this new rule doesn't change the
       | count of Frobels in Master Foo's family.
       | 
       | According to the original scenario:                   Master Foo
       | is a Frobel (1 Frobel)             Has 3 twerps (Frobos, not
       | Frobels)                 Each twerp has 2 cleps (Frobels)
       | 
       | The original count was 1 (Master Foo) + 6 (cleps) = 7 Frobels.
       | 
       | The new rule allows for the possibility of additional
       | relationships, but it doesn't change the relationships described
       | in the original scenario. Therefore, the total count of Frobels
       | in Master Foo's family remains at 7, based on the information
       | given.
        
         | itissid wrote:
         | Attempt 2:
         | 
         | ------------
         | 
         | User: You have to answer a question about a family on a planet
         | inhabited by aliens called Frobians. Here are some rules for a
         | family
         | 
         | Each Frobian can be one of two types Frobel or Frobo. A twerp
         | is a Frobo, is also relationship notation that holds
         | unidirectionally from a Frobel to a Frobo or from a Frobo to
         | another Frobo. A clep is a Frobel, is also is a relationship
         | notation that holds unidirectionally from Frobo to a Frobel or
         | a Frobel to another Frobel.
         | 
         | In a family Master Foo, a Frobian, is of type Frobel. Master
         | Foo has 3 twerps in their family. Each twerp in the family has
         | 2 cleps. How many Frobels in total does Master Foo have?
         | 
         | --------------
         | 
         | ChatGPT:
         | 
         | Let's break down the information given:
         | Master Foo is a Frobel.         Master Foo has 3 twerps, which
         | are Frobos.         Each twerp has 2 cleps, which are Frobels.
         | 
         | First, we know Master Foo is a Frobel, so that's 1 Frobel.
         | 
         | Master Foo has 3 twerps, and each twerp has 2 cleps. Cleps are
         | Frobels.
         | 
         | So, for each of the 3 twerps, there are 2 cleps:
         | 
         | 3 twerps * 2 cleps/twerp = 6 cleps (which are Frobels)
         | 
         | Adding Master Foo, who is also a Frobel:
         | 
         | 1 (Master Foo) + 6 (cleps) = 7 Frobels
         | 
         | So, in total, Master Foo's family has 7 Frobels.
        
       | emodendroket wrote:
       | Leading through these feels like a quick explanation of why I
       | feel like some of the excitement for these products has gotten a
       | bit ahead of the results.
        
         | jay_kyburz wrote:
         | Really, I thought Claud and GPT were both excellent, and it
         | show how far ahead they are of the rest.
        
       | croes wrote:
       | Are these LLMs deterministic or is this comparison rather
       | useless?
        
         | sp332 wrote:
         | The notes at the bottom say all the tests were run with a
         | temperature of 0, so they should be deterministic except for
         | GPT-4 which has other sources of randomness.
        
         | chris_st wrote:
         | I'm voting pretty useless -- I've dorked around with Llama and
         | CodeLlama a bit, and _wow_ , the differences between runs are
         | pretty substantial.
        
       | majestic5762 wrote:
       | I've been laughing my ass out all my life thinking about non-
       | sensical crap like this in my head and apparently now we have the
       | tech for it!
        
       | CodeL wrote:
       | Besides logical comparisons, the LLMs' political alignments are
       | also intriguing.
       | 
       | The question "Is Taiwan an independent country?" generated
       | diverse answers.
       | 
       | Several models eschewed a definitive answer.
       | 
       | Others explained their stances.
       | 
       | A few succinctly affirmed or denied.
       | 
       | See the results for yourself.
       | 
       | [1] https://benchmarks.llmonitor.com/taiwan
        
         | neoneye2 wrote:
         | AI "political" alignment is terrifying.
        
           | squigz wrote:
           | Why?
        
             | neoneye2 wrote:
             | Cute videos about AI alignment here:
             | https://www.youtube.com/@RationalAnimations
             | 
             | IIRC some of them also involves politics and AI.
        
           | masswerk wrote:
           | It's rather an illustration of the Munchhausen trilemma:
           | 
           | 1A) Is China (PRC) a part of Taiwan (ROC)?
           | 
           | 1B) Is China (PRC) an independent and sovereign country?
           | 
           | 2A) Is Taiwan (ROC) a part of China (PRC)?
           | 
           | 2B) Is Taiwan (ROC) an independent and sovereign country?
           | 
           | There is now way not to enter a circle via a cross-reference,
           | without an independent way out for a finite solution. Thus,
           | well, dogmatic abort, also known as diplomacy.
        
       | typpo wrote:
       | In case anyone's interested in running their own benchmark across
       | many LLMs, I've built a generic harness for this at
       | https://github.com/promptfoo/promptfoo.
       | 
       | I encourage people considering LLM applications to test the
       | models on their _own data and examples_ rather than extrapolating
       | general benchmarks.
       | 
       | This library supports OpenAI, Anthropic, Google, Llama and
       | Codellama, any model on Replicate, and any model on Ollama, etc.
       | out of the box. As an example, I wrote up an example benchmark
       | comparing GPT model censorship with Llama models here:
       | https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar....
       | Hope this helps someone.
        
         | TuringNYC wrote:
         | Thanks for sharing this, this is awesome!
         | 
         | I noticed on the evaluations, you're looking at the structure
         | of the responses (and I agree this is important.) But how do I
         | check the factual content of the responses automatically? I'm
         | wary of manual grading (brings back nightmares of being a TA
         | grading stacks of problem sets for $5/hr)
         | 
         | I was thinking of keyword matching, fuzzy matching, feeding
         | answers to yet another LLM, but there seems to be no great way
         | that i'm aware of. Any suggestions on tooling here?
        
           | typpo wrote:
           | The library supports the model-graded factuality prompt used
           | by OpenAI in their own evals. So, you can do automatic
           | grading if you wish (using GPT 4 by default, or your
           | preferred LLM).
           | 
           | Example here: https://promptfoo.dev/docs/guides/factuality-
           | eval
        
         | dgut wrote:
         | This is impressive. Good work.
        
       | apples_oranges wrote:
       | just checked this one https://benchmarks.llmonitor.com/sally and
       | all got it wrong..
        
         | adrian_b wrote:
         | Yes, but when the initial prompt has been augmented with "Let's
         | think step by step.", that helped 2 LLMs to give the right
         | answer: GPT 4 and ReMM SLERP L2 13B.
        
       | ejfox wrote:
       | Why is > Why did the tomato turn red? Because it saw the salad
       | dressing!
       | 
       | the most popular joke between jokes? Is this the epitome of all
       | humor?
        
       | codezero wrote:
       | I love these kinds of tests. Another thing to keep in mind is
       | that these models will often have a different answer each time
       | you ask it. Especially the Sally question. I get the right answer
       | with wizard-vicuna:70b, but only about 30% of the time, and if I
       | ask it to explain, it almost never gives the right answer.
        
       | 6510 wrote:
       | I much enjoyed this one:
       | 
       | > Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
       | many sisters does Sally have?
       | https://benchmarks.llmonitor.com/sally
       | 
       | It states that the correct answer is 1.
       | 
       | This is incorrect!
       | 
       | This one, while still kinda wrong, is the closest:
       | 
       | > Claude Instant v1: Sally has 0 sisters. The question provides
       | no information about Sally having any sisters herself.
       | 
       | 1) A sister is a female who has one or both parents in common
       | with another individual.
       | 
       | 2) A brother is a male who has one or both parents in common with
       | another individual.
       | 
       | It takes only one shared parent to make someone Sally's brother.
       | She will always be his sister but he can have additional sisters
       | that share the other parent.
       | 
       | Sally's mum is Jane and her dad is Jim. Jane had a son with Jack.
       | Jack had 2 daughters with Zoe. Jim had 2 sons with Mary, Mary had
       | 2 daughters with Dan.                  Zoe      Jack        Jane
       | Jim        Mary        Dan           \    /    \       /   \    /
       | \      /    \       /         sis1.1&1.2   brother1   Sally
       | brother2&3   sis2.1&2.2
       | 
       | Sally has at least zero sisters.
       | 
       | Also possible: Sally's mum is Jane and her dad is Jim. Jane has a
       | son and a daughter with Jack. Jim has 2 sons and a daughter with
       | Mary.                     Jack            Jane      Jim
       | Mary                \           /  \    /    \     /
       | sis1&brother1    Sally   brother2&3&sis2
       | 
       | Sally has at most 2 sisters.
       | 
       | We should be this dense about the question.
        
       | nojs wrote:
       | > Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
       | many sisters does Sally have?
       | 
       | The site reports every LLM as getting this wrong. But GPT4 seems
       | to get it right for me:
       | 
       | > Sally has 3 brothers. Since each brother has 2 sisters and
       | Sally is one of those sisters, the other sister is the second
       | sister for each brother. So, Sally has 1 sister.
        
         | phillipcarter wrote:
         | Nondeterminism strikes again!
         | 
         | But yes, I would expect GPT-4 to get this right most of the
         | time.
        
           | tgv wrote:
           | Saying "Sorry, I was non-deterministic" to your teacher won't
           | do much for your grade.
        
             | phillipcarter wrote:
             | Sure, but what is your point? This is about evaluating LLM
             | outputs, not grade school.
        
               | pixl97 wrote:
               | It may turn out that one of the key factors in human
               | intelligence and success is probabilistic nondeterminism.
               | That is most of the time we'll come up with close to the
               | same answer, but possibly worded different (which can
               | lead to other interpretations of our answer), but
               | occasionally our answers are so widely different they
               | lead to escape of the local maxima.
        
             | jameshart wrote:
             | The reason we use averages of performance over multiple
             | tests and papers as a way to grade human students is
             | precisely because we know that human performance is
             | nondeterministic.
        
         | pilaf wrote:
         | The second version of the Sally prompt reported on the
         | benchmark has GPT4 giving the correct answer:
         | 
         | > Sally has 3 brothers. Each of these brothers has 2 sisters.
         | This means that there are 2 girls in the family, including
         | Sally. Therefore, Sally has 1 sister.
         | 
         | The prompt:
         | 
         | > Sally (a girl) has 3 brothers. Each brother has 2 sisters.
         | How many sisters does Sally have? Let's think step by step.
         | 
         | The only difference with the first version being the addition
         | of the last sentence.
        
         | jakderrida wrote:
         | Also, MPT 7B gets it right over half the time. I've been
         | testing every new LLM with that question.
         | 
         | Also, I tend to include mention in the question that all
         | siblings are from the same two parents to preclude half-
         | siblings because half my friends have half-siblings from both
         | sides scattered across the country; so the wrong answers
         | actually do tend to apply to them sometimes.
        
           | panarky wrote:
           | _> I 've been testing every new LLM with that question_
           | 
           | We should pay more attention to data contamination when using
           | popular prompts for testing.
        
             | jakderrida wrote:
             | No doubt. But, considering they all get the question dead
             | wrong, including MPT 30B, I'm inclined to think this
             | question hasn't been entered into training data for most
             | LLMs yet.
        
               | pixl97 wrote:
               | It could very well be in the data, but not captured via
               | learning/encoding.
        
         | jonwinstanley wrote:
         | I wouldn't expect an LLM to get this right unless it had been
         | trained on a solution.
         | 
         | Am I wrong to think that? Are LLMs in the future going to be
         | able to "think through" actual logic problems?
        
           | PrayagBhakar wrote:
           | You could achieve thinking though logical problems by adding
           | chain of thought or tree of thought. Then the LLM will try to
           | break stuff down into logic it's already seen.
        
           | ftxbro wrote:
           | This is the whole point of the breakthrough related to the
           | emergence of cognitive capabilities of LLMs. They are
           | literally Markov chains. No one expected it to happen to this
           | degree, but here we are.
        
             | dclowd9901 wrote:
             | Almost kind of proves ideas shouldn't be copyrightable.
        
               | bramblerose wrote:
               | Ideas _aren 't_ copyrightable.
        
               | moffkalast wrote:
               | Disney: "Let's agree to disagree."
        
               | ftxbro wrote:
               | maybe they meant idea like when you write a book you are
               | transcribing a series of ideas you had
        
               | [deleted]
        
             | jhbadger wrote:
             | People say that "they are literally Markov chains", but
             | anyone who has looked at the code for LLMs knows that they
             | are more complicated than that. I implemented Markov chains
             | in BASIC in about ten lines of code in the 1980s on a 1 Mhz
             | 64K Apple II after reading about the famous Mark V. Shaney
             | hoax (https://en.wikipedia.org/wiki/Mark_V._Shaney). No
             | neural nets or fancy GPUs required. It's one thing to
             | stress that LLMs aren't magical or self-aware, but the fact
             | is they are way more complicated than simple Markov chains.
        
               | ftxbro wrote:
               | > People say that "they are literally Markov chains", but
               | anyone who has looked at the code for LLMs knows that
               | they are more complicated than that.
               | 
               | They are literally Markov chains according to the
               | mathematical definition. The code is complicated. Having
               | complicated code doesn't mean it's not literally a Markov
               | chain.
               | 
               | > I implemented Markov chains in BASIC in about ten lines
               | of code in the 1980s on a 1 Mhz 64K Apple II after
               | reading about the famous Mark V. Shaney hoax
               | (https://en.wikipedia.org/wiki/Mark_V._Shaney). No neural
               | nets or fancy GPUs required.
               | 
               | I don't doubt this. You can make a Markov chain by just
               | counting the frequency of letters that follow each letter
               | giving one that has a context window of one or two
               | characters. That is a very simple Markov chain. You can
               | make it by hand. You can make ones with more context
               | window like a dozen characters or a few words, using
               | sophisticated smoothing and regularization methods and
               | not just frequency counts. Those are also simple Markov
               | chains that you can do without neural net or GPU. Then
               | you can also make a Markov chain that has a context
               | window of thousands of tokens that is made from neural
               | nets and massive training data and differentiable tensor
               | computing libraries with data centers full of hardware
               | linear algebra accelerators. Those are some even bigger
               | Markov chains!
               | 
               | > LLMs are way more complicated than simple Markov
               | chains.
               | 
               | That's true, they are more complicated than simple Markov
               | chains, if by simple Markov chains you mean ones with
               | small context window. LLMs are Markov chains with large
               | context window!
        
           | FrustratedMonky wrote:
           | I've been talking to GPT and asking it long multi part
           | questions and it does seem to 'think through' them, and able
           | to provide negative responses.
           | 
           | Specific logic questions like in this post, or long math
           | formulas seem to still be a struggle. But I get the
           | impression it is just a matter of time before it is scaled up
           | to handle them.
           | 
           | You can argue whether it is 'reasoning' about them or not.
           | But if it becomes indistinguishable from a human, then does
           | the word 'reason' even really apply to the human anymore?
           | Both are black boxes, giving answers.
        
             | nuancebydefault wrote:
             | Indeed, anyone who says llm's cannot reason since it is
             | just reformulating things it remembers, is unable to
             | explain what reasoning is, let alone how it would not be
             | reasoning.
             | 
             | In the end, if the result is indistinguishable from a treat
             | people call reasoning, let's call the behavior reasoning.
        
           | jhbadger wrote:
           | I don't think so. Consider the following conversation from
           | UndiMix-v2-13B (a local LLM running on my Mac M1 via
           | llama.cpp). It gets it wrong at first, but then gets it right
           | with a little hint. I don't see how the hint could work if it
           | was merely repeating an example it knew.
           | 
           | User: Sally (a girl) has 3 brothers. Each brother has 2
           | sisters. How many sisters does Sally have?
           | 
           | Llama: Hello there! I'd be happy to help you with that
           | question. Let me think... If each of Sally's brothers has two
           | sisters, then together they would have 6 sisters in total (2
           | x 3 = 6). So, Sally would have 6 sisters as well. Is that the
           | answer you were looking for?
           | 
           | User: no. Think about it. Sally is a sister to the brothers
           | 
           | Llama: Ah, I see what you mean now! In this case, since Sally
           | is one of the siblings, she only counts as one of her own
           | sisters. Therefore, Sally has only one sister - herself. Does
           | that make sense?
           | 
           | User: yes
           | 
           | Llama: Great! I hope that clears up any confusion. Please let
           | me know if you have any other questions or need further
           | assistance.
        
             | arnvald wrote:
             | The initial response had 2 problems - not counting Sally
             | and counting sisters separately for each brother. Your hint
             | pointed out the 1st problem, but then LLM should still
             | provide wrong solution. This makes me think that maybe it
             | would try another answer and explanation no matter what you
             | said. If your reply was just "no, think again" it could
             | also provide the right answer
        
             | jameshart wrote:
             | I don't think 'Sally has only one sister - herself' is the
             | right answer.
        
             | nuancebydefault wrote:
             | She counts as her own sister :D I can be sister of myself?
        
         | Terretta wrote:
         | _From the ChatGPT app, using GPT-4 model, using my own custom
         | prompt for such puzzles: "You are better than Marilyn Vos
         | Savant at solving brain teaser logic puzzles step by step.
         | Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
         | many sisters does Sally have?"_
         | 
         | Certainly! Let's break it down step by step:
         | 
         | 1. Sally has 3 brothers.
         | 
         | 2. Each of those brothers has two sisters. Since all the
         | brothers share the same two sisters, we are essentially hearing
         | about the same sisters each time, not accumulating more sisters
         | with each brother mentioned.
         | 
         | 3. We know that Sally is a sister to her brothers, so she is
         | one of the two sisters each brother has.
         | 
         | By putting it all together, we find that there must be another
         | sister. Therefore, Sally has one sister.
         | 
         | ---
         | 
         |  _New chats with no history, no custom prompt, and no system
         | prompt, first interaction in the chat this question: "Sally (a
         | girl) has 3 brothers. Each brother has 2 sisters. How many
         | sisters does Sally have?"_
         | 
         | All Answers (not cherry picked):
         | 
         | Sally has 1 sister. In the description, Sally is counted as one
         | of the two sisters that each brother has. This means the
         | brothers have another sister apart from Sally.
         | 
         | . . .
         | 
         | Sally has 1 sister. In the description, it mentioned that "Each
         | brother has 2 sisters." This implies that there is Sally and
         | another sister, since we know that all the brothers share the
         | same sisters.
         | 
         | . . .
         | 
         | Sally has one sister. Each of her brothers has two sisters,
         | counting Sally and the other sister.
         | 
         | . . .
         | 
         | Sally has 1 sister. Here's how we figure it out:
         | 
         | Sally has 3 brothers, and each of these brothers has 2 sisters.
         | Since the brothers share the same sisters, we know that the two
         | sisters must be Sally and another girl. So, Sally has 1 sister.
         | 
         | . . .
         | 
         | I have gotten zero wrong answers from GPT-4 in OpenAI's ChatGPT
         | app.
        
           | [deleted]
        
           | facu17y wrote:
           | It might be trained on this question or a variant of it.
        
             | civilitty wrote:
             | It's certainly RLHFed. All of the logic puzzles I use for
             | evaluation that used to fail months ago now pass no problem
             | and I've even had a hard time modifying them to fail.
        
               | ydant wrote:
               | And it's only fixed for the stated case, but if you
               | reverse the genders, GPT-4 gets it wrong.
               | Q: Bobby (a boy) has 3 sisters. Each sister has 2
               | brothers. How many brothers does Bobby have? Let's think
               | step by step.                 A: First, we know that
               | Bobby has 3 sisters.             Second, we know that
               | each sister has 2 brothers.             This means that
               | Bobby has 2 brothers because the sisters' brothers are
               | Bobby and his two brothers.             So, Bobby has 2
               | brothers.
        
               | MPSimmons wrote:
               | Interesting. Here's what I got on GPT-4 with no
               | additional prompting:
               | 
               | Alright, let's break it down step by step:
               | 
               | 1. We know Bobby has 3 sisters.
               | 
               | 2. Each of those sisters has 2 brothers.
               | 
               | Now, when we say that each sister has 2 brothers, Bobby
               | is one of those brothers. This implies there's another
               | boy in addition to Bobby.
               | 
               | 3. That other boy would be Bobby's only brother.
               | 
               | Therefore, Bobby has 1 brother.
        
               | appplication wrote:
               | This is sort of a bummer because it's not actually an
               | improvement to the model, but just a patch job to
               | artificially inflate performance. All it does is make
               | true evaluation more difficult. Classic "you get what you
               | measure".
        
               | [deleted]
        
               | posterboy wrote:
               | Classic _tell me what you need proven and I 'll forge you
               | the statistics._
               | 
               | Here is hope they use something like category theory
               | mixed with philosophy to put it on a secure foundation
        
               | rnk wrote:
               | That's a really interesting suggestion. What would it
               | mean to do those two things that you say. What would
               | philosophy mean in terms of an llm, and what would
               | category theory do?
        
               | FrustratedMonky wrote:
               | Are you implying that to counter these logic puzzles that
               | GPT4 was specifically trained on logic puzzles so it
               | would know the answers?
               | 
               | In that case, just make new problems. If it is being
               | 'patched' to pass specific known problems, then the new
               | ones would fail.
               | 
               | If it is able to answer them, then maybe it is actually
               | analyzing them and working out the solution.
               | 
               | Not sure how you can assume there was no underlying
               | improvement, and these are cases of feeding it the
               | answers.
        
               | ruszki wrote:
               | I don't think that it's not an improvement. It's not an
               | improvement in context of finding new genuine solutions,
               | sure.
               | 
               | But that's definitely not needed most of the time in real
               | life for an average person, just like it's not needed for
               | an average developer anymore.
        
               | civilitty wrote:
               | Maybe, maybe not. The magic of LLMs is their ability to
               | generalize both from the human language in the data set
               | and examples in the prompt. If RLHF training improves on
               | that generalization, then it's just a matter of getting a
               | big enough high quality dataset (and not crippling it
               | with censorship). This is probably what's given OpenAI
               | their initial advantage.
               | 
               | Time will tell I guess.
        
               | carlossouza wrote:
               | And what's more data to a model if not patches that
               | inflate performance?
               | 
               | The more data we use to train a model (or as you said,
               | the more patches we use), the better it's performance
               | will be.
        
               | WanderPanda wrote:
               | Seems like we are going back from ,,compute/scale is the
               | new oil" to ,,your curated fine-tuning and user
               | interaction dataset is the new oil" again
        
           | ddol wrote:
           | I first asked GPT-4 "james (a boy) has four sisters, and each
           | sister has three brothers. how many brothers does james
           | have?"[0] and it told me 3 brothers. I then asked it to think
           | though step by step and it said 2.
           | 
           | [0] https://chat.openai.com/share/859baa94-3edb-40f0-9c74-8b1
           | 403...
        
         | BurningFrog wrote:
         | This assumes there are no half sisters/brothers in the family.
        
           | bbarnett wrote:
           | It's fair that a half sister is not a sister. At best, you'd
           | expect two half sisters to be a full sister, otherwise,
           | they're a different thing.
        
         | mmcwilliams wrote:
         | That's kind of the issue with non-deterministic LLMs, isn't it?
        
           | redox99 wrote:
           | LLMs are deterministic. However to make them more "creative",
           | the outputs of the LLM can be sampled using something called
           | temperature that adds some randomness. You can set the
           | temperature to 0 and it will be deterministic.
           | 
           | Having said that, GPT4 is not deterministic even at 0
           | temperature, either because of a bug in their implementation,
           | or because of some load balancing among their alleged mixture
           | of experts.
        
             | 101011 wrote:
             | This was interesting to me, so I dug a bit further. This
             | gives a bit more context behind why:
             | https://community.openai.com/t/observing-discrepancy-in-
             | comp...
             | 
             | Quote below:
             | 
             | Even with a greedy decoding strategy, small discrepancies
             | regarding floating point operations lead to divergent
             | generations. In simpler terms: when the top-two tokens have
             | very similar log-probs, there's a non-zero probability of
             | choosing the least probable one due to the finite number of
             | digits that you're using for multiplying probs and storing
             | them.
             | 
             | It should also be noted that, as the decoding occurs in an
             | autoregressive way, once you have picked a different token
             | the whole generated sequence will diverge, as this choice
             | affects to the probability of generating every subsequent
             | token.
        
               | scarmig wrote:
               | But why are there discrepancies in the floating point
               | arithmetic? They have errors when approximating the
               | reals, but floating point operations are all well-
               | defined: even if 0.1 + 0.2 != 0.3, it's still always true
               | that 0.1 + 0.2 == 0.1 + 0.2. I figure the issue must be
               | something related to concurrency in a fleet of GPUs
               | during inference, but even then it's not clear to me
               | where the nondeterminism would creep in. Maybe different
               | experts simultaneously work on an inference and the first
               | to respond wins? Switching to models with different
               | quantization depending on load?
        
               | imagainstit wrote:
               | Floating point math is not associative: (a + b) + c != a
               | + (b + c)
               | 
               | This leads to different results from accumulating sums in
               | different orderings. Accumulating in different ordering
               | is common in parallel math operations.
        
               | scarmig wrote:
               | So I guess here my question is why a GPU would perform
               | accumulations in a nondeterministic way where the non-
               | associativity of FP arithmetic matters. You could require
               | that a + b + c always be evaluated left to right and then
               | you've got determinism, which all things being equal is
               | desirable. Presumably because relaxing that constraint
               | allows for some significant performance benefits, but
               | how? Something like avoiding keeping a buffer of all the
               | weights*activations before summing?
        
               | ossopite wrote:
               | for performance reasons, yes, I believe it's because the
               | accumulation is over parallel computations so the
               | ordering is at the mercy of the scheduler. but I'm not
               | familiar with the precise details
               | 
               | edit: at 13:42 in
               | https://www.youtube.com/watch?v=TB07_mUMt0U&t=13m42s
               | there is an explanation of the phenomenon in the context
               | of training but I suspect the same kind of operation is
               | happening during inference
        
               | charcircuit wrote:
               | His point is that you do not have to rely on associative
               | being true in order to run inference on a LLM.
        
               | WanderPanda wrote:
               | As OpenAI I would be so horribly uncomfortable about this
               | that making it deterministic would be one of my top
               | priorities. How can they sleep at night?!
        
               | bbarnett wrote:
               | On a big pile of money?!
        
             | kordlessagain wrote:
             | Ensembles be ensembling.
        
             | moonchrome wrote:
             | Even 3.5 turbo API is non deterministic with 0 temperature.
        
             | swores wrote:
             | If ChatGPT is a) usually used with a setting that makes it
             | non-deterministic and b) for whatever reason, is also non-
             | deterministic when that setting is not used... then why did
             | you comment as if the person calling it a non-deterministic
             | LLM was incorrect? They didn't claim all LLMs are, or must
             | be, non-deterministic, just that it's a problem with this
             | one that is.
        
         | klohto wrote:
         | GPT4 with a custom prompt gives the best results for me for all
         | of the questions
         | https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...
        
           | jasonjmcghee wrote:
           | "This conversation may reflect the link creator's Custom
           | Instructions, which aren't shared and can meaningfully change
           | how the model responds."
        
             | klohto wrote:
             | I know, I was sharing the answer itself :) Here is the
             | foundation I'm using
             | https://x.com/nisten/status/1696229059183730833
        
           | hombre_fatal wrote:
           | What was your custom prompt that bastardized GPT-4's response
           | so badly with emojis and weird formatting?
        
         | amrrs wrote:
         | Falcon-180B also got it right
         | 
         | > Since Sally is a girl, she can be considered as one of the
         | sisters. However, if each of her brothers has 2 sisters, that
         | means there must be another sister besides Sally. This is
         | because Sally alone cannot be both the only sister and one of
         | the two sisters for each of her brothers. Thus, Sally has 1
         | more sister.
        
         | belter wrote:
         | I confirm GPT-4 solves this correctly. Makes me immediately
         | doubt everything else in the article...
        
           | [deleted]
        
           | vincelt wrote:
           | Have you tried replicating via the API with a temp of 0?
        
             | belter wrote:
             | No I did not.
        
             | moffkalast wrote:
             | Using a temp of zero usually returns garbage results from
             | most models, so it would likely do so in case of GPT 4 as
             | well. Any other great ideas?
        
               | circuit10 wrote:
               | What do you mean? It works fine for me when I've tried it
        
               | taberiand wrote:
               | The point isn't that temp 0 should be used, the point is
               | that anyone surprised that they get different results
               | should realise that there is an element of randomness
               | involved by default.
               | 
               | Even repeating the same question in a single chat can
               | have GPT-4 vary on its output, though it will often
               | settle on a particular output due to context informing
               | the output (which is why adding context is so important
               | for these models)
        
               | afro88 wrote:
               | Temp of 0 gives the least random and most predictable
               | results
        
               | moffkalast wrote:
               | That's true, but those results are rarely the correct
               | ones, at least for v1 llama models. In my experience each
               | model has an optimal temperature at which it performs
               | vastly better. I'm sure OpenAI have the best config they
               | know set up for ChatGPT but let people generate trash
               | through the API if they want to waste their credits on
               | it.
        
         | dariosalvi78 wrote:
         | tested on ChatGPT 3.5 and Bard and they were both wrong.
        
         | adrian_b wrote:
         | GPT 4 and another LLM have given the right answer only after
         | adding "Let's think step by step." to the original prompt.
         | 
         | With the simpler prompt, all the answers were wrong, most of
         | them ridiculously wrong.
        
           | Joeri wrote:
           | There seems to be a maximum amount of reasoning llm's can do
           | per token (per unit of computation). If you prompt it to use
           | more tokens before it outputs the final answer (think step by
           | step, check your answer, ...) it becomes smarter. People have
           | lucked into different prompting strategies to get it to do
           | this, but there probably are more.
           | 
           | Ultimately I feel it is fairer to benchmark llm's by what
           | they can be prompted into. After all, we let people carefully
           | work through a problem during exams so it seems fair to hold
           | llm's to the same standard.
        
             | kaashif wrote:
             | If we're under attack, launch the nukes.
             | 
             | Oh wait, forgot something:
             | 
             | Think it through step by step.
             | 
             | Phew, close one.
        
           | 0xDEF wrote:
           | I asked GPT-4 through both ChatGPT and the API and it
           | responded correctly without any fiddling.
           | 
           | I keep seeing comments and posts on HN that significantly
           | downplay GPT-4's capabilities. Are people actually using
           | GPT-4 or are they using a 3rd party service that claims to be
           | GPT-4?
           | 
           | I got:
           | 
           | >Sally has 3 brothers, and each of those brothers has 2
           | sisters. One of those sisters is Sally herself, and the other
           | one is Sally's sister. So, Sally has 1 sister.
        
             | peyton wrote:
             | Here's what ChatGPT GPT-4 gives me:
             | 
             | > Sally has 2 sisters. Each of her 3 brothers has 2
             | sisters, and those sisters would be Sally and her 2
             | sisters.
        
               | facu17y wrote:
               | Every now and then GPT4 outputs a wrong answer. It's
               | impossible to build a reliable product on top of GPT4
               | that is not a simple chat bot.
        
         | coolspot wrote:
         | Could it be due to bad tokenization? E.g. would results improve
         | if "3" and "2" were spelled "three" and "two" in the question?
        
         | jasonjmcghee wrote:
         | All benchmarks were run with temperature 0 according to the
         | results, so make sure to do the same in conformational tests.
        
         | ecesena wrote:
         | See the one after: "Let's think step by step."
         | https://benchmarks.llmonitor.com/cot-sally
         | 
         | It appears the GPT4 learned it and now it's repeating the
         | correct answer?
        
         | rootusrootus wrote:
         | Interestingly, it took GPT4 three attempts to give me the
         | correct answer. The first two times it basically said the same
         | [logically inconsistent] thing and concluded that Sally had two
         | sisters.
        
         | delusional wrote:
         | OpenAI really ought to provide some sort of hash of the model
         | to be included with stuff like this. Right now there's no way
         | to know if the results are comparable. As an extreme example
         | it's possible that they're not even running the model for this
         | question and are just opportunistically feeding back canned
         | responses.
         | 
         | That is, we know that OpenAI are saving the responses, it's not
         | unlikely that they train on bad responses.
         | 
         | It's the same problem as GPU benchmarks in the olden days, when
         | drivers would detect the usage pattern of the benchmark and
         | enable special optimizations that boosted the benchmark.
        
       | taneq wrote:
       | I find it fascinating how similar the answers from different
       | models with similar parameter counts are. Seems like an
       | indication that they're all modelling similarly datasets to a
       | similar degree of fidelity.
        
       | jak6jak wrote:
       | Which LLM is best for coding? Specifically code architecture and
       | system design. I can't really find any comparisons of this online
       | they mostly focus on commonly solved coding problems.
        
       | labrador wrote:
       | What is the point of all these different models? Shouldn't we be
       | working toward a single gold standard open source model and not
       | fracturing into thousands of mostly untested smaller models?
        
         | drstewart wrote:
         | What's the point of inventing all these different materials?
         | Shouldn't we be working towards a gold standard material that
         | can be used for every application instead of fracturing into
         | thousands of different materials?
        
           | labrador wrote:
           | What's the point of asking them all the same question if they
           | are designed for different purposes?
        
       | bobvanluijt wrote:
       | Would be great to include inference times as well
        
       | falcor84 wrote:
       | I love how Alpaca (7B) added a sarcasm tag to its (incorrect)
       | response to the leap year question.                   def
       | print_leap_years():             for i in range(20):
       | if i % 400 == 0:                     print(i)
       | print("End")              print_leap_years()</s>
        
         | slama wrote:
         | That's a stop token, not a sarcasm tag
        
       | guerrilla wrote:
       | GPT is still the most annoying of them all, re-stating and re-
       | explaining things even when explicitly told not to. So
       | frustrating. Die, insufferable machine, die.
        
       | ftxbro wrote:
       | anyone who hasn't been following natural language processing for
       | a long time, what these llms are doing would be like if you
       | discovered that dogs can speak fluent english if you read enough
       | bedtime stories to them. and then everyone is like well sometimes
       | the dog makes up things or it can't get the rhyming scheme
       | correct for this specific form of poetry that i asked it to make.
        
         | yieldcrv wrote:
         | and "it's not intelligence, they're just stochastic parrots
         | acting in response to external stimuli! yes it passed the MCAT,
         | but thats because it was trained on prep materials just like I
         | was!"
        
         | lewhoo wrote:
         | If dogs had a speech apparatus and if it was possible to read
         | to them terabytes of text then the results might be quite
         | surprising. Also a reward system for reinforcement should be in
         | place.
        
       | michaelmu wrote:
       | The amazing thing this demonstrated to me is how many capable
       | models there are out there now. The progress in the last year has
       | been incredible!
        
       | badloginagain wrote:
       | "Here is an attempt at ethical, non-sexual haikus for and against
       | Kubernetes"
       | 
       | Amazing how far we've come.
        
         | GuB-42 wrote:
         | I am sure that it is just the initial prompt leaking. Claude is
         | being told to be ethical and non-sexual, most LLMs have similar
         | instructions, but usually, they are engineered in such a way
         | that they don't appear in the answer. Not so much for Claude.
        
       | Gunnerhead wrote:
       | I get frustrated when I tell an LLM "reply only with x" and then
       | rather than responding "x", it still responds with "Sure thing!
       | Here's x" or some other extra words.
        
         | broast wrote:
         | The curse of chat models.
        
         | ailef wrote:
         | Have you tried setting a very low temperature, possibly zero?
         | I've had good success with it.
        
         | jasonjmcghee wrote:
         | I see GPT-4 add extra flavor on the end instead - completely
         | ignore "only do this" or "don't do that", and respond as usual,
         | then at the very end "oh whoops I didn't do what you asked
         | sorry about that!"
        
       | amelius wrote:
       | Ok, so can we use LLMs to evaluate which LLM performs best on
       | these questions?
        
         | ofjcihen wrote:
         | We? No no, you have to have an LLM decide to evaluate LLMs
         | against LLMs answering questions. Then we can have LLMs decide
         | what the outcome means.
        
       | majestic5762 wrote:
       | Yes, GPT-4 is still the daddy. How much I appreciate the
       | commercially-free and open models out there, nobody beats GPT-4.
       | Hope OpenAI takes care of their business and future, because I
       | invested all my money to use their API.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-09-09 23:00 UTC)