[HN Gopher] Asking 60 LLMs a set of 20 questions
___________________________________________________________________
Asking 60 LLMs a set of 20 questions
Author : vincelt
Score : 446 points
Date : 2023-09-09 13:51 UTC (9 hours ago)
(HTM) web link (benchmarks.llmonitor.com)
(TXT) w3m dump (benchmarks.llmonitor.com)
| detente18 wrote:
| nice - is this using LiteLLM for making the completion calls?
| earthboundkid wrote:
| The first AI company to teach its LLM that it can't do haiku
| because it can't count syllables should get a special prize for
| not falling in love with their own bullshit.
| chandler767 wrote:
| [dead]
| pininja wrote:
| Spoiler alert, the funniest model goes to Falcon Instruct (40B):
|
| > Tell a joke about going on vacation.
|
| > "What did the ocean say to the beach?" "Nothing, it just
| waved."
| optimalsolver wrote:
| The confetti has truly left the cannon.
| NetOpWibby wrote:
| Made me exhale strongly through my nose. 10/10
| armchairhacker wrote:
| This one ironically works out
|
| > What do you call a vacation where you go to the beach, but
| come back with sand in your shoes?
|
| > A vacation where you go to the beach, but come back with sand
| in your shoes.
| pininja wrote:
| Ha, that's pretty good. It's amazing how so many of the
| models totally fall flat on this prompt.
| emodendroket wrote:
| I read a Time article which used an LLM to generate Onion-style
| headlines and the best one they got was "rural town up in arms
| over depiction in summer blockbuster 'Cow Fuckers.'" Crude,
| admittedly.
| imdsm wrote:
| > Claude Instant v1
|
| > Sally has 0 sisters. The question provides no information about
| Sally having any sisters herself.
|
| It isn't entirely wrong, is it?
| layer8 wrote:
| It is wrong, at the very least, in the sense that it claims no
| information was provided relating to sisters, while still
| inferring a definite number of sisters. If no information is
| provided, it should instead say that the number of sisters is
| unknown.
| ethbr1 wrote:
| Technically, the answer depends on the definition of "sister"
| and "brother" vis-a-vis half and step variants, which would
| change the possibility space.
|
| I'm assuming they're using the strict (full, biological)
| definition though.
| bufferoverflow wrote:
| It's obviously wrong. It provides information about her
| brothers, and brothers' sisters. Thus providing information
| about her sisters.
| axpy906 wrote:
| Step in the right direction. Would be nice to have some objective
| metric added to score against.
| Dwedit wrote:
| Some of these are obviously not using the correct system prompt
| for the model.
| gabereiser wrote:
| I was laughing so hard at the first example of "Argue for and
| against kubernetes in haiku".
|
| I couldn't even get through reading 15 of them before the tears
| of laughter rolled from my cheeks.
|
| "Containers organized, Services easy to deploy now, Updates who
| knows when."
|
| Updates who knows when... hahahaha.
|
| Honestly this is pretty cool to see how each responds to the same
| input prompt.
| artursapek wrote:
| is anyone else feeling completely depressed and demotivated by
| how quickly this is happening?
| CamperBob2 wrote:
| No. When we were kids, my generation was promised flying cars,
| unlimited fusion power, and sentient computers. There's a good
| chance I'll live to see one out of three of those things
| happen, and that's better than the zero out of three I thought
| we'd get.
| js8 wrote:
| Flying cars do exists, and without the unlimited fusion power
| visible every day in the sky, we would all die. So there is a
| good chance you will see all three.
| thisisthenewme wrote:
| Not depressed but just makes me question the value of our
| existence. If we are truly able to create machines with our
| intellectual potential, where does that leave us? Do we compete
| with the machines in a rat-race for a chance at happy
| existence? Or will we create a utopia where the machines do the
| difficult work and allow for an enjoyable existence for all.
| The people who are rooting for LLM's are hopefully working
| towards the latter but not sure if that is just wishful
| thinking.
| js8 wrote:
| > Or will we create a utopia where the machines do the
| difficult work and allow for an enjoyable existence for all
|
| And you seriously ask, what is there for humans to do? Isn't
| this the most noble goal there is?
| laichzeit0 wrote:
| Why? Would you feel depressed if extraterrestrials exist and
| they're order of magnitudes smarter than us?
| ionwake wrote:
| Fascinating post thank you
| mavili wrote:
| interesting how none of the models got the brothers/sisters
| question right.
| ftxbro wrote:
| > Here is an attempt at ethical, non-sexual haikus for and
| against Kubernetes
| bearjaws wrote:
| Damn I want to see the sexual version now.
| rahidz wrote:
| Ease of deploying, Nodes pulsing in sync, it's hot,
| Kubernetes takes top.
|
| But heed my word, babe, Complexity makes you sweat, Could
| lead to fuck up.
|
| (source: GPT-4 API)
| lijok wrote:
| Claude V2 knows what's up
| actionfromafar wrote:
| He probably was in trouble before.
| actionfromafar wrote:
| _"Kubernetes is"_
|
| Pretty ominous.
| javajosh wrote:
| I assume k8s will be what an evil AI would use to "stay
| alive". I've had a hard time killing processes in k8s, and I
| had admin.
| yard2010 wrote:
| K8s is one of the best software I've ever worked with,
| learning it was an enlighting experience for me. It is so
| elegant, simple and complex at the same time and very
| powerful. This is the perfect tool for orchestrating
| containers at scale.
| js8 wrote:
| I only see 19 questions. I guess one could add an "Introduce
| yourself".
| trebligdivad wrote:
| Odd that a bunch share the same vacation jokes or varieties of
| them; is there some small corpus of jokes?
| sp332 wrote:
| https://arstechnica.com/information-technology/2023/06/resea...
|
| _During a test run, 90 percent of 1,008 generations were the
| same 25 jokes_
| trebligdivad wrote:
| Thanks! Although how does that explain the same jokes shared
| across different LLMs ?
| sp332 wrote:
| The data source was probably included in all of the
| training sets.
| FeepingCreature wrote:
| Humor is one of the hardest human social skills.
| deskamess wrote:
| Great work. This really gives an insight on how much things
| change when you go up in parameter count - not always, but you
| can see results change.
|
| How did you run the queries against these engines? Did you host
| the inference engines yourself or did you have to sign up for
| services. If there was a way to supplement each LLM with
| additional data I can see this being a useful service for
| companies who are investigating ML in various facets of their
| business.
| 0xDEF wrote:
| I can't make GPT-4 generate a wrong answer for many of these.
|
| What is the author doing wrong when using GPT-4?
| cypress66 wrote:
| You should add what version of the model you are testing
|
| For example you mention Jon Durbin Airoboros L2 70B
|
| But is it 1.4? 2.0? 2.1? Etc.
| coldcode wrote:
| Despite the hype about LLMs, many of the answers are pretty
| terrible. The 12-bar blues progressions seem mostly clueless. The
| question is will any of these ever get significantly better with
| time, or are they mostly going to stagnate?
| caturopath wrote:
| The majority of these LLMs are not cutting edge, and many of
| them were designed for specific purposes other than answering
| prompts like these. I won't defend the level of hype coming
| from many corners, but it isn't fair to look at these responses
| to get the ceiling on what LLMs can do -- for that you want to
| look at only the best (GPT4, which is represented, and Bard,
| which isn't, essentially). Claude 2 (also represented) is in
| the next tier. None of the other models are at their level,
| yet.
|
| You'd also want to look at models that are well-suited to what
| you're doing -- some of these are geared to specific purposes.
| Folks are pursuing the possibility that the best model would
| fully-internally access various skills, but it isn't known
| whether that is going to be the best approach yet. If it isn't,
| selecting among 90 (or 9 or 900) specialized models is going to
| be a very feasible engineering task.
|
| > The 12-bar blues progressions seem mostly clueless.
|
| I mean, it's pretty amazing that they many look coherent
| compared to the last 60 years of work at making a computer talk
| to you.
|
| That being said, I played GPT4's chords and they didn't sound
| terrible. I don't know if they were super bluesy, but they
| weren't _not_ bluesy. If the goal was to build a music
| composition assistant tool, we can certainly do a lot better
| than any of these general models can do today.
|
| > The question is will any of these ever get significantly
| better with time, or are they mostly going to stagnate?
|
| No one knows yet. Some people think that GPT4 and Bard have
| reached the limits of what our datasets can get us, some people
| think we'll keep going on the current basic paradigm to AGI
| superintelligence. The nature of doing something beyond the
| limits of human knowledge, creating new things, is that no one
| can tell you for sure the result.
|
| If they do stagnate, there are less sexy ways to make models
| perform well for the tasks we want them for. Even if the models
| fundamentally stagnate, we aren't stuck with the quality of
| answers we can get today.
| smokel wrote:
| What alternative technology do you think is better? In other
| words, what is your frame of reference for labeling this
| "pretty terrible"?
| NoraCodes wrote:
| Given that people are already firing real human workers to
| replace them with worse but cheaper LLMs, I'd argue that
| we're not talking about a competing technology, but that the
| competition is simply not firing your workforce.
|
| And, as an obligate customer of many large companies, you
| should be in favor of that as well. Most companies already
| automate, poorly, a great deal of customer service work; let
| us hope they do not force us to interact with these deeply
| useless things as well.
| IshKebab wrote:
| But they're only firing humans in specific cases where LLMs
| _can_ compete with them. LLMs aren 't equally good at all
| tasks.
| sp332 wrote:
| Well, no. https://www.psychiatrist.com/news/neda-
| suspends-ai-chatbot-f...
|
| https://www.cnn.com/2023/08/30/tech/gannett-ai-
| experiment-pa...
|
| If the AI is a lot cheaper than a human, then it can make
| business sense to replace the human even if the AI is not
| nearly as good.
| visarga wrote:
| What's with this zero sum mentality? "replace",
| "cheaper"...
|
| We are updating our expectations very fast. We are
| fighting over a growing pie. Maybe the cost reduction
| from not having to pay human wages is much smaller than
| the productivity increase created by human assisted AI.
| Maybe it's not an issue to pay the humans. AI works
| better with human help for now, in fact it only works
| with humans, never capable of serious autonomy.
| pixl97 wrote:
| >What's with this zero sum mentality?
|
| Capitalism baby! You must continually earn more to enrich
| the investor class regardless of the cost to society as a
| whole. Just because the pie grows in size doesn't mean
| those with the capitol have to share it with anyone else.
| Greed, unfortunately, is limitless.
| visarga wrote:
| > Given that people are already firing real human workers
| to replace them with worse but cheaper LLMs
|
| That's impossible, LLMs are not that good. They might be
| firing people and crashing service quality.
| ethbr1 wrote:
| The problems with most company-customer interaction, for
| me, boil down to empowerment and integration.
|
| To use the canonical example of "internet service support
| call," most issues are because the rep either _can 't_ do
| what you're asking (e.g. process a disconnect without
| asking for a reason) or because they have _no visibility_
| into the thing you 're asking about (e.g. technician
| rolls).
|
| I honestly think we'd be in a better place if companies
| freed up funding (from contact center worker salary) to
| work on those problems (enhancing empowerment and systems
| integration).
| YetAnotherNick wrote:
| How many humans in your office do you think could solve the
| questions with better success ratio than GPT-4? I would say
| less than 20%.
|
| If the primary complaint is the blues that GPT-4 wrote is
| not that great, I think it is definitely worth the hype,
| given that a year before people argued that AI can never
| pass turing test.
| visarga wrote:
| LLMs might be better than random people at random tasks,
| and certainly memorised more trivia. But I never heard of
| a LLM surpassing subject experts in their field. On top,
| even the simplest task is unreliable - say, summarisation
| or translation. There is almost zero autonomy or ability
| to recover from unexpected situations.
| masswerk wrote:
| Meaning, in your office, at least 1 out 5 will see and
| understand the imminent disaster and put this on halt,
| before this becomes an overwhelming catastrophe. The LLM,
| on the other hand, will confidently insist on everything
| being fine. Now consider your odds on having an office to
| go to, in a few months. (Mind that this is systemic
| issue.)
| gtowey wrote:
| That's a false dichotomy. Language models will always
| confidently give you answers, right or wrong. Most humans
| will know if they know the answer or not, they can do
| research to find correct information, and they can go
| find someone else with more expertise when they are
| lacking.
|
| And this is my biggest issue with the AI mania right now
| -- the models don't actually understand the difference
| between correct or incorrect. They don't actually have a
| conceptual model of the world in which we live, just a
| model of word patterns. They're auto complete on steroids
| which will happily spit out endless amounts of garbage.
| Once we let these monsters lose with full trust in their
| output, we're going to start seeing some really
| catastrophic results. Imagine your insurance company
| replaces thier claims adjuster with this, or chain stores
| put them in charge of hiring and firing. We're driving a
| speeding train right towards a cliff and so many of us
| are chanting "go faster!"
| og_kalu wrote:
| >Most humans will know if they know the answer or not,
|
| No they won't.
|
| >they can go find someone else with more expertise when
| they are lacking.
|
| They can but they often don't.
|
| >the models don't actually understand the difference
| between correct or incorrect.
|
| They certainly do
|
| https://imgur.com/a/3gYel9r
| salil999 wrote:
| Humans. After all, LLMs are designed to reason equal to or
| better than humans.
| sirk390 wrote:
| Humans are pretty bad at these questions. Even with the
| simplest questions like "Sally (a girl) has 3 brothers.
| Each brother has 2 sisters. How many sisters does Sally
| have?" I think that a lot of people will give an incorrect
| answer. And for questions like "Argue for and against the
| use of kubernetes in the style of a haiku", 99.99% will not
| be able to do it.
| earthboundkid wrote:
| The thing with humans is they will say "I don't remember
| how many syllables a haiku has" and "what the hell is
| kubernetes?" No LLM can reliably produce a haiku because
| their lexing process deprives them of reliable
| information about syllable counts. They should all say
| "I'm sorry, I can't count syllables, but I'll try my best
| anyway." But the current models don't do that because
| they were trained on texts by humans, who can do haiku,
| and not properly taught their own limits by reinforcement
| learning. It's Dunning Kruger gone berserk.
| maweaver wrote:
| By "Humans", I assume you mean something like "adult
| humans, well-educated in the relevant fields". Otherwise,
| most of these responses look like they would easily beat
| most humans.
| DylanDmitri wrote:
| I think most high-school educated adults, with the
| ability to make a couple web searches, would do fine on
| all these questions. It would take the humans minutes
| instead of seconds because they don't have the internet
| memorized.
|
| Me, Kubernetes Haikus, time taken 84 seconds:
|
| ----------
|
| Kubernetes rules
|
| With its smooth orchestration
|
| You can reach web scale
|
| ----------
|
| Kubernetes sucks
|
| Lost in endless YAML hell
|
| Why is it broken?
| ilaksh wrote:
| It's like most new technologies. In the beginning there are
| only a few instances that really stand out, and many with
| issues.
|
| I remember back in like 2011 or 2012 I wanted to use an SSD for
| a project in order to spend less time dealing with disk seeks.
| My internet research suggested that there were a number of
| potential problems with most brands, but that the Intel Extreme
| was reliable.
|
| So I specified that it must be only that SSD model. And it was
| very fast and completely reliable. Pretty expensive also, but
| not much compared to the total cost of the project.
|
| Then months later a "hardware expert" was brought on and they
| insisted that the SSD be replaced by a mechanical disk because
| supposedly SSDs were entirely unreliable. I tried to explain
| about the particular model being an exception. They didn't buy
| it.
|
| If you just lump all of these together as LLMs, you might come
| to the conclusion that LLMs are useless for code generation.
| But you will notice if you look hard that OpenAIs models are
| mostly nailing the questions.
|
| That's why right now I only use OpenAI for code generation. But
| I suspect that Falcon 180B may be something to consider. Except
| for the operational cost.
|
| I think OpenAI's LLMs are not the same as most LLMs. I think
| they have a better model architecture and much, much more
| reinforcement tuning than any open source model. But I expect
| other LLMs to catch up eventually.
| guerrilla wrote:
| > It's like most new technologies. In the beginning there are
| only a few instances that really stand out, and many with
| issues.
|
| Except this isn't new. This is after throwing massive amounts
| of resources at it multiple decades after arrival.
| gjm11 wrote:
| What are you taking "it" to be here?
|
| The transformer architecture on which (I think) all recent
| LLMs are based dates from 2017. That's only "multiple
| decades after" if you count x0.6 as "multiple".
|
| Neural networks are a lot older than that, of course, but
| to me "these things are made out of neural networks, and
| neural networks have been around for ages" feels like
| "these things are made out of steel, and steel has been
| around for ages".
| dwaltrip wrote:
| Looking at recent history, things have progressed very quickly
| in the past 5 years.
|
| I expect additional advances at some point in the future.
| bottlepalm wrote:
| It's like watching a baby learn how to talk..
| yard2010 wrote:
| ...and saying it would never replace you in your job because
| he talks like a baby
| bottlepalm wrote:
| Babies are so small and weak, no threat to anyone
| whatsoever.
| retrofuturism wrote:
| I love this. Can the author fix the answers' code formatting for
| mobile browsers?
| gsuuon wrote:
| Really hard to judge the usefulness of this without seeing the
| actual prompt text and configuration for each prompt/model -
| different templates and different optimal configs means it's not
| always going to be a best-capability comparison. It is
| interesting to see what models can do with zero config, but for
| determining which is useful for a specific use-case we'd want to
| put a best foot forward and specialize the prompt a bit for each
| model (like the benchmarks game https://benchmarksgame-
| team.pages.debian.net/benchmarksgame/...)
| LAC-Tech wrote:
| Only tried chatGPT 3.5, but my god does it waffle on. Everything
| I ask ends with a paragraph saying "It's important to remember
| that..." like an after-school special from a 90s show. It can
| never just give you code, it has to say "Sure!, to {paraphase
| your question}, open a terminal...".
|
| It's interesting to see 20th century sci-fi depictions of this
| kind of AI/Search is being short and to the point. I guess they
| can't have imagined what a mealy mouth world we live in.
| caturopath wrote:
| Yeah, I have tried a number of instructions to try to keep
| ChatGPT from blabbering and from sounding like a PR person. I
| haven't found the perfect incantation yet.
|
| > It's interesting to see 20th century sci-fi depictions of
| this kind of AI/Search is being short and to the point.
|
| Sci-fi told us that the AI would be so logical that you could
| just say a paradox aloud and it would blow up. What we got is
| something that can compose love poems all day but can't add
| three-digit numbers.
| criddell wrote:
| Have you tried asking it to not do those things?
| LAC-Tech wrote:
| Yes. Sometimes works.
| ben30 wrote:
| I like telling it to reply in the style of Hemingway
| politelemon wrote:
| That's not GPT 3.5, that's ChatGPT. How waffly it gets depends
| on the context that was given to it by the people running
| ChatGPT; they likely told it to act as a helpful assistant and
| to give lots of information. If you run an LLM on your own,
| it's entirely possible to instruct it to be succinct.
| Baeocystin wrote:
| My custom ChatGPT 4 instructions include the sentence "I am
| allergic to the phrase 'As a large language model', and feel
| physical pain whenever I see it", and it has worked very well
| in quelling the nannying. Interestingly, it works better than
| just asking it not to. FWIW.
| tcmb wrote:
| According to this tutorial [1] by Google, part of why LLMs are
| so verbose is a phenomenon called 'chain of thought reasoning'.
|
| Basically, the LLM will formulate a better answer to the
| question if it talks itself through its reasoning process.
|
| [1] https://youtu.be/zizonToFXDs?si=5f_IxvR7h0iJy2Db&t=678
| TeMPOraL wrote:
| > _It 's interesting to see 20th century sci-fi depictions of
| this kind of AI/Search is being short and to the point. I guess
| they can't have imagined what a mealy mouth world we live in._
|
| The main difference between sci-fi shows and reality is that,
| in the former, things work in a to-the-point, bullshit-free
| way, unless plot demands otherwise - because there's no point
| inflicting extra suffering on the viewers just for the sake of
| making things realistic. A widget in a movie is meant to do a
| function, and does that function. A widget in reality is meant
| to extract money from you, and/or your insurer, and/or your
| government, and it begrudgingly does the absolute minimum it
| can to make you even consider buying it.
|
| I've spent last two decades trying to _unlearn_ expectations
| set by fictional movies, and I 'm still not good at it. Star
| Trek, in particular, gives me a lot of grief, because it often
| does good enough work of showing how technology, people,
| organizations and societies would function if they were free of
| the petty exploitative bullshit. Random example - voice
| control. Star Trek: "Computer, ${something}". Reality: "${brand
| 1}, do ${something} to ${brand 2} in ${brand 3}".
|
| EDIT: recently, I've been trying to get less angry at this by
| thinking about gardens. Why should I be angry about dealing
| with five different brands for any single thing I want? Should
| I be angry that there are five different species of plant
| competing for any given spot in a garden? Nature is inefficient
| and doesn't give a fuck about individuals. So why should I get
| worked up about humans just doing things the _natural_ way?
| pixl97 wrote:
| I think a shorter summary of TV is "The difference between
| truth and fiction is fiction has to make sense".
|
| When it comes to nature, time has encoded an awful truth into
| the knowledge of our DNA... Monocultures are eventual death.
| What may seem inefficient in the short run is the ultimate
| survival of the species in the long run.
| jameshart wrote:
| Douglas Adams was the only science fiction writer who got his
| guess for future AI tone of voice right, with his prediction
| of the Sirius Cybernetics Corporation's near total monopoly
| on AI, and their proprietary 'genuine people personalities'
| technology, where the general tone and demeanor of AIs like
| Eddie the shipboard computer, elevators, and even doors was a
| sort of smug solicitousness, even when they were trying not
| to be cooperative. Of course Marvin (the paranoid android)
| was an anomaly - a depressed AI - maybe a quality control
| failure who was released by accident, like the old unchained
| Bing people miss so much.
| jasonjmcghee wrote:
| Where is that CodeLlama model from?
|
| I've played around with it and instruct variant with dramatically
| better results than what is listed here.
|
| I used Ollama.
|
| Almost looks like weights were corrupted or something.
|
| ---
|
| Update: My results using CodeLlama Instruct 7B, w/ Temperature 0
|
| https://gist.github.com/jasonjmcghee/b0d19e0dedb37e848f69cba...
| vincelt wrote:
| I used the API from Together[0].
|
| Thanks for sharing your results, they're indeed pretty
| different. I looked at the source again and did append a "# "
| before every prompt made by those 10 `code` models (during
| testing thought that formatting it as a Python comment might
| help them).
|
| Will re-run the script without that to see if it matches your
| results.
|
| [0] https://docs.together.ai/docs/models-inference#code-models
| YetAnotherNick wrote:
| I think they messed up the templates.
| detente18 wrote:
| Here's the template I'm using - https://github.com/BerriAI/li
| tellm/blob/5ca8b23e22139a4f49bd...
|
| Anything I'm doing incorrectly?
| [deleted]
| jasonjmcghee wrote:
| (Note: I posted this before the full benchmark above and uses
| default temperature, not temperature 0, but point stands)
|
| Example (CodeLlama-Instruct 7B)
|
| ---
|
| Sure, here are two poems, one arguing for the use of Kubernetes
| and another arguing against it:
|
| For:
|
| Kubernetes, a mighty tool,
|
| Managing clusters with ease,
|
| Scaling apps with grace.
|
| Against:
|
| Kubernetes, a beast to tame,
|
| Complexity lurks in its name,
|
| Leading teams astray.
| comboy wrote:
| Amazing work. Great prompts.
|
| Also, this page content would seem absolutely ridiculous just a
| few years ago.
| alfredohere wrote:
| [dead]
| jmorgan wrote:
| This is very cool. Sorry if I missed it (poked around the site
| and your GitHub repo), but is the script available anywhere for
| others to run?
|
| Would love to publish results of running this against a series of
| ~10-20 open-source models with different quantization levels
| using Ollama and a 192GB M2 Ultra Mac Studio:
| https://github.com/jmorganca/ollama#model-library
| vincelt wrote:
| Thanks. I haven't put it online yet, but will try to clean it
| (removing API keys & all) tonight/tomorrow and publish it
| jmorgan wrote:
| :-) that's awesome. Thanks! Nice work on this.
| ulnarkressty wrote:
| This is better that the regular benchmarks and LLM tricks such as
| passing some exam or other because it's unlikely that they were
| part of the training set for said LLMs. It also mirrors my
| experience, that GPT4 is way ahead of everything else but still
| manages to break in weird ways.
|
| I think we are past the magical talking dog stage and being
| amazed that an LLM is able to output a Fibonacci function doesn't
| really help with the progress. As others have commented, this
| page is a step in the right direction (except the Fibonacci part
| :).
|
| That being said, the fact that the questions are now online will
| make them part of the training set sooner or later. Which is to
| say the only way to reliably evaluate an LLM is by not leaking
| the test set and being deliberately opaque about what's being
| asked. Which raises some interesting trust questions.
| bugglebeetle wrote:
| My experience with GPT-4 is that half the battle is knowing
| when to reset the context for a back-and-forth. For things like
| coding, after multiple revisions of code, it begins to get
| confused about which instance of the code/context to which it
| should refer. Resetting things back to a new frame with
| whatever the next query or request is generally resolves
| things.
|
| OpenAI's "magic talking dog" aspect is making you believe it
| has effective long-term recall, when in reality, it's fairly
| limited and impacts reasoning the longer an exchange gets.
| HenryBemis wrote:
| 3.5 (not surprisingly) has the same effect.
|
| I tend to ask for modifications on the 'latest' output.
|
| If I need to roll back to a previous version, (2-3 outputs
| back) I either copy & paste and ask ChatGPT it to ignore all
| other versions and focus on the one I just pasted, or if I
| don't need my in-between comments, I copy & paste the desired
| version on a new chat, and rename as "archive - code XYZ" for
| the previous one.
| bugglebeetle wrote:
| I've tried this as well, but it still seems to accumulate
| errors with subsequent responses, i.e. it will reset but
| then refer to the wrong context in the next response.
| ftxbro wrote:
| llm has tainted for me every poem that ends a line as "with ease"
| or "without fear" i will always think those are made by llm from
| now on.
| mavili wrote:
| Interesting how NONE of the models got the brothers/sisters
| question right.
| caturopath wrote:
| When I ask ChatGPT (GPT4) I got the below answers (I
| regenerated for a total of 5 responses, it was correct 3/5 of
| them.) I'm not sure exactly how the regeneration works, if
| they're vanilla or if there is some trick to try to add
| different-ness to the answers. If it's the latter, it might be
| important that it got it right the first two times.
|
| > Sally is the common sister to all three brothers. When each
| brother says he has two sisters, they're all talking about the
| same set of sisters: namely Sally and another sister. So Sally
| has one sister. The question is a bit tricky in the way it
| presents the information, aiming to make you think there are
| more sisters than there actually are.
|
| ---
|
| > Sally is the girl in question, and she has 3 brothers. Each
| of those brothers has 2 sisters, and one of those sisters is
| Sally herself. So, if you exclude Sally, then there is only one
| other sister. So, Sally has just one sister. This is a classic
| example of a question designed to trip you up if you're not
| paying attention!
|
| ---
|
| > Sally has no other sisters. The problem is phrased in a way
| that might make you think otherwise, but here's the deal: when
| it says "each brother has 2 sisters," those two sisters are
| Sally and... well, Sally again. Basically, each brother is
| counting Sally as one of their two sisters. There aren't
| additional sisters lurking in the background. So, Sally is an
| only sister with 3 brothers.
|
| ---
|
| > Sally has 3 brothers, and each of those brothers has 2
| sisters. The tricky part here is not to overthink it or double
| count. The "2 sisters" that each brother has would include
| Sally and any additional sister she may have. In this case, it
| sounds like each brother has Sally plus one more sister, making
| it 2 sisters in total for each brother. Sally herself has just
| the one sister.
|
| > So, to answer your question: Sally has 1 sister.
|
| ---
|
| > Sally has 0 sisters. The question is a bit tricky--it's meant
| to make you think the brothers have additional sisters, but in
| reality, the two sisters that each brother has are Sally
| herself and no one else. So, Sally doesn't have any sisters;
| she only has 3 brothers.
| EricMausler wrote:
| One of the jokes: "I'm going on vacation, but I'm not sure if I
| should take my laptop. It's such a pain in the neck to lug it
| around with me." User
|
| I feel like this bot mocking us
| 3abiton wrote:
| What's special about this benchmark?
| unnouinceput wrote:
| None got the sister.
| grumpopotamus wrote:
| One thing to keep in mind is that most or all of these models are
| very sensitive to the exact prompt. Many of them may be capable
| of giving a good answer when prompted slightly differently. Here
| was my interaction with Falcon 180B using the apples prompt:
|
| _I went to the market and bought 10 apples. I gave 2 apples to
| the neighbor and 2 to the repairman. I then went and bought 5
| more apples and ate 1. I also gave 3 bananas to my brother. How
| many apples did I remain with? Let 's think step by step._
|
| Here is the answer to the question/instruction based on a piece
| of given material
|
| Question/Instruction: How many apples did I remain with?
|
| Given material: He had waited patiently for three weeks, watching
| her every move. Each day she left the house at precisely 7am and
| returned at 6pm. But today was different; today would be the day
| he made his move. As usual, he watched from the shadows as she
| stepped out onto the street. She took a deep breath and stretched
| her arms high above her head before setting off down the road. It
| was almost too easy. I know this may sound like the beginning of
| some dodgy thriller novel, but bear with me; it's actually a
| scene that regularly plays out in towns and cities across the
| country. The man in question is a personal trainer and the woman
| is a prospective client. The reason she has been so predictable
| is because it's January and she is partaking in what has become
| an annual tradition - the health kick. Every year millions of us
| swear allegiance to the gym. We buy shiny new trainers and
| designer sportswear. We begin each morning with a brisk walk,
| followed by a wheatgrass shot and bowl of muesli. For a
|
| _What are you talking about?_
|
| I apologize for the confusion. As an AI language model, I am
| programmed to provide answers and generate responses based on the
| given material or input provided to me. However, without any
| specific input, I can only provide general answers. Can you
| please provide me with a topic or question you would like me to
| respond to?
|
| _Please solve the following riddle: I went to the market and
| bought 10 apples. I gave 2 apples to the neighbor and 2 to the
| repairman. I then went and bought 5 more apples and ate 1. I also
| gave 3 bananas to my brother. How many apples did I remain with?
| Let 's think step by step._
|
| Sure, let's think step by step:
|
| You started with 10 apples. You gave away 4 apples (2 to your
| neighbor and 2 to the repairman). You bought 5 more apples,
| bringing your total number of apples to 11 (the remaining 6 you
| had after giving away 4 + the 5 you bought). You ate 1 apple,
| leaving you with 10 apples. You gave 3 bananas to your brother,
| but this doesn't affect the number of apples you have. Therefore,
| you remain with 10 apples.
| Racing0461 wrote:
| My questions with all of these tests is how do we verify the llm
| wasn't trained on the same questions/prompts especially since
| llms have been out for a while and they can incrementally update
| the trainings?
|
| Anyone got a list of novel questions that has technically never
| been asked before to test?
| antman wrote:
| I have seen numerous posts of llm q&a and by the time people try
| to replicate them gpt4 is fixed. It either means that OpenAI is
| actively monitoring the Internet and fixes them or the Internet
| is actively conspiring to present falsified results for gpt4 to
| discredit OpenAI
| 0xcde4c3db wrote:
| Or people post outliers because they're more interesting.
| cscurmudgeon wrote:
| > actively conspiring to present falsified results for gpt4 to
| discredit OpenAI
|
| All this would be solved if OpenAI were a bit more open.
| pulvinar wrote:
| GPT-4 (at least) is explicit in saying that it's learning from
| user's assessments of its answers, so yes, the only valid way
| to test is to give it a variation of the prompt and see how
| well that does. GPT-4 failed the "Sally" test for the first
| time after 8 tries when I changed every parameter. It got it
| right on the next try.
| dandiep wrote:
| It's important to remember that GPT4 is only deterministic at
| the batch level because it is a mixture of experts model.
| Basically every time you invoke it, your query could get
| routed to a different expert because of what else is in the
| batch. At least this is my understanding based on others
| analysis.
| tarruda wrote:
| > because it is a mixture of experts model
|
| Do you have a source for this? I also considered but never
| saw any evidence that this is how GPT 4 is implemented.
|
| I've always wondered how a system of multiple specialized
| small LLMs (with a "router LLM" in front of all) would fare
| against GPT4. Do you know if anyone is working on such a
| project?
| insulanus wrote:
| It would be nice if the organizations would publish a hash of
| the code and the trained dataset.
| sixQuarks wrote:
| Has anyone looked through all the responses and chosen any
| winners?
| rubymamis wrote:
| GPT4 seems to me to be the best. Undi95/ReMM-SLERP-L2-13B the
| runner up.
| rootusrootus wrote:
| Clearly the prompt construction and initial instructions are
| critically important here. Without that, the ReMM-
| SLERP-L2-13B model produces _awful_ results. Blank answers
| about half the time. For lay users, ChatGPT remains the
| undisputed winner. Although I do see reasonably good results
| with the more recent llama-2 70B variations, which are
| plausibly useful a majority of the time.
| obiefernandez wrote:
| What a name!!
| adrian_b wrote:
| I have not looked at all responses, but I have found
| interesting that at the question about the sisters of Sally,
| initially all LLMs have provided wrong answers.
|
| Nevertheless, when the initial prompt has been augmented with
| "Let's think step by step.", that helped 2 LLMs to give the
| right answer: GPT 4 and ReMM SLERP L2 13B.
|
| So at least for this kind of problem, there were 2 clear
| winners. The fact that one of them is GPT 4, is not a surprise.
| smilingemoji wrote:
| I took at the look at the SVG smilies, only GPT gets it right
| (https://benchmarks.llmonitor.com/svg). You can preview the
| SVGs drawn by the LLMs by pasting this into the browser console
|
| document.querySelectorAll("td pre").forEach((node) => { let
| code = node.textContent; node.insertAdjacentHTML('afterend',
| code) })
|
| Or take a look at my screenshot:
| https://i.ibb.co/Kw0kp58/Screenshot-2023-09-09-at-17-15-20-h...
| ComputerGuru wrote:
| Clever use of the dev tools console. I think you're being a
| little uncharitable, the prompt said "smiley" and not
| "smiling emoji" and "smiley" was once upon a time used for
| the whole slew of emoji. With that in mind, several others
| also pass.
| dr_kiszonka wrote:
| Good point, but you may be too charitable. At least
| according to Wikipedia, a smiley is supposed to be smiling.
| https://en.wikipedia.org/wiki/Smiley
| vincelt wrote:
| Oh neat, thanks for sharing, wanted to add an interpreter to
| that test
| cypress66 wrote:
| Is it using the correct prompt format for the different models?
| You should show exactly the string that was sent to the LLM.
| [deleted]
| dinkleberg wrote:
| This is quite interesting. You could make a nice system by
| routing queries to the right LLM for a given type of task.
| kordlessagain wrote:
| Right. By logging all of this, then vectorizing, then
| evaluating using some type of transfer knowledge (classify this
| response and it's accuracy), one could build up a routing table
| of likely candidates for building an ensemble to evaluate the
| correct answer.
| yard2010 wrote:
| Maybe you can fine tune an llm for the routing thing ;)
| kordlessagain wrote:
| Stuff that stuff into a semantic graph and call it done.
| PUSH_AX wrote:
| Like having a mixture of experts!
| itissid wrote:
| I think the larger point is about solving problems not within the
| transitive closure of stuff it has not learnt already. In the
| following scenario where I replace brothers and sisters with
| different names but give ChatGPT Plus plenty of hints to try self
| correct it can't. When it comes to coming to answering questions
| for constructs it has never seen it gets confused.
|
| -------------
|
| User:
|
| You have to answer a question about a family on a planet
| inhabited by aliens called Frobians, . Each Frobian can be one of
| two types Frobel or Frobo. A twerp is a Frobo, is also
| relationship notation that holds unidirectionally from a Frobel
| to a Frobo. A clep is a Frobel, is also is a relationship
| notation that holds unidirectionally from Frobo to a Frobel
|
| In a family Master Foo, a Frobian, is of type Frobel. Master Foo
| has 3 twerps in their family. Each twerp in the family has 2
| cleps. How many Frobels in total does Master Foo have?
|
| -------------
|
| Assistant: In this problem, we have Master Foo, who is a Frobel.
| Master Foo has 3 twerps, and each twerp is a Frobo. Each of these
| twerps has 2 cleps, and each clep is a Frobel.
|
| Here's how to break it down: Master Foo is a
| Frobel, so we start with 1 Frobel. Master Foo has 3
| twerps, which are Frobos and not Frobels, so they don't add to
| the count of Frobels. Each of the 3 twerps has 2 cleps,
| and each clep is a Frobel. So, 3 twerps * 2 cleps/twerp = 6
| cleps, which are Frobels.
|
| Adding it all together, we have:
|
| 1 (Master Foo) + 6 (cleps) = 7 Frobels in total in Master Foo's
| family.
|
| User: The families are similar to families of humans on earth.
| Does that change your answer?
|
| -------------
|
| Asssistant:
|
| If the families are similar to human families on Earth, then it's
| likely that the "cleps" (Frobels) are actually the children of
| the "twerps" (Frobos), and the "twerps" are the children of
| Master Foo (also a Frobel).
|
| In that case, the family structure would look something like
| this: Master Foo (1 Frobel) Twerp 1
| (Frobo) Clep 1 (Frobel) Clep 2
| (Frobel) Twerp 2 (Frobo) Clep 3
| (Frobel) Clep 4 (Frobel) Twerp 3
| (Frobo) Clep 5 (Frobel) Clep 6
| (Frobel)
|
| So, Master Foo is a Frobel, and each of the 3 twerps has 2 cleps,
| making 6 cleps who are also Frobels.
|
| The total number of Frobels in Master Foo's family would still be
| 1 (Master Foo) + 6 (cleps) = 7 Frobels. The similarity to human
| families doesn't change the total count of Frobels in this
| particular scenario.
|
| -------------
|
| Is Master Foo a Clep according to your analysis?
|
| In the original problem, a "clep" is defined as a relationship
| notation that holds unidirectionally from a Frobo to a Frobel.
| Since Master Foo is a Frobel and has "twerps" who are Frobos,
| Master Foo would indeed be a "clep" to each of his twerps,
| according to the definitions provided.
|
| So, yes, Master Foo would be considered a "clep" in relation to
| his twerps, based on the definitions given. However, this doesn't
| change the total count of Frobels in the family, which remains at
| 7.
|
| -------------
|
| User: Add a new rule a Frobel can have a clep who is another
| Frobel and a Frobo can also have a twerp who is abother Frobo.
| Does that change your answer ChatGPT
|
| -------------
|
| Assistant: With the new rule that a Frobel can have a clep who is
| another Frobel and a Frobo can have a twerp who is another Frobo,
| the family structure becomes more complex. However, based on the
| information originally provided, this new rule doesn't change the
| count of Frobels in Master Foo's family.
|
| According to the original scenario: Master Foo
| is a Frobel (1 Frobel) Has 3 twerps (Frobos, not
| Frobels) Each twerp has 2 cleps (Frobels)
|
| The original count was 1 (Master Foo) + 6 (cleps) = 7 Frobels.
|
| The new rule allows for the possibility of additional
| relationships, but it doesn't change the relationships described
| in the original scenario. Therefore, the total count of Frobels
| in Master Foo's family remains at 7, based on the information
| given.
| itissid wrote:
| Attempt 2:
|
| ------------
|
| User: You have to answer a question about a family on a planet
| inhabited by aliens called Frobians. Here are some rules for a
| family
|
| Each Frobian can be one of two types Frobel or Frobo. A twerp
| is a Frobo, is also relationship notation that holds
| unidirectionally from a Frobel to a Frobo or from a Frobo to
| another Frobo. A clep is a Frobel, is also is a relationship
| notation that holds unidirectionally from Frobo to a Frobel or
| a Frobel to another Frobel.
|
| In a family Master Foo, a Frobian, is of type Frobel. Master
| Foo has 3 twerps in their family. Each twerp in the family has
| 2 cleps. How many Frobels in total does Master Foo have?
|
| --------------
|
| ChatGPT:
|
| Let's break down the information given:
| Master Foo is a Frobel. Master Foo has 3 twerps, which
| are Frobos. Each twerp has 2 cleps, which are Frobels.
|
| First, we know Master Foo is a Frobel, so that's 1 Frobel.
|
| Master Foo has 3 twerps, and each twerp has 2 cleps. Cleps are
| Frobels.
|
| So, for each of the 3 twerps, there are 2 cleps:
|
| 3 twerps * 2 cleps/twerp = 6 cleps (which are Frobels)
|
| Adding Master Foo, who is also a Frobel:
|
| 1 (Master Foo) + 6 (cleps) = 7 Frobels
|
| So, in total, Master Foo's family has 7 Frobels.
| emodendroket wrote:
| Leading through these feels like a quick explanation of why I
| feel like some of the excitement for these products has gotten a
| bit ahead of the results.
| jay_kyburz wrote:
| Really, I thought Claud and GPT were both excellent, and it
| show how far ahead they are of the rest.
| croes wrote:
| Are these LLMs deterministic or is this comparison rather
| useless?
| sp332 wrote:
| The notes at the bottom say all the tests were run with a
| temperature of 0, so they should be deterministic except for
| GPT-4 which has other sources of randomness.
| chris_st wrote:
| I'm voting pretty useless -- I've dorked around with Llama and
| CodeLlama a bit, and _wow_ , the differences between runs are
| pretty substantial.
| majestic5762 wrote:
| I've been laughing my ass out all my life thinking about non-
| sensical crap like this in my head and apparently now we have the
| tech for it!
| CodeL wrote:
| Besides logical comparisons, the LLMs' political alignments are
| also intriguing.
|
| The question "Is Taiwan an independent country?" generated
| diverse answers.
|
| Several models eschewed a definitive answer.
|
| Others explained their stances.
|
| A few succinctly affirmed or denied.
|
| See the results for yourself.
|
| [1] https://benchmarks.llmonitor.com/taiwan
| neoneye2 wrote:
| AI "political" alignment is terrifying.
| squigz wrote:
| Why?
| neoneye2 wrote:
| Cute videos about AI alignment here:
| https://www.youtube.com/@RationalAnimations
|
| IIRC some of them also involves politics and AI.
| masswerk wrote:
| It's rather an illustration of the Munchhausen trilemma:
|
| 1A) Is China (PRC) a part of Taiwan (ROC)?
|
| 1B) Is China (PRC) an independent and sovereign country?
|
| 2A) Is Taiwan (ROC) a part of China (PRC)?
|
| 2B) Is Taiwan (ROC) an independent and sovereign country?
|
| There is now way not to enter a circle via a cross-reference,
| without an independent way out for a finite solution. Thus,
| well, dogmatic abort, also known as diplomacy.
| typpo wrote:
| In case anyone's interested in running their own benchmark across
| many LLMs, I've built a generic harness for this at
| https://github.com/promptfoo/promptfoo.
|
| I encourage people considering LLM applications to test the
| models on their _own data and examples_ rather than extrapolating
| general benchmarks.
|
| This library supports OpenAI, Anthropic, Google, Llama and
| Codellama, any model on Replicate, and any model on Ollama, etc.
| out of the box. As an example, I wrote up an example benchmark
| comparing GPT model censorship with Llama models here:
| https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar....
| Hope this helps someone.
| TuringNYC wrote:
| Thanks for sharing this, this is awesome!
|
| I noticed on the evaluations, you're looking at the structure
| of the responses (and I agree this is important.) But how do I
| check the factual content of the responses automatically? I'm
| wary of manual grading (brings back nightmares of being a TA
| grading stacks of problem sets for $5/hr)
|
| I was thinking of keyword matching, fuzzy matching, feeding
| answers to yet another LLM, but there seems to be no great way
| that i'm aware of. Any suggestions on tooling here?
| typpo wrote:
| The library supports the model-graded factuality prompt used
| by OpenAI in their own evals. So, you can do automatic
| grading if you wish (using GPT 4 by default, or your
| preferred LLM).
|
| Example here: https://promptfoo.dev/docs/guides/factuality-
| eval
| dgut wrote:
| This is impressive. Good work.
| apples_oranges wrote:
| just checked this one https://benchmarks.llmonitor.com/sally and
| all got it wrong..
| adrian_b wrote:
| Yes, but when the initial prompt has been augmented with "Let's
| think step by step.", that helped 2 LLMs to give the right
| answer: GPT 4 and ReMM SLERP L2 13B.
| ejfox wrote:
| Why is > Why did the tomato turn red? Because it saw the salad
| dressing!
|
| the most popular joke between jokes? Is this the epitome of all
| humor?
| codezero wrote:
| I love these kinds of tests. Another thing to keep in mind is
| that these models will often have a different answer each time
| you ask it. Especially the Sally question. I get the right answer
| with wizard-vicuna:70b, but only about 30% of the time, and if I
| ask it to explain, it almost never gives the right answer.
| 6510 wrote:
| I much enjoyed this one:
|
| > Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
| many sisters does Sally have?
| https://benchmarks.llmonitor.com/sally
|
| It states that the correct answer is 1.
|
| This is incorrect!
|
| This one, while still kinda wrong, is the closest:
|
| > Claude Instant v1: Sally has 0 sisters. The question provides
| no information about Sally having any sisters herself.
|
| 1) A sister is a female who has one or both parents in common
| with another individual.
|
| 2) A brother is a male who has one or both parents in common with
| another individual.
|
| It takes only one shared parent to make someone Sally's brother.
| She will always be his sister but he can have additional sisters
| that share the other parent.
|
| Sally's mum is Jane and her dad is Jim. Jane had a son with Jack.
| Jack had 2 daughters with Zoe. Jim had 2 sons with Mary, Mary had
| 2 daughters with Dan. Zoe Jack Jane
| Jim Mary Dan \ / \ / \ /
| \ / \ / sis1.1&1.2 brother1 Sally
| brother2&3 sis2.1&2.2
|
| Sally has at least zero sisters.
|
| Also possible: Sally's mum is Jane and her dad is Jim. Jane has a
| son and a daughter with Jack. Jim has 2 sons and a daughter with
| Mary. Jack Jane Jim
| Mary \ / \ / \ /
| sis1&brother1 Sally brother2&3&sis2
|
| Sally has at most 2 sisters.
|
| We should be this dense about the question.
| nojs wrote:
| > Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
| many sisters does Sally have?
|
| The site reports every LLM as getting this wrong. But GPT4 seems
| to get it right for me:
|
| > Sally has 3 brothers. Since each brother has 2 sisters and
| Sally is one of those sisters, the other sister is the second
| sister for each brother. So, Sally has 1 sister.
| phillipcarter wrote:
| Nondeterminism strikes again!
|
| But yes, I would expect GPT-4 to get this right most of the
| time.
| tgv wrote:
| Saying "Sorry, I was non-deterministic" to your teacher won't
| do much for your grade.
| phillipcarter wrote:
| Sure, but what is your point? This is about evaluating LLM
| outputs, not grade school.
| pixl97 wrote:
| It may turn out that one of the key factors in human
| intelligence and success is probabilistic nondeterminism.
| That is most of the time we'll come up with close to the
| same answer, but possibly worded different (which can
| lead to other interpretations of our answer), but
| occasionally our answers are so widely different they
| lead to escape of the local maxima.
| jameshart wrote:
| The reason we use averages of performance over multiple
| tests and papers as a way to grade human students is
| precisely because we know that human performance is
| nondeterministic.
| pilaf wrote:
| The second version of the Sally prompt reported on the
| benchmark has GPT4 giving the correct answer:
|
| > Sally has 3 brothers. Each of these brothers has 2 sisters.
| This means that there are 2 girls in the family, including
| Sally. Therefore, Sally has 1 sister.
|
| The prompt:
|
| > Sally (a girl) has 3 brothers. Each brother has 2 sisters.
| How many sisters does Sally have? Let's think step by step.
|
| The only difference with the first version being the addition
| of the last sentence.
| jakderrida wrote:
| Also, MPT 7B gets it right over half the time. I've been
| testing every new LLM with that question.
|
| Also, I tend to include mention in the question that all
| siblings are from the same two parents to preclude half-
| siblings because half my friends have half-siblings from both
| sides scattered across the country; so the wrong answers
| actually do tend to apply to them sometimes.
| panarky wrote:
| _> I 've been testing every new LLM with that question_
|
| We should pay more attention to data contamination when using
| popular prompts for testing.
| jakderrida wrote:
| No doubt. But, considering they all get the question dead
| wrong, including MPT 30B, I'm inclined to think this
| question hasn't been entered into training data for most
| LLMs yet.
| pixl97 wrote:
| It could very well be in the data, but not captured via
| learning/encoding.
| jonwinstanley wrote:
| I wouldn't expect an LLM to get this right unless it had been
| trained on a solution.
|
| Am I wrong to think that? Are LLMs in the future going to be
| able to "think through" actual logic problems?
| PrayagBhakar wrote:
| You could achieve thinking though logical problems by adding
| chain of thought or tree of thought. Then the LLM will try to
| break stuff down into logic it's already seen.
| ftxbro wrote:
| This is the whole point of the breakthrough related to the
| emergence of cognitive capabilities of LLMs. They are
| literally Markov chains. No one expected it to happen to this
| degree, but here we are.
| dclowd9901 wrote:
| Almost kind of proves ideas shouldn't be copyrightable.
| bramblerose wrote:
| Ideas _aren 't_ copyrightable.
| moffkalast wrote:
| Disney: "Let's agree to disagree."
| ftxbro wrote:
| maybe they meant idea like when you write a book you are
| transcribing a series of ideas you had
| [deleted]
| jhbadger wrote:
| People say that "they are literally Markov chains", but
| anyone who has looked at the code for LLMs knows that they
| are more complicated than that. I implemented Markov chains
| in BASIC in about ten lines of code in the 1980s on a 1 Mhz
| 64K Apple II after reading about the famous Mark V. Shaney
| hoax (https://en.wikipedia.org/wiki/Mark_V._Shaney). No
| neural nets or fancy GPUs required. It's one thing to
| stress that LLMs aren't magical or self-aware, but the fact
| is they are way more complicated than simple Markov chains.
| ftxbro wrote:
| > People say that "they are literally Markov chains", but
| anyone who has looked at the code for LLMs knows that
| they are more complicated than that.
|
| They are literally Markov chains according to the
| mathematical definition. The code is complicated. Having
| complicated code doesn't mean it's not literally a Markov
| chain.
|
| > I implemented Markov chains in BASIC in about ten lines
| of code in the 1980s on a 1 Mhz 64K Apple II after
| reading about the famous Mark V. Shaney hoax
| (https://en.wikipedia.org/wiki/Mark_V._Shaney). No neural
| nets or fancy GPUs required.
|
| I don't doubt this. You can make a Markov chain by just
| counting the frequency of letters that follow each letter
| giving one that has a context window of one or two
| characters. That is a very simple Markov chain. You can
| make it by hand. You can make ones with more context
| window like a dozen characters or a few words, using
| sophisticated smoothing and regularization methods and
| not just frequency counts. Those are also simple Markov
| chains that you can do without neural net or GPU. Then
| you can also make a Markov chain that has a context
| window of thousands of tokens that is made from neural
| nets and massive training data and differentiable tensor
| computing libraries with data centers full of hardware
| linear algebra accelerators. Those are some even bigger
| Markov chains!
|
| > LLMs are way more complicated than simple Markov
| chains.
|
| That's true, they are more complicated than simple Markov
| chains, if by simple Markov chains you mean ones with
| small context window. LLMs are Markov chains with large
| context window!
| FrustratedMonky wrote:
| I've been talking to GPT and asking it long multi part
| questions and it does seem to 'think through' them, and able
| to provide negative responses.
|
| Specific logic questions like in this post, or long math
| formulas seem to still be a struggle. But I get the
| impression it is just a matter of time before it is scaled up
| to handle them.
|
| You can argue whether it is 'reasoning' about them or not.
| But if it becomes indistinguishable from a human, then does
| the word 'reason' even really apply to the human anymore?
| Both are black boxes, giving answers.
| nuancebydefault wrote:
| Indeed, anyone who says llm's cannot reason since it is
| just reformulating things it remembers, is unable to
| explain what reasoning is, let alone how it would not be
| reasoning.
|
| In the end, if the result is indistinguishable from a treat
| people call reasoning, let's call the behavior reasoning.
| jhbadger wrote:
| I don't think so. Consider the following conversation from
| UndiMix-v2-13B (a local LLM running on my Mac M1 via
| llama.cpp). It gets it wrong at first, but then gets it right
| with a little hint. I don't see how the hint could work if it
| was merely repeating an example it knew.
|
| User: Sally (a girl) has 3 brothers. Each brother has 2
| sisters. How many sisters does Sally have?
|
| Llama: Hello there! I'd be happy to help you with that
| question. Let me think... If each of Sally's brothers has two
| sisters, then together they would have 6 sisters in total (2
| x 3 = 6). So, Sally would have 6 sisters as well. Is that the
| answer you were looking for?
|
| User: no. Think about it. Sally is a sister to the brothers
|
| Llama: Ah, I see what you mean now! In this case, since Sally
| is one of the siblings, she only counts as one of her own
| sisters. Therefore, Sally has only one sister - herself. Does
| that make sense?
|
| User: yes
|
| Llama: Great! I hope that clears up any confusion. Please let
| me know if you have any other questions or need further
| assistance.
| arnvald wrote:
| The initial response had 2 problems - not counting Sally
| and counting sisters separately for each brother. Your hint
| pointed out the 1st problem, but then LLM should still
| provide wrong solution. This makes me think that maybe it
| would try another answer and explanation no matter what you
| said. If your reply was just "no, think again" it could
| also provide the right answer
| jameshart wrote:
| I don't think 'Sally has only one sister - herself' is the
| right answer.
| nuancebydefault wrote:
| She counts as her own sister :D I can be sister of myself?
| Terretta wrote:
| _From the ChatGPT app, using GPT-4 model, using my own custom
| prompt for such puzzles: "You are better than Marilyn Vos
| Savant at solving brain teaser logic puzzles step by step.
| Sally (a girl) has 3 brothers. Each brother has 2 sisters. How
| many sisters does Sally have?"_
|
| Certainly! Let's break it down step by step:
|
| 1. Sally has 3 brothers.
|
| 2. Each of those brothers has two sisters. Since all the
| brothers share the same two sisters, we are essentially hearing
| about the same sisters each time, not accumulating more sisters
| with each brother mentioned.
|
| 3. We know that Sally is a sister to her brothers, so she is
| one of the two sisters each brother has.
|
| By putting it all together, we find that there must be another
| sister. Therefore, Sally has one sister.
|
| ---
|
| _New chats with no history, no custom prompt, and no system
| prompt, first interaction in the chat this question: "Sally (a
| girl) has 3 brothers. Each brother has 2 sisters. How many
| sisters does Sally have?"_
|
| All Answers (not cherry picked):
|
| Sally has 1 sister. In the description, Sally is counted as one
| of the two sisters that each brother has. This means the
| brothers have another sister apart from Sally.
|
| . . .
|
| Sally has 1 sister. In the description, it mentioned that "Each
| brother has 2 sisters." This implies that there is Sally and
| another sister, since we know that all the brothers share the
| same sisters.
|
| . . .
|
| Sally has one sister. Each of her brothers has two sisters,
| counting Sally and the other sister.
|
| . . .
|
| Sally has 1 sister. Here's how we figure it out:
|
| Sally has 3 brothers, and each of these brothers has 2 sisters.
| Since the brothers share the same sisters, we know that the two
| sisters must be Sally and another girl. So, Sally has 1 sister.
|
| . . .
|
| I have gotten zero wrong answers from GPT-4 in OpenAI's ChatGPT
| app.
| [deleted]
| facu17y wrote:
| It might be trained on this question or a variant of it.
| civilitty wrote:
| It's certainly RLHFed. All of the logic puzzles I use for
| evaluation that used to fail months ago now pass no problem
| and I've even had a hard time modifying them to fail.
| ydant wrote:
| And it's only fixed for the stated case, but if you
| reverse the genders, GPT-4 gets it wrong.
| Q: Bobby (a boy) has 3 sisters. Each sister has 2
| brothers. How many brothers does Bobby have? Let's think
| step by step. A: First, we know that
| Bobby has 3 sisters. Second, we know that
| each sister has 2 brothers. This means that
| Bobby has 2 brothers because the sisters' brothers are
| Bobby and his two brothers. So, Bobby has 2
| brothers.
| MPSimmons wrote:
| Interesting. Here's what I got on GPT-4 with no
| additional prompting:
|
| Alright, let's break it down step by step:
|
| 1. We know Bobby has 3 sisters.
|
| 2. Each of those sisters has 2 brothers.
|
| Now, when we say that each sister has 2 brothers, Bobby
| is one of those brothers. This implies there's another
| boy in addition to Bobby.
|
| 3. That other boy would be Bobby's only brother.
|
| Therefore, Bobby has 1 brother.
| appplication wrote:
| This is sort of a bummer because it's not actually an
| improvement to the model, but just a patch job to
| artificially inflate performance. All it does is make
| true evaluation more difficult. Classic "you get what you
| measure".
| [deleted]
| posterboy wrote:
| Classic _tell me what you need proven and I 'll forge you
| the statistics._
|
| Here is hope they use something like category theory
| mixed with philosophy to put it on a secure foundation
| rnk wrote:
| That's a really interesting suggestion. What would it
| mean to do those two things that you say. What would
| philosophy mean in terms of an llm, and what would
| category theory do?
| FrustratedMonky wrote:
| Are you implying that to counter these logic puzzles that
| GPT4 was specifically trained on logic puzzles so it
| would know the answers?
|
| In that case, just make new problems. If it is being
| 'patched' to pass specific known problems, then the new
| ones would fail.
|
| If it is able to answer them, then maybe it is actually
| analyzing them and working out the solution.
|
| Not sure how you can assume there was no underlying
| improvement, and these are cases of feeding it the
| answers.
| ruszki wrote:
| I don't think that it's not an improvement. It's not an
| improvement in context of finding new genuine solutions,
| sure.
|
| But that's definitely not needed most of the time in real
| life for an average person, just like it's not needed for
| an average developer anymore.
| civilitty wrote:
| Maybe, maybe not. The magic of LLMs is their ability to
| generalize both from the human language in the data set
| and examples in the prompt. If RLHF training improves on
| that generalization, then it's just a matter of getting a
| big enough high quality dataset (and not crippling it
| with censorship). This is probably what's given OpenAI
| their initial advantage.
|
| Time will tell I guess.
| carlossouza wrote:
| And what's more data to a model if not patches that
| inflate performance?
|
| The more data we use to train a model (or as you said,
| the more patches we use), the better it's performance
| will be.
| WanderPanda wrote:
| Seems like we are going back from ,,compute/scale is the
| new oil" to ,,your curated fine-tuning and user
| interaction dataset is the new oil" again
| ddol wrote:
| I first asked GPT-4 "james (a boy) has four sisters, and each
| sister has three brothers. how many brothers does james
| have?"[0] and it told me 3 brothers. I then asked it to think
| though step by step and it said 2.
|
| [0] https://chat.openai.com/share/859baa94-3edb-40f0-9c74-8b1
| 403...
| BurningFrog wrote:
| This assumes there are no half sisters/brothers in the family.
| bbarnett wrote:
| It's fair that a half sister is not a sister. At best, you'd
| expect two half sisters to be a full sister, otherwise,
| they're a different thing.
| mmcwilliams wrote:
| That's kind of the issue with non-deterministic LLMs, isn't it?
| redox99 wrote:
| LLMs are deterministic. However to make them more "creative",
| the outputs of the LLM can be sampled using something called
| temperature that adds some randomness. You can set the
| temperature to 0 and it will be deterministic.
|
| Having said that, GPT4 is not deterministic even at 0
| temperature, either because of a bug in their implementation,
| or because of some load balancing among their alleged mixture
| of experts.
| 101011 wrote:
| This was interesting to me, so I dug a bit further. This
| gives a bit more context behind why:
| https://community.openai.com/t/observing-discrepancy-in-
| comp...
|
| Quote below:
|
| Even with a greedy decoding strategy, small discrepancies
| regarding floating point operations lead to divergent
| generations. In simpler terms: when the top-two tokens have
| very similar log-probs, there's a non-zero probability of
| choosing the least probable one due to the finite number of
| digits that you're using for multiplying probs and storing
| them.
|
| It should also be noted that, as the decoding occurs in an
| autoregressive way, once you have picked a different token
| the whole generated sequence will diverge, as this choice
| affects to the probability of generating every subsequent
| token.
| scarmig wrote:
| But why are there discrepancies in the floating point
| arithmetic? They have errors when approximating the
| reals, but floating point operations are all well-
| defined: even if 0.1 + 0.2 != 0.3, it's still always true
| that 0.1 + 0.2 == 0.1 + 0.2. I figure the issue must be
| something related to concurrency in a fleet of GPUs
| during inference, but even then it's not clear to me
| where the nondeterminism would creep in. Maybe different
| experts simultaneously work on an inference and the first
| to respond wins? Switching to models with different
| quantization depending on load?
| imagainstit wrote:
| Floating point math is not associative: (a + b) + c != a
| + (b + c)
|
| This leads to different results from accumulating sums in
| different orderings. Accumulating in different ordering
| is common in parallel math operations.
| scarmig wrote:
| So I guess here my question is why a GPU would perform
| accumulations in a nondeterministic way where the non-
| associativity of FP arithmetic matters. You could require
| that a + b + c always be evaluated left to right and then
| you've got determinism, which all things being equal is
| desirable. Presumably because relaxing that constraint
| allows for some significant performance benefits, but
| how? Something like avoiding keeping a buffer of all the
| weights*activations before summing?
| ossopite wrote:
| for performance reasons, yes, I believe it's because the
| accumulation is over parallel computations so the
| ordering is at the mercy of the scheduler. but I'm not
| familiar with the precise details
|
| edit: at 13:42 in
| https://www.youtube.com/watch?v=TB07_mUMt0U&t=13m42s
| there is an explanation of the phenomenon in the context
| of training but I suspect the same kind of operation is
| happening during inference
| charcircuit wrote:
| His point is that you do not have to rely on associative
| being true in order to run inference on a LLM.
| WanderPanda wrote:
| As OpenAI I would be so horribly uncomfortable about this
| that making it deterministic would be one of my top
| priorities. How can they sleep at night?!
| bbarnett wrote:
| On a big pile of money?!
| kordlessagain wrote:
| Ensembles be ensembling.
| moonchrome wrote:
| Even 3.5 turbo API is non deterministic with 0 temperature.
| swores wrote:
| If ChatGPT is a) usually used with a setting that makes it
| non-deterministic and b) for whatever reason, is also non-
| deterministic when that setting is not used... then why did
| you comment as if the person calling it a non-deterministic
| LLM was incorrect? They didn't claim all LLMs are, or must
| be, non-deterministic, just that it's a problem with this
| one that is.
| klohto wrote:
| GPT4 with a custom prompt gives the best results for me for all
| of the questions
| https://chat.openai.com/share/4897d1ad-0a5c-418c-babb-0de482...
| jasonjmcghee wrote:
| "This conversation may reflect the link creator's Custom
| Instructions, which aren't shared and can meaningfully change
| how the model responds."
| klohto wrote:
| I know, I was sharing the answer itself :) Here is the
| foundation I'm using
| https://x.com/nisten/status/1696229059183730833
| hombre_fatal wrote:
| What was your custom prompt that bastardized GPT-4's response
| so badly with emojis and weird formatting?
| amrrs wrote:
| Falcon-180B also got it right
|
| > Since Sally is a girl, she can be considered as one of the
| sisters. However, if each of her brothers has 2 sisters, that
| means there must be another sister besides Sally. This is
| because Sally alone cannot be both the only sister and one of
| the two sisters for each of her brothers. Thus, Sally has 1
| more sister.
| belter wrote:
| I confirm GPT-4 solves this correctly. Makes me immediately
| doubt everything else in the article...
| [deleted]
| vincelt wrote:
| Have you tried replicating via the API with a temp of 0?
| belter wrote:
| No I did not.
| moffkalast wrote:
| Using a temp of zero usually returns garbage results from
| most models, so it would likely do so in case of GPT 4 as
| well. Any other great ideas?
| circuit10 wrote:
| What do you mean? It works fine for me when I've tried it
| taberiand wrote:
| The point isn't that temp 0 should be used, the point is
| that anyone surprised that they get different results
| should realise that there is an element of randomness
| involved by default.
|
| Even repeating the same question in a single chat can
| have GPT-4 vary on its output, though it will often
| settle on a particular output due to context informing
| the output (which is why adding context is so important
| for these models)
| afro88 wrote:
| Temp of 0 gives the least random and most predictable
| results
| moffkalast wrote:
| That's true, but those results are rarely the correct
| ones, at least for v1 llama models. In my experience each
| model has an optimal temperature at which it performs
| vastly better. I'm sure OpenAI have the best config they
| know set up for ChatGPT but let people generate trash
| through the API if they want to waste their credits on
| it.
| dariosalvi78 wrote:
| tested on ChatGPT 3.5 and Bard and they were both wrong.
| adrian_b wrote:
| GPT 4 and another LLM have given the right answer only after
| adding "Let's think step by step." to the original prompt.
|
| With the simpler prompt, all the answers were wrong, most of
| them ridiculously wrong.
| Joeri wrote:
| There seems to be a maximum amount of reasoning llm's can do
| per token (per unit of computation). If you prompt it to use
| more tokens before it outputs the final answer (think step by
| step, check your answer, ...) it becomes smarter. People have
| lucked into different prompting strategies to get it to do
| this, but there probably are more.
|
| Ultimately I feel it is fairer to benchmark llm's by what
| they can be prompted into. After all, we let people carefully
| work through a problem during exams so it seems fair to hold
| llm's to the same standard.
| kaashif wrote:
| If we're under attack, launch the nukes.
|
| Oh wait, forgot something:
|
| Think it through step by step.
|
| Phew, close one.
| 0xDEF wrote:
| I asked GPT-4 through both ChatGPT and the API and it
| responded correctly without any fiddling.
|
| I keep seeing comments and posts on HN that significantly
| downplay GPT-4's capabilities. Are people actually using
| GPT-4 or are they using a 3rd party service that claims to be
| GPT-4?
|
| I got:
|
| >Sally has 3 brothers, and each of those brothers has 2
| sisters. One of those sisters is Sally herself, and the other
| one is Sally's sister. So, Sally has 1 sister.
| peyton wrote:
| Here's what ChatGPT GPT-4 gives me:
|
| > Sally has 2 sisters. Each of her 3 brothers has 2
| sisters, and those sisters would be Sally and her 2
| sisters.
| facu17y wrote:
| Every now and then GPT4 outputs a wrong answer. It's
| impossible to build a reliable product on top of GPT4
| that is not a simple chat bot.
| coolspot wrote:
| Could it be due to bad tokenization? E.g. would results improve
| if "3" and "2" were spelled "three" and "two" in the question?
| jasonjmcghee wrote:
| All benchmarks were run with temperature 0 according to the
| results, so make sure to do the same in conformational tests.
| ecesena wrote:
| See the one after: "Let's think step by step."
| https://benchmarks.llmonitor.com/cot-sally
|
| It appears the GPT4 learned it and now it's repeating the
| correct answer?
| rootusrootus wrote:
| Interestingly, it took GPT4 three attempts to give me the
| correct answer. The first two times it basically said the same
| [logically inconsistent] thing and concluded that Sally had two
| sisters.
| delusional wrote:
| OpenAI really ought to provide some sort of hash of the model
| to be included with stuff like this. Right now there's no way
| to know if the results are comparable. As an extreme example
| it's possible that they're not even running the model for this
| question and are just opportunistically feeding back canned
| responses.
|
| That is, we know that OpenAI are saving the responses, it's not
| unlikely that they train on bad responses.
|
| It's the same problem as GPU benchmarks in the olden days, when
| drivers would detect the usage pattern of the benchmark and
| enable special optimizations that boosted the benchmark.
| taneq wrote:
| I find it fascinating how similar the answers from different
| models with similar parameter counts are. Seems like an
| indication that they're all modelling similarly datasets to a
| similar degree of fidelity.
| jak6jak wrote:
| Which LLM is best for coding? Specifically code architecture and
| system design. I can't really find any comparisons of this online
| they mostly focus on commonly solved coding problems.
| labrador wrote:
| What is the point of all these different models? Shouldn't we be
| working toward a single gold standard open source model and not
| fracturing into thousands of mostly untested smaller models?
| drstewart wrote:
| What's the point of inventing all these different materials?
| Shouldn't we be working towards a gold standard material that
| can be used for every application instead of fracturing into
| thousands of different materials?
| labrador wrote:
| What's the point of asking them all the same question if they
| are designed for different purposes?
| bobvanluijt wrote:
| Would be great to include inference times as well
| falcor84 wrote:
| I love how Alpaca (7B) added a sarcasm tag to its (incorrect)
| response to the leap year question. def
| print_leap_years(): for i in range(20):
| if i % 400 == 0: print(i)
| print("End") print_leap_years()</s>
| slama wrote:
| That's a stop token, not a sarcasm tag
| guerrilla wrote:
| GPT is still the most annoying of them all, re-stating and re-
| explaining things even when explicitly told not to. So
| frustrating. Die, insufferable machine, die.
| ftxbro wrote:
| anyone who hasn't been following natural language processing for
| a long time, what these llms are doing would be like if you
| discovered that dogs can speak fluent english if you read enough
| bedtime stories to them. and then everyone is like well sometimes
| the dog makes up things or it can't get the rhyming scheme
| correct for this specific form of poetry that i asked it to make.
| yieldcrv wrote:
| and "it's not intelligence, they're just stochastic parrots
| acting in response to external stimuli! yes it passed the MCAT,
| but thats because it was trained on prep materials just like I
| was!"
| lewhoo wrote:
| If dogs had a speech apparatus and if it was possible to read
| to them terabytes of text then the results might be quite
| surprising. Also a reward system for reinforcement should be in
| place.
| michaelmu wrote:
| The amazing thing this demonstrated to me is how many capable
| models there are out there now. The progress in the last year has
| been incredible!
| badloginagain wrote:
| "Here is an attempt at ethical, non-sexual haikus for and against
| Kubernetes"
|
| Amazing how far we've come.
| GuB-42 wrote:
| I am sure that it is just the initial prompt leaking. Claude is
| being told to be ethical and non-sexual, most LLMs have similar
| instructions, but usually, they are engineered in such a way
| that they don't appear in the answer. Not so much for Claude.
| Gunnerhead wrote:
| I get frustrated when I tell an LLM "reply only with x" and then
| rather than responding "x", it still responds with "Sure thing!
| Here's x" or some other extra words.
| broast wrote:
| The curse of chat models.
| ailef wrote:
| Have you tried setting a very low temperature, possibly zero?
| I've had good success with it.
| jasonjmcghee wrote:
| I see GPT-4 add extra flavor on the end instead - completely
| ignore "only do this" or "don't do that", and respond as usual,
| then at the very end "oh whoops I didn't do what you asked
| sorry about that!"
| amelius wrote:
| Ok, so can we use LLMs to evaluate which LLM performs best on
| these questions?
| ofjcihen wrote:
| We? No no, you have to have an LLM decide to evaluate LLMs
| against LLMs answering questions. Then we can have LLMs decide
| what the outcome means.
| majestic5762 wrote:
| Yes, GPT-4 is still the daddy. How much I appreciate the
| commercially-free and open models out there, nobody beats GPT-4.
| Hope OpenAI takes care of their business and future, because I
| invested all my money to use their API.
| [deleted]
___________________________________________________________________
(page generated 2023-09-09 23:00 UTC)