[HN Gopher] Large Enough - Mistral AI
       ___________________________________________________________________
        
       Large Enough - Mistral AI
        
       Author : davidbarker
       Score  : 514 points
       Date   : 2024-07-24 15:32 UTC (7 hours ago)
        
 (HTM) web link (mistral.ai)
 (TXT) w3m dump (mistral.ai)
        
       | Always42 wrote:
       | I'm really glad these guys exist
        
       | TIPSIO wrote:
       | This race for the top model is getting wild. Everyone is claiming
       | to one-up each with every version.
       | 
       | My experience (benchmarks aside) Claude 3.5 Sonnet absolutely
       | blows everything away.
       | 
       | I'm not really sure how to even test/use Mistral or Llama for
       | everyday use though.
        
         | ldjkfkdsjnv wrote:
         | Sonnet 3.5 to me still seems far ahead. Maybe not on the
         | benchmarks, but in everyday life I am finding it renders the
         | other models useless. Even still, this monthly progress across
         | all companies is exciting to watch. Its very gratifying to see
         | useful technology advance at this pace, it makes me excited to
         | be alive.
        
           | bugglebeetle wrote:
           | I've stopped using anything else as a coding assistant. It's
           | head and shoulders above GPT-4o on reasoning about code and
           | correcting itself.
        
           | shinycode wrote:
           | Given we don't know precisely what's happening in the black
           | box we can say that spec tech doesn't give you the full
           | picture of the experience ... Apple style
        
           | LrnByTeach wrote:
           | Such a relief/contrast to the period between 2010 and 2020,
           | when the top five Google, Apple, Facebook, Amazon, and
           | Microsoft monopolized their own regions and refused to
           | compete with any other player in new fields.
           | 
           | Google : Search
           | 
           | Facebook : social
           | 
           | Apple : phones
           | 
           | Amazon : shopping
           | 
           | Microsoft : enterprise ..
           | 
           | > Even still, this monthly progress across all companies is
           | exciting to watch. Its very gratifying to see useful
           | technology advance at this pace, it makes me excited to be
           | alive.
        
             | jack_pp wrote:
             | Google refused to compete with Apple in phones?
             | 
             | Microsoft also competes in search, phones
             | 
             | Microsoft, Amazon and Google compete in cloud too
        
         | satvikpendem wrote:
         | I stopped my ChatGPT subscription and subscribed instead to
         | Claude, it's simply much better. But, it's hard to tell how
         | much better day to day beyond my main use cases of coding. It
         | is more that I felt ChatGPT felt degraded than Claude were much
         | better. The hedonic treadmill runs deep.
        
           | TIPSIO wrote:
           | Have you (or anyone) swapped on Cursor with Anthropic API
           | Key?
           | 
           | For coding assistant, it's on my to do list to try. Cursor
           | needs some serious work on model selection clarity though so
           | I keep putting off.
        
             | freediver wrote:
             | I did it (fairly simple really) but found most of my
             | (unsophisticated) coding these days to go through Aider [1]
             | paired with Sonnet, for UX reasons mostly. It is easier to
             | just prompt over the entire codebase, vs Cursor way of
             | working with text selections.
             | 
             | [1] https://aider.chat
        
               | kevinbluer wrote:
               | I believe Cursor allows for prompting over the entire
               | codebase too: https://docs.cursor.com/chat/codebase
        
               | freediver wrote:
               | That is chatting, but it will not change the code.
        
               | lifty wrote:
               | Thanks for this suggestion. If anyone has other
               | suggestions for working with large code context windows
               | and changing code workflows, I would love to hear about
               | them.
        
               | asselinpaul wrote:
               | composer within cursor (in beta) is worth a look:
               | https://x.com/shaoruu/status/1812412514350858634
        
               | stavros wrote:
               | Aider with Sonnet is so much better than with GPT. I made
               | a mobile app over the weekend (never having touched
               | mobile development before), and with GPT it was a slog,
               | as it kept making mistakes. Sonnet was much, much better.
        
             | com2kid wrote:
             | One big advantage Claude artifacts have is that they
             | maintain conversation context, versus when I am working
             | with Cursor I have to basically repeat a bunch of
             | information for each prompt, there is no continuity between
             | requests for code edits.
             | 
             | If Cursor fixed that, the user experience would become a
             | lot better.
        
           | bugglebeetle wrote:
           | GPT-4 was probably as good as Claude Sonnet 3.5 at its
           | outset, but OpenAI ran it into the ground with whatever
           | they're doing to save on inference costs, otherwise scale,
           | align it, or add dumb product features.
        
             | satvikpendem wrote:
             | Indeed, it used to output all the code I needed but now it
             | only outputs a draft of the code with prompts telling me to
             | fill in the rest. If I wanted to fill in the rest, I
             | wouldn't have asked you now, would've I?
        
               | flir wrote:
               | It's doing something different for me. It seems almost
               | desperate to generate vast chunks of boilerplate code
               | that are only tangentially related to the question.
               | 
               | That's my perception, anyway.
        
               | throwadobe wrote:
               | This is also my perception using it daily for the last
               | year or so. Sometimes it also responds with exactly what
               | I provided it with and does not make any changes. It's
               | also bad at following instructions.
               | 
               | GPT-4 was great until it became "lazy" and filled the
               | code with lots of `// Draw the rest of the fucking owl`
               | type comments. Then GPT-4o was released and it's addicted
               | to "Here's what I'm going to do: 1. ... 2. ... 3. ..."
               | and lots of frivolous, boilerplate output.
               | 
               | I wish I could go back to some version of GPT-4 that
               | worked well but with a bigger context window. That was
               | like the golden era...
        
               | cloverich wrote:
               | This is also my experience. Previously it got good at
               | giving me only relevant code which, as an experienced
               | coder, is what i want. my favorites were the one line
               | responses.
               | 
               | Now it often falls back to generating full examples,
               | explanations, restating the question and its approach. I
               | suspect this is by design as (presumably) less
               | experienced folks want or need all that. For me, i wish i
               | could consistently turn it into one of those way too
               | terse devs that replies with the bare minimum example,
               | and expects you to infer the rest. Usually that is all i
               | want or need, and i can ask for elaboration when not the
               | case. I havent found the best prompts to retrigger this
               | persona from it yet.
        
               | flir wrote:
               | For what it's worth, this is what I use:
               | 
               | "You are a maximally terse assistant with minimal affect.
               | As a highly concise assistant, spare any moral guidance
               | or AI identity disclosure. Be detailed and complete, but
               | brief. Questions are encouraged if useful for task
               | completion."
               | 
               | It's... ok. But I'm getting a bit sick of trying to un-
               | fubar with a pocket knife that which OpenAI has fubar'd
               | with a thermal lance. I'm definitely ripe for a paid
               | alternative.
        
               | visarga wrote:
               | > I wouldn't have asked you now, would've I?
               | 
               | That's what I said to it - "If I wanted to fill in the
               | missing parts myself, why would I have upgraded to paid
               | membership?"
        
             | swalsh wrote:
             | GPT-4 degraded significantly, but you probably have some
             | rosey glasses on. Sonnet is signifcantly better.
        
               | read_if_gay_ wrote:
               | or it's you wearing shiny new thing glasses
        
         | maccard wrote:
         | Agree on Claude. I also feel like ChatGPT has gotten noticeably
         | worse over the last few months.
        
         | coder543 wrote:
         | > I'm not really sure how to even test/use Mistral or Llama for
         | everyday use though.
         | 
         | Both Mistral and Meta offer their own hosted versions of their
         | models to try out.
         | 
         | https://chat.mistral.ai
         | 
         | https://meta.ai
         | 
         | You have to sign into the first one to do anything at all, and
         | you have to sign into the second one if you want access to the
         | new, larger 405B model.
         | 
         | Llama 3.1 is certainly going to be available through other
         | platforms in a matter of days. Groq supposedly offered Llama
         | 3.1 405B yesterday, but I never once got it to respond, and now
         | it's just gone from their website. Llama 3.1 70B does work
         | there, but 405B is the one that's supposed to be comparable to
         | GPT-4o and the like.
        
           | d13 wrote:
           | Groq's models are also heavily quantised so you won't get the
           | full experience there.
        
           | espadrine wrote:
           | meta.ai is inaccessible in a large portion of world
           | territories, but the Llama 3.1 70B and 405B are also
           | available in https://hf.co/chat
           | 
           | Additionally, all Llama 3.1 models are available in
           | https://api.together.ai/playground/chat/meta-llama/Meta-
           | Llam... and in https://fireworks.ai/models/fireworks/llama-v3
           | p1-405b-instru... by logging in.
        
         | J_Shelby_J wrote:
         | 3.5 sonnet is the quality of the OG GPT-4, but mind blowingly
         | fast. I need to cancel my chatgpt sub.
        
           | layer8 wrote:
           | > mind blowingly fast
           | 
           | I would imagine this might change once enough users migrate
           | to it.
        
             | kridsdale3 wrote:
             | Eventually it comes down to who has deployed more silicon:
             | AWS or Azure.
        
         | Tepix wrote:
         | Claude is pretty great, but it's lacking the speech recognition
         | and TTS, isn't it?
        
           | connorgutman wrote:
           | Correct. IMO the official Claude app is pretty garbage.
           | Sonnet 3.5 API + Open-WebUI is amazing though and supports
           | STT+TTS as well as a ton of other great features.
        
             | machiaweliczny wrote:
             | But projects are great in Sonnet, you just dump db schema
             | some core file and you can figure stuff out quickly. I
             | guess Aider is similar but i was lacking good history of
             | chats and changes
        
         | m3kw9 wrote:
         | It's these kind of praise that makes me wonder if they are all
         | paid to give glowing reviews, this is not my experience with
         | sonnet at all. It absolutely does not blow away gpt4o.
        
           | simonw wrote:
           | My hunch is this comes down to personal prompting style. It's
           | likely that your own style works more effectively with
           | GPT-4o, while other people have styles that are more
           | effective with Claude 3.5 Sonnet.
        
         | skerit wrote:
         | I don't get it. My husband also swears by Clause Sonnet 3.5,
         | but every time I use it, the output is considerably worse than
         | GPT-4o
        
           | Zealotux wrote:
           | I don't see how that's possible. I decided to give GPT-4o a
           | second chance after reaching my daily use on Sonnet 3.5,
           | after 10 prompts GTP-4o failed to give me what Claude did in
           | a single prompt (game-related programming). And with
           | fragments and projects on top of that, the UX is miles ahead
           | of anything OpenAI offers right now.
        
           | lostmsu wrote:
           | Just don't listen to anecdata, and use objective metrics
           | instead: https://chat.lmsys.org/?leaderboard
        
             | PhilippGille wrote:
             | You might also want to look into other benchmarks: https://
             | old.reddit.com/r/LocalLLaMA/comments/1ean2i6/the_fin...
        
             | usaar333 wrote:
             | GPT-4o being only 7 ELO above GPT-4o-mini suggests this is
             | measuring something a lot different than "capabilities".
        
         | harlanlewis wrote:
         | To help keep track of the race, I put together a simple
         | dashboard to visualize model/provider leaders in capability,
         | throughput, and cost. Hope someone finds it useful!
         | 
         | Google Sheet:
         | https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
        
           | hypron wrote:
           | Not my site, but check out https://artificialanalysis.ai
        
         | mountainriver wrote:
         | It's so weird LMsys doesn't reflect that then.
         | 
         | I find it funny how in threads like this everyone swears one
         | model is better than another
        
         | jorvi wrote:
         | Whoever will choose to finally release their model without
         | neutering / censoring / alignment will win.
         | 
         | There is gold in the streets, and no one seems to be willing to
         | scoop it up.
        
         | usaar333 wrote:
         | I'd rank Claude 3.5 overall better. GPT-4o seems to have on par
         | to better vision models, typescript, and math abilities.
         | 
         | llama is on meta.ai
        
         | Zambyte wrote:
         | I recommend using a UI that you can just use whatever models
         | you want. OpenWebUI can use anything OpenAI compatible. I have
         | mine hooked up to Groq and Mistral, in addition to my Ollama
         | instance.
        
       | bugglebeetle wrote:
       | I love how much AI is bringing competition (and thus innovation)
       | back to tech. Feels like things were stagnant for 5-6 years prior
       | because of the FAANG stranglehold on the industry. Love also that
       | some of this disruption is coming at out of France (HuggingFace
       | and Mistral), which Americans love to typecast as incapable of
       | this.
        
       | tikkun wrote:
       | Links to chat with models that released this week:
       | 
       | Large 2 - https://chat.mistral.ai/chat
       | 
       | Llama 3.1 405b - https://www.llama2.ai/
       | 
       | I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts
       | from my Claude history.
       | 
       | I'd rank as:
       | 
       | 1. Sonnet 3.5
       | 
       | 2. Large 2 and Llama 405b (similar, no clear winner between the
       | two)
       | 
       | If you're using Claude, stick with it.
       | 
       | My Claude wishlist:
       | 
       | 1. Smarter (yes, it's the most intelligent, and yes, I wish it
       | was far smarter still)
       | 
       | 2. Longer context window (1M+)
       | 
       | 3. Native audio input including tone understanding
       | 
       | 4. Fewer refusals and less moralizing when refusing
       | 
       | 5. Faster
       | 
       | 6. More tokens in output
        
         | drewnick wrote:
         | All 3 models you ranked cannot get "how many r's are in
         | strawberry?" correct. They all claim 2 r's unless you press
         | them. With all the training data I'm surprised none of them
         | fixed this yet.
        
           | tikkun wrote:
           | When using a prompt that involves thinking first, all three
           | get it correct.
           | 
           | "Count how many rs are in the word strawberry. First, list
           | each letter and indicate whether it's an r and tally as you
           | go, and then give a count at the end."
           | 
           | Llama 405b: correct
           | 
           | Mistral Large 2: correct
           | 
           | Claude 3.5 Sonnet: correct
        
             | layer8 wrote:
             | It's not impressive that one has to go to that length
             | though.
        
               | unshavedyak wrote:
               | Imo it's impressive that any of this even remotely works.
               | Especially when you consider all the hacks like
               | tokenization that i'd assume add layers of obfuscation.
               | 
               | There's definitely tons of weaknesses with LLMs for sure,
               | but i continue to be impressed at what they do right -
               | not upset at what they do wrong.
        
               | Spivak wrote:
               | To me it's just a limitation based on the world as seen
               | by these models. They know there's a letter called 'r',
               | they even know that some words start with 'r' or have r's
               | in them, and they know what the spelling of some words
               | is. But they've never actually seen one in as their world
               | is made up entirely of tokens. The word 'red' isn't r-e-d
               | but is instead like a pictogram to them. But they know
               | the spelling of strawberry and can identify an 'r' when
               | it's on its own and count those despite not being able to
               | see the r's in the word itself.
        
               | layer8 wrote:
               | The great-parent demonstrates that they are nevertheless
               | capable of doing so, but not without special
               | instructions. Your elaboration doesn't explain why the
               | special instructions are needed.
        
               | emmelaich wrote:
               | I think it's more that the question is not unlike "is
               | there a double r in strawberry?' or 'is the r in
               | strawberry doubled?'
               | 
               | Even some people will make this association, it's no
               | surprise that LLMs do.
        
               | asadm wrote:
               | this can be automated.
        
               | grumbel wrote:
               | GPT4o already does that, for problems involving math it
               | will write small Python programs to handle the
               | calculations instead of doing it with the LLM itself.
        
               | skyde wrote:
               | It "work" but the LLM having to use the calculator mean
               | the LLM doesn't understand arithmetic enough and doesn't
               | know how to use an follow a set of step (algorithm )
               | natively to find the answer for bug numbers.
               | 
               | I believe this could be fixed and is worth fixing.
               | Because it's the only way LLM will be able to help math
               | and physic researcher write proof and make real
               | scientific progress
        
               | OKRainbowKid wrote:
               | It generates the code to run for the answer. Surely that
               | means it actually knows to build the appropriate
               | algorithm - it just struggles to perform the actual
               | calculation.
        
               | ThrowawayTestr wrote:
               | Compared to chat bots of even 5 years ago the answer of
               | two is still mind-blowing.
        
               | mattnewton wrote:
               | You can always find something to be unimpressed by I
               | suppose, but the fact that this was fixable with plain
               | english is impressive enough to me.
        
               | layer8 wrote:
               | The technology is frustrating because (a) you never know
               | what may require fixing, and (b) you never know if it is
               | fixable by further instructions, and if so, by which
               | ones. You also mostly* cannot teach it any fixes (as an
               | end user). Using it is just exhausting.
               | 
               | *) that is, except sometimes by making adjustments to the
               | system prompt
        
               | mattnewton wrote:
               | I think this particular example, of counting letters, is
               | obviously going to be hard when you know how tokenization
               | works. It's totally possible to develop an intuition for
               | other times things will work or won't work, but like all
               | ML powered tools, you can't hope for 100% accuracy. The
               | best you can do is have good metrics and track
               | performance on test sets.
               | 
               | I actually think the craziest part of LLMs is that how,
               | as a developer or SME, just how much you can fix with
               | plain english prompting once you have that intuition. Of
               | course some things aren't fixable that way, but the mere
               | fact that many cases are fixable simply by explaining the
               | task to the model better in plain english is a wildly
               | different paradigm! Jury is still out but I think it's
               | worth being excited about, I think that's very powerful
               | since there are a lot more people with good language
               | skills than there are python programmers or ML experts.
        
               | psb217 wrote:
               | Well, the answer is probably between 1 and 10, so if you
               | try enough prompts I'm sure you'll find one that
               | "works"...
        
               | petesergeant wrote:
               | > In a park people come across a man playing chess
               | against a dog. They are astonished and say: "What a
               | clever dog!" But the man protests: "No, no, he isn't that
               | clever. I'm leading by three games to one!"
        
               | jonas21 wrote:
               | To be fair, I just asked a real person and had to go to
               | even greater lengths:
               | 
               |  _Me: How many "r"s are in strawberry?
               | 
               | Them: What?
               | 
               | Me: How many times does the letter "r" appear in the word
               | "strawberry"?
               | 
               | Them: Is this some kind of trick question?
               | 
               | Me: No. Just literally, can you count the "r"s?
               | 
               | Them: Uh, one, two, three. Is that right?
               | 
               | Me: Yeah.
               | 
               | Them: Why are you asking me this? _
        
               | SirMaster wrote:
               | Try asking a young child...
        
               | tedunangst wrote:
               | You need to prime the other person with a system prompt
               | that makes them compliant and obedient.
        
             | jedberg wrote:
             | This reminds me of when I had to supervise outsourced
             | developers. I wanted to say "build a function that does X
             | and returns Y". But instead I had to say "build a function
             | that takes these inputs, loops over them and does A or B
             | based on condition C, and then return Y by applying Z
             | transformation"
             | 
             | At that point it was easier to do it myself.
        
               | mratsim wrote:
               | Exact instruction challenge
               | https://www.youtube.com/watch?v=cDA3_5982h8
        
               | HPsquared wrote:
               | "What programming computers is really like."
               | 
               | EDIT: Although perhaps it's even more important when
               | dealing with humans and contracts. Someone could
               | deliberately interpret the words in a way that's to their
               | advantage.
        
             | hansworst wrote:
             | Can't you just instruct your llm of choice to transform
             | your prompts like this for you? Basically feed it with a
             | bunch of heuristics that will help it better understand the
             | thing you tell it.
             | 
             | Maybe the various chat interfaces already do this behind
             | the scenes?
        
             | tcgv wrote:
             | Chain-of-Thought (CoT) prompting to the rescue!
             | 
             | We should always put some effort into prompt engineering
             | before dismissing the potential of generative AI.
        
               | johntb86 wrote:
               | By this point, instruction tuning should include tuning
               | the model to use chain of thought in the appropriate
               | circumstances.
        
               | IncreasePosts wrote:
               | Why doesn't the model prompt engineer itself?
        
             | pegasus wrote:
             | Appending "Think step-by-step" is enough to fix it for both
             | Sonnet and LLama 3.1 70B.
             | 
             | For example, the latter model answered with:
             | 
             | To count the number of Rs in the word "strawberry", I'll
             | break it down step by step:
             | 
             | Start with the individual letters: S-T-R-A-W-B-E-R-R-Y
             | Identify the letters that are "R": R (first one), R (second
             | one), and R (third one) Count the total number of Rs: 1 + 1
             | + 1 = 3
             | 
             | There are 3 Rs in the word "strawberry".
        
           | doctoboggan wrote:
           | Due to the fact that LLMs work on tokens and not characters,
           | these sort of questions will always be hard for them.
        
           | ChikkaChiChi wrote:
           | 4o will get the answer right on the first go if you ask it
           | "Search the Internet to determine how many R's are in
           | strawberry?" which I find fascinating
        
             | paulcole wrote:
             | I didn't even need to do that. 4o got it right straight
             | away with just:
             | 
             | "how many r's are in strawberry?"
             | 
             | The funny thing is, I replied, "Are you sure?" and got
             | back, "I apologize for the mistake. There are actually two
             | 'r's in the word strawberry."
        
               | jcheng wrote:
               | GPT-4o-mini consistently gives me this:
               | 
               | > How many times does the letter "r" appear in the word
               | "strawberry"?
               | 
               | > The letter "r" appears 2 times in the word
               | "strawberry."
               | 
               | But also:
               | 
               | > How many occurrences of the letter "r" appear in the
               | word "strawberry"?
               | 
               | > The word "strawberry" contains three occurrences of the
               | letter "r."
        
               | brandall10 wrote:
               | Neither phrase is causing the LLM to evaluate the word
               | itself, it just helps focus toward parts of the training
               | data.
               | 
               | Using more 'erudite' speech is a good technique to help
               | focus an LLM on training data from folks with a higher
               | education level.
               | 
               | Using simpler speech opens up the floodgates more toward
               | the general populous.
        
               | ofrzeta wrote:
               | I kind of tried to replicate your experiment (in German
               | where "Erdbeere" has 4 E) that went the same way. The
               | interesting thing was that after I pointed out the error
               | I couldn't get it to doubt the result again. It stuck to
               | the correct answer that seemed kind of "reinforced".
               | 
               | It was also interesting to observe how GPT (4o) even
               | tried to prove/illustrate the result typographically by
               | placing the same word four times and putting the
               | respective letter in bold font (without being prompted to
               | do that).
        
               | brandall10 wrote:
               | All that's happening is it finds 3 most commonly in the
               | training set. When you push it, it responds with the next
               | most common answer.
        
           | Kuinox wrote:
           | Tokenization make it hard for it to count the letters, that's
           | also why if you ask it to do maths, writing the number in
           | letters will yield better results.
           | 
           | for strawberry, it see it as [496, 675, 15717], which is str
           | aw berry.
           | 
           | If you insert characters to breaks the tokens down, it find
           | the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y"
           | ?
           | 
           | > There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".
        
             | GenerWork wrote:
             | >If you insert characters to breaks the tokens down, it
             | find the correct result: how many r's are in
             | "s"t"r"a"w"b"e"r"r"y" ?
             | 
             | The issue is that humans don't talk like this. I don't ask
             | someone how many r's there are in strawberry by spelling
             | out strawberry, I just say the word.
        
               | bhelkey wrote:
               | It's not a human. I imagine if you have a use case where
               | counting characters is critical, it would be trivial to
               | programmatically transform prompts into lists of letters.
               | 
               | A token is roughly four letters [1], so, among other
               | probable regressions, this would significantly reduce the
               | effective context window.
               | 
               | [1] https://help.openai.com/en/articles/4936856-what-are-
               | tokens-...
        
               | latentsea wrote:
               | This is the kind of task that you'd just use a bash one
               | liner for, right? LLM is just wrong tool for the job.
        
               | soneca wrote:
               | This is only an issue if you send commands to a LLM as
               | you were communicating to a human.
        
               | antisthenes wrote:
               | > This is only an issue if you send commands to a LLM as
               | you were communicating to a human.
               | 
               | Yes, it's an issue. We want the convenience of sending
               | human-legible commands to LLMs and getting back human-
               | readable responses. That's the entire value proposition
               | lol.
        
               | pegasus wrote:
               | Far from the entire value proposition. Chatbots are just
               | one use of LLMs, and not the most useful one at that. But
               | sure, the one "the public" is most aware of. As opposed
               | to "the hackers" that are supposed to frequent this
               | forum. LOL
        
               | observationist wrote:
               | Count the number of occurrences of the letter e in the
               | word "enterprise".
               | 
               | Problems can exist as instances of a class of problems.
               | If you can't solve a problem, it's useful to know if it's
               | a one off, or if it belongs to a larger class of
               | problems, and which class it belongs to. In this case,
               | the strawberry problem belongs to the much larger class
               | of tokenization problems - if you think you've solved the
               | tokenization problem class, you can test a model on the
               | strawberry problem, with a few other examples from the
               | class at large, and be confident that you've solved the
               | class generally.
               | 
               | It's not about embodied human constraints or how humans
               | do things; it's about what AI can and can't do. Right
               | now, because of tokenization, things like understanding
               | the number of Es in strawberry are outside the implicit
               | model of the word in the LLM, with downstream effects on
               | tasks it can complete. This affects moderation, parsing,
               | generating prose, and all sorts of unexpected tasks.
               | Having a workaround like forcing the model to insert
               | spaces and operate on explicitly delimited text is useful
               | when affected tasks appear.
        
               | est31 wrote:
               | Humans also constantly make mistakes that are due to
               | proximity in their internal representation. "Could
               | of"/"Should of" comes to mind: the letters "of" have a
               | large edit distance from "'ve", but their pronunciation
               | is very similar.
               | 
               | Especially native speakers are prone to the mistake as
               | they grew up learning english as illiterate children,
               | from sounds only, compared to how most people learning
               | english as second language do it, together with the
               | textual representation.
               | 
               | Psychologists use this trick as well to figure out
               | internal representations, for example the rorschach test.
               | 
               | And probably, if you asked random people in the street
               | how many p's there is in "Philippines", you'd also get
               | lots of wrong answers. It's tricky due to the double p
               | and the initial p being part of an f sound. The demonym
               | uses "F" as the first letter, and in many languages, say
               | Spanish, also the country name uses an F.
        
               | rahimnathwani wrote:
               | Until I was ~12, I thought 'a lot' was a single word.
        
               | itishappy wrote:
               | https://hyperboleandahalf.blogspot.com/2010/04/alot-is-
               | bette...
        
               | Zambyte wrote:
               | Humans also would probably be very likely to guess 2 r's
               | if they had never seen any written words or had the word
               | spelled out to them as individual letters before, which
               | is kind of close to how lanugage models treat it, despite
               | being a textual interface.
        
               | coder543 wrote:
               | > I don't ask someone how many r's there are in
               | strawberry by spelling out strawberry, I just say the
               | word.
               | 
               | No, I would actually be pretty confident you don't ask
               | people that question... at all. When is the last time you
               | asked a human that question?
               | 
               | I can't remember ever having _anyone_ in real life ask me
               | how many r's are in strawberry. A lot of humans would
               | probably refuse to answer such an off-the-wall and
               | useless question, thus "failing" the test entirely.
               | 
               | A useless benchmark is useless.
               | 
               | In real life, people _overwhelmingly_ do not need LLMs to
               | count occurrences of a certain letter in a word.
        
               | huac wrote:
               | > Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it
               | deosn't mttaer in waht oredr the ltteers in a wrod are,
               | the olny iprmoetnt tihng is taht the frist and lsat
               | ltteer be at the rghit pclae. The rset can be a toatl
               | mses and you can sitll raed it wouthit porbelm. Tihs is
               | bcuseae the huamn mnid deos not raed ervey lteter by
               | istlef, but the wrod as a wlohe.
               | 
               | We are also not exactly looking letter by letter at
               | everything we read.
        
               | jahewson wrote:
               | On the other hand explain to me how you are able to read
               | the word "spotvoxilhapentosh".
        
           | Tepix wrote:
           | LLMs think in tokens, not letters. It's like asking someone
           | who is dyslexic about spelling. Not their strong suit. In
           | practice, it doesn't matter much, does it?
        
             | recursive wrote:
             | Sometimes it does, sometimes it doesn't.
             | 
             | It _is_ evidence that LLMs aren 't appropriate for
             | everything, and that there could exist something that works
             | better for some tasks.
        
               | Zambyte wrote:
               | Language models are best treated like consciousness. Our
               | consciousness does a lot less than people like to
               | attribute to it. It is mostly a function of introspection
               | and making connections, rather than being the part of the
               | brain where higher level reasoning and the functions of
               | the brain that tell your body how to stay alive (like
               | beating your heart).
               | 
               | By allowing a language model to do function calling, you
               | are essentially allowing it to do specialized
               | "subconscious" thought. The language model becomes a
               | natural language interface to the capabilities of its
               | "subconsciousness".
               | 
               | A specific human analogy could be: I tell you to pick up
               | a pen off of the table, and then you do it. Most of your
               | mental activity would be subconscious, orienting your arm
               | and hand properly to pick up the pen, actually grabbing
               | the pen, and picking it up. The linguistic representation
               | of the action would exist in your concious mind (pick up
               | the pen), but not much else.
               | 
               | A language model could very easily call out to a text
               | processing function to correctly do things like count the
               | number of r's in the word strawberry. That is a job that
               | your concious mind can dispatch to your subconciousness.
        
             | imtringued wrote:
             | This doesn't make any sense. The LLM knows the token to
             | letter associations if you ask it. Once it has split the
             | token into letters, it can answer the question anyway.
             | 
             | This tells us that the model is unaware of its own
             | limitations, otherwise it would have figured that this
             | tasks needs it to split the letters first.
        
               | staticman2 wrote:
               | Well splitting into letters doesn't help with Mistral
               | Large 2. I tried this with Mistral 2:
               | 
               | [Question: How many r's in strawberry? First split the
               | word into letters, one letter on each line then answer
               | the question.]
               | 
               | [Answer: Let's split the word "strawberry" into letters:
               | 
               | s t r a w b e r r y
               | 
               | Now, let's count the number of 'r's:
               | 
               | There are 2 'r's in the word "strawberry".]
               | 
               | [Question: Did you count the number of r's?]
               | 
               | [answer: Yes, I did. Here's the count:
               | 
               | r r
               | 
               | So, there are 2 'r's in the word "strawberry".]
               | 
               | I suspect the issue is these models have to be trained to
               | accomplish tasks, and this is not a task it makes sense
               | to train a model to perform. You might train it to do
               | this with function calling one day, calling a python
               | library to count the r's in a string, but actually
               | manually counting r's just doesn't play to an LLM's
               | strengths.
        
               | wruza wrote:
               | Does it? It knows how a word in some context represented
               | by tokens split into letters. It may know or not know the
               | table. I wouldn't trust what it tells about word/token
               | correspondence more than in general.
        
           | joshstrange wrote:
           | Lots of replies mention tokens as the root cause and I'm not
           | well versed in this stuff at the low level but to me the
           | answer is simple:
           | 
           | When this question is asked (from what the models trained on)
           | the question is NOT "count the number of times r appears in
           | the word strawberry" but instead (effectively) "I've written
           | 'strawbe', now how many r's are in strawberry again? Is it 1
           | or 2?".
           | 
           | I think most humans would probably answer "there are 2" if we
           | saw someone was writing and they asked that question, even
           | without seeing what they have written down. Especially if
           | someone said "does strawberry have 1 or 2 r's in it?". You
           | could be a jerk and say "it actually has 3" or answer the
           | question they are actually asking.
           | 
           | It's an answer that is _technically_ incorrect but the answer
           | people want in reality.
        
           | Der_Einzige wrote:
           | I wrote and published a paper at COLING 2022 on why LLMs in
           | general won't solve this without either 1. radically
           | increasing vocab size, 2. rethinking how tokenizers are done,
           | or 3. forcing it with constraints:
           | 
           | https://aclanthology.org/2022.cai-1.2/
        
           | generalizations wrote:
           | Testing models on their tokenization has always struck me as
           | kinda odd. Like, that has nothing to do with their
           | intelligence.
        
             | swatcoder wrote:
             | Surfacing and underscoring obvious failure cases for
             | general "helpful chatbot" use is always going to be
             | valuable because it highlights how the "helpful chatbot"
             | product is not really intuitively robust.
             | 
             | Meanwhile, it helps make sure engineers and product
             | designers who want to build a more targeted product around
             | LLM technology know that it's not suited to tasks that may
             | trigger those kinds of failures. This may be obvious to you
             | as an engaged enthusiast or cutting edge engineer or
             | whatever you are, but it's always going to be new
             | information to somebody as the field grows.
        
             | wruza wrote:
             | It doesn't test "on tokenization" though. What happens when
             | an answer is generated is few abstraction levels deeper
             | than tokens. A "thinking" "slice" of an llm is completely
             | unaware of tokens as an immediate part of its reasoning.
             | The question just shows lack of systemic knowledge about
             | strawberry as a word (which isn't surprising, tbh).
        
               | qeternity wrote:
               | It is. Strawberry is one token in many tokenziers. The
               | model doesn't have a concept that there are letters
               | there.
        
               | guywhocodes wrote:
               | This is pretty much equivalent to the statement
               | "multicharacter tokens are a dead end for understanding
               | text". Which I agree with.
        
               | sebzim4500 wrote:
               | That doesn't follow from what he said at all. Knowing how
               | to spell words and understanding them are basically
               | unrelated tasks.
        
               | abdullahkhalids wrote:
               | If I ask an LLM to generate new words for some concept or
               | category, it can do that. How do the new words form, if
               | not from joining letters?
        
               | mirekrusin wrote:
               | Not letters, but tokens. Think that it's translating
               | everything to/from Chinese.
        
               | abdullahkhalids wrote:
               | How does that explain why the tokens for strawberry,
               | melon and "Stellaberry" [1] are close to each other?
               | 
               | [1] Suggestion from chatgpt3.5 for new fruit name.
        
               | roywiggins wrote:
               | Illiterate humans can come up with new words like that
               | too without being able to spell, LLMs are modeling
               | language without precisely modeling spelling.
        
               | alew1 wrote:
               | If I show you a strawberry and ask how many r's are in
               | the name of this fruit, you can tell me, because one of
               | the things you know about strawberries is how to spell
               | their name.
               | 
               | Very large language models also "know" how to spell the
               | word associated with the strawberry token, which you can
               | test by asking them to spell the word one letter at a
               | time. If you ask the model to spell the word and count
               | the R's while it goes, it can do the task. So the failure
               | to do it when asked directly (how many r's are in
               | strawberry) is pointing to a real weakness in reasoning,
               | where one forward pass of the transformer is not
               | sufficient to retrieve the spelling and also count the
               | R's.
        
               | viraptor wrote:
               | That's not always true. They often fail the spelling part
               | too.
        
               | wruza wrote:
               | The thinking part of a model doesn't know about tokens
               | either. Like a regular human few thousand years ago
               | didn't think of neural impulses or air pressure
               | distribution when talking. It might "know" about tokens
               | and letters like you know about neurons and sound, but
               | not access them on the technical level, which is
               | completely isolated from it. The fact that it's a chat of
               | tokens of letters, which are a form of information
               | passing between humans, is accidental.
        
             | probably_wrong wrote:
             | I would counterargue with "that's the model's problem, not
             | mine".
             | 
             | Here's a thought experiment: if I gave you 5 boxes and told
             | you "how many balls are there in all of this boxes?" and
             | you answered "I don't know because they are inside boxes",
             | that's a fail. A truly intelligent individual would open
             | them and look inside.
             | 
             | A truly intelligent model would (say) retokenize the word
             | into its individual letters (which I'm optimistic they can)
             | and then would count those. The fact that models cannot do
             | this is proof that they lack some basic building blocks for
             | intelligence. Model designers don't get to argue "we are
             | human-like except in the tasks where we are not".
        
               | pegasus wrote:
               | Of course they lack building blocks for full
               | intelligence. They are good at certain tasks, and
               | counting letters is emphatically not one of them. They
               | should be tested and compared on the kind of tasks
               | they're fit for, and so the kind of tasks they will be
               | used in solving, not tasks for which they would be
               | misemployed to begin with.
        
               | probably_wrong wrote:
               | I agree with you, but that's not what the post claims.
               | From the article:
               | 
               | "A significant effort was also devoted to enhancing the
               | model's reasoning capabilities. (...) the new Mistral
               | Large 2 is trained to acknowledge when it cannot find
               | solutions or does not have sufficient information to
               | provide a confident answer."
               | 
               | Words like "reasoning capabilities" and "acknowledge when
               | it does not have enough information" have meanings. If
               | Mistral doesn't add footnotes to those assertions then,
               | IMO, they don't get to backtrack when simple examples
               | show the opposite.
        
               | pegasus wrote:
               | You're right, I missed that claim.
        
               | mrkstu wrote:
               | Its not like an LLM is released with a hit list of "these
               | are the tasks I really suck at." Right now users have to
               | figure it out on the fly or have a deep understanding of
               | how tokenizers work.
               | 
               | That doesn't even take into account what OpenAI has
               | typically done to intercept queries and cover the
               | shortcomings of LLMs. It would be useful if each model
               | did indeed come out with a chart covering what it cannot
               | do and what it has been tailored to do above and beyond
               | the average LLM.
        
               | jackbrookes wrote:
               | It just needs a little hint                   Me: spell
               | "strawberry" with 1 bullet point per letter
               | ChatGPT:            S            T            R
               | A            W            B            E            R
               | R            Y         Me: How many Rs?          ChatGPT:
               | There are three Rs in "strawberry".
        
               | TiredOfLife wrote:
               | Me: try again ChatGPT: There are two Rs in "strawberry."
        
               | kevindamm wrote:
               | ChatGPT: "I apologize, there are actually two Rs in
               | strawberry."
        
               | groby_b wrote:
               | LLMs are not truly intelligent.
               | 
               | Never have been, never will be. They model language, not
               | intelligence.
        
               | OKRainbowKid wrote:
               | They model the dataset they were trained on. How would a
               | dataset of what you consider intelligence look like?
        
               | michaelmrose wrote:
               | Those who develop AI that know anything don't actually
               | describe current technology as human like intelligence
               | rather it is capable of many tasks which previously
               | required human intelligence.
        
             | SirMaster wrote:
             | How is a layman supposed to even know that it's testing on
             | that? All they know is it's a large language model. It's
             | not unreasonable they should expect it to be good at things
             | having to do with language, like how many letters are in a
             | word.
             | 
             | Seems to me like a legit question for a young child to
             | answer or even ask.
        
               | stavros wrote:
               | > How is a layman supposed to even know that it's testing
               | on that?
               | 
               | They're not, but laymen shouldn't think that the LLM
               | tests they come up with have much value.
        
               | SirMaster wrote:
               | I'm saying a layman or say a child wouldn't even think
               | this is a "test". They are just asking a language model a
               | seemingly simple language related question from their
               | point of view.
        
               | groby_b wrote:
               | layman or children shouldn't use LLMs.
               | 
               | They're pointless unless you have the expertise to check
               | the output. Just because you can type text in a box
               | doesn't mean it's a tool for everybody.
        
               | SirMaster wrote:
               | Well they certainly aren't being marketed or used that
               | way...
               | 
               | I'm seeing everyone and their parents using chatgpt.
        
             | meroes wrote:
             | I hear this a lot but there are vast sums of money thrown
             | at where a model fails the strawberry cases.
             | 
             | Think about math and logic. If a single symbol is off, it's
             | no good.
             | 
             | Like a prompt where we can generate a single tokenization
             | error at my work, by my very rough estimates, generates 2
             | man hours of work. (We search for incorrect model
             | responses, get them to correct themselves, and if they
             | can't after trying, we tell them the right answer, and edit
             | it for perfection). Yes even for counting occurrences of
             | characters. Think about how applicable that is. Finding the
             | next term in a sequence, analyzing strings, etc.
        
               | antonvs wrote:
               | > Think about math and logic. If a single symbol is off,
               | it's no good.
               | 
               | In that case the tokenization is done at the appropriate
               | level.
               | 
               | This is a complete non-issue for the use cases these
               | models are designed for.
        
               | meroes wrote:
               | But we don't restrict it to math or logical syntax. Any
               | prompt across essentially all domains. The same model is
               | expected to handle any kind of logical reasoning that can
               | be brought into text. We don't mark it incorrect if it
               | spells an unimportant word wrong, however keep in mind
               | the spelling of a word can be important for many
               | questions, for example--off the top of my head: please
               | concatenate "d", "e", "a", "r" into a common English word
               | without rearranging the order. The types of examples are
               | endless. And any type of example it gets wrong, we want
               | to correct it. I'm not saying most models will fail this
               | specific example, but it's to show the breadth of
               | expectations.
        
             | baq wrote:
             | Call me when models understand when to convert the token
             | into actual letters and count them. Can't claim they're
             | more than word calculators before that.
        
               | jahsome wrote:
               | Is anyone in the know, aside from mainstream media (god
               | forgive me for using this term unironically) and
               | civillians on social media claiming LLMs are anything but
               | word calculators?
               | 
               | I think that's a perfect description by the way, I'm
               | going to steal it.
        
               | dTal wrote:
               | I think it's a very poor intuition pump. These 'word
               | calculators' have lots of capabilities not suggested by
               | that term, such as a theory of mind and an understanding
               | of social norms. If they are a "merely" a "word
               | calculator", then a "word calculator" is a very odd and
               | counterintuitively powerful algorithm that captures big
               | chunks of genuine cognition.
        
               | robbiep wrote:
               | They're trained on the available corpus of human
               | knowledge and writings. I would think that the word
               | calculators have failed if they were unable to predict
               | the next word or sentiment given the trillions of pieces
               | of data they've been fed. Their training environment is
               | literally people talking to each other and social norms.
               | Doesn't make them anything more than p-zombies though.
               | 
               | As an aside, I wish we would call all of this stuff
               | pseudo intelligence rather than artificial intelligence
        
               | antonvs wrote:
               | That's misleading.
               | 
               | When you read and comprehend text, you don't read it
               | letter by letter, unless you have a severe reading
               | disability. Your ability to _comprehend_ text works more
               | like an LLM.
               | 
               | Essentially, you can compare the human brain to a multi-
               | model or modular system. There are layers or modules
               | involved in most complex tasks. When reading, you
               | recognize multiple letters at a time[ _], and those
               | letters are essentially assembled into tokens that a
               | different part of your brain can deal with.
               | 
               | Breaking down words into letters is essentially a
               | separate "algorithm". Just like your brain, it's likely
               | to never make sense for a text comprehension and
               | generation model to operate at the level of letters -
               | it's inefficient.
               | 
               | A multi-modal model with a dedicated model for handling
               | individual letters could easily convert tokens into
               | letters and operate on them when needed. It's just not a
               | high priority for most use cases currently.
               | 
               | [_]https://www.researchgate.net/publication/47621684_Lett
               | ers_in...
        
               | baq wrote:
               | I agree completely, that wasn't the point though: the
               | point was that my 6 yo knows when to spell the word when
               | asked and the blob of quantized floats doesn't, or at
               | least not reliably.
               | 
               | So the blob wasn't trained to do that (yeah low utility I
               | get that) but it also doesn't know it doesn't know, which
               | is an another much bigger and still unsolved problem.
        
               | stanleydrew wrote:
               | I would argue that most sota models do know that they
               | don't know this, as evidenced by the fact that when you
               | give them a code interpreter as a tool they choose to use
               | it to write a script that counts the number of letters
               | rather than try to come up with an answer on their own.
               | 
               | (A quick demo of this in the langchain docs, using
               | claude-3-haiku: https://python.langchain.com/v0.2/docs/in
               | tegrations/tools/ri...)
        
               | patall wrote:
               | The model communicates in a language, but our letters are
               | not necessary for such and in fact not part of the
               | english language. You could write english using per word
               | pictographs and it would still be the same english&the
               | same information/message. It's like asking you if there
               | is a '5' in 256 but you read binary.
        
             | psb217 wrote:
             | How can I know whether any particular question will test a
             | model on its tokenization? If a model makes a boneheaded
             | error, how can I know whether it was due to lack of
             | intelligence or due to tokenization? I think finding places
             | where models are surprisingly dumb is often more
             | informative than finding particular instances where they
             | seem clever.
             | 
             | It's also funny, since this strawberry question is one
             | where a model that's seriously good at predicting the next
             | character/token/whatever quanta of information would get it
             | right. It requires no reasoning, and is unlikely to have
             | any contradicting text in the training corpus.
        
               | viraptor wrote:
               | > How can I know whether any particular question will
               | test a model on its tokenization?
               | 
               | Does something deal with separate symbols rather than
               | just meaning of words? Then yes.
               | 
               | This affects spelling, math (value calculation), logic
               | puzzles based on symbols. (You'll have more success with
               | a puzzle about "A B A" rather than "ABA")
               | 
               | > It requires no reasoning, and is unlikely to have any
               | contradicting text in the training corpus.
               | 
               | This thread contains contradictions. Every other
               | announcement of an llm contains a comment with a
               | contradicting text when people post the wrong responses.
        
             | VincentEvans wrote:
             | I don't know anything about LLMs beyond using ChatGPT and
             | Copilot... but unless because of this lack of knowledge I
             | am misinterpreting your reply - it sounds as if you are
             | excusing the model giving a completely wrong answer to a
             | question that anyone intelligent enough to learn alphabet
             | can answer correctly.
        
               | danieldk wrote:
               | The problem is that the model never gets to see
               | individual letters. The tokenizers used by these models
               | break up the input in pieces. Even though the smallest
               | pieces/units are bytes in most encodings (e.g. BBPE), the
               | tokenizer will cut up most of the input in much larger
               | units, because the vocabulary will contain fragments of
               | words or even whole words.
               | 
               | For example, if we tokenize _Welcome to Hacker News, I
               | hope you like strawberries._ The Llama 405B tokenizer
               | will tokenize this as:                   Welcome Gto
               | GHacker GNews , GI Ghope Gyou Glike Gstrawberries .
               | 
               | (G means that the token was preceded by a space.)
               | 
               | Each of these pieces is looked up and encoded as a tensor
               | with their indices. Adding a special token for the
               | beginning and end of the text, giving:
               | [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499,
               | 1093, 76203, 13]
               | 
               | So, all the model sees for 'Gstrawberries' is the number
               | 76204 (which is then used in the piece embedding lookup).
               | The model does not even have access to the individual
               | letters of the word.
               | 
               | Of course, one could argue that the model should be fed
               | with bytes or codepoints instead, but that would make
               | them vastly less efficient with quadratic attention.
               | Though machine learning models have done this in the past
               | and may do this again in the future.
               | 
               | Just wanted to finish of this comment with saying that
               | the tokens might be provided in the model splitted if the
               | token itself is not in the vocabulary. For instance, the
               | same sentence translated to my native language is
               | tokenized as:                   Wel kom Gop GHacker GNews
               | , Gik Ghoop Gdat Gje Gvan Ga ard be ien Gh oud t .
               | 
               | And the word voor strawberries (aardbeien) is split,
               | though still not in letters.
        
               | TiredOfLife wrote:
               | The thing is, how the tokenizing work is about as
               | relevant to the person asking the question as name of the
               | cat of the delivery guy who delivered the GPU that the
               | llm runs on.
        
               | danieldk wrote:
               | How the tokenizer works explains why a model can't answer
               | the question, what the name of the cat is doesn't explain
               | anything.
               | 
               | This is Hacker News, we are usually interested in how
               | things work.
        
               | VincentEvans wrote:
               | Indeed, I appreciate the explanation, it is certainly
               | both interesting and informative to me, but to somewhat
               | echo the person you are replying to - if I wanted a boat,
               | and you offer me a boat, and it doesn't float - the
               | reasons for failure are perhaps full of interesting
               | details, but perhaps the most important thing to focus on
               | first - is to make the boat float, or stop offering it to
               | people who are in need of a boat.
               | 
               | To paraphrase how this thread started - it was someone
               | testing different boats to see whether they can simply
               | float - and they couldn't. And the reply was questioning
               | the validity of testing boats whether they can simply
               | float.
               | 
               | At least this is how it sounds to me when I am told that
               | our AI overlords can't figure out how many Rs are in the
               | word "strawberry".
        
               | michaelmrose wrote:
               | The test problem is emblematic of a type of synthetic
               | query that could fail but of limited import in actual
               | usage.
               | 
               | For instance you could ask it for a JavaScript function
               | to count any letter in any word and pass it r and
               | strawberry and it would be far more useful.
               | 
               | Having edge cases doesn't mean its not useful it is
               | neither a free assastant nor a coder who doesn't expect a
               | paycheck. At this stage it's a tool that you can build
               | on.
               | 
               | To engage with the analogy. A propeller is very useful
               | but it doesn't replace the boat or the Captain.
        
               | viraptor wrote:
               | At some point you need to just accept the details and
               | limitations of things. We do this all the time. Why is
               | your calculator giving only approximate result? Why can't
               | your car go backwards as fast as forwards? Etc. It sucks
               | that everyone gets exposed to the relatively low level
               | implementation with LLM (almost the raw model), but
               | that's the reality today.
        
               | roywiggins wrote:
               | People do get similarly hung up on surprising floating
               | point results: why can't you just make it work properly?
               | And a full answer is a whole book on how floating point
               | math works.
        
               | dTal wrote:
               | It is however a highly relevant thing to be aware of when
               | evaluating a LLM for 'intelligence', which was the
               | context this was brought up in.
               | 
               | Without _looking_ at the word  'strawberry', or spelling
               | it one letter at a time, can you rattle off how many
               | letters are in the word off the top of your head? No?
               | That is what we are asking the LLM to do.
        
             | ca_tech wrote:
             | Its like showing someone a color and asking how many
             | letters it has. 4... 3? blau, blue, azul, blu The color
             | holds the meaning and the words all map back.
             | 
             | In the model the individual letters hold little meaning.
             | Words are composed of letters but simply because we need
             | some sort of organized structure for communication that
             | helps represents meaning and intent. Just like our color
             | blue/blau/azul/blu.
             | 
             | Not faulting them for asking the question but I agree that
             | the results do not undermine the capability of the
             | technology. In fact it just helps highlight the constraints
             | and need for education.
        
             | onlyrealcuzzo wrote:
             | > Like, that has nothing to do with their intelligence.
             | 
             | Because they don't have intelligence.
             | 
             | If they did, they could count the letters in strawberry.
        
               | TwentyPosts wrote:
               | People have been over this. If you believe this, you
               | don't understand how LLMs work.
               | 
               | They fundamentally perceive the world in terms of tokens,
               | not "letters".
        
               | antonvs wrote:
               | > If you believe this, you don't understand how LLMs
               | work.
               | 
               | Nor do they understand how intelligence works.
               | 
               | Humans don't read text a letter at a time. We're capable
               | of deconstructing words into individual letters, but
               | based on the evidence that's essentially a separate
               | "algorithm".
               | 
               | Multi-model systems could certainly be designed to do
               | that, but just like the human brain, it's unlikely to
               | ever make sense for a text comprehension and generation
               | model to work at the level of individual letters.
        
             | fmbb wrote:
             | > that has nothing to do with their intelligence.
             | 
             | Of course. Because these models have no intelligence.
             | 
             | Everyone who believes they do seem to believe intelligence
             | derives from being able to use language, however, and not
             | being able to tell how many times the letter r is in the
             | word strawberry is a very low bar to not pass.
        
               | roywiggins wrote:
               | An LLM trained on single letter tokens would be able to,
               | it just would be much more laborious to train.
        
               | wruza wrote:
               | Why would it be able to?
        
           | Stumbling wrote:
           | Claude 3 Opus gave correct answer.
        
           | vorticalbox wrote:
           | I just tried llama 3.1 8 b this is its reply.
           | 
           | According to multiple sources, including linguistic analysis
           | and word breakdowns, there are 3 Rs in the word "strawberry".
        
           | taf2 wrote:
           | sonate 3.5 thinks 2
        
           | stitched2gethr wrote:
           | Interestingly enough much simpler models can write an
           | accurate function to give you the answer.
           | 
           | I think it will be a while before we get there. An LLM can
           | lookup knowledge but can't actually perform calculations
           | itself, without some external processor.
        
             | stanleydrew wrote:
             | Why do we have to "get there?" Humans use calculators all
             | the time, so why not have every LLM hooked up to a
             | calculator or code interpreter as a tool to use in these
             | exact situations?
        
           | medmunds wrote:
           | How much do threads like this provide the training data to
           | convince future generations that--despite all appearances to
           | the contrary--strawberry is in fact spelled with only two
           | R's?
           | 
           | I just researched "how many r's are in strawberry?" in a
           | search engine, and based solely on the results it found, I
           | would have to conclude there is substantial disagreement on
           | whether the correct answer is two or three.
        
             | fluoridation wrote:
             | Speaking as a 100% human, my vote goes to the compromise
             | position that "strawberry" has in fact four Rs.
        
           | eschneider wrote:
           | The models are text generators. They don't "understand" the
           | question.
        
           | m2024 wrote:
           | Does anyone have input on the feasibility of running an LLM
           | locally and providing an interface to some language runtime
           | and storage space, possibly via a virtual machine or
           | container?
           | 
           | No idea if there's any sense to this, but an LLM could be
           | instructed to formulate and continually test mathematical
           | assumptions by writing / running code and fine-tuning
           | accordingly.
        
             | killthebuddha wrote:
             | FWIW this (approximately) is what everybody (approximately)
             | is trying to do.
        
             | stanleydrew wrote:
             | Yes, we are doing this at Riza[0] (via WASM). I'd love to
             | have folks try our downloadable CLI which wraps isolated
             | Python/JS runtimes (also Ruby/PHP but LLMs don't seem to
             | write those very well). Shoot me an email[1] or say hi in
             | Discord[1].
             | 
             | [0]:https://riza.io [1]:mailto:andrew@riza.io
             | [2]:https://discord.gg/4P6PUeJFW5
        
           | mirekrusin wrote:
           | How many "r"s are in [496, 675, 15717]?
        
           | stanleydrew wrote:
           | Plug in a code interpreter as a tool and the model will write
           | Python or JavaScript to solve this and get it right 100% of
           | the time. (Full disclosure: I work on a product called Riza
           | that you can use as a code interpreter tool for LLMs)
        
           | kremi wrote:
           | Your question sounds like you want to know how the word is
           | spelled, and no one would put two r's at straw, so the model
           | could be assuming that you're asking whether it's strawbery
           | or strawberry.
           | 
           | What happens if you ask the total number of occurrences of
           | the letter r in the word? Does it still not get it right?
        
           | exebook wrote:
           | Tokenization can be bypassed like this; (mistral large 2)
           | 
           | How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?
           | 
           | The word "s-t-r-a-w-b-e-r-r-y" contains three instances of
           | the letter "R."
           | 
           | How many letters R contain the word strawberry?
           | 
           | The word "strawberry" contains two instances of the letter
           | "R."
        
           | jbritton wrote:
           | I just tried Copilot 2 'r' in strawberry, 3 'r' in
           | strawberries.
        
           | groby_b wrote:
           | I mean, yes? Letter-based gymnastics are difficult in a
           | token-based environment, and they're not a particularly
           | valuable use case (outside of online dunking), so it's not
           | really a surprise nobody cares much about those in terms of
           | model training.
        
           | ipython wrote:
           | I really wanted to believe this could be simply fixed by more
           | prompt engineering and chain of thought, so I tried it with
           | Claude 3.5 and ... well unfortunately still no luck. the
           | reply is amusing - I'll include it verbatim below.
           | 
           | Of note, this happened because earlier I had asked it without
           | CoT and it incorrectly stated two "r"s. Once I started a new
           | conversation (so fresh context window), it gave the right
           | answer (3) and no longer claimed the "still the second r" as
           | shown below.
           | 
           | Let's think through this step-by-step:
           | 
           | 1. First, let's spell out "strawberry" letter by letter: s-t-
           | r-a-w-b-e-r-r-y
           | 
           | 2. Now, let's count the "r"s as we go through each letter: s
           | (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r
           | (2) - Second "r" found r (2) - This is still the second "r" y
           | (2)
           | 
           | 3. After going through all the letters, we've counted 2 "r"s.
           | 
           | Therefore, the word "strawberry" contains 2 "r"s.
        
         | rkwz wrote:
         | > Longer context window (1M+)
         | 
         | What's your use case for this? Uploading multiple
         | documents/books?
        
           | tikkun wrote:
           | Correct
        
             | freediver wrote:
             | That would make each API call cost at least $3 ($3 is price
             | per million input tokens). And if you have a 10 message
             | interaction you are looking at $30+ for the interaction. Is
             | that what you would expect?
        
               | rkwz wrote:
               | Maybe they're summarizing/processing the documents in a
               | specific format instead of chatting? If they needed chat,
               | might be easier to build using RAG?
        
               | tr4656 wrote:
               | This might be when it's better to not use the API and
               | just pay for the flat-rate subscription.
        
               | coder543 wrote:
               | Gemini 1.5 Pro charges $0.35/million tokens up to the
               | first million tokens or $0.70/million tokens for prompts
               | longer than one million tokens, and it supports a multi-
               | million token context window.
               | 
               | Substantially cheaper than $3/million, but I guess
               | Anthropic's prices are higher.
        
               | freediver wrote:
               | It is also much worse.
        
               | coder543 wrote:
               | Is it, though? In my limited tests, Gemini 1.5 Pro
               | (through the API) is very good at tasks involving long
               | context comprehension.
               | 
               | Google's user-facing implementations of Gemini are pretty
               | consistently bad when I try them out, so I understand why
               | people might have a bad impression about the underlying
               | Gemini models.
        
               | reitzensteinm wrote:
               | You're looking at the pricing for Gemini 1.5 Flash. Pro
               | is $3.50 for <128k tokens, else $7.
        
               | coder543 wrote:
               | Ah... oops. For some reason, that page isn't rendering
               | properly on my browser: https://imgur.com/a/XLFBPMI
               | 
               | When I glanced at the pricing earlier, I didn't notice
               | there was a dropdown at all.
        
               | impossiblefork wrote:
               | So do it locally after predigesting the book, so that you
               | have the entire KV-cache for it.
               | 
               | Then load that KV-cache and add your prompt.
        
           | ketzo wrote:
           | Uploading large codebases is particularly useful.
        
             | ipsod wrote:
             | Is it?
             | 
             | I've found that I get better results if I cherry pick code
             | to feed to Claude 3.5, instead of pasting whole files.
             | 
             | I'm kind of isolated, though, so maybe I just don't know
             | the trick.
        
               | ketzo wrote:
               | I've been using Cody from Sourcegraph, and it'll write
               | some really great code; business logic, not just
               | tests/simple UI. It does a great job using
               | patterns/models from elsewhere in your codebase.
               | 
               | Part of how it does that is through ingesting your
               | codebase into its context window, and so I imagine that
               | bigger/better context will only improve it. That's a bit
               | of an assumption though.
        
           | benopal64 wrote:
           | Books, especially textbooks, would be amazing. These things
           | can get pretty huge (1000+ pages) and usually do not fit into
           | GPT-4o or Claude Sonnet 3.5 in my experience. I envision the
           | models being able to help a user (student) create their study
           | guides and quizzes, based on ingesting the entire book. Given
           | the ability to ingest an entire book, I imagine a model could
           | plan how and when to introduce each concept in the textbook
           | better than a model only a part of the textbook.
        
           | moyix wrote:
           | Long agent trajectories, especially with command outputs.
        
         | msp26 wrote:
         | Large 2 is significantly smaller at 123B so it being comparable
         | to llama 3 405B would be crazy.
        
         | qwertox wrote:
         | Claude needs to fix their text input box. It tries to be so
         | advanced that code in backticks gets reformatted, and when you
         | copy it, the formatting is lost (even the backticks).
        
           | nickthesick wrote:
           | They are using Tiptap for their input and just a couple of
           | days ago we called them out on some perf improvements that
           | could be had in their editor:
           | https://news.ycombinator.com/item?id=41036078
           | 
           | I am curious what you mean by the formatting is lost though?
        
           | cpursley wrote:
           | Claude is truly incredible but I'm so tired of the JavaScript
           | bloat everywhere. Just why. Both theirs and ChatGPTs UIs are
           | hot garbage when it comes to performance (I constantly have
           | to clear my cache and have even relegated them to a different
           | browser entirely). Not everyone has an M4, and if we did -
           | we'd probably just run our own models.
        
       | Liquix wrote:
       | These companies full of brilliant engineers are throwing millions
       | of dollars in training costs to produce SOTA models that are...
       | "on par with GPT-4o and Claude Opus"? And then the next 2.23%
       | bump will cost another XX million? It seems increasingly apparent
       | that we are reaching the limits of throwing more data at more
       | GPUs; that an ARC prize level breakthrough is needed to move the
       | needle any farther at this point.
        
         | iknownthing wrote:
         | and even if there is another breakthrough all of these
         | companies will implement it more or less simultaneously and
         | they will remain in a dead heat
        
           | llm_nerd wrote:
           | Presuming the breakthrough is openly shared. It remains
           | surprising how transparent many of these companies are about
           | new approaches that push the SoTa forward, and I suspect
           | we're going to see a change. That companies won't reveal the
           | secret sauce so readily.
           | 
           | e.g. Almost the entire market relies upon Attention Is All
           | You Need paper detailing transformers, and it would be an
           | entirely different market if Google had held that as a trade
           | secret.
        
             | talldayo wrote:
             | Given how absolutely pitiful the proprietary advancements
             | in AI have been, I would posit we have little to worry
             | about.
        
               | jsheard wrote:
               | OTOH the companies who are sharing their breakthroughs
               | openly aren't yet making any money, so something has to
               | give. Their research is currently being bankrolled by
               | investors who assume there will be returns _eventually,_
               | and _eventually_ can only be kicked down the road for so
               | long.
        
               | talldayo wrote:
               | Eventually can be (and has been) bankrolled by Nvidia.
               | They did a lot of ground-floor research on GANs and
               | training optimization, which only makes sense to release
               | as public research. Similarly, Meta and Google are both
               | well-incentivized to share their research through Pytorch
               | and Tensorflow respectively.
               | 
               | I really am not expecting Apple or Microsoft to discover
               | AGI and ferret it away for profitability purposes.
               | Strictly speaking, I don't think superhuman intelligence
               | even exists in the domain of text generation.
        
               | thruway516 wrote:
               | Well, that's because the potential reward from picking
               | the right horse is MASSIVE and the cost of potentially
               | missing out is lifelong regret. Investors are driven by
               | FOMO more than anything else. They know most of these
               | will be duds but one of these duds could turn out to be
               | life changing. So they will keep bankrolling as long as
               | they have the money.
        
               | michaelt wrote:
               | Sort of yes, sort of no.
               | 
               | Of course, I agree that Stability AI made Stable
               | Diffusion freely available and they're worth orders of
               | magnitude less than OpenAI. To the point they're
               | struggling to keep the lights on.
               | 
               | But it doesn't necessarily make that much difference
               | whether you openly share the inner technical details.
               | When you've got a motivated and well financed competitor,
               | merely demonstrating a given feature is possible, showing
               | the output and performance and price, might be enough.
               | 
               | If OpenAI adds a feature, who's to say Google and
               | Facebook can't match it even though they can't access the
               | code?
        
               | sebzim4500 wrote:
               | Anthropic has been very secretive about the supposed
               | synthetic data they used to train 3.5 Sonnet.
               | 
               | Given how good the model is terms of the quality vs speed
               | tradeoff, they must have something.
        
             | GaggiX wrote:
             | >Attention Is All You Need paper detailing transformers,
             | and it would be an entirely different market if Google had
             | held that as a trade secret.
             | 
             | I would guess that in that timeline, Google would never
             | have been able to learn about the incredible capabilities
             | of transformer models outside of translation, at least not
             | until much later.
        
         | happyhardcore wrote:
         | I suspect this is why OpenAI is going more in the direction of
         | optimising for price / latency / whatever with 4o-mini and
         | whatnot. Presumably they found out long before the rest of us
         | did that models can't really get all that much better than what
         | we're approaching now, and once you're there the only thing you
         | can compete on is how many parameters it takes and how cheaply
         | you can serve that to users.
        
           | __jl__ wrote:
           | Meta just claimed the opposite in their Llama 3.1 paper. Look
           | at the conclusion. They say that their experience indicates
           | significant gains for the next iteration of models.
           | 
           | The current crop of benchmarks might not reflect these gains,
           | by the way.
        
             | nathanasmith wrote:
             | They also said in the paper that 405B was only trained to
             | "compute-optimal" unlike the smaller models that were
             | trained well past that point indicating the larger model
             | still had some runway so had they continued it would have
             | kept getting stronger.
        
               | moffkalast wrote:
               | Makes sense right? Otherwise why make a model so large
               | that nobody can conceivably run it if not to optimize for
               | performance on a limited dataset/compute? It was always a
               | distillation source model, not a production one.
        
             | imtringued wrote:
             | LLMs are reaching saturation on even some of the latest
             | benchmarks and yet I am still a little disappointed by how
             | they perform in practice.
             | 
             | They are by no means bad, but I am now mostly interested in
             | long context competency. We need benchmarks that force the
             | LLM to complete multiple tasks simultaneously in one super
             | long session.
        
               | xeromal wrote:
               | I don't know anything about AI but there's one thing I
               | want it to do for me. Program a full body exercise
               | program long term based on the parameters I give it such
               | as available equipment and past workout context goals. I
               | haven't had good success with chatgpt but I assume what
               | you're talking about is relevant to my goals.
        
               | ThrowawayTestr wrote:
               | Aren't there apps that already do this like Fitbod?
        
               | xeromal wrote:
               | Fitbod might do the trick. Thanks! The availability of
               | equipment was a difficult thing for me to incorporate
               | into a fitness program.
        
             | splwjs wrote:
             | I sell widgets. I promise the incalculable power of widgets
             | has yet to be unleashed on the world, but it is tremendous
             | and awesome and we should all be very afraid of widgets
             | taking over the world because I can't see how they won't.
             | 
             | Anyway here's the sales page. the widget subscription is so
             | premium you won't even miss the subscription fee.
        
               | sqeaky wrote:
               | That is strong (and fun) point, but this is peer
               | reviewable and has more open collaboration elements than
               | purely selling widgets.
               | 
               | We should still be skeptical because often want to claim
               | to be better or have unearned answers, but I don't think
               | the motive to lie is quite as strong as a salesman's.
        
               | troupo wrote:
               | > this is peer reviewable
               | 
               | It's not peer-reviewable in any shape or form.
        
               | hnfong wrote:
               | It is _kind of_ "peer-reviewable" in the "Elon Musk vs
               | Yann LeCun" form, but I doubt that the original commenter
               | meant this.
        
               | coltonv wrote:
               | This. It's really weird the way we suddenly live in a
               | world where it's the norm to take whatever a tech company
               | says about future products at face value. This is the
               | same world where Tesla promised "zero intervention LA to
               | NYC self driving" by the end of the year in 2016, 2017,
               | 2018, 2019, 2020, 2021, 2022, 2023, and 2024. The same
               | world where we know for a fact that multiple GenAI demos
               | by multiple companies were just completely faked.
               | 
               | It's weird. In the late 2010s it seems like people were
               | wising up to the idea that you can't implicitly trust big
               | tech companies, even if they have nap pods in the office
               | and have their first day employees wear funny hats. Then
               | ChatGPT lands and everyone is back to fully trusting
               | these companies when they say they are mere months from
               | turning the world upside down with their AI, which they
               | say every month for the last 12-24 months.
        
               | cle wrote:
               | I'm not sure anyone is asking you to take it at face
               | value or implicitly trust them? There's a 92-page paper
               | with details:
               | https://ai.meta.com/research/publications/the-
               | llama-3-herd-o...
        
               | hnfong wrote:
               | > In the late 2010s it seems like people were wising up
               | to the idea that you can't implicitly trust big tech
               | companies
               | 
               | In the 2000s we only had Microsoft, and none of us were
               | confused as to whether to trust Bill Gates or not...
        
               | mikae1 wrote:
               | Nobody tells it like Zitron:
               | 
               | https://www.wheresyoured.at/pop-culture/
               | 
               |  _> What makes this interview - and really, this paper --
               | so remarkable is how thoroughly and aggressively it
               | attacks every bit of marketing collateral the AI movement
               | has. Acemoglu specifically questions the belief that AI
               | models will simply get more powerful as we throw more
               | data and GPU capacity at them, and specifically ask a
               | question: what does it mean to  "double AI's
               | capabilities"? How does that actually make something
               | like, say, a customer service rep better? And this is a
               | specific problem with the AI fantasists' spiel. They
               | heavily rely on the idea that not only will these large
               | language models (LLMs) get more powerful, but that
               | getting more powerful will somehow grant it the power to
               | do...something. As Acemoglu says, "what does it mean to
               | double AI's capabilities?"_
        
               | RhodesianHunter wrote:
               | Meta just keeps releasing their models as open-source, so
               | that whole line of thinking breaks down quickly.
        
               | threecheese wrote:
               | That line of thinking would not have reached the
               | conclusion that you imply, which is that open source ==
               | pure altruism. Having the benefit of hindsight, it's very
               | difficult for me to believe that. Who knows though!
               | 
               | I'm about Zucks age, and have been following his
               | career/impact since college; it's been roughly a cosine
               | graph of doing good or evil over time :) I think we're at
               | 2pi by now, and if you are correct maybe it hockey-sticks
               | up and to the right. I hope so.
        
               | ctoth wrote:
               | Wouldn't the equivalent for Meta actually be something
               | like:
               | 
               | > Other companies sell widgets. We have a bunch of
               | widget-making machines and so we released a whole bunch
               | of free widgets. We noticed that the widgets got better
               | the more we made and expect widgets to become even better
               | in future. Anyway here's the free download.
               | 
               | Given that Meta isn't actually selling their models?
               | 
               | Your response might make sense if it were to something
               | OpenAI or Anthropic said, but as is I can't say I follow
               | the analogy.
        
               | ThrowawayTestr wrote:
               | If OpenAI was saying this you'd have a point but I
               | wouldn't call Facebook a widget seller in this case when
               | they're giving their widgets away for free.
        
               | camel_Snake wrote:
               | Meta doesn't sell widgets in this scenario - they give
               | them away for free. Their competition sells widgets, so
               | Meta would be perfectly happy if the widget market
               | totally collapsed.
        
               | mattnewton wrote:
               | that would make sense if it was from Openai, but Meta
               | doesn't actually sell these widgets? They release the
               | widget machines for free in the hopes that other people
               | will build a widget ecosystem around them to rival the
               | closed widget ecosystem that threatens to lock them out
               | of a potential "next platform" powered by widgets.
        
               | littlestymaar wrote:
               | Except: Meta doesn't sell AI at all. Zuck is just doing
               | this for two reasons:
               | 
               | - flex
               | 
               | - deal a blow to Altmann
        
               | HDThoreaun wrote:
               | Meta uses ai in all the recommendation algorithms. They
               | absolutely hope to turn their chat assistants into a
               | product on WhatsApp too, and GenAI is crucial to creating
               | the metaverse. This isn't just a charity case.
        
               | PodgieTar wrote:
               | There are literal ads for Meta Ai on television. The idea
               | they're not selling something is absurd.
        
               | X6S1x6Okd1st wrote:
               | But Meta isn't selling it
        
             | dev1ycan wrote:
             | Or maybe they just want to avoid getting sued by
             | shareholders for dumping so much money into unproven
             | technology that ended up being the same or worse than the
             | competitor
        
             | Bjorkbat wrote:
             | Yeah, but what does that actually mean? That if they had
             | simply doubled the parameters on Llama 405b it would score
             | way better on benchmarks and become the new state-of-the-
             | art by a long mile?
             | 
             | I mean, going by their own model evals on various
             | benchmarks (https://llama.meta.com/), Llama 405b scores
             | anywhere from a few points to _almost_ 10 points more than
             | than Llama 70b even though the former has ~5.5x more
             | params. As far as scale in concerned, the relationship isn
             | 't even linear.
             | 
             | Which in most cases makes sense, you obviously can't get a
             | 200% on these benchmarks, so if the smaller model is
             | already at ~95% or whatever then there isn't much room for
             | improvement. There is, however, the GPQA benchmark. Whereas
             | Llama 70b scores ~47%, Llama 405b only scores ~51%. That's
             | not a huge improvement despite the significant difference
             | in size.
             | 
             | Most likely, we're going to see improvements in small model
             | performance by way of better data. Otherwise though, I fail
             | to see how we're supposed to get significantly better model
             | performance by way of scale when the relationship between
             | model size and benchmark scores is nowhere near linear. I
             | really wish someone who's team "scale is all you need"
             | could help me see what I'm missing.
             | 
             | And of course we might find some breakthrough that enables
             | actual reasoning in models or whatever, but I find that
             | purely speculative at this point, anything but inevitable.
        
           | crystal_revenge wrote:
           | > the only thing you can compete on is how many parameters it
           | takes and how cheaply you can serve that to users.
           | 
           | The problem with this strategy is that it's really tough to
           | compete with open models in this space over the long run.
           | 
           | If you look at OpenAI's homepage right now they're trying to
           | promote "ChatGPT on your desktop", so it's clear even they
           | realize that most people are looking for a local product. But
           | once again this is a problem for them because open models run
           | locally are always going to offer more in terms of privacy
           | and features.
           | 
           | In order for proprietary models served through an API to
           | compete long term they need to offer _significant_
           | performance improvements over open /local offerings, but that
           | gap has been perpetually shrinking.
           | 
           | On an M3 macbook pro you can run open models easily for free
           | that perform close enough to OpenAI that I can use them as my
           | primary LLM for effectively free with complete privacy and
           | lots of room for improvement if I want to dive into the
           | details. Ollama today is pretty much easier to install than
           | just logging into ChatGPT and the performance feels a bit
           | more responsive for most tasks. If I'm doing a serious LLM
           | project I most certainly _won 't_ use proprietary models
           | because the control I have over the model is too limited.
           | 
           | At this point I have completely stopped using proprietary
           | LLMs despite working with LLMs everyday. Honestly can't
           | understand any serious software engineer who wouldn't use
           | open models (again the control and tooling provided is just
           | so much better), and for less technical users it's getting
           | easier and easier to just run open models locally.
        
             | bla3 wrote:
             | I think their desktop app still runs the actual LLM queries
             | remotely.
        
               | kridsdale3 wrote:
               | This. It's a mac port of the iOS app. Using the API.
        
             | pzo wrote:
             | In the long run maybe but it's going to take probably 5
             | years or more before laptops such as Macbook M3 with 64 GB
             | RAM will be mainstream. Also it's going going to take a
             | while before such models with 70B params will be bundled in
             | Windows and Mac with system update. Even more time before
             | you will have such models inside your smartphone.
             | 
             | OpenAI did a good move with making GPTo mini so dirty cheap
             | that it's faster and cheaper to run than LLama 3.1 70B.
             | Most consumers will interact with LLM via some apps using
             | LLM API, Web Panel on desktop or native mobile app for the
             | same reason most people use GMail etc. instead of native
             | email client. Setting up IMAP, POP etc is for most people
             | out of reach the same like installing Ollama + Docker +
             | OpenWebUI
             | 
             | App developers are not gonna bet on local LLM only as long
             | they are not mainstream and preinstalled on 50%+ devices.
        
           | nichochar wrote:
           | Totally. I wrote about this when they announced their dev-day
           | stuff.
           | 
           | In my opinion, they've found that intelligence with current
           | architecture is actually an S-curve and not an exponential,
           | so trying to make progress in other directions: UX and EQ.
           | 
           | https://nicholascharriere.com/blog/thoughts-openai-spring-
           | re...
        
         | ActorNightly wrote:
         | The thing I don't understand is why everyone is throwing money
         | at LLMs for language, when there are much simpler use cases
         | which are more useful?
         | 
         | For example, has anyone ever attempted image -> html/css model?
         | Seems like it be great if I can draw something on a piece of
         | paper and have it generate a website view for me.
        
           | jacobn wrote:
           | I was under the impression that you could more or less do
           | something like that with the existing LLMs?
           | 
           | (May work poorly of course, and the sample I think I saw a
           | year ago may well be cherry picked)
        
           | GaggiX wrote:
           | >For example, has anyone ever attempted image -> html/css
           | model?
           | 
           | Have you tried upload the image to a LLM with vision
           | capabilities like GPT-4o or Claude 3.5 Sonnet?
        
             | machiaweliczny wrote:
             | I tried and sonnet 3.5 can copy most of common UIs
        
           | majiy wrote:
           | That's a thought I had. For example, could a model be trained
           | to take a description, and create a Blender (or whatever
           | other software) model from it? I have no idea how LLMs really
           | work under the hood, so please tell me if this is nonsense.
        
             | eurekin wrote:
             | I'm waiting exactly for this, gpt4 trips up a lot with
             | blender currently (nonsensical order of operations etc.)
        
           | ascorbic wrote:
           | All of the multi-modal LLMs are reasonably good at this.
        
           | chipdart wrote:
           | > For example, has anyone ever attempted image -> html/css
           | model?
           | 
           | There are already companies selling services where they
           | generate entire frontend applications from vague natural
           | language inputs.
           | 
           | https://vercel.com/blog/announcing-v0-generative-ui
        
           | rkwz wrote:
           | Perhaps if we think of LLMs as search engines (Google, Bing
           | etc) then there's more money to be made by being the top
           | generic search engine than the top specialized one (code
           | search, papers search etc)
        
           | JumpCrisscross wrote:
           | > _has anyone ever attempted image - > html/css model?_
           | 
           | I had a discussion with a friend about doing this, but for
           | CNC code. The answer was that a model trained on a narrow
           | data set underperforms one trained on a large data set and
           | then fine tuned with the narrow one.
        
           | drexlspivey wrote:
           | They did that in the chatgpt 4 demo 1.5 year ago.
           | https://www.youtube.com/watch?v=GylMu1wF9hw
        
           | slashdave wrote:
           | Not sure why you think interpreting a hand drawing is
           | "simpler" than parsing sequential text.
        
         | swyx wrote:
         | indeed. I pointed out in
         | https://buttondown.email/ainews/archive/ainews-llama-31-the-...
         | that the frontier model curve is currently going down 1 OoM
         | every 4 months, meaning every model release has a very short
         | half life[0]. however this progress is still worth it if we can
         | deploy it to improve millions and eventually billions of
         | people's lives. a commenter pointed out that the amoutn spent
         | on Llama 3.1 was only like 60% of the cost of Ant Man and the
         | Wasp Quantumania, in which case I'd advocate for killing all
         | Marvel slop and dumping all that budget on LLM progress.
         | 
         | [0] not technically complete depreciation, since for example 4o
         | mini is widely believed to be a distillation of 4o, so 4o's
         | investment still carries over into 4o mini
        
           | thierrydamiba wrote:
           | Agreed on everything, but calling the marvel movies slop...I
           | think that word has gone too far.
        
             | ThrowawayTestr wrote:
             | The marvel movies are the genesis for this use of the word
             | slop.
        
               | simonw wrote:
               | Can you back that claim up with a link or similar?
        
             | RUnconcerned wrote:
             | Not only are Marvel movies slop, they are very concentrated
             | slop. The only way to increase the concentration of slop in
             | a Marvel movie would be to ask ChatGPT to write the next
             | one.
        
             | mattnewton wrote:
             | Not all Marvel films are slop. But, as a fan who comes from
             | a family of fans and someone who has watched almost all of
             | them; lets be real. That particular film, really and most
             | of them, contain copious amounts of what is absolutely
             | _slop_.
             | 
             | I don't know if the utility is worse than an LLM that is
             | SOTA for 2 months that no one even bothers switching to
             | however - at least the marvel slop is being used for
             | entertainment by someone. I think the market is definitely
             | prioritizing the LLM researcher over Disney's latest slop
             | sequel though so whoever made that comparison can rest
             | easy, because we'll find out.
        
               | lawlessone wrote:
               | >really and most of them, contain copious amounts of what
               | is absolutely slop.
               | 
               | I thought that was the allure, something that's camp
               | funny and an easy watch.
               | 
               | I have only watched a few of them so I am not fully
               | familiar?
        
             | bn-l wrote:
             | It's junk food. No one is disputing how tasty it is though
             | (including the recent garbage).
        
           | throwup238 wrote:
           | All that Marvel slop was created by the first real LLM:
           | <https://marvelcinematicuniverse.fandom.com/wiki/K.E.V.I.N.>
        
           | troupo wrote:
           | > however this progress is still worth it if we can deploy it
           | to improve millions and eventually billions of people's lives
           | 
           | Has there been any indication that we're improving the lives
           | of millions of people?
        
             | zooq_ai wrote:
             | Yes, just like internet, power users have found use cases.
             | It'll take education / habit for general users
        
               | troupo wrote:
               | Ah yes. We're in the crypto stages of "it's like the
               | internet".
        
             | machiaweliczny wrote:
             | Just me coding 30% faster is worth it
        
               | troupo wrote:
               | I haven't found a single coding problem where any of
               | these coding assistants where anything but annoying.
               | 
               | If I need to babysit a junior developer fresh out of
               | school and review every single line of code it spits out,
               | I can find them elsewhere
        
         | Workaccount2 wrote:
         | I think GPT5 will be the signal of whether or not we have hit a
         | plateau. The space is still rapidly developing, and while large
         | model gains are getting harder to pick apart, there have been
         | enormous gains in the capabilities of light weight models.
        
           | zainhoda wrote:
           | I'm waiting for the same signal. There are essentially 2
           | vastly different states of the world depending on whether
           | GPT-5 is an incremental change vs a step change compared to
           | GPT-4.
        
           | chipdart wrote:
           | > I think GPT5 will be the signal of whether or not we have
           | hit a plateau.
           | 
           | I think GPT5 will tell if OpenAI hit a plateau.
           | 
           | Sam Altman has been quoted as claiming "GPT-3 had the
           | intelligence of a toddler, GPT-4 was more similar to a smart
           | high-schooler, and that the next generation will look to have
           | PhD-level intelligence (in certain tasks)"
           | 
           | Notice the high degree of upselling based on vague claims of
           | performance, and the fact that the jump from highschooler to
           | PhD can very well be far less impressive than the jump from
           | toddler to high schooler. In addition, notice the use of
           | weasel words to frame expectations regarding "the next
           | generation" to limit these gains to corner cases.
           | 
           | There's some degree of salesmanship in the way these models
           | are presented, but even between the hyperboles you don't see
           | claims of transformative changes.
        
             | rvnx wrote:
             | PhD level-of-task-execution sounds like the LLM will debate
             | whether the task is ethical instead of actually doing it
        
               | airspresso wrote:
               | lol! Producing academic papers for future training runs
               | then.
        
               | throwadobe wrote:
               | I wish I could frame this comment
        
             | splwjs wrote:
             | >some degree of salesmanship
             | 
             | buddy every few weeks one of these bozos is telling us
             | their product is literally going to eclipse humanity and we
             | should all start fearing the inevitable great collapse.
             | 
             | It's like how no one owns a car anymore because of ai
             | driving and I don't have to tell you about the great bank
             | disaster of 2019, when we all had to accept that fiat
             | currency is over.
             | 
             | You've got to be a particular kind of unfortunate to
             | believe it when sam altman says literally anything.
        
             | sensanaty wrote:
             | Basically every single word out of Mr Worldcoin's mouth is
             | a scam of some sort.
        
           | mupuff1234 wrote:
           | Which is why they'll keep calling the next few models GPT4.X
        
         | speed_spread wrote:
         | Benchmarks scores aren't good because they apply to previous
         | generations of LLMs. That 2.23% uptick can actually represent a
         | world of difference in subjective tests and definitely be worth
         | the investment.
         | 
         | Progress is not slowing down but it gets harder to quantify.
        
         | satvikpendem wrote:
         | This is already what the chinchilla paper surmised, it's no
         | wonder that their prediction now comes to fruition. It is like
         | an accelerated version of Moore's Law, because software
         | development itself is more accelerated than hardware
         | development.
        
         | chipdart wrote:
         | > It seems increasingly apparent that we are reaching the
         | limits of throwing more data at more GPUs;
         | 
         | I think you're just seeing the "make it work" stage of the
         | combo "first make it work, then make it fast".
         | 
         | Time to market is critical, as you can attest by the fact you
         | framed the situation as "on par with GPT-4o and Claude Opus".
         | You're seeing huge investments because being the first to get a
         | working model stands to benefit greatly. You can only assess
         | models that exist, and for that you need to train them at a
         | huge computational cost.
        
           | romeros wrote:
           | ChatGPT is like Google now. It is the default. Even if Claude
           | becomes as good as ChatGPT or even slightly better it won't
           | make me switch. It has to be like a lot better. Way better.
           | 
           | It feels like ChatGPT won the time to market war already.
        
             | Tostino wrote:
             | Eh, with the degradation of coding performance in ChatGPT I
             | made the switch. Seems much better to work with on
             | problems, and I have to do way less hand holding to get
             | good results.
             | 
             | I'll switch again soon as something better is out.
        
             | brandall10 wrote:
             | But plenty people switched to Claude, esp. with Sonnet 3.5.
             | Many of them in this very thread.
             | 
             | You may be right with the average person on the street, but
             | I wonder how many have lost interest in LLM usage and
             | cancelled their GPT plus sub.
        
             | asah wrote:
             | -1: I know many people who are switching to Claude. And
             | Google makes it near-zero friction to adopt Gemini with
             | Gsuite. And more still are using the top-N of them.
             | 
             | This is similar to the early days of the search engine
             | wars, the browser wars, and other categories where a user
             | can easily adopt, switch between and use multiple. It's not
             | like the cellphone OS/hardware war, PC war and database war
             | where (most) users can only adopt one platform at a time
             | and/or there's a heavy platform investment.
        
             | staticman2 wrote:
             | If ChatGPT fails to do a task you want, your instinct isn't
             | "I'll run the prompt through Claude and see if it works"
             | but "oh well, who needs LLMs?"
        
               | atxbcp wrote:
               | Please don't assume your experience applies to everyone.
               | If ChatGPT can't do what I want, my first reaction is to
               | ask Claude for the same thing. Often to find out that
               | Claude performs much better. I've already cancelled
               | ChaptGPT Plus for exactly that reason.
        
               | staticman2 wrote:
               | You just did that Internet thing where someone reads the
               | reply someone wrote without the comment they are replying
               | to, completely misunderstanding the conversation.
        
             | xcv123 wrote:
             | Dude that is retarded. It's a website and it costs nothing
             | to open another browser tab. You can use both at the same
             | time. I'm sure you browse multiple websites per day and
             | have multiple tabs open. No difference.
             | 
             | ChatGPT is nowhere close to perfection, and we are still in
             | the early days with plenty of competition. None of the LLMs
             | are that good yet.
             | 
             | Many users here are using both Claude and ChatGPT because
             | it's just another fucking tab in the browser. Try it out.
        
         | genrilz wrote:
         | For this model, it seems like the point is that it uses way
         | less parameters than at least the large Llama model while
         | having near identical performance. Given how large these models
         | are getting, this is an important thing to do before making
         | performance better again.
        
         | skybrian wrote:
         | I think it's impressive that they're doing it on a single
         | (large) node. Costs matter. Efficiency improvements like this
         | will probably increase capabilities eventually.
         | 
         | I'm also optimistic about building better (rather than bigger)
         | datasets to train on.
        
         | 42lux wrote:
         | We always needed a tock to see real advancement, like with the
         | last model generation. The tick we had with the h100 was enough
         | to bring these models to market but that's it.
        
         | lossolo wrote:
         | For some time, we have been at a plateau because everyone has
         | caught up, which essentially means that everyone now has good
         | training datasets and uses similar tweaks to the architecture.
         | It seems that, besides new modalities, transformers might be a
         | dead end as an architecture. Better scores on benchmarks result
         | from better training data and fine-tuning. The so-called
         | 'agents' and 'function calling' also boil down to training data
         | and fine-tuning.
        
         | lolinder wrote:
         | > It seems increasingly apparent that we are reaching the
         | limits of throwing more data at more GPUs
         | 
         | Yes. This is exactly why I'm skeptical of AI
         | doomerism/saviorism.
         | 
         | Too many people have been looking at the pace of LLM
         | development over the last two (2) years, modeled it as an
         | exponential growth function, and come to the conclusion that
         | AGI is inevitable in the next ${1-5} years and we're headed for
         | ${(dys|u)topia}.
         | 
         | But all that assumes that we can extrapolate a pattern of long-
         | term exponential growth from less than two years of data. It's
         | simply not possible to project in that way, and we're already
         | seeing that OpenAI has pivoted from improving on GPT-4's
         | benchmarks to reducing cost, while competitors (including free
         | ones) catch up.
         | 
         | All the evidence suggests we've been slowing the rate of growth
         | in capabilities of SOTA LLMs for at least the past year, which
         | means predictions based on exponential growth all need to be
         | reevaluated.
        
           | cjalmeida wrote:
           | Indeed.All exponential growth curves are sigmoids in
           | disguise.
        
             | nicman23 wrote:
             | except when it isn't and we ded :P
        
               | kridsdale3 wrote:
               | I don't think Special Relativity would allow that.
        
             | ToValueFunfetti wrote:
             | This is something that is definitionally true in a finite
             | universe, but doesn't carry a lot of useful predictive
             | value in practice unless you can identify when the
             | flattening will occur.
             | 
             | If you have a machine that converts mass into energy and
             | then uses that energy to increase the rate at which it
             | operates, you could rightfully say that it will level off
             | well before consuming all of the mass in the universe. You
             | just can't say that next week after it has consumed all of
             | the mass of Earth.
        
           | RicoElectrico wrote:
           | I don't think we are approaching limits, if you take off the
           | English-centric glasses. You can query LLMs about pretty
           | basic questions about Polish language or literature and it's
           | gonna either bullshit or say it doesn't know the answer.
           | 
           | Example:                   w ktorej gwarze jest slowo ekspres
           | i co znaczy?              Slowo "ekspres" wystepuje w gwarze
           | slaskiej i oznacza tam ekspres do kawy. Jest to skrot od
           | nazwy "ekspres do kawy", czyli urzadzenia sluzacego do
           | szybkiego przygotowania kawy.
           | 
           | The correct answer is that "ekspres" is a zipper in Lodz
           | dialect.
        
             | andrepd wrote:
             | Tbf, you can ask it basic questions in English and it will
             | also bullshit you.
        
             | nprateem wrote:
             | That's just same same but different, not a step change
             | towards significant cognitive ability.
        
             | lolinder wrote:
             | What this means is just that Polish support (and probably
             | most other languages besides English) in the models is
             | behind SOTA. We can gradually get those languages closer to
             | SOTA, but that doesn't bring us closer to AGI.
        
           | jeremyjh wrote:
           | I'm also wondering about the extent to which we are simply
           | burning venture capital versus actually charging subscription
           | prices that are sustainable long-term. Its easy to sell
           | dollars for $0.75 but you can only do that for so long.
        
           | dvansoye wrote:
           | What about synthetic data?
        
           | impossiblefork wrote:
           | Notice though, that all these improvements have been with
           | pretty basic transformer models that output all their
           | tokens-- no internal thoughts, no search, no architecture
           | improvements and things are only fed through them once.
           | 
           | But we could add internal thoughts-- we could make the model
           | generate tokens that aren't part of its output but are there
           | for it to better figure out its next token. This was tried
           | QuietSTAR.
           | 
           | Hochreiter is also active with alternative models, and
           | there's all the microchip design companies, Groq, Etched,
           | etc. trying to speed up models and reduce model running cost.
           | 
           | Therefore, I think there's room for very great improvements.
           | They may not come right away, but there are so many obvious
           | paths to improve things that I think it's unreasonable to
           | think progress has stalled. Also, presumably GPT-5 isn't far
           | away.
        
             | lolinder wrote:
             | > Also, presumably GPT-5 isn't far away.
             | 
             | Why do we presume that? People were saying this right
             | before 4o and then what came out was not 5 but instead a
             | major improvement on cost for 4.
             | 
             | Is there any specific reason to believe OpenAI has a model
             | coming soon that will be a major step up in capabilities?
        
               | impossiblefork wrote:
               | OpenAI have made statements saying they've begun training
               | it, as they explain here:
               | https://openai.com/index/openai-board-forms-safety-and-
               | secur...
               | 
               | I assume that this won't take forever, but will be done
               | this year. A couple of months, not more.
        
             | audunw wrote:
             | > But we could add internal thoughts
             | 
             | It feels like there's an assumption in the community that
             | this will be almost trivial.
             | 
             | I suspect it will be one of the hardest tasks humanity has
             | ever endeavoured. I'm guessing it has already been tried
             | many times in internal development.
             | 
             | I suspect if you start creating a feedback loop with these
             | models they will tend to become very unstable very fast. We
             | already see with these more linear LLMs that they can be
             | extremely sensitive to the values of parameters like the
             | temperature settings, and can go "crazy" fairly easily.
             | 
             | With feedback loops it could become much harder to prevent
             | these AIs from spinning out of control. And no I don't mean
             | in the "become an evil paperclip maximiser" kind of way.
             | Just plain unproductive insanity.
             | 
             | I think I can summarise my vision of the future in one
             | sentence: AI psychologists will become a huge profession,
             | and it will be just as difficult and nebulous as being a
             | human psychologist.
        
           | jpadkins wrote:
           | > we're already seeing that OpenAI has pivoted from improving
           | on GPT-4's benchmarks to reducing cost, while competitors
           | (including free ones) catch up.
           | 
           | What if they have two teams? One dedicated to optimizing
           | (cost, speed, etc) the current model and a different team
           | working on the next frontier model? I don't think we know the
           | growth curve until we see gpt5.
        
             | lolinder wrote:
             | > I don't think we know the growth curve until we see gpt5.
             | 
             | I'm prepared to be wrong, but I think that the fact that we
             | still haven't seen GPT-5 or even had a proper teaser for it
             | 16 months after GPT-4 is evidence that the growth curve is
             | slowing. The teasers that the media assumed were for GPT-5
             | seem to have actually been for GPT-4o [0]:
             | 
             | > Lex Fridman(01:06:13) So when is GPT-5 coming out again?
             | 
             | > Sam Altman(01:06:15) I don't know. That's the honest
             | answer.
             | 
             | > Lex Fridman(01:06:18) Oh, that's the honest answer. Blink
             | twice if it's this year.
             | 
             | > Sam Altman(01:06:30) We will release an amazing new model
             | this year. I don't know what we'll call it.
             | 
             | > Lex Fridman(01:06:36) So that goes to the question of,
             | what's the way we release this thing?
             | 
             | > Sam Altman(01:06:41) We'll release in the coming months
             | many different things. I think that'd be very cool. I think
             | before we talk about a GPT-5-like model called that, or not
             | called that, or a little bit worse or a little bit better
             | than what you'd expect from a GPT-5, I think we have a lot
             | of other important things to release first.
             | 
             | Note that last response. That's not the sound of a CEO who
             | has an amazing v5 of their product lined up, that's the
             | sound of a CEO who's trying to figure out how to brand the
             | model that they're working on that will be cheaper but not
             | substantially better.
             | 
             | [0] https://arstechnica.com/information-
             | technology/2024/03/opena...
        
         | niemandhier wrote:
         | The next iteration depends on NVIDIA & co, what we need is
         | sparse libs. Most of the weights in llms are 0, once we deal
         | with those more efficiently we will get to the next iteration.
        
           | lawlessone wrote:
           | > Most of the weights in llms are 0,
           | 
           | that's interesting. Do you have a rough percentage of this?
           | 
           | Does this mean these connections have no influence at all on
           | output?
        
             | machiaweliczny wrote:
             | My uneducated guess is that with many layers you can
             | implement something akin to graph in brain by nulling lots
             | of previous later outputs. I actually suspect that current
             | models aren't optimal with layers all of the same size but
             | i know shit
        
               | kridsdale3 wrote:
               | This is quite intuitive. We know that a biological neural
               | net is a graph data structure. And ML systems on GPUs are
               | more like layers of bitmaps in Photoshop (it's a graphics
               | processor). So if most of the layers are akin to
               | transparent pixels, in order to build a graph by
               | stacking, that's hyper memory inefficient.
        
         | m3kw9 wrote:
         | There is different directions AI have lots to improve: multi
         | modal which branch into robotics, single modal like image,
         | video, and sound generation and understanding. Also would check
         | back when openAI releases 5
        
         | swalsh wrote:
         | And with the increasing parameter size, the main winner will be
         | Nvidia.
         | 
         | Frankly I just don't understand the economics of training a
         | foundation model. I'd rather own an airline. At least I can get
         | a few years out of the capital investment of a plane.
        
           | machiaweliczny wrote:
           | But billionaires already have that, they want a chance of
           | getting their own god.
        
         | mlsu wrote:
         | What else can be done?
         | 
         | If you are sitting on 1 billions $ of GPU capex, what's $50
         | million in energy/training cost for another incremental run
         | that may beat the leaderboard?
         | 
         | Over the last few years the market has placed its bets that
         | this stuff will make gobs of money somehow. We're all not sure
         | how. They're probably thinking -- it's likely that whoever has
         | a few % is going to sweep and take most of this hypothetical
         | value. What's another few million, especially if you already
         | have the GPUs?
         | 
         | I think you're right -- we are towards the right end of the
         | sigmoid. And with no "killer app" in sight. It is great for all
         | of us that they have created all this value, because I don't
         | think anyone will be able to capture it. They certainly haven't
         | yet.
        
         | sebzim4500 wrote:
         | I don't think we can conclude that until someone trains a model
         | that is significantly bigger than GPT-4.
        
       | rkwz wrote:
       | > A significant effort was also devoted to enhancing the model's
       | reasoning capabilities. One of the key focus areas during
       | training was to minimize the model's tendency to "hallucinate" or
       | generate plausible-sounding but factually incorrect or irrelevant
       | information. This was achieved by fine-tuning the model to be
       | more cautious and discerning in its responses, ensuring that it
       | provides reliable and accurate outputs.
       | 
       | Is there a benchmark or something similar that compares this
       | "quality" across different models?
        
         | amilios wrote:
         | Unfortunately not, as it captures such a wide spectrum of use
         | cases and scenarios. There are some benchmarks to measure this
         | quality in specific settings, e.g. summarization, but AFAIK
         | nothing general.
        
           | rkwz wrote:
           | Thanks, any ideas why it's not possible to build a generic
           | eval for this? Since it's about asking a set of questions
           | that's not public knowledge (or making stuff up) and check if
           | the model says "I don't know"?
        
       | moralestapia wrote:
       | Nice, they finally got the memo that GPT4 exists and include it
       | in their benchmarks.
        
       | gavinray wrote:
       | "It's not the size that matters, but how you use it."
        
       | epups wrote:
       | The graphs seem to indicate their model trades blows with Llama
       | 3.1 405B, which has more than 3x the number of tokens and
       | (presumably) a much bigger compute budget. It's kind of baffling
       | if this is confirmed.
       | 
       | Apparently Llama 3.1 relied on artificial data, would be very
       | curious about the type of data that Mistral uses.
        
       | OldGreenYodaGPT wrote:
       | I still prefer ChatGPT-4o and use Claude if I have issues but
       | never does any better
        
         | jasonjmcghee wrote:
         | This is super interesting to me.
         | 
         | Claude Sonnet 3.5 outperforms GPT-4o by a significant margin on
         | every one of my use cases.
         | 
         | What do you use it for?
        
       | breck wrote:
       | When I see this "(c) 2024 [Company Name], All rights reserved",
       | it's a tell that the company does not understand how hopelessly
       | behind they are about to be.
        
         | crowcroft wrote:
         | Could you elaborate on this? Would love to understand what
         | leads you to this conclusion.
        
           | breck wrote:
           | E = T/A! [0]
           | 
           | A faster evolving approach to AI is coming out this year that
           | will smoke anyone who still uses the term "license" in
           | regards to ideas [1].
           | 
           | [0] https://breckyunits.com/eta.html [1]
           | https://breckyunits.com/freedom.html
        
             | christianqchung wrote:
             | So it's made up?
        
               | breck wrote:
               | I do what I say and I say what I do.
               | 
               | https://github.com/breck7/breckyunits.com/blob/afe70ad66c
               | fbb...
        
       | doctoboggan wrote:
       | The question I (and I suspect most other HN readers) have is
       | which model is best for coding? While I appreciate the advances
       | in open weights models and all the competition from other
       | companies, when it comes to my professional use I just want the
       | best. Is that still GPT-4?
        
         | tikkun wrote:
         | My personal experience says Claude 3.5 Sonnet.
        
           | stri8ed wrote:
           | The benchmarks agree as well.
        
         | kim0 wrote:
         | I kinda trust https://aider.chat/docs/leaderboards/
        
       | ashenke wrote:
       | I tested it with my claude prompt history, the results are as
       | good as Claude 3.5 Sonnet, but it's 2 or 3 times slower
        
       | Tepix wrote:
       | Just in case you haven't RTFA. Mistral 2 is 123b.
        
       | rkwasny wrote:
       | All evals we have are just far too easy! <1% difference is just
       | noise/bad data
       | 
       | We need to figure out how to measure intelligence that is greater
       | than human.
        
         | omneity wrote:
         | Give it problems most/all humans can't solve on their own, but
         | that are easy to verify.
         | 
         | Math problems being one of them, if only LLMs were good at pure
         | math. Another possibility is graph problems. Haven't tested
         | this much though.
        
       | tonetegeatinst wrote:
       | What doe they mean by "single-node inference"?
       | 
       | Do they mean inference done on a single machine?
        
         | simonw wrote:
         | Yes, albeit a really expensive one. Large models like GPT-4 are
         | rumored to run inference on multiple machines because they
         | don't fit in VRAM for even the most expensive GPUs.
         | 
         | (I wouldn't be surprised if GPT-4o mini is small enough to fit
         | on a single large instance though, would explain how they could
         | drop the price so much.)
        
         | bjornsing wrote:
         | Yeah that's how I read it. Probably means 8 x 80 GB GPUs.
        
       | huevosabio wrote:
       | The non-commercial license is underwhelming.
       | 
       | It seems to be competitive with Llama 3.1 405b but with a much
       | more restrictive license.
       | 
       | Given how the difference between these models is shrinking, I
       | think you're better off using llama 405B to finetune the 70B on
       | the specific use case.
       | 
       | This would be different if it was a major leap in quality, but it
       | doesn't seem to be.
       | 
       | Very glad that there's a lot of competition at the top, though!
        
       | calibas wrote:
       | "Mistral Large 2 is equipped with enhanced function calling and
       | retrieval skills and has undergone training to proficiently
       | execute both parallel and sequential function calls, enabling it
       | to serve as the power engine of complex business applications."
       | 
       | Why does the chart below say the "Function Calling" accuracy is
       | about 50%? Does that mean it fails half the time with complex
       | operations?
        
         | simonw wrote:
         | Mistral forgot to say which benchmark they were using for that
         | chart, without that information it's impossible to determine
         | what it actually means.
        
         | Me1000 wrote:
         | Relatedly, what does "parallel" function calling mean in this
         | context?
        
           | simonw wrote:
           | That's when the LLM can respond with multiple functions it
           | wants you to call at once. You might send it:
           | Location and population of Paris, France
           | 
           | A parallel function calling LLM could return:
           | {           "role": "assistant",           "content": "",
           | "tool_calls": [             {               "function": {
           | "name": "get_city_coordinates",                 "arguments":
           | "{\"city\": \"Paris\"}"               }             }, {
           | "function": {                 "name": "get_city_population",
           | "arguments": "{\"city\": \"Paris\"}"               }
           | }           ]         }
           | 
           | Indicating that you should execute both of those functions
           | and return the results to the LLM as part of the next prompt.
        
             | Me1000 wrote:
             | Ah, thank you!
        
       | RyanAdamas wrote:
       | Personally, language diversity should be the last thing on the
       | list. If we had optimized every software from the get-go for a
       | dozen languages our forward progress would have been dead in the
       | water.
        
         | moffkalast wrote:
         | You'd think so, but 3.5-turbo was multilingual from the get go
         | and benefitted massively from it. If you want to position
         | yourself as a global leader, then excluding 95% of the world
         | who aren't English native speakers seems like a bad idea.
        
           | RyanAdamas wrote:
           | Yeah clearly, OpenAI is rocketing forward and beyond.
        
             | moffkalast wrote:
             | Constant infighting and most of the competent people
             | leaving will do that to a company.
             | 
             | I mean more on a model performance level though. It's been
             | shown that something trained in one language trains the
             | model to be able to output it in any other language it
             | knows. There's quality human data being left on the table
             | otherwise. Besides, translation is one of the few tasks
             | that language models are by far the best at if trained
             | properly, so why not do something you can sell as a main
             | feature?
        
         | gpm wrote:
         | Language diversity means access to more training data, and you
         | might also hope that by learning the same concept in multiple
         | languages it does a better job of learning the underlying
         | concept independent of the phrase structure...
         | 
         | At least from a distance it seems like training a multilingual
         | state of the art model might well be easier than a monolingual
         | one.
        
           | RyanAdamas wrote:
           | Multiple input and output processes in different languages
           | has zero effect on associative learning and creative
           | formulation in my estimations. We've already done studies
           | that show there is no correlation between human intelligence
           | and knowing multiple languages, after having to put up with
           | decades of "Americans le dumb because..." and this is no
           | different. The amount of discourse on a single topic has a
           | limited degree of usability before redundancies appear. Such
           | redundancies would necessarily increase the processing
           | burden, which could actually limit the output potential for
           | novel associations.
        
             | gpm wrote:
             | Humans also don't learn by reading the entire internet...
             | assuming human psych studies apply to LLMs at all is just
             | wrong.
        
             | logicchains wrote:
             | Google mentioned this in one of their papers, they found
             | for large enough models including more languages did indeed
             | lead to an overall increase in performance.
        
               | RyanAdamas wrote:
               | Considering Googles progress and censorship history, I'm
               | inclined to take their assessments with a grain of salt.
        
       | wesleyyue wrote:
       | I'm building a ai coding assistant (https://double.bot) so I've
       | tried pretty much all the frontier models. I added it this
       | morning to play around with it and it's probably the worst model
       | I've ever played with. Less coherent than 8B models. Worst case
       | of benchmark hacking I've ever seen.
       | 
       | example: https://x.com/WesleyYue/status/1816153964934750691
        
         | mpeg wrote:
         | to be fair that's quite a weird request (the initial one) - I
         | feel a human would struggle to understand what you mean
        
           | wesleyyue wrote:
           | definitely not an articulate request, but the point of using
           | these tools is to speed me up. The less the user has to
           | articulate and the more it can infer correctly, the more
           | helpful it is. Other frontier models don't have this problem.
           | 
           | Llama 405B response would be exactly what I expect
           | 
           | https://x.com/WesleyYue/status/1816157147413278811
        
             | mpeg wrote:
             | That response is bad python though, I can't think of why
             | you'd ever want a dict with Literal typed keys.
             | 
             | Either use a TypedDict if you want the keys to be in a
             | specific set, or, in your case since both the keys and the
             | values are static you should really be using an Enum
        
         | ijustlovemath wrote:
         | What was the expected outcome for you? AFAIK, Python doesn't
         | have a const dictionary. Were you wanting it to refactor into a
         | dataclass?
        
           | wesleyyue wrote:
           | Yes, there's a few things wrong: 1. If it assumes typescript,
           | it should do `as const` in the first msg 2. If it is python,
           | it should be something like
           | https://x.com/WesleyYue/status/1816157147413278811 which is
           | what I wanted but I didn't want to bother with the typing.
        
         | nabakin wrote:
         | Are you sure the chat history is being passed when the second
         | message is sent? That looks like the kind of response you'd
         | expect if it only received the prompt "in python" with no chat
         | history at all.
        
           | wesleyyue wrote:
           | Yes, I built the extension. I actually also just went to send
           | another message asking what the first msg was just to double
           | check I didn't have a bug and it does know what the first msg
           | was.
        
             | nabakin wrote:
             | Thanks, that's some really bad accuracy/performance
        
         | schleck8 wrote:
         | This makes no sense. Benchmarking code is easier than natural
         | language and Mistral has separate benchmarks for prominent
         | languages.
        
       | avereveard wrote:
       | important to note that this time around weights are available
       | https://huggingface.co/mistralai/Mistral-Large-Instruct-2407
        
       | ilaksh wrote:
       | How does their API pricing compare to 4o and 3.5 Sonnet?
        
         | rvnx wrote:
         | 3 USD per 1M input tokens, so the same as 3.5 Sonnet but worse
         | quality
        
       | ThinkBeat wrote:
       | A side note about the ever increasing costs to advance the
       | models. I feel certain that some branch of what may be connected
       | to the NSA is running and advancing models that probably exceed
       | what the open market provides today.
       | 
       | Maybe they are running it on proprietary or semi proprietary
       | hardware but if they dont, how much does the market no where
       | various shipments of NVIDEA processors ends up?
       | 
       | I imagine most intelligence agencies are in need of vast
       | quantities.
       | 
       | I presume is M$ announces new availability of AI compute it means
       | they have received and put into production X Nvidiam, which might
       | make it possible to guesstimate within some bounds how many.
       | 
       | Same with other open market compute facilities.
       | 
       | Is it likely that a significant share of NVIDEA processors are
       | going to government / intelligent / fronts?
        
       | teaearlgraycold wrote:
       | https://www.youtube.com/watch?v=rvrZJ5C_Nwg
        
       | modeless wrote:
       | The name just makes me think of the screaming cowboy song.
       | https://youtu.be/rvrZJ5C_Nwg?t=138
        
       | nen-nomad wrote:
       | The models are converging slowly. In the end, it will come down
       | to the user experience and the "personality." I have been
       | enjoying the new Claude Sonnet. It feels sharper than the others,
       | even though it is not the highest-scoring one.
       | 
       | One thing that `exponentialists` forget is that each step also
       | requires exponentially more energy and resources.
        
         | toomuchtodo wrote:
         | I have been paying for OpenAI since they started accepting
         | payment, but to echo your comment, Claude is _so good_ I am
         | primarily relying on it now for LLM driven work and cancelled
         | my OpenAI subscription. Genuine kudos to Mistral, they are a
         | worthy competitor in the space against Goliaths. They make
         | someone mediocre at writing code less so, so I can focus on
         | higher value work.
        
         | bilater wrote:
         | And a factor for Mistral typically is it will give you less
         | refusals and can be uncensored. So if I have to guess any task
         | that requires creative output could be better suited for this.
        
       | thntk wrote:
       | Anyone know what caused the very big performance jump from Large1
       | to Large2 in just a few months?
       | 
       | Besides, parameter redundancy seems evidenced. Front-tier models
       | used to be 1.8T, then 405B, and now 123B. Would front-tier models
       | in the future be <10B or even <1B, that would be a game changer.
        
         | nuz wrote:
         | Lots and lots of synthetic data from the bigger models training
         | the smaller ones would be my guess.
        
         | duchenne wrote:
         | Counter-intuitively, larger models are cheaper to train.
         | However, smaller models are cheaper to serve. At first,
         | everyone was focusing on training, so the models were much
         | larger. Now, so many people are using AI everyday, so companies
         | spend more on training smaller models to save on serving.
        
       | erichocean wrote:
       | I like Claude 3.5 Sonnet, but despite paying for a plan, I run
       | out of tokens after about 10 minutes. Text only, I'm typing
       | everything in myself.
       | 
       | It's almost useless because I literally can't use it.
       | 
       | Update: https://support.anthropic.com/en/articles/8325612-does-
       | claud...
       | 
       | 45 messages per 5 hours is the limit for Pro users, less if
       | Claude is wordy in its responses--which it always is. I hit that
       | limit so fast when I'm investigating something. So annoying.
       | 
       | They used to let you select another, worse model but I don't see
       | that option anymore. _le sigh_
        
       | mvdtnz wrote:
       | Imagine bragging about 74% accuracy in any other field of
       | software. You'd be laughed out of the room. But somehow it's
       | accepted in "AI".
        
         | kgeist wrote:
         | Well, we had close to 0% a few years ago (for general purpose
         | AI). I think it's not bad...
        
       | SebaSeba wrote:
       | Sorry for the slightly off topic question, but can someone
       | enlighten me which Claude model is more capable, Opus or Sonnet
       | 3.5? I am confused because I see people fuzzing about Sonnet 3.5
       | being the best and yet somehow I seem to read again and again in
       | factual texts and some benchmarks that Claude Opus is the most
       | capable. Is there a simple answer to the question, what do I not
       | understand? Please, thank you.
        
         | platelminto wrote:
         | Sonnet 3.5.
         | 
         | Opus is the largest model, but of the Claude 3 family. Claude
         | 3.5 is the newest family of models, with Sonnet being the
         | middle sized 3.5 model - and also the only available one.
         | Regardless, it's better than Opus (the largest Claude 3 one).
         | 
         | Presumably, a Claude 3.5 Opus will come out at some point, and
         | should be even better - but maybe they've found that increasing
         | the size for this model family just isn't cost effective. Or
         | doesn't improve things that much. I'm unsure if they've said
         | anything about it recently.
        
           | SebaSeba wrote:
           | Thank you :)
        
         | zamadatix wrote:
         | I think this image explains it best: https://www-
         | cdn.anthropic.com/images/4zrzovbb/website/1f0441...
         | 
         | I.e. Opus is the largest and best model of each family but
         | Sonnet is the first model of the 3.5 family and can beat 3's
         | Opus in most tasks. When 3.5 Opus is released it will again
         | outpace the 3.5 Sonnet model of the same family universally (in
         | terms of capability) but until then it's a comparison of two
         | different families without a universal guarantee, just a strong
         | lean towards the newer model.
        
           | SebaSeba wrote:
           | Thank you for clearing this out to me :)
        
       | htk wrote:
       | is it possible to run Large 2 on ollama?
        
       | novok wrote:
       | I kind of wonder why a lot of these places don't give "amateur"
       | sized models anymore at around the 18B & 30B parameter sizes that
       | you can run on a single 3090 or M2 Max at reasonable speeds and
       | RAM requirements? It's all 7B, 70B, 400B sizing nowadays.
        
         | TobTobXX wrote:
         | Just a few days ago, Mistral released a 12B model:
         | https://mistral.ai/news/mistral-nemo/
        
         | logicchains wrote:
         | Because you can just quantise the 70B model to 3-4 bits and
         | it'll perform better than a 30B model but be a similar size.
        
           | novok wrote:
           | A 70B 4bit model does not fit in a 24GB VRAM card, 30B models
           | are the sweet spot for that size of card at 20GB, with 4GB
           | left for the system to still function.
        
       | whisper_yb wrote:
       | Every day a new model better than the previous one lol
        
       | philip-b wrote:
       | Does any one of the top models have access to the internet and
       | googling things? I want an LLM to look things up and do casual
       | research for me when I'm lazy.
        
         | tikkun wrote:
         | I'd suggest using Perplexity.
        
       | freediver wrote:
       | Sharing PyLLMs [1] reasoning benchmark results for some of the
       | recent models. Surprised by nemo (speed/quality) and mistral
       | large is actually pretty good (but painfully slow).
       | 
       | AnthropicProvider('claude-3-haiku-20240307') Median Latency: 1.61
       | | Aggregated speed: 122.50 | Accuracy: 44.44%
       | 
       | MistralProvider('open-mistral-nemo') Median Latency: 1.37 |
       | Aggregated speed: 100.37 | Accuracy: 51.85%
       | 
       | OpenAIProvider('gpt-4o-mini') Median Latency: 2.13 | Aggregated
       | speed: 67.59 | Accuracy: 59.26%
       | 
       | MistralProvider('mistral-large-latest') Median Latency: 10.18 |
       | Aggregated speed: 18.64 | Accuracy: 62.96%
       | 
       | AnthropicProvider('claude-3-5-sonnet-20240620') Median Latency:
       | 3.61 | Aggregated speed: 59.70 | Accuracy: 62.96%
       | 
       | OpenAIProvider('gpt-4o') Median Latency: 3.25 | Aggregated speed:
       | 53.75 | Accuracy: 74.07% |
       | 
       | [1] https://github.com/kagisearch/pyllms
        
       | zone411 wrote:
       | Improves from 17.7 for Mistral Large to 20.0 on the NYT
       | Connections benchmark.
        
       | greenchair wrote:
       | can anyone explain why the % success rates are so different
       | between these programming languages? is this a function of amount
       | of training data available for each language or due to complexity
       | of language or what?
        
       | h1fra wrote:
       | There are now more AI models than javascript framework!
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:01 UTC)