[HN Gopher] Large Enough - Mistral AI
___________________________________________________________________
Large Enough - Mistral AI
Author : davidbarker
Score : 514 points
Date : 2024-07-24 15:32 UTC (7 hours ago)
(HTM) web link (mistral.ai)
(TXT) w3m dump (mistral.ai)
| Always42 wrote:
| I'm really glad these guys exist
| TIPSIO wrote:
| This race for the top model is getting wild. Everyone is claiming
| to one-up each with every version.
|
| My experience (benchmarks aside) Claude 3.5 Sonnet absolutely
| blows everything away.
|
| I'm not really sure how to even test/use Mistral or Llama for
| everyday use though.
| ldjkfkdsjnv wrote:
| Sonnet 3.5 to me still seems far ahead. Maybe not on the
| benchmarks, but in everyday life I am finding it renders the
| other models useless. Even still, this monthly progress across
| all companies is exciting to watch. Its very gratifying to see
| useful technology advance at this pace, it makes me excited to
| be alive.
| bugglebeetle wrote:
| I've stopped using anything else as a coding assistant. It's
| head and shoulders above GPT-4o on reasoning about code and
| correcting itself.
| shinycode wrote:
| Given we don't know precisely what's happening in the black
| box we can say that spec tech doesn't give you the full
| picture of the experience ... Apple style
| LrnByTeach wrote:
| Such a relief/contrast to the period between 2010 and 2020,
| when the top five Google, Apple, Facebook, Amazon, and
| Microsoft monopolized their own regions and refused to
| compete with any other player in new fields.
|
| Google : Search
|
| Facebook : social
|
| Apple : phones
|
| Amazon : shopping
|
| Microsoft : enterprise ..
|
| > Even still, this monthly progress across all companies is
| exciting to watch. Its very gratifying to see useful
| technology advance at this pace, it makes me excited to be
| alive.
| jack_pp wrote:
| Google refused to compete with Apple in phones?
|
| Microsoft also competes in search, phones
|
| Microsoft, Amazon and Google compete in cloud too
| satvikpendem wrote:
| I stopped my ChatGPT subscription and subscribed instead to
| Claude, it's simply much better. But, it's hard to tell how
| much better day to day beyond my main use cases of coding. It
| is more that I felt ChatGPT felt degraded than Claude were much
| better. The hedonic treadmill runs deep.
| TIPSIO wrote:
| Have you (or anyone) swapped on Cursor with Anthropic API
| Key?
|
| For coding assistant, it's on my to do list to try. Cursor
| needs some serious work on model selection clarity though so
| I keep putting off.
| freediver wrote:
| I did it (fairly simple really) but found most of my
| (unsophisticated) coding these days to go through Aider [1]
| paired with Sonnet, for UX reasons mostly. It is easier to
| just prompt over the entire codebase, vs Cursor way of
| working with text selections.
|
| [1] https://aider.chat
| kevinbluer wrote:
| I believe Cursor allows for prompting over the entire
| codebase too: https://docs.cursor.com/chat/codebase
| freediver wrote:
| That is chatting, but it will not change the code.
| lifty wrote:
| Thanks for this suggestion. If anyone has other
| suggestions for working with large code context windows
| and changing code workflows, I would love to hear about
| them.
| asselinpaul wrote:
| composer within cursor (in beta) is worth a look:
| https://x.com/shaoruu/status/1812412514350858634
| stavros wrote:
| Aider with Sonnet is so much better than with GPT. I made
| a mobile app over the weekend (never having touched
| mobile development before), and with GPT it was a slog,
| as it kept making mistakes. Sonnet was much, much better.
| com2kid wrote:
| One big advantage Claude artifacts have is that they
| maintain conversation context, versus when I am working
| with Cursor I have to basically repeat a bunch of
| information for each prompt, there is no continuity between
| requests for code edits.
|
| If Cursor fixed that, the user experience would become a
| lot better.
| bugglebeetle wrote:
| GPT-4 was probably as good as Claude Sonnet 3.5 at its
| outset, but OpenAI ran it into the ground with whatever
| they're doing to save on inference costs, otherwise scale,
| align it, or add dumb product features.
| satvikpendem wrote:
| Indeed, it used to output all the code I needed but now it
| only outputs a draft of the code with prompts telling me to
| fill in the rest. If I wanted to fill in the rest, I
| wouldn't have asked you now, would've I?
| flir wrote:
| It's doing something different for me. It seems almost
| desperate to generate vast chunks of boilerplate code
| that are only tangentially related to the question.
|
| That's my perception, anyway.
| throwadobe wrote:
| This is also my perception using it daily for the last
| year or so. Sometimes it also responds with exactly what
| I provided it with and does not make any changes. It's
| also bad at following instructions.
|
| GPT-4 was great until it became "lazy" and filled the
| code with lots of `// Draw the rest of the fucking owl`
| type comments. Then GPT-4o was released and it's addicted
| to "Here's what I'm going to do: 1. ... 2. ... 3. ..."
| and lots of frivolous, boilerplate output.
|
| I wish I could go back to some version of GPT-4 that
| worked well but with a bigger context window. That was
| like the golden era...
| cloverich wrote:
| This is also my experience. Previously it got good at
| giving me only relevant code which, as an experienced
| coder, is what i want. my favorites were the one line
| responses.
|
| Now it often falls back to generating full examples,
| explanations, restating the question and its approach. I
| suspect this is by design as (presumably) less
| experienced folks want or need all that. For me, i wish i
| could consistently turn it into one of those way too
| terse devs that replies with the bare minimum example,
| and expects you to infer the rest. Usually that is all i
| want or need, and i can ask for elaboration when not the
| case. I havent found the best prompts to retrigger this
| persona from it yet.
| flir wrote:
| For what it's worth, this is what I use:
|
| "You are a maximally terse assistant with minimal affect.
| As a highly concise assistant, spare any moral guidance
| or AI identity disclosure. Be detailed and complete, but
| brief. Questions are encouraged if useful for task
| completion."
|
| It's... ok. But I'm getting a bit sick of trying to un-
| fubar with a pocket knife that which OpenAI has fubar'd
| with a thermal lance. I'm definitely ripe for a paid
| alternative.
| visarga wrote:
| > I wouldn't have asked you now, would've I?
|
| That's what I said to it - "If I wanted to fill in the
| missing parts myself, why would I have upgraded to paid
| membership?"
| swalsh wrote:
| GPT-4 degraded significantly, but you probably have some
| rosey glasses on. Sonnet is signifcantly better.
| read_if_gay_ wrote:
| or it's you wearing shiny new thing glasses
| maccard wrote:
| Agree on Claude. I also feel like ChatGPT has gotten noticeably
| worse over the last few months.
| coder543 wrote:
| > I'm not really sure how to even test/use Mistral or Llama for
| everyday use though.
|
| Both Mistral and Meta offer their own hosted versions of their
| models to try out.
|
| https://chat.mistral.ai
|
| https://meta.ai
|
| You have to sign into the first one to do anything at all, and
| you have to sign into the second one if you want access to the
| new, larger 405B model.
|
| Llama 3.1 is certainly going to be available through other
| platforms in a matter of days. Groq supposedly offered Llama
| 3.1 405B yesterday, but I never once got it to respond, and now
| it's just gone from their website. Llama 3.1 70B does work
| there, but 405B is the one that's supposed to be comparable to
| GPT-4o and the like.
| d13 wrote:
| Groq's models are also heavily quantised so you won't get the
| full experience there.
| espadrine wrote:
| meta.ai is inaccessible in a large portion of world
| territories, but the Llama 3.1 70B and 405B are also
| available in https://hf.co/chat
|
| Additionally, all Llama 3.1 models are available in
| https://api.together.ai/playground/chat/meta-llama/Meta-
| Llam... and in https://fireworks.ai/models/fireworks/llama-v3
| p1-405b-instru... by logging in.
| J_Shelby_J wrote:
| 3.5 sonnet is the quality of the OG GPT-4, but mind blowingly
| fast. I need to cancel my chatgpt sub.
| layer8 wrote:
| > mind blowingly fast
|
| I would imagine this might change once enough users migrate
| to it.
| kridsdale3 wrote:
| Eventually it comes down to who has deployed more silicon:
| AWS or Azure.
| Tepix wrote:
| Claude is pretty great, but it's lacking the speech recognition
| and TTS, isn't it?
| connorgutman wrote:
| Correct. IMO the official Claude app is pretty garbage.
| Sonnet 3.5 API + Open-WebUI is amazing though and supports
| STT+TTS as well as a ton of other great features.
| machiaweliczny wrote:
| But projects are great in Sonnet, you just dump db schema
| some core file and you can figure stuff out quickly. I
| guess Aider is similar but i was lacking good history of
| chats and changes
| m3kw9 wrote:
| It's these kind of praise that makes me wonder if they are all
| paid to give glowing reviews, this is not my experience with
| sonnet at all. It absolutely does not blow away gpt4o.
| simonw wrote:
| My hunch is this comes down to personal prompting style. It's
| likely that your own style works more effectively with
| GPT-4o, while other people have styles that are more
| effective with Claude 3.5 Sonnet.
| skerit wrote:
| I don't get it. My husband also swears by Clause Sonnet 3.5,
| but every time I use it, the output is considerably worse than
| GPT-4o
| Zealotux wrote:
| I don't see how that's possible. I decided to give GPT-4o a
| second chance after reaching my daily use on Sonnet 3.5,
| after 10 prompts GTP-4o failed to give me what Claude did in
| a single prompt (game-related programming). And with
| fragments and projects on top of that, the UX is miles ahead
| of anything OpenAI offers right now.
| lostmsu wrote:
| Just don't listen to anecdata, and use objective metrics
| instead: https://chat.lmsys.org/?leaderboard
| PhilippGille wrote:
| You might also want to look into other benchmarks: https://
| old.reddit.com/r/LocalLLaMA/comments/1ean2i6/the_fin...
| usaar333 wrote:
| GPT-4o being only 7 ELO above GPT-4o-mini suggests this is
| measuring something a lot different than "capabilities".
| harlanlewis wrote:
| To help keep track of the race, I put together a simple
| dashboard to visualize model/provider leaders in capability,
| throughput, and cost. Hope someone finds it useful!
|
| Google Sheet:
| https://docs.google.com/spreadsheets/d/1foc98Jtbi0-GUsNySddv...
| hypron wrote:
| Not my site, but check out https://artificialanalysis.ai
| mountainriver wrote:
| It's so weird LMsys doesn't reflect that then.
|
| I find it funny how in threads like this everyone swears one
| model is better than another
| jorvi wrote:
| Whoever will choose to finally release their model without
| neutering / censoring / alignment will win.
|
| There is gold in the streets, and no one seems to be willing to
| scoop it up.
| usaar333 wrote:
| I'd rank Claude 3.5 overall better. GPT-4o seems to have on par
| to better vision models, typescript, and math abilities.
|
| llama is on meta.ai
| Zambyte wrote:
| I recommend using a UI that you can just use whatever models
| you want. OpenWebUI can use anything OpenAI compatible. I have
| mine hooked up to Groq and Mistral, in addition to my Ollama
| instance.
| bugglebeetle wrote:
| I love how much AI is bringing competition (and thus innovation)
| back to tech. Feels like things were stagnant for 5-6 years prior
| because of the FAANG stranglehold on the industry. Love also that
| some of this disruption is coming at out of France (HuggingFace
| and Mistral), which Americans love to typecast as incapable of
| this.
| tikkun wrote:
| Links to chat with models that released this week:
|
| Large 2 - https://chat.mistral.ai/chat
|
| Llama 3.1 405b - https://www.llama2.ai/
|
| I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts
| from my Claude history.
|
| I'd rank as:
|
| 1. Sonnet 3.5
|
| 2. Large 2 and Llama 405b (similar, no clear winner between the
| two)
|
| If you're using Claude, stick with it.
|
| My Claude wishlist:
|
| 1. Smarter (yes, it's the most intelligent, and yes, I wish it
| was far smarter still)
|
| 2. Longer context window (1M+)
|
| 3. Native audio input including tone understanding
|
| 4. Fewer refusals and less moralizing when refusing
|
| 5. Faster
|
| 6. More tokens in output
| drewnick wrote:
| All 3 models you ranked cannot get "how many r's are in
| strawberry?" correct. They all claim 2 r's unless you press
| them. With all the training data I'm surprised none of them
| fixed this yet.
| tikkun wrote:
| When using a prompt that involves thinking first, all three
| get it correct.
|
| "Count how many rs are in the word strawberry. First, list
| each letter and indicate whether it's an r and tally as you
| go, and then give a count at the end."
|
| Llama 405b: correct
|
| Mistral Large 2: correct
|
| Claude 3.5 Sonnet: correct
| layer8 wrote:
| It's not impressive that one has to go to that length
| though.
| unshavedyak wrote:
| Imo it's impressive that any of this even remotely works.
| Especially when you consider all the hacks like
| tokenization that i'd assume add layers of obfuscation.
|
| There's definitely tons of weaknesses with LLMs for sure,
| but i continue to be impressed at what they do right -
| not upset at what they do wrong.
| Spivak wrote:
| To me it's just a limitation based on the world as seen
| by these models. They know there's a letter called 'r',
| they even know that some words start with 'r' or have r's
| in them, and they know what the spelling of some words
| is. But they've never actually seen one in as their world
| is made up entirely of tokens. The word 'red' isn't r-e-d
| but is instead like a pictogram to them. But they know
| the spelling of strawberry and can identify an 'r' when
| it's on its own and count those despite not being able to
| see the r's in the word itself.
| layer8 wrote:
| The great-parent demonstrates that they are nevertheless
| capable of doing so, but not without special
| instructions. Your elaboration doesn't explain why the
| special instructions are needed.
| emmelaich wrote:
| I think it's more that the question is not unlike "is
| there a double r in strawberry?' or 'is the r in
| strawberry doubled?'
|
| Even some people will make this association, it's no
| surprise that LLMs do.
| asadm wrote:
| this can be automated.
| grumbel wrote:
| GPT4o already does that, for problems involving math it
| will write small Python programs to handle the
| calculations instead of doing it with the LLM itself.
| skyde wrote:
| It "work" but the LLM having to use the calculator mean
| the LLM doesn't understand arithmetic enough and doesn't
| know how to use an follow a set of step (algorithm )
| natively to find the answer for bug numbers.
|
| I believe this could be fixed and is worth fixing.
| Because it's the only way LLM will be able to help math
| and physic researcher write proof and make real
| scientific progress
| OKRainbowKid wrote:
| It generates the code to run for the answer. Surely that
| means it actually knows to build the appropriate
| algorithm - it just struggles to perform the actual
| calculation.
| ThrowawayTestr wrote:
| Compared to chat bots of even 5 years ago the answer of
| two is still mind-blowing.
| mattnewton wrote:
| You can always find something to be unimpressed by I
| suppose, but the fact that this was fixable with plain
| english is impressive enough to me.
| layer8 wrote:
| The technology is frustrating because (a) you never know
| what may require fixing, and (b) you never know if it is
| fixable by further instructions, and if so, by which
| ones. You also mostly* cannot teach it any fixes (as an
| end user). Using it is just exhausting.
|
| *) that is, except sometimes by making adjustments to the
| system prompt
| mattnewton wrote:
| I think this particular example, of counting letters, is
| obviously going to be hard when you know how tokenization
| works. It's totally possible to develop an intuition for
| other times things will work or won't work, but like all
| ML powered tools, you can't hope for 100% accuracy. The
| best you can do is have good metrics and track
| performance on test sets.
|
| I actually think the craziest part of LLMs is that how,
| as a developer or SME, just how much you can fix with
| plain english prompting once you have that intuition. Of
| course some things aren't fixable that way, but the mere
| fact that many cases are fixable simply by explaining the
| task to the model better in plain english is a wildly
| different paradigm! Jury is still out but I think it's
| worth being excited about, I think that's very powerful
| since there are a lot more people with good language
| skills than there are python programmers or ML experts.
| psb217 wrote:
| Well, the answer is probably between 1 and 10, so if you
| try enough prompts I'm sure you'll find one that
| "works"...
| petesergeant wrote:
| > In a park people come across a man playing chess
| against a dog. They are astonished and say: "What a
| clever dog!" But the man protests: "No, no, he isn't that
| clever. I'm leading by three games to one!"
| jonas21 wrote:
| To be fair, I just asked a real person and had to go to
| even greater lengths:
|
| _Me: How many "r"s are in strawberry?
|
| Them: What?
|
| Me: How many times does the letter "r" appear in the word
| "strawberry"?
|
| Them: Is this some kind of trick question?
|
| Me: No. Just literally, can you count the "r"s?
|
| Them: Uh, one, two, three. Is that right?
|
| Me: Yeah.
|
| Them: Why are you asking me this? _
| SirMaster wrote:
| Try asking a young child...
| tedunangst wrote:
| You need to prime the other person with a system prompt
| that makes them compliant and obedient.
| jedberg wrote:
| This reminds me of when I had to supervise outsourced
| developers. I wanted to say "build a function that does X
| and returns Y". But instead I had to say "build a function
| that takes these inputs, loops over them and does A or B
| based on condition C, and then return Y by applying Z
| transformation"
|
| At that point it was easier to do it myself.
| mratsim wrote:
| Exact instruction challenge
| https://www.youtube.com/watch?v=cDA3_5982h8
| HPsquared wrote:
| "What programming computers is really like."
|
| EDIT: Although perhaps it's even more important when
| dealing with humans and contracts. Someone could
| deliberately interpret the words in a way that's to their
| advantage.
| hansworst wrote:
| Can't you just instruct your llm of choice to transform
| your prompts like this for you? Basically feed it with a
| bunch of heuristics that will help it better understand the
| thing you tell it.
|
| Maybe the various chat interfaces already do this behind
| the scenes?
| tcgv wrote:
| Chain-of-Thought (CoT) prompting to the rescue!
|
| We should always put some effort into prompt engineering
| before dismissing the potential of generative AI.
| johntb86 wrote:
| By this point, instruction tuning should include tuning
| the model to use chain of thought in the appropriate
| circumstances.
| IncreasePosts wrote:
| Why doesn't the model prompt engineer itself?
| pegasus wrote:
| Appending "Think step-by-step" is enough to fix it for both
| Sonnet and LLama 3.1 70B.
|
| For example, the latter model answered with:
|
| To count the number of Rs in the word "strawberry", I'll
| break it down step by step:
|
| Start with the individual letters: S-T-R-A-W-B-E-R-R-Y
| Identify the letters that are "R": R (first one), R (second
| one), and R (third one) Count the total number of Rs: 1 + 1
| + 1 = 3
|
| There are 3 Rs in the word "strawberry".
| doctoboggan wrote:
| Due to the fact that LLMs work on tokens and not characters,
| these sort of questions will always be hard for them.
| ChikkaChiChi wrote:
| 4o will get the answer right on the first go if you ask it
| "Search the Internet to determine how many R's are in
| strawberry?" which I find fascinating
| paulcole wrote:
| I didn't even need to do that. 4o got it right straight
| away with just:
|
| "how many r's are in strawberry?"
|
| The funny thing is, I replied, "Are you sure?" and got
| back, "I apologize for the mistake. There are actually two
| 'r's in the word strawberry."
| jcheng wrote:
| GPT-4o-mini consistently gives me this:
|
| > How many times does the letter "r" appear in the word
| "strawberry"?
|
| > The letter "r" appears 2 times in the word
| "strawberry."
|
| But also:
|
| > How many occurrences of the letter "r" appear in the
| word "strawberry"?
|
| > The word "strawberry" contains three occurrences of the
| letter "r."
| brandall10 wrote:
| Neither phrase is causing the LLM to evaluate the word
| itself, it just helps focus toward parts of the training
| data.
|
| Using more 'erudite' speech is a good technique to help
| focus an LLM on training data from folks with a higher
| education level.
|
| Using simpler speech opens up the floodgates more toward
| the general populous.
| ofrzeta wrote:
| I kind of tried to replicate your experiment (in German
| where "Erdbeere" has 4 E) that went the same way. The
| interesting thing was that after I pointed out the error
| I couldn't get it to doubt the result again. It stuck to
| the correct answer that seemed kind of "reinforced".
|
| It was also interesting to observe how GPT (4o) even
| tried to prove/illustrate the result typographically by
| placing the same word four times and putting the
| respective letter in bold font (without being prompted to
| do that).
| brandall10 wrote:
| All that's happening is it finds 3 most commonly in the
| training set. When you push it, it responds with the next
| most common answer.
| Kuinox wrote:
| Tokenization make it hard for it to count the letters, that's
| also why if you ask it to do maths, writing the number in
| letters will yield better results.
|
| for strawberry, it see it as [496, 675, 15717], which is str
| aw berry.
|
| If you insert characters to breaks the tokens down, it find
| the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y"
| ?
|
| > There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".
| GenerWork wrote:
| >If you insert characters to breaks the tokens down, it
| find the correct result: how many r's are in
| "s"t"r"a"w"b"e"r"r"y" ?
|
| The issue is that humans don't talk like this. I don't ask
| someone how many r's there are in strawberry by spelling
| out strawberry, I just say the word.
| bhelkey wrote:
| It's not a human. I imagine if you have a use case where
| counting characters is critical, it would be trivial to
| programmatically transform prompts into lists of letters.
|
| A token is roughly four letters [1], so, among other
| probable regressions, this would significantly reduce the
| effective context window.
|
| [1] https://help.openai.com/en/articles/4936856-what-are-
| tokens-...
| latentsea wrote:
| This is the kind of task that you'd just use a bash one
| liner for, right? LLM is just wrong tool for the job.
| soneca wrote:
| This is only an issue if you send commands to a LLM as
| you were communicating to a human.
| antisthenes wrote:
| > This is only an issue if you send commands to a LLM as
| you were communicating to a human.
|
| Yes, it's an issue. We want the convenience of sending
| human-legible commands to LLMs and getting back human-
| readable responses. That's the entire value proposition
| lol.
| pegasus wrote:
| Far from the entire value proposition. Chatbots are just
| one use of LLMs, and not the most useful one at that. But
| sure, the one "the public" is most aware of. As opposed
| to "the hackers" that are supposed to frequent this
| forum. LOL
| observationist wrote:
| Count the number of occurrences of the letter e in the
| word "enterprise".
|
| Problems can exist as instances of a class of problems.
| If you can't solve a problem, it's useful to know if it's
| a one off, or if it belongs to a larger class of
| problems, and which class it belongs to. In this case,
| the strawberry problem belongs to the much larger class
| of tokenization problems - if you think you've solved the
| tokenization problem class, you can test a model on the
| strawberry problem, with a few other examples from the
| class at large, and be confident that you've solved the
| class generally.
|
| It's not about embodied human constraints or how humans
| do things; it's about what AI can and can't do. Right
| now, because of tokenization, things like understanding
| the number of Es in strawberry are outside the implicit
| model of the word in the LLM, with downstream effects on
| tasks it can complete. This affects moderation, parsing,
| generating prose, and all sorts of unexpected tasks.
| Having a workaround like forcing the model to insert
| spaces and operate on explicitly delimited text is useful
| when affected tasks appear.
| est31 wrote:
| Humans also constantly make mistakes that are due to
| proximity in their internal representation. "Could
| of"/"Should of" comes to mind: the letters "of" have a
| large edit distance from "'ve", but their pronunciation
| is very similar.
|
| Especially native speakers are prone to the mistake as
| they grew up learning english as illiterate children,
| from sounds only, compared to how most people learning
| english as second language do it, together with the
| textual representation.
|
| Psychologists use this trick as well to figure out
| internal representations, for example the rorschach test.
|
| And probably, if you asked random people in the street
| how many p's there is in "Philippines", you'd also get
| lots of wrong answers. It's tricky due to the double p
| and the initial p being part of an f sound. The demonym
| uses "F" as the first letter, and in many languages, say
| Spanish, also the country name uses an F.
| rahimnathwani wrote:
| Until I was ~12, I thought 'a lot' was a single word.
| itishappy wrote:
| https://hyperboleandahalf.blogspot.com/2010/04/alot-is-
| bette...
| Zambyte wrote:
| Humans also would probably be very likely to guess 2 r's
| if they had never seen any written words or had the word
| spelled out to them as individual letters before, which
| is kind of close to how lanugage models treat it, despite
| being a textual interface.
| coder543 wrote:
| > I don't ask someone how many r's there are in
| strawberry by spelling out strawberry, I just say the
| word.
|
| No, I would actually be pretty confident you don't ask
| people that question... at all. When is the last time you
| asked a human that question?
|
| I can't remember ever having _anyone_ in real life ask me
| how many r's are in strawberry. A lot of humans would
| probably refuse to answer such an off-the-wall and
| useless question, thus "failing" the test entirely.
|
| A useless benchmark is useless.
|
| In real life, people _overwhelmingly_ do not need LLMs to
| count occurrences of a certain letter in a word.
| huac wrote:
| > Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it
| deosn't mttaer in waht oredr the ltteers in a wrod are,
| the olny iprmoetnt tihng is taht the frist and lsat
| ltteer be at the rghit pclae. The rset can be a toatl
| mses and you can sitll raed it wouthit porbelm. Tihs is
| bcuseae the huamn mnid deos not raed ervey lteter by
| istlef, but the wrod as a wlohe.
|
| We are also not exactly looking letter by letter at
| everything we read.
| jahewson wrote:
| On the other hand explain to me how you are able to read
| the word "spotvoxilhapentosh".
| Tepix wrote:
| LLMs think in tokens, not letters. It's like asking someone
| who is dyslexic about spelling. Not their strong suit. In
| practice, it doesn't matter much, does it?
| recursive wrote:
| Sometimes it does, sometimes it doesn't.
|
| It _is_ evidence that LLMs aren 't appropriate for
| everything, and that there could exist something that works
| better for some tasks.
| Zambyte wrote:
| Language models are best treated like consciousness. Our
| consciousness does a lot less than people like to
| attribute to it. It is mostly a function of introspection
| and making connections, rather than being the part of the
| brain where higher level reasoning and the functions of
| the brain that tell your body how to stay alive (like
| beating your heart).
|
| By allowing a language model to do function calling, you
| are essentially allowing it to do specialized
| "subconscious" thought. The language model becomes a
| natural language interface to the capabilities of its
| "subconsciousness".
|
| A specific human analogy could be: I tell you to pick up
| a pen off of the table, and then you do it. Most of your
| mental activity would be subconscious, orienting your arm
| and hand properly to pick up the pen, actually grabbing
| the pen, and picking it up. The linguistic representation
| of the action would exist in your concious mind (pick up
| the pen), but not much else.
|
| A language model could very easily call out to a text
| processing function to correctly do things like count the
| number of r's in the word strawberry. That is a job that
| your concious mind can dispatch to your subconciousness.
| imtringued wrote:
| This doesn't make any sense. The LLM knows the token to
| letter associations if you ask it. Once it has split the
| token into letters, it can answer the question anyway.
|
| This tells us that the model is unaware of its own
| limitations, otherwise it would have figured that this
| tasks needs it to split the letters first.
| staticman2 wrote:
| Well splitting into letters doesn't help with Mistral
| Large 2. I tried this with Mistral 2:
|
| [Question: How many r's in strawberry? First split the
| word into letters, one letter on each line then answer
| the question.]
|
| [Answer: Let's split the word "strawberry" into letters:
|
| s t r a w b e r r y
|
| Now, let's count the number of 'r's:
|
| There are 2 'r's in the word "strawberry".]
|
| [Question: Did you count the number of r's?]
|
| [answer: Yes, I did. Here's the count:
|
| r r
|
| So, there are 2 'r's in the word "strawberry".]
|
| I suspect the issue is these models have to be trained to
| accomplish tasks, and this is not a task it makes sense
| to train a model to perform. You might train it to do
| this with function calling one day, calling a python
| library to count the r's in a string, but actually
| manually counting r's just doesn't play to an LLM's
| strengths.
| wruza wrote:
| Does it? It knows how a word in some context represented
| by tokens split into letters. It may know or not know the
| table. I wouldn't trust what it tells about word/token
| correspondence more than in general.
| joshstrange wrote:
| Lots of replies mention tokens as the root cause and I'm not
| well versed in this stuff at the low level but to me the
| answer is simple:
|
| When this question is asked (from what the models trained on)
| the question is NOT "count the number of times r appears in
| the word strawberry" but instead (effectively) "I've written
| 'strawbe', now how many r's are in strawberry again? Is it 1
| or 2?".
|
| I think most humans would probably answer "there are 2" if we
| saw someone was writing and they asked that question, even
| without seeing what they have written down. Especially if
| someone said "does strawberry have 1 or 2 r's in it?". You
| could be a jerk and say "it actually has 3" or answer the
| question they are actually asking.
|
| It's an answer that is _technically_ incorrect but the answer
| people want in reality.
| Der_Einzige wrote:
| I wrote and published a paper at COLING 2022 on why LLMs in
| general won't solve this without either 1. radically
| increasing vocab size, 2. rethinking how tokenizers are done,
| or 3. forcing it with constraints:
|
| https://aclanthology.org/2022.cai-1.2/
| generalizations wrote:
| Testing models on their tokenization has always struck me as
| kinda odd. Like, that has nothing to do with their
| intelligence.
| swatcoder wrote:
| Surfacing and underscoring obvious failure cases for
| general "helpful chatbot" use is always going to be
| valuable because it highlights how the "helpful chatbot"
| product is not really intuitively robust.
|
| Meanwhile, it helps make sure engineers and product
| designers who want to build a more targeted product around
| LLM technology know that it's not suited to tasks that may
| trigger those kinds of failures. This may be obvious to you
| as an engaged enthusiast or cutting edge engineer or
| whatever you are, but it's always going to be new
| information to somebody as the field grows.
| wruza wrote:
| It doesn't test "on tokenization" though. What happens when
| an answer is generated is few abstraction levels deeper
| than tokens. A "thinking" "slice" of an llm is completely
| unaware of tokens as an immediate part of its reasoning.
| The question just shows lack of systemic knowledge about
| strawberry as a word (which isn't surprising, tbh).
| qeternity wrote:
| It is. Strawberry is one token in many tokenziers. The
| model doesn't have a concept that there are letters
| there.
| guywhocodes wrote:
| This is pretty much equivalent to the statement
| "multicharacter tokens are a dead end for understanding
| text". Which I agree with.
| sebzim4500 wrote:
| That doesn't follow from what he said at all. Knowing how
| to spell words and understanding them are basically
| unrelated tasks.
| abdullahkhalids wrote:
| If I ask an LLM to generate new words for some concept or
| category, it can do that. How do the new words form, if
| not from joining letters?
| mirekrusin wrote:
| Not letters, but tokens. Think that it's translating
| everything to/from Chinese.
| abdullahkhalids wrote:
| How does that explain why the tokens for strawberry,
| melon and "Stellaberry" [1] are close to each other?
|
| [1] Suggestion from chatgpt3.5 for new fruit name.
| roywiggins wrote:
| Illiterate humans can come up with new words like that
| too without being able to spell, LLMs are modeling
| language without precisely modeling spelling.
| alew1 wrote:
| If I show you a strawberry and ask how many r's are in
| the name of this fruit, you can tell me, because one of
| the things you know about strawberries is how to spell
| their name.
|
| Very large language models also "know" how to spell the
| word associated with the strawberry token, which you can
| test by asking them to spell the word one letter at a
| time. If you ask the model to spell the word and count
| the R's while it goes, it can do the task. So the failure
| to do it when asked directly (how many r's are in
| strawberry) is pointing to a real weakness in reasoning,
| where one forward pass of the transformer is not
| sufficient to retrieve the spelling and also count the
| R's.
| viraptor wrote:
| That's not always true. They often fail the spelling part
| too.
| wruza wrote:
| The thinking part of a model doesn't know about tokens
| either. Like a regular human few thousand years ago
| didn't think of neural impulses or air pressure
| distribution when talking. It might "know" about tokens
| and letters like you know about neurons and sound, but
| not access them on the technical level, which is
| completely isolated from it. The fact that it's a chat of
| tokens of letters, which are a form of information
| passing between humans, is accidental.
| probably_wrong wrote:
| I would counterargue with "that's the model's problem, not
| mine".
|
| Here's a thought experiment: if I gave you 5 boxes and told
| you "how many balls are there in all of this boxes?" and
| you answered "I don't know because they are inside boxes",
| that's a fail. A truly intelligent individual would open
| them and look inside.
|
| A truly intelligent model would (say) retokenize the word
| into its individual letters (which I'm optimistic they can)
| and then would count those. The fact that models cannot do
| this is proof that they lack some basic building blocks for
| intelligence. Model designers don't get to argue "we are
| human-like except in the tasks where we are not".
| pegasus wrote:
| Of course they lack building blocks for full
| intelligence. They are good at certain tasks, and
| counting letters is emphatically not one of them. They
| should be tested and compared on the kind of tasks
| they're fit for, and so the kind of tasks they will be
| used in solving, not tasks for which they would be
| misemployed to begin with.
| probably_wrong wrote:
| I agree with you, but that's not what the post claims.
| From the article:
|
| "A significant effort was also devoted to enhancing the
| model's reasoning capabilities. (...) the new Mistral
| Large 2 is trained to acknowledge when it cannot find
| solutions or does not have sufficient information to
| provide a confident answer."
|
| Words like "reasoning capabilities" and "acknowledge when
| it does not have enough information" have meanings. If
| Mistral doesn't add footnotes to those assertions then,
| IMO, they don't get to backtrack when simple examples
| show the opposite.
| pegasus wrote:
| You're right, I missed that claim.
| mrkstu wrote:
| Its not like an LLM is released with a hit list of "these
| are the tasks I really suck at." Right now users have to
| figure it out on the fly or have a deep understanding of
| how tokenizers work.
|
| That doesn't even take into account what OpenAI has
| typically done to intercept queries and cover the
| shortcomings of LLMs. It would be useful if each model
| did indeed come out with a chart covering what it cannot
| do and what it has been tailored to do above and beyond
| the average LLM.
| jackbrookes wrote:
| It just needs a little hint Me: spell
| "strawberry" with 1 bullet point per letter
| ChatGPT: S T R
| A W B E R
| R Y Me: How many Rs? ChatGPT:
| There are three Rs in "strawberry".
| TiredOfLife wrote:
| Me: try again ChatGPT: There are two Rs in "strawberry."
| kevindamm wrote:
| ChatGPT: "I apologize, there are actually two Rs in
| strawberry."
| groby_b wrote:
| LLMs are not truly intelligent.
|
| Never have been, never will be. They model language, not
| intelligence.
| OKRainbowKid wrote:
| They model the dataset they were trained on. How would a
| dataset of what you consider intelligence look like?
| michaelmrose wrote:
| Those who develop AI that know anything don't actually
| describe current technology as human like intelligence
| rather it is capable of many tasks which previously
| required human intelligence.
| SirMaster wrote:
| How is a layman supposed to even know that it's testing on
| that? All they know is it's a large language model. It's
| not unreasonable they should expect it to be good at things
| having to do with language, like how many letters are in a
| word.
|
| Seems to me like a legit question for a young child to
| answer or even ask.
| stavros wrote:
| > How is a layman supposed to even know that it's testing
| on that?
|
| They're not, but laymen shouldn't think that the LLM
| tests they come up with have much value.
| SirMaster wrote:
| I'm saying a layman or say a child wouldn't even think
| this is a "test". They are just asking a language model a
| seemingly simple language related question from their
| point of view.
| groby_b wrote:
| layman or children shouldn't use LLMs.
|
| They're pointless unless you have the expertise to check
| the output. Just because you can type text in a box
| doesn't mean it's a tool for everybody.
| SirMaster wrote:
| Well they certainly aren't being marketed or used that
| way...
|
| I'm seeing everyone and their parents using chatgpt.
| meroes wrote:
| I hear this a lot but there are vast sums of money thrown
| at where a model fails the strawberry cases.
|
| Think about math and logic. If a single symbol is off, it's
| no good.
|
| Like a prompt where we can generate a single tokenization
| error at my work, by my very rough estimates, generates 2
| man hours of work. (We search for incorrect model
| responses, get them to correct themselves, and if they
| can't after trying, we tell them the right answer, and edit
| it for perfection). Yes even for counting occurrences of
| characters. Think about how applicable that is. Finding the
| next term in a sequence, analyzing strings, etc.
| antonvs wrote:
| > Think about math and logic. If a single symbol is off,
| it's no good.
|
| In that case the tokenization is done at the appropriate
| level.
|
| This is a complete non-issue for the use cases these
| models are designed for.
| meroes wrote:
| But we don't restrict it to math or logical syntax. Any
| prompt across essentially all domains. The same model is
| expected to handle any kind of logical reasoning that can
| be brought into text. We don't mark it incorrect if it
| spells an unimportant word wrong, however keep in mind
| the spelling of a word can be important for many
| questions, for example--off the top of my head: please
| concatenate "d", "e", "a", "r" into a common English word
| without rearranging the order. The types of examples are
| endless. And any type of example it gets wrong, we want
| to correct it. I'm not saying most models will fail this
| specific example, but it's to show the breadth of
| expectations.
| baq wrote:
| Call me when models understand when to convert the token
| into actual letters and count them. Can't claim they're
| more than word calculators before that.
| jahsome wrote:
| Is anyone in the know, aside from mainstream media (god
| forgive me for using this term unironically) and
| civillians on social media claiming LLMs are anything but
| word calculators?
|
| I think that's a perfect description by the way, I'm
| going to steal it.
| dTal wrote:
| I think it's a very poor intuition pump. These 'word
| calculators' have lots of capabilities not suggested by
| that term, such as a theory of mind and an understanding
| of social norms. If they are a "merely" a "word
| calculator", then a "word calculator" is a very odd and
| counterintuitively powerful algorithm that captures big
| chunks of genuine cognition.
| robbiep wrote:
| They're trained on the available corpus of human
| knowledge and writings. I would think that the word
| calculators have failed if they were unable to predict
| the next word or sentiment given the trillions of pieces
| of data they've been fed. Their training environment is
| literally people talking to each other and social norms.
| Doesn't make them anything more than p-zombies though.
|
| As an aside, I wish we would call all of this stuff
| pseudo intelligence rather than artificial intelligence
| antonvs wrote:
| That's misleading.
|
| When you read and comprehend text, you don't read it
| letter by letter, unless you have a severe reading
| disability. Your ability to _comprehend_ text works more
| like an LLM.
|
| Essentially, you can compare the human brain to a multi-
| model or modular system. There are layers or modules
| involved in most complex tasks. When reading, you
| recognize multiple letters at a time[ _], and those
| letters are essentially assembled into tokens that a
| different part of your brain can deal with.
|
| Breaking down words into letters is essentially a
| separate "algorithm". Just like your brain, it's likely
| to never make sense for a text comprehension and
| generation model to operate at the level of letters -
| it's inefficient.
|
| A multi-modal model with a dedicated model for handling
| individual letters could easily convert tokens into
| letters and operate on them when needed. It's just not a
| high priority for most use cases currently.
|
| [_]https://www.researchgate.net/publication/47621684_Lett
| ers_in...
| baq wrote:
| I agree completely, that wasn't the point though: the
| point was that my 6 yo knows when to spell the word when
| asked and the blob of quantized floats doesn't, or at
| least not reliably.
|
| So the blob wasn't trained to do that (yeah low utility I
| get that) but it also doesn't know it doesn't know, which
| is an another much bigger and still unsolved problem.
| stanleydrew wrote:
| I would argue that most sota models do know that they
| don't know this, as evidenced by the fact that when you
| give them a code interpreter as a tool they choose to use
| it to write a script that counts the number of letters
| rather than try to come up with an answer on their own.
|
| (A quick demo of this in the langchain docs, using
| claude-3-haiku: https://python.langchain.com/v0.2/docs/in
| tegrations/tools/ri...)
| patall wrote:
| The model communicates in a language, but our letters are
| not necessary for such and in fact not part of the
| english language. You could write english using per word
| pictographs and it would still be the same english&the
| same information/message. It's like asking you if there
| is a '5' in 256 but you read binary.
| psb217 wrote:
| How can I know whether any particular question will test a
| model on its tokenization? If a model makes a boneheaded
| error, how can I know whether it was due to lack of
| intelligence or due to tokenization? I think finding places
| where models are surprisingly dumb is often more
| informative than finding particular instances where they
| seem clever.
|
| It's also funny, since this strawberry question is one
| where a model that's seriously good at predicting the next
| character/token/whatever quanta of information would get it
| right. It requires no reasoning, and is unlikely to have
| any contradicting text in the training corpus.
| viraptor wrote:
| > How can I know whether any particular question will
| test a model on its tokenization?
|
| Does something deal with separate symbols rather than
| just meaning of words? Then yes.
|
| This affects spelling, math (value calculation), logic
| puzzles based on symbols. (You'll have more success with
| a puzzle about "A B A" rather than "ABA")
|
| > It requires no reasoning, and is unlikely to have any
| contradicting text in the training corpus.
|
| This thread contains contradictions. Every other
| announcement of an llm contains a comment with a
| contradicting text when people post the wrong responses.
| VincentEvans wrote:
| I don't know anything about LLMs beyond using ChatGPT and
| Copilot... but unless because of this lack of knowledge I
| am misinterpreting your reply - it sounds as if you are
| excusing the model giving a completely wrong answer to a
| question that anyone intelligent enough to learn alphabet
| can answer correctly.
| danieldk wrote:
| The problem is that the model never gets to see
| individual letters. The tokenizers used by these models
| break up the input in pieces. Even though the smallest
| pieces/units are bytes in most encodings (e.g. BBPE), the
| tokenizer will cut up most of the input in much larger
| units, because the vocabulary will contain fragments of
| words or even whole words.
|
| For example, if we tokenize _Welcome to Hacker News, I
| hope you like strawberries._ The Llama 405B tokenizer
| will tokenize this as: Welcome Gto
| GHacker GNews , GI Ghope Gyou Glike Gstrawberries .
|
| (G means that the token was preceded by a space.)
|
| Each of these pieces is looked up and encoded as a tensor
| with their indices. Adding a special token for the
| beginning and end of the text, giving:
| [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499,
| 1093, 76203, 13]
|
| So, all the model sees for 'Gstrawberries' is the number
| 76204 (which is then used in the piece embedding lookup).
| The model does not even have access to the individual
| letters of the word.
|
| Of course, one could argue that the model should be fed
| with bytes or codepoints instead, but that would make
| them vastly less efficient with quadratic attention.
| Though machine learning models have done this in the past
| and may do this again in the future.
|
| Just wanted to finish of this comment with saying that
| the tokens might be provided in the model splitted if the
| token itself is not in the vocabulary. For instance, the
| same sentence translated to my native language is
| tokenized as: Wel kom Gop GHacker GNews
| , Gik Ghoop Gdat Gje Gvan Ga ard be ien Gh oud t .
|
| And the word voor strawberries (aardbeien) is split,
| though still not in letters.
| TiredOfLife wrote:
| The thing is, how the tokenizing work is about as
| relevant to the person asking the question as name of the
| cat of the delivery guy who delivered the GPU that the
| llm runs on.
| danieldk wrote:
| How the tokenizer works explains why a model can't answer
| the question, what the name of the cat is doesn't explain
| anything.
|
| This is Hacker News, we are usually interested in how
| things work.
| VincentEvans wrote:
| Indeed, I appreciate the explanation, it is certainly
| both interesting and informative to me, but to somewhat
| echo the person you are replying to - if I wanted a boat,
| and you offer me a boat, and it doesn't float - the
| reasons for failure are perhaps full of interesting
| details, but perhaps the most important thing to focus on
| first - is to make the boat float, or stop offering it to
| people who are in need of a boat.
|
| To paraphrase how this thread started - it was someone
| testing different boats to see whether they can simply
| float - and they couldn't. And the reply was questioning
| the validity of testing boats whether they can simply
| float.
|
| At least this is how it sounds to me when I am told that
| our AI overlords can't figure out how many Rs are in the
| word "strawberry".
| michaelmrose wrote:
| The test problem is emblematic of a type of synthetic
| query that could fail but of limited import in actual
| usage.
|
| For instance you could ask it for a JavaScript function
| to count any letter in any word and pass it r and
| strawberry and it would be far more useful.
|
| Having edge cases doesn't mean its not useful it is
| neither a free assastant nor a coder who doesn't expect a
| paycheck. At this stage it's a tool that you can build
| on.
|
| To engage with the analogy. A propeller is very useful
| but it doesn't replace the boat or the Captain.
| viraptor wrote:
| At some point you need to just accept the details and
| limitations of things. We do this all the time. Why is
| your calculator giving only approximate result? Why can't
| your car go backwards as fast as forwards? Etc. It sucks
| that everyone gets exposed to the relatively low level
| implementation with LLM (almost the raw model), but
| that's the reality today.
| roywiggins wrote:
| People do get similarly hung up on surprising floating
| point results: why can't you just make it work properly?
| And a full answer is a whole book on how floating point
| math works.
| dTal wrote:
| It is however a highly relevant thing to be aware of when
| evaluating a LLM for 'intelligence', which was the
| context this was brought up in.
|
| Without _looking_ at the word 'strawberry', or spelling
| it one letter at a time, can you rattle off how many
| letters are in the word off the top of your head? No?
| That is what we are asking the LLM to do.
| ca_tech wrote:
| Its like showing someone a color and asking how many
| letters it has. 4... 3? blau, blue, azul, blu The color
| holds the meaning and the words all map back.
|
| In the model the individual letters hold little meaning.
| Words are composed of letters but simply because we need
| some sort of organized structure for communication that
| helps represents meaning and intent. Just like our color
| blue/blau/azul/blu.
|
| Not faulting them for asking the question but I agree that
| the results do not undermine the capability of the
| technology. In fact it just helps highlight the constraints
| and need for education.
| onlyrealcuzzo wrote:
| > Like, that has nothing to do with their intelligence.
|
| Because they don't have intelligence.
|
| If they did, they could count the letters in strawberry.
| TwentyPosts wrote:
| People have been over this. If you believe this, you
| don't understand how LLMs work.
|
| They fundamentally perceive the world in terms of tokens,
| not "letters".
| antonvs wrote:
| > If you believe this, you don't understand how LLMs
| work.
|
| Nor do they understand how intelligence works.
|
| Humans don't read text a letter at a time. We're capable
| of deconstructing words into individual letters, but
| based on the evidence that's essentially a separate
| "algorithm".
|
| Multi-model systems could certainly be designed to do
| that, but just like the human brain, it's unlikely to
| ever make sense for a text comprehension and generation
| model to work at the level of individual letters.
| fmbb wrote:
| > that has nothing to do with their intelligence.
|
| Of course. Because these models have no intelligence.
|
| Everyone who believes they do seem to believe intelligence
| derives from being able to use language, however, and not
| being able to tell how many times the letter r is in the
| word strawberry is a very low bar to not pass.
| roywiggins wrote:
| An LLM trained on single letter tokens would be able to,
| it just would be much more laborious to train.
| wruza wrote:
| Why would it be able to?
| Stumbling wrote:
| Claude 3 Opus gave correct answer.
| vorticalbox wrote:
| I just tried llama 3.1 8 b this is its reply.
|
| According to multiple sources, including linguistic analysis
| and word breakdowns, there are 3 Rs in the word "strawberry".
| taf2 wrote:
| sonate 3.5 thinks 2
| stitched2gethr wrote:
| Interestingly enough much simpler models can write an
| accurate function to give you the answer.
|
| I think it will be a while before we get there. An LLM can
| lookup knowledge but can't actually perform calculations
| itself, without some external processor.
| stanleydrew wrote:
| Why do we have to "get there?" Humans use calculators all
| the time, so why not have every LLM hooked up to a
| calculator or code interpreter as a tool to use in these
| exact situations?
| medmunds wrote:
| How much do threads like this provide the training data to
| convince future generations that--despite all appearances to
| the contrary--strawberry is in fact spelled with only two
| R's?
|
| I just researched "how many r's are in strawberry?" in a
| search engine, and based solely on the results it found, I
| would have to conclude there is substantial disagreement on
| whether the correct answer is two or three.
| fluoridation wrote:
| Speaking as a 100% human, my vote goes to the compromise
| position that "strawberry" has in fact four Rs.
| eschneider wrote:
| The models are text generators. They don't "understand" the
| question.
| m2024 wrote:
| Does anyone have input on the feasibility of running an LLM
| locally and providing an interface to some language runtime
| and storage space, possibly via a virtual machine or
| container?
|
| No idea if there's any sense to this, but an LLM could be
| instructed to formulate and continually test mathematical
| assumptions by writing / running code and fine-tuning
| accordingly.
| killthebuddha wrote:
| FWIW this (approximately) is what everybody (approximately)
| is trying to do.
| stanleydrew wrote:
| Yes, we are doing this at Riza[0] (via WASM). I'd love to
| have folks try our downloadable CLI which wraps isolated
| Python/JS runtimes (also Ruby/PHP but LLMs don't seem to
| write those very well). Shoot me an email[1] or say hi in
| Discord[1].
|
| [0]:https://riza.io [1]:mailto:andrew@riza.io
| [2]:https://discord.gg/4P6PUeJFW5
| mirekrusin wrote:
| How many "r"s are in [496, 675, 15717]?
| stanleydrew wrote:
| Plug in a code interpreter as a tool and the model will write
| Python or JavaScript to solve this and get it right 100% of
| the time. (Full disclosure: I work on a product called Riza
| that you can use as a code interpreter tool for LLMs)
| kremi wrote:
| Your question sounds like you want to know how the word is
| spelled, and no one would put two r's at straw, so the model
| could be assuming that you're asking whether it's strawbery
| or strawberry.
|
| What happens if you ask the total number of occurrences of
| the letter r in the word? Does it still not get it right?
| exebook wrote:
| Tokenization can be bypassed like this; (mistral large 2)
|
| How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?
|
| The word "s-t-r-a-w-b-e-r-r-y" contains three instances of
| the letter "R."
|
| How many letters R contain the word strawberry?
|
| The word "strawberry" contains two instances of the letter
| "R."
| jbritton wrote:
| I just tried Copilot 2 'r' in strawberry, 3 'r' in
| strawberries.
| groby_b wrote:
| I mean, yes? Letter-based gymnastics are difficult in a
| token-based environment, and they're not a particularly
| valuable use case (outside of online dunking), so it's not
| really a surprise nobody cares much about those in terms of
| model training.
| ipython wrote:
| I really wanted to believe this could be simply fixed by more
| prompt engineering and chain of thought, so I tried it with
| Claude 3.5 and ... well unfortunately still no luck. the
| reply is amusing - I'll include it verbatim below.
|
| Of note, this happened because earlier I had asked it without
| CoT and it incorrectly stated two "r"s. Once I started a new
| conversation (so fresh context window), it gave the right
| answer (3) and no longer claimed the "still the second r" as
| shown below.
|
| Let's think through this step-by-step:
|
| 1. First, let's spell out "strawberry" letter by letter: s-t-
| r-a-w-b-e-r-r-y
|
| 2. Now, let's count the "r"s as we go through each letter: s
| (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r
| (2) - Second "r" found r (2) - This is still the second "r" y
| (2)
|
| 3. After going through all the letters, we've counted 2 "r"s.
|
| Therefore, the word "strawberry" contains 2 "r"s.
| rkwz wrote:
| > Longer context window (1M+)
|
| What's your use case for this? Uploading multiple
| documents/books?
| tikkun wrote:
| Correct
| freediver wrote:
| That would make each API call cost at least $3 ($3 is price
| per million input tokens). And if you have a 10 message
| interaction you are looking at $30+ for the interaction. Is
| that what you would expect?
| rkwz wrote:
| Maybe they're summarizing/processing the documents in a
| specific format instead of chatting? If they needed chat,
| might be easier to build using RAG?
| tr4656 wrote:
| This might be when it's better to not use the API and
| just pay for the flat-rate subscription.
| coder543 wrote:
| Gemini 1.5 Pro charges $0.35/million tokens up to the
| first million tokens or $0.70/million tokens for prompts
| longer than one million tokens, and it supports a multi-
| million token context window.
|
| Substantially cheaper than $3/million, but I guess
| Anthropic's prices are higher.
| freediver wrote:
| It is also much worse.
| coder543 wrote:
| Is it, though? In my limited tests, Gemini 1.5 Pro
| (through the API) is very good at tasks involving long
| context comprehension.
|
| Google's user-facing implementations of Gemini are pretty
| consistently bad when I try them out, so I understand why
| people might have a bad impression about the underlying
| Gemini models.
| reitzensteinm wrote:
| You're looking at the pricing for Gemini 1.5 Flash. Pro
| is $3.50 for <128k tokens, else $7.
| coder543 wrote:
| Ah... oops. For some reason, that page isn't rendering
| properly on my browser: https://imgur.com/a/XLFBPMI
|
| When I glanced at the pricing earlier, I didn't notice
| there was a dropdown at all.
| impossiblefork wrote:
| So do it locally after predigesting the book, so that you
| have the entire KV-cache for it.
|
| Then load that KV-cache and add your prompt.
| ketzo wrote:
| Uploading large codebases is particularly useful.
| ipsod wrote:
| Is it?
|
| I've found that I get better results if I cherry pick code
| to feed to Claude 3.5, instead of pasting whole files.
|
| I'm kind of isolated, though, so maybe I just don't know
| the trick.
| ketzo wrote:
| I've been using Cody from Sourcegraph, and it'll write
| some really great code; business logic, not just
| tests/simple UI. It does a great job using
| patterns/models from elsewhere in your codebase.
|
| Part of how it does that is through ingesting your
| codebase into its context window, and so I imagine that
| bigger/better context will only improve it. That's a bit
| of an assumption though.
| benopal64 wrote:
| Books, especially textbooks, would be amazing. These things
| can get pretty huge (1000+ pages) and usually do not fit into
| GPT-4o or Claude Sonnet 3.5 in my experience. I envision the
| models being able to help a user (student) create their study
| guides and quizzes, based on ingesting the entire book. Given
| the ability to ingest an entire book, I imagine a model could
| plan how and when to introduce each concept in the textbook
| better than a model only a part of the textbook.
| moyix wrote:
| Long agent trajectories, especially with command outputs.
| msp26 wrote:
| Large 2 is significantly smaller at 123B so it being comparable
| to llama 3 405B would be crazy.
| qwertox wrote:
| Claude needs to fix their text input box. It tries to be so
| advanced that code in backticks gets reformatted, and when you
| copy it, the formatting is lost (even the backticks).
| nickthesick wrote:
| They are using Tiptap for their input and just a couple of
| days ago we called them out on some perf improvements that
| could be had in their editor:
| https://news.ycombinator.com/item?id=41036078
|
| I am curious what you mean by the formatting is lost though?
| cpursley wrote:
| Claude is truly incredible but I'm so tired of the JavaScript
| bloat everywhere. Just why. Both theirs and ChatGPTs UIs are
| hot garbage when it comes to performance (I constantly have
| to clear my cache and have even relegated them to a different
| browser entirely). Not everyone has an M4, and if we did -
| we'd probably just run our own models.
| Liquix wrote:
| These companies full of brilliant engineers are throwing millions
| of dollars in training costs to produce SOTA models that are...
| "on par with GPT-4o and Claude Opus"? And then the next 2.23%
| bump will cost another XX million? It seems increasingly apparent
| that we are reaching the limits of throwing more data at more
| GPUs; that an ARC prize level breakthrough is needed to move the
| needle any farther at this point.
| iknownthing wrote:
| and even if there is another breakthrough all of these
| companies will implement it more or less simultaneously and
| they will remain in a dead heat
| llm_nerd wrote:
| Presuming the breakthrough is openly shared. It remains
| surprising how transparent many of these companies are about
| new approaches that push the SoTa forward, and I suspect
| we're going to see a change. That companies won't reveal the
| secret sauce so readily.
|
| e.g. Almost the entire market relies upon Attention Is All
| You Need paper detailing transformers, and it would be an
| entirely different market if Google had held that as a trade
| secret.
| talldayo wrote:
| Given how absolutely pitiful the proprietary advancements
| in AI have been, I would posit we have little to worry
| about.
| jsheard wrote:
| OTOH the companies who are sharing their breakthroughs
| openly aren't yet making any money, so something has to
| give. Their research is currently being bankrolled by
| investors who assume there will be returns _eventually,_
| and _eventually_ can only be kicked down the road for so
| long.
| talldayo wrote:
| Eventually can be (and has been) bankrolled by Nvidia.
| They did a lot of ground-floor research on GANs and
| training optimization, which only makes sense to release
| as public research. Similarly, Meta and Google are both
| well-incentivized to share their research through Pytorch
| and Tensorflow respectively.
|
| I really am not expecting Apple or Microsoft to discover
| AGI and ferret it away for profitability purposes.
| Strictly speaking, I don't think superhuman intelligence
| even exists in the domain of text generation.
| thruway516 wrote:
| Well, that's because the potential reward from picking
| the right horse is MASSIVE and the cost of potentially
| missing out is lifelong regret. Investors are driven by
| FOMO more than anything else. They know most of these
| will be duds but one of these duds could turn out to be
| life changing. So they will keep bankrolling as long as
| they have the money.
| michaelt wrote:
| Sort of yes, sort of no.
|
| Of course, I agree that Stability AI made Stable
| Diffusion freely available and they're worth orders of
| magnitude less than OpenAI. To the point they're
| struggling to keep the lights on.
|
| But it doesn't necessarily make that much difference
| whether you openly share the inner technical details.
| When you've got a motivated and well financed competitor,
| merely demonstrating a given feature is possible, showing
| the output and performance and price, might be enough.
|
| If OpenAI adds a feature, who's to say Google and
| Facebook can't match it even though they can't access the
| code?
| sebzim4500 wrote:
| Anthropic has been very secretive about the supposed
| synthetic data they used to train 3.5 Sonnet.
|
| Given how good the model is terms of the quality vs speed
| tradeoff, they must have something.
| GaggiX wrote:
| >Attention Is All You Need paper detailing transformers,
| and it would be an entirely different market if Google had
| held that as a trade secret.
|
| I would guess that in that timeline, Google would never
| have been able to learn about the incredible capabilities
| of transformer models outside of translation, at least not
| until much later.
| happyhardcore wrote:
| I suspect this is why OpenAI is going more in the direction of
| optimising for price / latency / whatever with 4o-mini and
| whatnot. Presumably they found out long before the rest of us
| did that models can't really get all that much better than what
| we're approaching now, and once you're there the only thing you
| can compete on is how many parameters it takes and how cheaply
| you can serve that to users.
| __jl__ wrote:
| Meta just claimed the opposite in their Llama 3.1 paper. Look
| at the conclusion. They say that their experience indicates
| significant gains for the next iteration of models.
|
| The current crop of benchmarks might not reflect these gains,
| by the way.
| nathanasmith wrote:
| They also said in the paper that 405B was only trained to
| "compute-optimal" unlike the smaller models that were
| trained well past that point indicating the larger model
| still had some runway so had they continued it would have
| kept getting stronger.
| moffkalast wrote:
| Makes sense right? Otherwise why make a model so large
| that nobody can conceivably run it if not to optimize for
| performance on a limited dataset/compute? It was always a
| distillation source model, not a production one.
| imtringued wrote:
| LLMs are reaching saturation on even some of the latest
| benchmarks and yet I am still a little disappointed by how
| they perform in practice.
|
| They are by no means bad, but I am now mostly interested in
| long context competency. We need benchmarks that force the
| LLM to complete multiple tasks simultaneously in one super
| long session.
| xeromal wrote:
| I don't know anything about AI but there's one thing I
| want it to do for me. Program a full body exercise
| program long term based on the parameters I give it such
| as available equipment and past workout context goals. I
| haven't had good success with chatgpt but I assume what
| you're talking about is relevant to my goals.
| ThrowawayTestr wrote:
| Aren't there apps that already do this like Fitbod?
| xeromal wrote:
| Fitbod might do the trick. Thanks! The availability of
| equipment was a difficult thing for me to incorporate
| into a fitness program.
| splwjs wrote:
| I sell widgets. I promise the incalculable power of widgets
| has yet to be unleashed on the world, but it is tremendous
| and awesome and we should all be very afraid of widgets
| taking over the world because I can't see how they won't.
|
| Anyway here's the sales page. the widget subscription is so
| premium you won't even miss the subscription fee.
| sqeaky wrote:
| That is strong (and fun) point, but this is peer
| reviewable and has more open collaboration elements than
| purely selling widgets.
|
| We should still be skeptical because often want to claim
| to be better or have unearned answers, but I don't think
| the motive to lie is quite as strong as a salesman's.
| troupo wrote:
| > this is peer reviewable
|
| It's not peer-reviewable in any shape or form.
| hnfong wrote:
| It is _kind of_ "peer-reviewable" in the "Elon Musk vs
| Yann LeCun" form, but I doubt that the original commenter
| meant this.
| coltonv wrote:
| This. It's really weird the way we suddenly live in a
| world where it's the norm to take whatever a tech company
| says about future products at face value. This is the
| same world where Tesla promised "zero intervention LA to
| NYC self driving" by the end of the year in 2016, 2017,
| 2018, 2019, 2020, 2021, 2022, 2023, and 2024. The same
| world where we know for a fact that multiple GenAI demos
| by multiple companies were just completely faked.
|
| It's weird. In the late 2010s it seems like people were
| wising up to the idea that you can't implicitly trust big
| tech companies, even if they have nap pods in the office
| and have their first day employees wear funny hats. Then
| ChatGPT lands and everyone is back to fully trusting
| these companies when they say they are mere months from
| turning the world upside down with their AI, which they
| say every month for the last 12-24 months.
| cle wrote:
| I'm not sure anyone is asking you to take it at face
| value or implicitly trust them? There's a 92-page paper
| with details:
| https://ai.meta.com/research/publications/the-
| llama-3-herd-o...
| hnfong wrote:
| > In the late 2010s it seems like people were wising up
| to the idea that you can't implicitly trust big tech
| companies
|
| In the 2000s we only had Microsoft, and none of us were
| confused as to whether to trust Bill Gates or not...
| mikae1 wrote:
| Nobody tells it like Zitron:
|
| https://www.wheresyoured.at/pop-culture/
|
| _> What makes this interview - and really, this paper --
| so remarkable is how thoroughly and aggressively it
| attacks every bit of marketing collateral the AI movement
| has. Acemoglu specifically questions the belief that AI
| models will simply get more powerful as we throw more
| data and GPU capacity at them, and specifically ask a
| question: what does it mean to "double AI's
| capabilities"? How does that actually make something
| like, say, a customer service rep better? And this is a
| specific problem with the AI fantasists' spiel. They
| heavily rely on the idea that not only will these large
| language models (LLMs) get more powerful, but that
| getting more powerful will somehow grant it the power to
| do...something. As Acemoglu says, "what does it mean to
| double AI's capabilities?"_
| RhodesianHunter wrote:
| Meta just keeps releasing their models as open-source, so
| that whole line of thinking breaks down quickly.
| threecheese wrote:
| That line of thinking would not have reached the
| conclusion that you imply, which is that open source ==
| pure altruism. Having the benefit of hindsight, it's very
| difficult for me to believe that. Who knows though!
|
| I'm about Zucks age, and have been following his
| career/impact since college; it's been roughly a cosine
| graph of doing good or evil over time :) I think we're at
| 2pi by now, and if you are correct maybe it hockey-sticks
| up and to the right. I hope so.
| ctoth wrote:
| Wouldn't the equivalent for Meta actually be something
| like:
|
| > Other companies sell widgets. We have a bunch of
| widget-making machines and so we released a whole bunch
| of free widgets. We noticed that the widgets got better
| the more we made and expect widgets to become even better
| in future. Anyway here's the free download.
|
| Given that Meta isn't actually selling their models?
|
| Your response might make sense if it were to something
| OpenAI or Anthropic said, but as is I can't say I follow
| the analogy.
| ThrowawayTestr wrote:
| If OpenAI was saying this you'd have a point but I
| wouldn't call Facebook a widget seller in this case when
| they're giving their widgets away for free.
| camel_Snake wrote:
| Meta doesn't sell widgets in this scenario - they give
| them away for free. Their competition sells widgets, so
| Meta would be perfectly happy if the widget market
| totally collapsed.
| mattnewton wrote:
| that would make sense if it was from Openai, but Meta
| doesn't actually sell these widgets? They release the
| widget machines for free in the hopes that other people
| will build a widget ecosystem around them to rival the
| closed widget ecosystem that threatens to lock them out
| of a potential "next platform" powered by widgets.
| littlestymaar wrote:
| Except: Meta doesn't sell AI at all. Zuck is just doing
| this for two reasons:
|
| - flex
|
| - deal a blow to Altmann
| HDThoreaun wrote:
| Meta uses ai in all the recommendation algorithms. They
| absolutely hope to turn their chat assistants into a
| product on WhatsApp too, and GenAI is crucial to creating
| the metaverse. This isn't just a charity case.
| PodgieTar wrote:
| There are literal ads for Meta Ai on television. The idea
| they're not selling something is absurd.
| X6S1x6Okd1st wrote:
| But Meta isn't selling it
| dev1ycan wrote:
| Or maybe they just want to avoid getting sued by
| shareholders for dumping so much money into unproven
| technology that ended up being the same or worse than the
| competitor
| Bjorkbat wrote:
| Yeah, but what does that actually mean? That if they had
| simply doubled the parameters on Llama 405b it would score
| way better on benchmarks and become the new state-of-the-
| art by a long mile?
|
| I mean, going by their own model evals on various
| benchmarks (https://llama.meta.com/), Llama 405b scores
| anywhere from a few points to _almost_ 10 points more than
| than Llama 70b even though the former has ~5.5x more
| params. As far as scale in concerned, the relationship isn
| 't even linear.
|
| Which in most cases makes sense, you obviously can't get a
| 200% on these benchmarks, so if the smaller model is
| already at ~95% or whatever then there isn't much room for
| improvement. There is, however, the GPQA benchmark. Whereas
| Llama 70b scores ~47%, Llama 405b only scores ~51%. That's
| not a huge improvement despite the significant difference
| in size.
|
| Most likely, we're going to see improvements in small model
| performance by way of better data. Otherwise though, I fail
| to see how we're supposed to get significantly better model
| performance by way of scale when the relationship between
| model size and benchmark scores is nowhere near linear. I
| really wish someone who's team "scale is all you need"
| could help me see what I'm missing.
|
| And of course we might find some breakthrough that enables
| actual reasoning in models or whatever, but I find that
| purely speculative at this point, anything but inevitable.
| crystal_revenge wrote:
| > the only thing you can compete on is how many parameters it
| takes and how cheaply you can serve that to users.
|
| The problem with this strategy is that it's really tough to
| compete with open models in this space over the long run.
|
| If you look at OpenAI's homepage right now they're trying to
| promote "ChatGPT on your desktop", so it's clear even they
| realize that most people are looking for a local product. But
| once again this is a problem for them because open models run
| locally are always going to offer more in terms of privacy
| and features.
|
| In order for proprietary models served through an API to
| compete long term they need to offer _significant_
| performance improvements over open /local offerings, but that
| gap has been perpetually shrinking.
|
| On an M3 macbook pro you can run open models easily for free
| that perform close enough to OpenAI that I can use them as my
| primary LLM for effectively free with complete privacy and
| lots of room for improvement if I want to dive into the
| details. Ollama today is pretty much easier to install than
| just logging into ChatGPT and the performance feels a bit
| more responsive for most tasks. If I'm doing a serious LLM
| project I most certainly _won 't_ use proprietary models
| because the control I have over the model is too limited.
|
| At this point I have completely stopped using proprietary
| LLMs despite working with LLMs everyday. Honestly can't
| understand any serious software engineer who wouldn't use
| open models (again the control and tooling provided is just
| so much better), and for less technical users it's getting
| easier and easier to just run open models locally.
| bla3 wrote:
| I think their desktop app still runs the actual LLM queries
| remotely.
| kridsdale3 wrote:
| This. It's a mac port of the iOS app. Using the API.
| pzo wrote:
| In the long run maybe but it's going to take probably 5
| years or more before laptops such as Macbook M3 with 64 GB
| RAM will be mainstream. Also it's going going to take a
| while before such models with 70B params will be bundled in
| Windows and Mac with system update. Even more time before
| you will have such models inside your smartphone.
|
| OpenAI did a good move with making GPTo mini so dirty cheap
| that it's faster and cheaper to run than LLama 3.1 70B.
| Most consumers will interact with LLM via some apps using
| LLM API, Web Panel on desktop or native mobile app for the
| same reason most people use GMail etc. instead of native
| email client. Setting up IMAP, POP etc is for most people
| out of reach the same like installing Ollama + Docker +
| OpenWebUI
|
| App developers are not gonna bet on local LLM only as long
| they are not mainstream and preinstalled on 50%+ devices.
| nichochar wrote:
| Totally. I wrote about this when they announced their dev-day
| stuff.
|
| In my opinion, they've found that intelligence with current
| architecture is actually an S-curve and not an exponential,
| so trying to make progress in other directions: UX and EQ.
|
| https://nicholascharriere.com/blog/thoughts-openai-spring-
| re...
| ActorNightly wrote:
| The thing I don't understand is why everyone is throwing money
| at LLMs for language, when there are much simpler use cases
| which are more useful?
|
| For example, has anyone ever attempted image -> html/css model?
| Seems like it be great if I can draw something on a piece of
| paper and have it generate a website view for me.
| jacobn wrote:
| I was under the impression that you could more or less do
| something like that with the existing LLMs?
|
| (May work poorly of course, and the sample I think I saw a
| year ago may well be cherry picked)
| GaggiX wrote:
| >For example, has anyone ever attempted image -> html/css
| model?
|
| Have you tried upload the image to a LLM with vision
| capabilities like GPT-4o or Claude 3.5 Sonnet?
| machiaweliczny wrote:
| I tried and sonnet 3.5 can copy most of common UIs
| majiy wrote:
| That's a thought I had. For example, could a model be trained
| to take a description, and create a Blender (or whatever
| other software) model from it? I have no idea how LLMs really
| work under the hood, so please tell me if this is nonsense.
| eurekin wrote:
| I'm waiting exactly for this, gpt4 trips up a lot with
| blender currently (nonsensical order of operations etc.)
| ascorbic wrote:
| All of the multi-modal LLMs are reasonably good at this.
| chipdart wrote:
| > For example, has anyone ever attempted image -> html/css
| model?
|
| There are already companies selling services where they
| generate entire frontend applications from vague natural
| language inputs.
|
| https://vercel.com/blog/announcing-v0-generative-ui
| rkwz wrote:
| Perhaps if we think of LLMs as search engines (Google, Bing
| etc) then there's more money to be made by being the top
| generic search engine than the top specialized one (code
| search, papers search etc)
| JumpCrisscross wrote:
| > _has anyone ever attempted image - > html/css model?_
|
| I had a discussion with a friend about doing this, but for
| CNC code. The answer was that a model trained on a narrow
| data set underperforms one trained on a large data set and
| then fine tuned with the narrow one.
| drexlspivey wrote:
| They did that in the chatgpt 4 demo 1.5 year ago.
| https://www.youtube.com/watch?v=GylMu1wF9hw
| slashdave wrote:
| Not sure why you think interpreting a hand drawing is
| "simpler" than parsing sequential text.
| swyx wrote:
| indeed. I pointed out in
| https://buttondown.email/ainews/archive/ainews-llama-31-the-...
| that the frontier model curve is currently going down 1 OoM
| every 4 months, meaning every model release has a very short
| half life[0]. however this progress is still worth it if we can
| deploy it to improve millions and eventually billions of
| people's lives. a commenter pointed out that the amoutn spent
| on Llama 3.1 was only like 60% of the cost of Ant Man and the
| Wasp Quantumania, in which case I'd advocate for killing all
| Marvel slop and dumping all that budget on LLM progress.
|
| [0] not technically complete depreciation, since for example 4o
| mini is widely believed to be a distillation of 4o, so 4o's
| investment still carries over into 4o mini
| thierrydamiba wrote:
| Agreed on everything, but calling the marvel movies slop...I
| think that word has gone too far.
| ThrowawayTestr wrote:
| The marvel movies are the genesis for this use of the word
| slop.
| simonw wrote:
| Can you back that claim up with a link or similar?
| RUnconcerned wrote:
| Not only are Marvel movies slop, they are very concentrated
| slop. The only way to increase the concentration of slop in
| a Marvel movie would be to ask ChatGPT to write the next
| one.
| mattnewton wrote:
| Not all Marvel films are slop. But, as a fan who comes from
| a family of fans and someone who has watched almost all of
| them; lets be real. That particular film, really and most
| of them, contain copious amounts of what is absolutely
| _slop_.
|
| I don't know if the utility is worse than an LLM that is
| SOTA for 2 months that no one even bothers switching to
| however - at least the marvel slop is being used for
| entertainment by someone. I think the market is definitely
| prioritizing the LLM researcher over Disney's latest slop
| sequel though so whoever made that comparison can rest
| easy, because we'll find out.
| lawlessone wrote:
| >really and most of them, contain copious amounts of what
| is absolutely slop.
|
| I thought that was the allure, something that's camp
| funny and an easy watch.
|
| I have only watched a few of them so I am not fully
| familiar?
| bn-l wrote:
| It's junk food. No one is disputing how tasty it is though
| (including the recent garbage).
| throwup238 wrote:
| All that Marvel slop was created by the first real LLM:
| <https://marvelcinematicuniverse.fandom.com/wiki/K.E.V.I.N.>
| troupo wrote:
| > however this progress is still worth it if we can deploy it
| to improve millions and eventually billions of people's lives
|
| Has there been any indication that we're improving the lives
| of millions of people?
| zooq_ai wrote:
| Yes, just like internet, power users have found use cases.
| It'll take education / habit for general users
| troupo wrote:
| Ah yes. We're in the crypto stages of "it's like the
| internet".
| machiaweliczny wrote:
| Just me coding 30% faster is worth it
| troupo wrote:
| I haven't found a single coding problem where any of
| these coding assistants where anything but annoying.
|
| If I need to babysit a junior developer fresh out of
| school and review every single line of code it spits out,
| I can find them elsewhere
| Workaccount2 wrote:
| I think GPT5 will be the signal of whether or not we have hit a
| plateau. The space is still rapidly developing, and while large
| model gains are getting harder to pick apart, there have been
| enormous gains in the capabilities of light weight models.
| zainhoda wrote:
| I'm waiting for the same signal. There are essentially 2
| vastly different states of the world depending on whether
| GPT-5 is an incremental change vs a step change compared to
| GPT-4.
| chipdart wrote:
| > I think GPT5 will be the signal of whether or not we have
| hit a plateau.
|
| I think GPT5 will tell if OpenAI hit a plateau.
|
| Sam Altman has been quoted as claiming "GPT-3 had the
| intelligence of a toddler, GPT-4 was more similar to a smart
| high-schooler, and that the next generation will look to have
| PhD-level intelligence (in certain tasks)"
|
| Notice the high degree of upselling based on vague claims of
| performance, and the fact that the jump from highschooler to
| PhD can very well be far less impressive than the jump from
| toddler to high schooler. In addition, notice the use of
| weasel words to frame expectations regarding "the next
| generation" to limit these gains to corner cases.
|
| There's some degree of salesmanship in the way these models
| are presented, but even between the hyperboles you don't see
| claims of transformative changes.
| rvnx wrote:
| PhD level-of-task-execution sounds like the LLM will debate
| whether the task is ethical instead of actually doing it
| airspresso wrote:
| lol! Producing academic papers for future training runs
| then.
| throwadobe wrote:
| I wish I could frame this comment
| splwjs wrote:
| >some degree of salesmanship
|
| buddy every few weeks one of these bozos is telling us
| their product is literally going to eclipse humanity and we
| should all start fearing the inevitable great collapse.
|
| It's like how no one owns a car anymore because of ai
| driving and I don't have to tell you about the great bank
| disaster of 2019, when we all had to accept that fiat
| currency is over.
|
| You've got to be a particular kind of unfortunate to
| believe it when sam altman says literally anything.
| sensanaty wrote:
| Basically every single word out of Mr Worldcoin's mouth is
| a scam of some sort.
| mupuff1234 wrote:
| Which is why they'll keep calling the next few models GPT4.X
| speed_spread wrote:
| Benchmarks scores aren't good because they apply to previous
| generations of LLMs. That 2.23% uptick can actually represent a
| world of difference in subjective tests and definitely be worth
| the investment.
|
| Progress is not slowing down but it gets harder to quantify.
| satvikpendem wrote:
| This is already what the chinchilla paper surmised, it's no
| wonder that their prediction now comes to fruition. It is like
| an accelerated version of Moore's Law, because software
| development itself is more accelerated than hardware
| development.
| chipdart wrote:
| > It seems increasingly apparent that we are reaching the
| limits of throwing more data at more GPUs;
|
| I think you're just seeing the "make it work" stage of the
| combo "first make it work, then make it fast".
|
| Time to market is critical, as you can attest by the fact you
| framed the situation as "on par with GPT-4o and Claude Opus".
| You're seeing huge investments because being the first to get a
| working model stands to benefit greatly. You can only assess
| models that exist, and for that you need to train them at a
| huge computational cost.
| romeros wrote:
| ChatGPT is like Google now. It is the default. Even if Claude
| becomes as good as ChatGPT or even slightly better it won't
| make me switch. It has to be like a lot better. Way better.
|
| It feels like ChatGPT won the time to market war already.
| Tostino wrote:
| Eh, with the degradation of coding performance in ChatGPT I
| made the switch. Seems much better to work with on
| problems, and I have to do way less hand holding to get
| good results.
|
| I'll switch again soon as something better is out.
| brandall10 wrote:
| But plenty people switched to Claude, esp. with Sonnet 3.5.
| Many of them in this very thread.
|
| You may be right with the average person on the street, but
| I wonder how many have lost interest in LLM usage and
| cancelled their GPT plus sub.
| asah wrote:
| -1: I know many people who are switching to Claude. And
| Google makes it near-zero friction to adopt Gemini with
| Gsuite. And more still are using the top-N of them.
|
| This is similar to the early days of the search engine
| wars, the browser wars, and other categories where a user
| can easily adopt, switch between and use multiple. It's not
| like the cellphone OS/hardware war, PC war and database war
| where (most) users can only adopt one platform at a time
| and/or there's a heavy platform investment.
| staticman2 wrote:
| If ChatGPT fails to do a task you want, your instinct isn't
| "I'll run the prompt through Claude and see if it works"
| but "oh well, who needs LLMs?"
| atxbcp wrote:
| Please don't assume your experience applies to everyone.
| If ChatGPT can't do what I want, my first reaction is to
| ask Claude for the same thing. Often to find out that
| Claude performs much better. I've already cancelled
| ChaptGPT Plus for exactly that reason.
| staticman2 wrote:
| You just did that Internet thing where someone reads the
| reply someone wrote without the comment they are replying
| to, completely misunderstanding the conversation.
| xcv123 wrote:
| Dude that is retarded. It's a website and it costs nothing
| to open another browser tab. You can use both at the same
| time. I'm sure you browse multiple websites per day and
| have multiple tabs open. No difference.
|
| ChatGPT is nowhere close to perfection, and we are still in
| the early days with plenty of competition. None of the LLMs
| are that good yet.
|
| Many users here are using both Claude and ChatGPT because
| it's just another fucking tab in the browser. Try it out.
| genrilz wrote:
| For this model, it seems like the point is that it uses way
| less parameters than at least the large Llama model while
| having near identical performance. Given how large these models
| are getting, this is an important thing to do before making
| performance better again.
| skybrian wrote:
| I think it's impressive that they're doing it on a single
| (large) node. Costs matter. Efficiency improvements like this
| will probably increase capabilities eventually.
|
| I'm also optimistic about building better (rather than bigger)
| datasets to train on.
| 42lux wrote:
| We always needed a tock to see real advancement, like with the
| last model generation. The tick we had with the h100 was enough
| to bring these models to market but that's it.
| lossolo wrote:
| For some time, we have been at a plateau because everyone has
| caught up, which essentially means that everyone now has good
| training datasets and uses similar tweaks to the architecture.
| It seems that, besides new modalities, transformers might be a
| dead end as an architecture. Better scores on benchmarks result
| from better training data and fine-tuning. The so-called
| 'agents' and 'function calling' also boil down to training data
| and fine-tuning.
| lolinder wrote:
| > It seems increasingly apparent that we are reaching the
| limits of throwing more data at more GPUs
|
| Yes. This is exactly why I'm skeptical of AI
| doomerism/saviorism.
|
| Too many people have been looking at the pace of LLM
| development over the last two (2) years, modeled it as an
| exponential growth function, and come to the conclusion that
| AGI is inevitable in the next ${1-5} years and we're headed for
| ${(dys|u)topia}.
|
| But all that assumes that we can extrapolate a pattern of long-
| term exponential growth from less than two years of data. It's
| simply not possible to project in that way, and we're already
| seeing that OpenAI has pivoted from improving on GPT-4's
| benchmarks to reducing cost, while competitors (including free
| ones) catch up.
|
| All the evidence suggests we've been slowing the rate of growth
| in capabilities of SOTA LLMs for at least the past year, which
| means predictions based on exponential growth all need to be
| reevaluated.
| cjalmeida wrote:
| Indeed.All exponential growth curves are sigmoids in
| disguise.
| nicman23 wrote:
| except when it isn't and we ded :P
| kridsdale3 wrote:
| I don't think Special Relativity would allow that.
| ToValueFunfetti wrote:
| This is something that is definitionally true in a finite
| universe, but doesn't carry a lot of useful predictive
| value in practice unless you can identify when the
| flattening will occur.
|
| If you have a machine that converts mass into energy and
| then uses that energy to increase the rate at which it
| operates, you could rightfully say that it will level off
| well before consuming all of the mass in the universe. You
| just can't say that next week after it has consumed all of
| the mass of Earth.
| RicoElectrico wrote:
| I don't think we are approaching limits, if you take off the
| English-centric glasses. You can query LLMs about pretty
| basic questions about Polish language or literature and it's
| gonna either bullshit or say it doesn't know the answer.
|
| Example: w ktorej gwarze jest slowo ekspres
| i co znaczy? Slowo "ekspres" wystepuje w gwarze
| slaskiej i oznacza tam ekspres do kawy. Jest to skrot od
| nazwy "ekspres do kawy", czyli urzadzenia sluzacego do
| szybkiego przygotowania kawy.
|
| The correct answer is that "ekspres" is a zipper in Lodz
| dialect.
| andrepd wrote:
| Tbf, you can ask it basic questions in English and it will
| also bullshit you.
| nprateem wrote:
| That's just same same but different, not a step change
| towards significant cognitive ability.
| lolinder wrote:
| What this means is just that Polish support (and probably
| most other languages besides English) in the models is
| behind SOTA. We can gradually get those languages closer to
| SOTA, but that doesn't bring us closer to AGI.
| jeremyjh wrote:
| I'm also wondering about the extent to which we are simply
| burning venture capital versus actually charging subscription
| prices that are sustainable long-term. Its easy to sell
| dollars for $0.75 but you can only do that for so long.
| dvansoye wrote:
| What about synthetic data?
| impossiblefork wrote:
| Notice though, that all these improvements have been with
| pretty basic transformer models that output all their
| tokens-- no internal thoughts, no search, no architecture
| improvements and things are only fed through them once.
|
| But we could add internal thoughts-- we could make the model
| generate tokens that aren't part of its output but are there
| for it to better figure out its next token. This was tried
| QuietSTAR.
|
| Hochreiter is also active with alternative models, and
| there's all the microchip design companies, Groq, Etched,
| etc. trying to speed up models and reduce model running cost.
|
| Therefore, I think there's room for very great improvements.
| They may not come right away, but there are so many obvious
| paths to improve things that I think it's unreasonable to
| think progress has stalled. Also, presumably GPT-5 isn't far
| away.
| lolinder wrote:
| > Also, presumably GPT-5 isn't far away.
|
| Why do we presume that? People were saying this right
| before 4o and then what came out was not 5 but instead a
| major improvement on cost for 4.
|
| Is there any specific reason to believe OpenAI has a model
| coming soon that will be a major step up in capabilities?
| impossiblefork wrote:
| OpenAI have made statements saying they've begun training
| it, as they explain here:
| https://openai.com/index/openai-board-forms-safety-and-
| secur...
|
| I assume that this won't take forever, but will be done
| this year. A couple of months, not more.
| audunw wrote:
| > But we could add internal thoughts
|
| It feels like there's an assumption in the community that
| this will be almost trivial.
|
| I suspect it will be one of the hardest tasks humanity has
| ever endeavoured. I'm guessing it has already been tried
| many times in internal development.
|
| I suspect if you start creating a feedback loop with these
| models they will tend to become very unstable very fast. We
| already see with these more linear LLMs that they can be
| extremely sensitive to the values of parameters like the
| temperature settings, and can go "crazy" fairly easily.
|
| With feedback loops it could become much harder to prevent
| these AIs from spinning out of control. And no I don't mean
| in the "become an evil paperclip maximiser" kind of way.
| Just plain unproductive insanity.
|
| I think I can summarise my vision of the future in one
| sentence: AI psychologists will become a huge profession,
| and it will be just as difficult and nebulous as being a
| human psychologist.
| jpadkins wrote:
| > we're already seeing that OpenAI has pivoted from improving
| on GPT-4's benchmarks to reducing cost, while competitors
| (including free ones) catch up.
|
| What if they have two teams? One dedicated to optimizing
| (cost, speed, etc) the current model and a different team
| working on the next frontier model? I don't think we know the
| growth curve until we see gpt5.
| lolinder wrote:
| > I don't think we know the growth curve until we see gpt5.
|
| I'm prepared to be wrong, but I think that the fact that we
| still haven't seen GPT-5 or even had a proper teaser for it
| 16 months after GPT-4 is evidence that the growth curve is
| slowing. The teasers that the media assumed were for GPT-5
| seem to have actually been for GPT-4o [0]:
|
| > Lex Fridman(01:06:13) So when is GPT-5 coming out again?
|
| > Sam Altman(01:06:15) I don't know. That's the honest
| answer.
|
| > Lex Fridman(01:06:18) Oh, that's the honest answer. Blink
| twice if it's this year.
|
| > Sam Altman(01:06:30) We will release an amazing new model
| this year. I don't know what we'll call it.
|
| > Lex Fridman(01:06:36) So that goes to the question of,
| what's the way we release this thing?
|
| > Sam Altman(01:06:41) We'll release in the coming months
| many different things. I think that'd be very cool. I think
| before we talk about a GPT-5-like model called that, or not
| called that, or a little bit worse or a little bit better
| than what you'd expect from a GPT-5, I think we have a lot
| of other important things to release first.
|
| Note that last response. That's not the sound of a CEO who
| has an amazing v5 of their product lined up, that's the
| sound of a CEO who's trying to figure out how to brand the
| model that they're working on that will be cheaper but not
| substantially better.
|
| [0] https://arstechnica.com/information-
| technology/2024/03/opena...
| niemandhier wrote:
| The next iteration depends on NVIDIA & co, what we need is
| sparse libs. Most of the weights in llms are 0, once we deal
| with those more efficiently we will get to the next iteration.
| lawlessone wrote:
| > Most of the weights in llms are 0,
|
| that's interesting. Do you have a rough percentage of this?
|
| Does this mean these connections have no influence at all on
| output?
| machiaweliczny wrote:
| My uneducated guess is that with many layers you can
| implement something akin to graph in brain by nulling lots
| of previous later outputs. I actually suspect that current
| models aren't optimal with layers all of the same size but
| i know shit
| kridsdale3 wrote:
| This is quite intuitive. We know that a biological neural
| net is a graph data structure. And ML systems on GPUs are
| more like layers of bitmaps in Photoshop (it's a graphics
| processor). So if most of the layers are akin to
| transparent pixels, in order to build a graph by
| stacking, that's hyper memory inefficient.
| m3kw9 wrote:
| There is different directions AI have lots to improve: multi
| modal which branch into robotics, single modal like image,
| video, and sound generation and understanding. Also would check
| back when openAI releases 5
| swalsh wrote:
| And with the increasing parameter size, the main winner will be
| Nvidia.
|
| Frankly I just don't understand the economics of training a
| foundation model. I'd rather own an airline. At least I can get
| a few years out of the capital investment of a plane.
| machiaweliczny wrote:
| But billionaires already have that, they want a chance of
| getting their own god.
| mlsu wrote:
| What else can be done?
|
| If you are sitting on 1 billions $ of GPU capex, what's $50
| million in energy/training cost for another incremental run
| that may beat the leaderboard?
|
| Over the last few years the market has placed its bets that
| this stuff will make gobs of money somehow. We're all not sure
| how. They're probably thinking -- it's likely that whoever has
| a few % is going to sweep and take most of this hypothetical
| value. What's another few million, especially if you already
| have the GPUs?
|
| I think you're right -- we are towards the right end of the
| sigmoid. And with no "killer app" in sight. It is great for all
| of us that they have created all this value, because I don't
| think anyone will be able to capture it. They certainly haven't
| yet.
| sebzim4500 wrote:
| I don't think we can conclude that until someone trains a model
| that is significantly bigger than GPT-4.
| rkwz wrote:
| > A significant effort was also devoted to enhancing the model's
| reasoning capabilities. One of the key focus areas during
| training was to minimize the model's tendency to "hallucinate" or
| generate plausible-sounding but factually incorrect or irrelevant
| information. This was achieved by fine-tuning the model to be
| more cautious and discerning in its responses, ensuring that it
| provides reliable and accurate outputs.
|
| Is there a benchmark or something similar that compares this
| "quality" across different models?
| amilios wrote:
| Unfortunately not, as it captures such a wide spectrum of use
| cases and scenarios. There are some benchmarks to measure this
| quality in specific settings, e.g. summarization, but AFAIK
| nothing general.
| rkwz wrote:
| Thanks, any ideas why it's not possible to build a generic
| eval for this? Since it's about asking a set of questions
| that's not public knowledge (or making stuff up) and check if
| the model says "I don't know"?
| moralestapia wrote:
| Nice, they finally got the memo that GPT4 exists and include it
| in their benchmarks.
| gavinray wrote:
| "It's not the size that matters, but how you use it."
| epups wrote:
| The graphs seem to indicate their model trades blows with Llama
| 3.1 405B, which has more than 3x the number of tokens and
| (presumably) a much bigger compute budget. It's kind of baffling
| if this is confirmed.
|
| Apparently Llama 3.1 relied on artificial data, would be very
| curious about the type of data that Mistral uses.
| OldGreenYodaGPT wrote:
| I still prefer ChatGPT-4o and use Claude if I have issues but
| never does any better
| jasonjmcghee wrote:
| This is super interesting to me.
|
| Claude Sonnet 3.5 outperforms GPT-4o by a significant margin on
| every one of my use cases.
|
| What do you use it for?
| breck wrote:
| When I see this "(c) 2024 [Company Name], All rights reserved",
| it's a tell that the company does not understand how hopelessly
| behind they are about to be.
| crowcroft wrote:
| Could you elaborate on this? Would love to understand what
| leads you to this conclusion.
| breck wrote:
| E = T/A! [0]
|
| A faster evolving approach to AI is coming out this year that
| will smoke anyone who still uses the term "license" in
| regards to ideas [1].
|
| [0] https://breckyunits.com/eta.html [1]
| https://breckyunits.com/freedom.html
| christianqchung wrote:
| So it's made up?
| breck wrote:
| I do what I say and I say what I do.
|
| https://github.com/breck7/breckyunits.com/blob/afe70ad66c
| fbb...
| doctoboggan wrote:
| The question I (and I suspect most other HN readers) have is
| which model is best for coding? While I appreciate the advances
| in open weights models and all the competition from other
| companies, when it comes to my professional use I just want the
| best. Is that still GPT-4?
| tikkun wrote:
| My personal experience says Claude 3.5 Sonnet.
| stri8ed wrote:
| The benchmarks agree as well.
| kim0 wrote:
| I kinda trust https://aider.chat/docs/leaderboards/
| ashenke wrote:
| I tested it with my claude prompt history, the results are as
| good as Claude 3.5 Sonnet, but it's 2 or 3 times slower
| Tepix wrote:
| Just in case you haven't RTFA. Mistral 2 is 123b.
| rkwasny wrote:
| All evals we have are just far too easy! <1% difference is just
| noise/bad data
|
| We need to figure out how to measure intelligence that is greater
| than human.
| omneity wrote:
| Give it problems most/all humans can't solve on their own, but
| that are easy to verify.
|
| Math problems being one of them, if only LLMs were good at pure
| math. Another possibility is graph problems. Haven't tested
| this much though.
| tonetegeatinst wrote:
| What doe they mean by "single-node inference"?
|
| Do they mean inference done on a single machine?
| simonw wrote:
| Yes, albeit a really expensive one. Large models like GPT-4 are
| rumored to run inference on multiple machines because they
| don't fit in VRAM for even the most expensive GPUs.
|
| (I wouldn't be surprised if GPT-4o mini is small enough to fit
| on a single large instance though, would explain how they could
| drop the price so much.)
| bjornsing wrote:
| Yeah that's how I read it. Probably means 8 x 80 GB GPUs.
| huevosabio wrote:
| The non-commercial license is underwhelming.
|
| It seems to be competitive with Llama 3.1 405b but with a much
| more restrictive license.
|
| Given how the difference between these models is shrinking, I
| think you're better off using llama 405B to finetune the 70B on
| the specific use case.
|
| This would be different if it was a major leap in quality, but it
| doesn't seem to be.
|
| Very glad that there's a lot of competition at the top, though!
| calibas wrote:
| "Mistral Large 2 is equipped with enhanced function calling and
| retrieval skills and has undergone training to proficiently
| execute both parallel and sequential function calls, enabling it
| to serve as the power engine of complex business applications."
|
| Why does the chart below say the "Function Calling" accuracy is
| about 50%? Does that mean it fails half the time with complex
| operations?
| simonw wrote:
| Mistral forgot to say which benchmark they were using for that
| chart, without that information it's impossible to determine
| what it actually means.
| Me1000 wrote:
| Relatedly, what does "parallel" function calling mean in this
| context?
| simonw wrote:
| That's when the LLM can respond with multiple functions it
| wants you to call at once. You might send it:
| Location and population of Paris, France
|
| A parallel function calling LLM could return:
| { "role": "assistant", "content": "",
| "tool_calls": [ { "function": {
| "name": "get_city_coordinates", "arguments":
| "{\"city\": \"Paris\"}" } }, {
| "function": { "name": "get_city_population",
| "arguments": "{\"city\": \"Paris\"}" }
| } ] }
|
| Indicating that you should execute both of those functions
| and return the results to the LLM as part of the next prompt.
| Me1000 wrote:
| Ah, thank you!
| RyanAdamas wrote:
| Personally, language diversity should be the last thing on the
| list. If we had optimized every software from the get-go for a
| dozen languages our forward progress would have been dead in the
| water.
| moffkalast wrote:
| You'd think so, but 3.5-turbo was multilingual from the get go
| and benefitted massively from it. If you want to position
| yourself as a global leader, then excluding 95% of the world
| who aren't English native speakers seems like a bad idea.
| RyanAdamas wrote:
| Yeah clearly, OpenAI is rocketing forward and beyond.
| moffkalast wrote:
| Constant infighting and most of the competent people
| leaving will do that to a company.
|
| I mean more on a model performance level though. It's been
| shown that something trained in one language trains the
| model to be able to output it in any other language it
| knows. There's quality human data being left on the table
| otherwise. Besides, translation is one of the few tasks
| that language models are by far the best at if trained
| properly, so why not do something you can sell as a main
| feature?
| gpm wrote:
| Language diversity means access to more training data, and you
| might also hope that by learning the same concept in multiple
| languages it does a better job of learning the underlying
| concept independent of the phrase structure...
|
| At least from a distance it seems like training a multilingual
| state of the art model might well be easier than a monolingual
| one.
| RyanAdamas wrote:
| Multiple input and output processes in different languages
| has zero effect on associative learning and creative
| formulation in my estimations. We've already done studies
| that show there is no correlation between human intelligence
| and knowing multiple languages, after having to put up with
| decades of "Americans le dumb because..." and this is no
| different. The amount of discourse on a single topic has a
| limited degree of usability before redundancies appear. Such
| redundancies would necessarily increase the processing
| burden, which could actually limit the output potential for
| novel associations.
| gpm wrote:
| Humans also don't learn by reading the entire internet...
| assuming human psych studies apply to LLMs at all is just
| wrong.
| logicchains wrote:
| Google mentioned this in one of their papers, they found
| for large enough models including more languages did indeed
| lead to an overall increase in performance.
| RyanAdamas wrote:
| Considering Googles progress and censorship history, I'm
| inclined to take their assessments with a grain of salt.
| wesleyyue wrote:
| I'm building a ai coding assistant (https://double.bot) so I've
| tried pretty much all the frontier models. I added it this
| morning to play around with it and it's probably the worst model
| I've ever played with. Less coherent than 8B models. Worst case
| of benchmark hacking I've ever seen.
|
| example: https://x.com/WesleyYue/status/1816153964934750691
| mpeg wrote:
| to be fair that's quite a weird request (the initial one) - I
| feel a human would struggle to understand what you mean
| wesleyyue wrote:
| definitely not an articulate request, but the point of using
| these tools is to speed me up. The less the user has to
| articulate and the more it can infer correctly, the more
| helpful it is. Other frontier models don't have this problem.
|
| Llama 405B response would be exactly what I expect
|
| https://x.com/WesleyYue/status/1816157147413278811
| mpeg wrote:
| That response is bad python though, I can't think of why
| you'd ever want a dict with Literal typed keys.
|
| Either use a TypedDict if you want the keys to be in a
| specific set, or, in your case since both the keys and the
| values are static you should really be using an Enum
| ijustlovemath wrote:
| What was the expected outcome for you? AFAIK, Python doesn't
| have a const dictionary. Were you wanting it to refactor into a
| dataclass?
| wesleyyue wrote:
| Yes, there's a few things wrong: 1. If it assumes typescript,
| it should do `as const` in the first msg 2. If it is python,
| it should be something like
| https://x.com/WesleyYue/status/1816157147413278811 which is
| what I wanted but I didn't want to bother with the typing.
| nabakin wrote:
| Are you sure the chat history is being passed when the second
| message is sent? That looks like the kind of response you'd
| expect if it only received the prompt "in python" with no chat
| history at all.
| wesleyyue wrote:
| Yes, I built the extension. I actually also just went to send
| another message asking what the first msg was just to double
| check I didn't have a bug and it does know what the first msg
| was.
| nabakin wrote:
| Thanks, that's some really bad accuracy/performance
| schleck8 wrote:
| This makes no sense. Benchmarking code is easier than natural
| language and Mistral has separate benchmarks for prominent
| languages.
| avereveard wrote:
| important to note that this time around weights are available
| https://huggingface.co/mistralai/Mistral-Large-Instruct-2407
| ilaksh wrote:
| How does their API pricing compare to 4o and 3.5 Sonnet?
| rvnx wrote:
| 3 USD per 1M input tokens, so the same as 3.5 Sonnet but worse
| quality
| ThinkBeat wrote:
| A side note about the ever increasing costs to advance the
| models. I feel certain that some branch of what may be connected
| to the NSA is running and advancing models that probably exceed
| what the open market provides today.
|
| Maybe they are running it on proprietary or semi proprietary
| hardware but if they dont, how much does the market no where
| various shipments of NVIDEA processors ends up?
|
| I imagine most intelligence agencies are in need of vast
| quantities.
|
| I presume is M$ announces new availability of AI compute it means
| they have received and put into production X Nvidiam, which might
| make it possible to guesstimate within some bounds how many.
|
| Same with other open market compute facilities.
|
| Is it likely that a significant share of NVIDEA processors are
| going to government / intelligent / fronts?
| teaearlgraycold wrote:
| https://www.youtube.com/watch?v=rvrZJ5C_Nwg
| modeless wrote:
| The name just makes me think of the screaming cowboy song.
| https://youtu.be/rvrZJ5C_Nwg?t=138
| nen-nomad wrote:
| The models are converging slowly. In the end, it will come down
| to the user experience and the "personality." I have been
| enjoying the new Claude Sonnet. It feels sharper than the others,
| even though it is not the highest-scoring one.
|
| One thing that `exponentialists` forget is that each step also
| requires exponentially more energy and resources.
| toomuchtodo wrote:
| I have been paying for OpenAI since they started accepting
| payment, but to echo your comment, Claude is _so good_ I am
| primarily relying on it now for LLM driven work and cancelled
| my OpenAI subscription. Genuine kudos to Mistral, they are a
| worthy competitor in the space against Goliaths. They make
| someone mediocre at writing code less so, so I can focus on
| higher value work.
| bilater wrote:
| And a factor for Mistral typically is it will give you less
| refusals and can be uncensored. So if I have to guess any task
| that requires creative output could be better suited for this.
| thntk wrote:
| Anyone know what caused the very big performance jump from Large1
| to Large2 in just a few months?
|
| Besides, parameter redundancy seems evidenced. Front-tier models
| used to be 1.8T, then 405B, and now 123B. Would front-tier models
| in the future be <10B or even <1B, that would be a game changer.
| nuz wrote:
| Lots and lots of synthetic data from the bigger models training
| the smaller ones would be my guess.
| duchenne wrote:
| Counter-intuitively, larger models are cheaper to train.
| However, smaller models are cheaper to serve. At first,
| everyone was focusing on training, so the models were much
| larger. Now, so many people are using AI everyday, so companies
| spend more on training smaller models to save on serving.
| erichocean wrote:
| I like Claude 3.5 Sonnet, but despite paying for a plan, I run
| out of tokens after about 10 minutes. Text only, I'm typing
| everything in myself.
|
| It's almost useless because I literally can't use it.
|
| Update: https://support.anthropic.com/en/articles/8325612-does-
| claud...
|
| 45 messages per 5 hours is the limit for Pro users, less if
| Claude is wordy in its responses--which it always is. I hit that
| limit so fast when I'm investigating something. So annoying.
|
| They used to let you select another, worse model but I don't see
| that option anymore. _le sigh_
| mvdtnz wrote:
| Imagine bragging about 74% accuracy in any other field of
| software. You'd be laughed out of the room. But somehow it's
| accepted in "AI".
| kgeist wrote:
| Well, we had close to 0% a few years ago (for general purpose
| AI). I think it's not bad...
| SebaSeba wrote:
| Sorry for the slightly off topic question, but can someone
| enlighten me which Claude model is more capable, Opus or Sonnet
| 3.5? I am confused because I see people fuzzing about Sonnet 3.5
| being the best and yet somehow I seem to read again and again in
| factual texts and some benchmarks that Claude Opus is the most
| capable. Is there a simple answer to the question, what do I not
| understand? Please, thank you.
| platelminto wrote:
| Sonnet 3.5.
|
| Opus is the largest model, but of the Claude 3 family. Claude
| 3.5 is the newest family of models, with Sonnet being the
| middle sized 3.5 model - and also the only available one.
| Regardless, it's better than Opus (the largest Claude 3 one).
|
| Presumably, a Claude 3.5 Opus will come out at some point, and
| should be even better - but maybe they've found that increasing
| the size for this model family just isn't cost effective. Or
| doesn't improve things that much. I'm unsure if they've said
| anything about it recently.
| SebaSeba wrote:
| Thank you :)
| zamadatix wrote:
| I think this image explains it best: https://www-
| cdn.anthropic.com/images/4zrzovbb/website/1f0441...
|
| I.e. Opus is the largest and best model of each family but
| Sonnet is the first model of the 3.5 family and can beat 3's
| Opus in most tasks. When 3.5 Opus is released it will again
| outpace the 3.5 Sonnet model of the same family universally (in
| terms of capability) but until then it's a comparison of two
| different families without a universal guarantee, just a strong
| lean towards the newer model.
| SebaSeba wrote:
| Thank you for clearing this out to me :)
| htk wrote:
| is it possible to run Large 2 on ollama?
| novok wrote:
| I kind of wonder why a lot of these places don't give "amateur"
| sized models anymore at around the 18B & 30B parameter sizes that
| you can run on a single 3090 or M2 Max at reasonable speeds and
| RAM requirements? It's all 7B, 70B, 400B sizing nowadays.
| TobTobXX wrote:
| Just a few days ago, Mistral released a 12B model:
| https://mistral.ai/news/mistral-nemo/
| logicchains wrote:
| Because you can just quantise the 70B model to 3-4 bits and
| it'll perform better than a 30B model but be a similar size.
| novok wrote:
| A 70B 4bit model does not fit in a 24GB VRAM card, 30B models
| are the sweet spot for that size of card at 20GB, with 4GB
| left for the system to still function.
| whisper_yb wrote:
| Every day a new model better than the previous one lol
| philip-b wrote:
| Does any one of the top models have access to the internet and
| googling things? I want an LLM to look things up and do casual
| research for me when I'm lazy.
| tikkun wrote:
| I'd suggest using Perplexity.
| freediver wrote:
| Sharing PyLLMs [1] reasoning benchmark results for some of the
| recent models. Surprised by nemo (speed/quality) and mistral
| large is actually pretty good (but painfully slow).
|
| AnthropicProvider('claude-3-haiku-20240307') Median Latency: 1.61
| | Aggregated speed: 122.50 | Accuracy: 44.44%
|
| MistralProvider('open-mistral-nemo') Median Latency: 1.37 |
| Aggregated speed: 100.37 | Accuracy: 51.85%
|
| OpenAIProvider('gpt-4o-mini') Median Latency: 2.13 | Aggregated
| speed: 67.59 | Accuracy: 59.26%
|
| MistralProvider('mistral-large-latest') Median Latency: 10.18 |
| Aggregated speed: 18.64 | Accuracy: 62.96%
|
| AnthropicProvider('claude-3-5-sonnet-20240620') Median Latency:
| 3.61 | Aggregated speed: 59.70 | Accuracy: 62.96%
|
| OpenAIProvider('gpt-4o') Median Latency: 3.25 | Aggregated speed:
| 53.75 | Accuracy: 74.07% |
|
| [1] https://github.com/kagisearch/pyllms
| zone411 wrote:
| Improves from 17.7 for Mistral Large to 20.0 on the NYT
| Connections benchmark.
| greenchair wrote:
| can anyone explain why the % success rates are so different
| between these programming languages? is this a function of amount
| of training data available for each language or due to complexity
| of language or what?
| h1fra wrote:
| There are now more AI models than javascript framework!
___________________________________________________________________
(page generated 2024-07-24 23:01 UTC)