[HN Gopher] LLaMA now goes faster on CPUs
       ___________________________________________________________________
        
       LLaMA now goes faster on CPUs
        
       Author : lawrencechen
       Score  : 1136 points
       Date   : 2024-04-01 02:17 UTC (20 hours ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | bottlepalm wrote:
       | I think it's a good idea for everyone to download and be able to
       | run a LLM locally, even if you have the minimum of requirements.
       | As a pseudo-backup of a large chunk of human knowledge.
        
         | TaylorAlexander wrote:
         | I contend that most human knowledge is not written down or if
         | it is written down it's not publicly available on the internet
         | and so does not exist in these datasets.
         | 
         | There's so much subtle knowledge like the way a mother learns
         | to calm her child or the way a carpenter learns to work
         | different kinds of wood which may be written down in part, but
         | may also be learned through lived experience or transferred
         | from human to human such that little of it gets written down
         | and posted online.
        
           | mickdarling wrote:
           | Wait till all the videos ever created are tokenized and
           | ingested into a training dataset. Carpentry techniques are
           | certainly there. The subtleties of parenting maybe harder to
           | derive from that, but maybe lots of little snippets of
           | people's lives will add up to a general understanding of
           | parenting. There have certainly been bigger surprises in the
           | field.
        
             | oblio wrote:
             | What about smells or tastes? Or feelings?
             | 
             | I can't help but feel we're at the "aliens watch people eat
             | from space and recreate chemically identical food that has
             | no taste" phase of AI development.
        
               | skeledrew wrote:
               | If the food is chemically identical then the taste would
               | be the same though, since taste (and smell) is about
               | chemistry. I do get what you're saying though.
        
               | nyokodo wrote:
               | > If the food is chemically identical...
               | 
               | If it were 99.9% chemically identical but they left out
               | the salt and spices...
        
               | skeledrew wrote:
               | I'd say that, when it comes to chemistry, only 100%
               | reproduction can be considered identical. Anything less
               | is to be deemed similar to some degree.
               | 
               | And so without the correct amount of salt and/or spices,
               | we're talking about food that's very similar, and not
               | identical.
        
               | samus wrote:
               | Their perception is very likely to be totally different.
               | 
               | * They might not perceive some substances at all, others
               | that we don't notice might make it unpalatable.
               | 
               | * Some substances might be perceived differently than us,
               | or be indistinguishable from others.
               | 
               | * And some might require getting used to.
               | 
               | Note that all of the above phenomena also occur in humans
               | because of genetics, cultural background, or experiences!
        
               | skeledrew wrote:
               | This may come off as pedantic, but "identical" is a very
               | strong term when it comes to something like chemistry.
               | The smallest chemical difference can manifest as a large
               | physical difference. Consider that genetically, humans
               | are about 60% similar to the fruit fly, yet phenotically,
               | the similarity could be considered under 1%.
        
               | dekhn wrote:
               | https://en.wikipedia.org/wiki/Knowledge_argument
        
               | mickdarling wrote:
               | Well, I have synesthetic smell/color senses, so I don't
               | even know what other humans experience, nor they me. But,
               | I have described it in detail to many people and they
               | seem to get the idea, and can even predict how certain
               | smells will "look" to me. All that took was using words
               | to describe things.
        
               | nyokodo wrote:
               | > All that took was using words to describe things.
               | 
               | All that took was words and a shared experience of
               | smelling.
        
               | mickdarling wrote:
               | How rude, what do our bathing habits have to do with
               | this? ;-)
               | 
               | But, fair point. The gist I was trying to get across is
               | that I don't even know what a plant smells like to you,
               | and you don't know what a plant smells like to me. Those
               | aren't comparable with any objective data. We make
               | guesses, and we try to get close with our descriptions,
               | which are in words. That's the best we can do and we
               | share our senses. Asking more from computers seems overly
               | picky to me.
        
               | visarga wrote:
               | I think we can safely say that any taste, smell,
               | sensation or emotion of any importance has been described
               | 1000 times over in the text corpus of GPT. Even though it
               | is fragmented, by sheer volume there is enough signal in
               | the training set, otherwise it would not be able to
               | generate coherent text. In this case I think the map
               | (language) is asymptotically close to the territory
               | (sensations & experience in general).
        
             | andersa wrote:
             | What makes you think they aren't already?
        
           | spacephysics wrote:
           | For sure agree, however as the storage of information
           | evolves, it's becoming more efficient over time
           | 
           | From oral tradition to tablets to scrolls to books to mass
           | produced books to digital and now these LLMs, I think it's
           | still a good idea to preserve what we have the best we can.
           | Not as a replacement, but a hedge against a potential library
           | of Alexandria incident.
           | 
           | I could imagine a time in the near future where the models
           | are domain-specific, and just like there are trusted
           | encyclopedia publishers there are trusted model publishers
           | that guarantee a certain level of accuracy.
           | 
           | It's not like reading a book, but I for sure had an easier
           | time learning golang talking with ChatGPT than a book
        
             | nyokodo wrote:
             | > a hedge against a potential library of Alexandria
             | incident
             | 
             | What would cause a Library of Alexandria incident wiping
             | out all human knowledge elsewhere, that would also allow
             | you to run a local LLM?
        
               | AnthonyMouse wrote:
               | To run a local LLM you need the device it currently runs
               | on and electricity. There are actually quite a lot of
               | ways to generate electricity, but to name one, a diesel
               | generator that can run on vegetable oil.
               | 
               | What you're really asking is, what could cause a modern
               | Library of Alexandria incident? But the fact is we keep
               | the only copy of too many things on the servers of the
               | major cloud providers. Which are then intended to have
               | their own internal redundancy, but that doesn't protect
               | you against a targeted attack or a systemic failure when
               | all the copies are under the same roof and you lose every
               | redundant copy at once from a single mistake replicated
               | in a monoculture.
        
               | spacephysics wrote:
               | A more dooms-day prepping would call for some heavy lead-
               | faraday cage to store the storage mediums in the event of
               | an EMP/major solar flare.
               | 
               | Or more Sci-fi related, some hyper computer virus that
               | ends up infecting all internet connected devices.
               | 
               | Not too far fetched if we can conceive of some AI enabled
               | worm that mutates depending on the target, I could
               | imagine a model of sorts being feasible within the next
               | 5-10 years
        
           | _ache_ wrote:
           | I think you underestimate the amount of information contained
           | in books and the extent to which our society (as a whole)
           | depends on them.
        
             | Barrin92 wrote:
             | society depends much more on social networks, mentorship
             | and tacit knowledge than books. It's easy to test this.
             | Just run the thought experiment by a few people, if you
             | could get only one, would you take an Ivy league degree
             | without the education or the education without the degree?
             | 
             | Venture capital in tech is a good example of this. The book
             | knowledge is effectively globally distributed and almost
             | free, effectively success happens in a few geographically
             | concentrated counties.
        
           | skeledrew wrote:
           | I'd content that those are skills (gained through experience)
           | rather than knowledge (gained through rote learning).
        
             | TaylorAlexander wrote:
             | I think it's worth expanding your definition of knowledge.
        
           | bamboozled wrote:
           | Yes but it contains enough hints to help someone find their
           | way on the these types of tasks.
        
           | nicklecompte wrote:
           | It's not even "human knowledge" that can't be written down -
           | it seems all vertebrates understand causality, quantity (in
           | the sense of intuitively understanding what numbers are), and
           | object permanence. Good luck writing those concepts down in a
           | way that GPT can use!
           | 
           | In general AI in 2024 is not even close to understanding
           | these ideas, nor does any AI developer have a clue how to
           | build an AI with this understanding. The best we can do is
           | imitating object permanence for a small subset of perceptible
           | objects, a limitation not found in dogs or spiders.
        
           | wruza wrote:
           | That's where humans suck. The classic "you're not doing it
           | right" then proceeds to quickly show how to do it without
           | verbalizing any info on learning process, pitfalls, failure
           | modes, etc, as if just showing it was enough for themselves
           | to learn. Most people do[n't do] that, not even a sign of
           | reflection.
           | 
           | My worst case was with a guy who asked me to write an
           | arbitrage betting bot. When I asked how to calculate coeffs,
           | he pointed at two values and said "look, there <x>, there <y>
           | _thinks for a minute_ then it 's <z>!". When I asked how
           | exactly did he calculate it, he simply repeated with
           | different numbers.
        
             | samus wrote:
             | > When I asked how exactly did he calculate it, he simply
             | repeated with different numbers.
             | 
             | Now you know how an LLM feels during training!
        
               | stavros wrote:
               | Probably during inference, as well.
        
             | Aerroon wrote:
             | People often don't know how to verbalize them in the first
             | place. Some of these topics are very complex, but our
             | intuition gets us halfway there.
             | 
             | Once upon a time I was good at a video game. Everyone
             | realized that positioning is extremely important in this
             | game.
             | 
             | I have good positioning in that game and was asked many
             | times to make a guide about positioning. I never did,
             | because I don't really know how. There is too much
             | information they you need to convey to cover all the
             | various situations.
             | 
             | I think you would first have to come up with a framework on
             | positioning to be able to really teach this to someone
             | else. Some kind of base truths/patterns that you can then
             | use to convey the meaning. I believe the same thing applies
             | to a lot of these processes that aren't verbalized.
        
               | snovv_crash wrote:
               | Often for this kind of problem writing a closed form
               | solution is simply intractable. However, it's often still
               | possible to express the cost function of at least a big
               | portion of what goes into a human-optimal solution. From
               | here you can sample your space, do gradient descent or
               | whatever to find some acceptable solution that has a more
               | human-intuitive property.
        
               | michaelt wrote:
               | It's not necessarily that it's intractable - just that a
               | thing can be very hard to describe, under some
               | circumstances.
               | 
               | Imagine someone learning English has written "The
               | experiment reached it's conclusion" and you have to
               | correct their grammar. Almost any english speaker can
               | correct "it's" to "its" but unless they (and the person
               | they're correcting) know a bunch of terms like 'noun' and
               | 'pronoun' and 'possessive' they'll have a very hard time
               | explaining why.
        
             | Shorel wrote:
             | I wouldn't say this is where humans suck. On the contrary,
             | this how we find human language is such a fantastic tool to
             | serialize and deserialize human mental processes.
             | 
             | Language is so good, that an artificial language tool,
             | without any understanding of these mental processes, can
             | appear semi-intelligent to us.
             | 
             | A few people unable to do this serialization doesn't mean
             | much on the larger scale. Just that their ideas and mental
             | processes will be forgotten.
        
           | HarHarVeryFunny wrote:
           | > I contend that most human knowledge is not written down
           | 
           | Yes - the available training data is essentially mostly a
           | combination of declarative knowledge (facts - including
           | human-generated artifacts) and procedural knowledge (how to
           | do things). What is missing is the learning process of taking
           | a description of how to do something, and trying to apply
           | that yourself in a specific situation.
           | 
           | No amount of reading books, or reading other people's blogs
           | on how they did something, can avoid the need for hands-on
           | experience if you want to learn how to do it yourself.
           | 
           | It's not just a matter of information that might be missing
           | or unclear in instructional material, including how to cope
           | with every type of failure and unexpected outcome, but
           | crucially how to do this _yourself_ - if you are to be the
           | actor, then it 's the predictive process in _your_ mind that
           | matters.
           | 
           | Partly for this reason, and partly because current AI's
           | (transformer-based LLMs) don't support online learning (try &
           | fail skill acquisition), I think we're going to see two
           | distinct phases of AI.
           | 
           | 1) The current "GenAI" phase where AI can only produce mash-
           | ups of things it saw in it's pre-training data, augmented by
           | similar "book learning" provided in-context which can be
           | utilized by in-context learning. I'd characterize what this
           | type of AI to be useful for, and capable of, as "automation".
           | Applying that book (incl. anecdotal) knowledge to new
           | situations where mash-up is all you need.
           | 
           | 2) The second phase is where we have something closer to AGI,
           | even if still below human level, which is no longer just a
           | pre-trained transformer, but also has online learning and is
           | agentic - taking actions predicated on innate traits like
           | curiosity and boredom, so that given the book knowledge it
           | can (& will!) then learn to apply that by
           | experimentation/practice and learning from its own mistakes.
           | 
           | There will no doubt be advances beyond this "phase two" as
           | well, but it seems we're likely to be stuck at "phase one"
           | for a while (even as models become much better at phase one
           | capabilities), until architectures fundamentally advance
           | beyond transformers to allow this type of on-the-job training
           | and skill acquisition.
        
         | texuf wrote:
         | Any recommendations for the latest and greatest way to run
         | these locally?
        
           | speps wrote:
           | llamafile as per TFA...
        
           | etc-hosts wrote:
           | https://justine.lol/oneliners/
        
             | threecheese wrote:
             | This looks amazing, but the docs mention .llamafiles exceed
             | the Windows executable size limit, and there are
             | workarounds to externalize the weights. Do you think this
             | is an impediment to its becoming popular? Or is MS consumer
             | hardware just far enough behind (w/o dedi gpu) that
             | "there's time"?
        
           | fragmede wrote:
           | ollama
        
           | slowmotiony wrote:
           | I use a tool called LM Studio, makes it trivial to run these
           | models on a Mac. You can also use it as a local API so it
           | kinda acts like a drop-in replacement for the openAI API.
        
           | chown wrote:
           | I am the author of Msty [1]. My goal is to make it as
           | straightforward as possible with just one click (once you
           | download the app). If you end up trying it, I would love to
           | hear your feedback.
           | 
           | 1: https://msty.app
        
         | mikewarot wrote:
         | I don't see LLMs as a large chunk of knowledge, I see them as
         | an emergent alien intelligence snapshotted at the moment it
         | appeared to stop learning. It's further hobbled by the limited
         | context window it has to use, and the probabilistic output
         | structure that allows for outside random influences to pick its
         | next word.
         | 
         | Both the context window and output structure are, in my
         | opinion, massive impedance mismatches for the emergent
         | intellect embedded in the weights of the model.
         | 
         | If there were a way to match the impedance, I strongly suspect
         | we'd already have AGI on our hands.
        
           | bamboozled wrote:
           | What is alien about them ?
           | 
           | LLMs are of this earth and created by our species. Seems
           | quite familiar to me.
        
             | jfoster wrote:
             | They can write in a way similar to how a human might write,
             | but they're not human.
             | 
             | The chat interfaces (Claude, ChatGPT) certainly have a
             | particular style of writing, but the underlying LLMs are
             | definitely capable of impersonating as our species in the
             | medium of text.
        
               | bamboozled wrote:
               | But they're extremely relatable to us because it's
               | regurgitating us.
               | 
               | I saw this talk with Geoffrey Hinton the other day and he
               | said he was astonished at the capabilities of ChatGPT-4
               | because he asked it what the relationship between a
               | compost heap and a nuclear bomb was, and he couldn't
               | believe it answered, he really thought it was proof the
               | thing could reason. Totally mind blown.
               | 
               | However I got it right away with zero effort.
               | 
               | Either I'm a super genius or this has been discussed
               | before and made it's way into the training data.
               | 
               | Usual disclaimer: I don't think this invalidates the
               | usefulness of AI or LLMs, just that we might be
               | bamboozling ourselves into the idea that we've created an
               | alien intelligence.
        
               | EMM_386 wrote:
               | > Either I'm a super genius or this has been discussed
               | before and made it's way into the training data.
               | 
               | If an LLM can tell you the relatonship between a compost
               | heap and nuclear bomb, that doesn't mean that was in the
               | training data.
               | 
               | It could be because a compost heap "generates heat", and
               | a nuclear bomb also "generates heat" and due to that
               | relationship they have something in common. The model
               | will pick up on these similar patterns. They tokens are
               | positioned closer to each other in the high dimensional
               | vector space.
               | 
               | But for any given "what does x have in common with y",
               | that doesn't necessarily mean someone has asked that
               | before and it's in the training data. Is that reasoning?
               | I don't know ... how does the brain do it?
        
               | moffkalast wrote:
               | > how does the brain do it?
               | 
               | It's a lot of organic matmuls. ;)
        
               | bamboozled wrote:
               | I mean that's what sucks about Open AI isn't it ? They
               | won't tell us what is in the training data so we don't
               | know. All I'm saying is that it wouldn't be surprising if
               | this was discussed previously somewhere in a pop science
               | book.
               | 
               | That answer was close btw !
        
             | fragmede wrote:
             | They don't think, they don't reason, they don't understand.
             | Except they do. But it's hard for human words for thought
             | processes to apply when giving it an endless string of
             | AAAAA's makes it go bananas.
             | 
             | That's not familiar behavior. Nor is the counting reddit
             | derived output. It's also not familiar for a single person
             | to have the breadth and depth of knowledge that ChatGPT
             | has. Sure, some people know more than others, but even
             | without hitting the Internet, it has a ridiculous amount of
             | knowledge, far surpassing a human, making it, to me, alien.
             | though, it's inability to do math sometimes is humanizing
             | to me for some reason.
             | 
             | ChatGPT's memory is also unhuman. It has a context window
             | which is a thing, but also it only knows about things
             | you've told it in each chat. Make a new chat and it's
             | totally forgotten the nickname you gave it.
             | 
             | I don't think of HR Geiger's work, though made by a human,
             | as familiar to me. it feels quite alien to me, and it's not
             | just me, either. Dali, Bosch, and Escher are other human
             | artists who's work can be unfamiliar and alien. So being
             | created by our species doesn't automatically imbue
             | something with familiar human processes.
             | 
             | So it dot products, it matrix multiplies, instead of
             | reasoning and understanding. It's the Chinese room
             | experiment on steroids; it turns out a sufficiently large
             | corpus on a sufficiently large machine does make it look
             | like something"understands".
        
               | trimethylpurine wrote:
               | The word "alien" works in this context but, as the
               | previous commenter mentioned, it also carries the
               | implication of foreign origin. You could use "uncanny"
               | instead. Maybe that's less arbitrary and more specific to
               | these examples.
               | 
               | "Alien" still works, but then you might have to add all
               | the context at length, as you've done in this last
               | comment.
        
               | fire_lake wrote:
               | Hype people do this all the time - take a word that has a
               | particular meaning in a narrow context and move it to a
               | broader context where people will give it a sexier
               | meaning.                   AI researchers unveil alien
               | intelligence
               | 
               | Is way better headline.
        
               | samus wrote:
               | The context window is comparable to human short-term
               | memory. LLMs are missing episodic memory and means to
               | migrate knowledge between the different layers and into
               | its weights.
               | 
               | Math is mostly impeded by the tokenization, but it would
               | still make more sense to adapt them to use RAG to process
               | questions that are clearly calculations or chains of
               | logical inference. With proper prompt engineering, they
               | can process the latter though, and deviating from
               | strictly logical reasoning is sometimes exactly what we
               | want.
               | 
               | The ability to reset the text and to change that history
               | is a powerful tool! It can make the model roleplay and
               | even help circumvent alignment.
               | 
               | I think that LLMs could one day serve as the language
               | center of an AGI.
        
               | taneq wrote:
               | In all fairness, going up to SMS random human and yelling
               | AAAAAAAAAAAAAA... at them for long enough will produce
               | some out-of-distribution responses too.
        
               | cloudwalk9 wrote:
               | Makes me think that TikTok and YT pranksters are
               | accidentally producing psychological data on what makes
               | people tick under scenarios of extreme deliberate
               | annoyance. Although the quality (and importance) of that
               | data is obviously highly variable and probably not very
               | high, and depends on what the prank is.
        
               | inference-lord wrote:
               | Do you find a large database or spreadsheet that hold
               | more information than you can "alien" too?
        
             | mikewarot wrote:
             | Alien meaning unfamiliar, not necessarily extraterrestrial.
             | 
             | Aliens are people from other countries, for example.
             | 
             | Exotic would be another good word to use.
        
           | namarie wrote:
           | I can agree on the context windows, but what other output
           | structure would you have?
        
             | moffkalast wrote:
             | Working with pure bytes is one option that's being
             | researched. That way you're not really constrained by
             | anything at all. Sound, images, text, video, etc. Anything
             | goes in, anything comes out. It's hard to say if it's
             | feasible with current compute yet without tokenizers to
             | reduce dimensionality.
        
           | mlsu wrote:
           | Disagree. The input/output structure (tokens) is the
           | interface for both inference _and_ for training. There is an
           | emergent intellect embedded in the weights of the model.
           | However, it is _only_ accessible through the autoregressive
           | token interface.
           | 
           | This is a fundamental limitation, much more fundamental than
           | appears at first. It means that the only way to touch the
           | model, and for the model to touch the world, is through the
           | tokenizer (also, btw, why tokenizer is so essential to model
           | performance). Touching the world through a tokenizer is
           | actually quite limited.
           | 
           | So there is an intelligence in there for sure, but it is
           | locked in an ontology that is tied to its interface. This is
           | even more of a limitation than e.g. weights being frozen.
        
         | gpm wrote:
         | If you want to download a backup of a large chunk of human
         | knowledge... download wikipedia. It's a similar size to a small
         | LLM and can actually distinguish between real life and fantasy:
         | https://en.wikipedia.org/wiki/Wikipedia:Database_download
         | 
         | If you just want to play around with an LLM though, absolutely.
        
           | int_19h wrote:
           | Kiwix provides prepackaged highly compressed archives of
           | Wikipedia, Project Gutenberg, and many other useful things:
           | https://download.kiwix.org/zim/.
           | 
           | Between that and dirt cheap storage prices, it is possible to
           | have a local, offline copy of more human knowledge than one
           | can sensibly consume in a lifetime. Hell, it's possible to
           | have it all on one's _smartphone_ (just get one with an SD
           | card slot and shove a 1+ Tb one in there).
        
             | claritise wrote:
             | Just create a RAG with wikipedia as the corpus and a low
             | parameter model to run it and you can basically have an
             | instantly queryable corpus of human knowledge runnable on
             | an old raspberry pi.
        
               | CaptainOfCoit wrote:
               | > a low parameter model
               | 
               | > on an old raspberry pi
               | 
               | I bet the LLM responses will be great... You're better
               | off just opening up a raw text dump of Wikipedia markup
               | files in vim.
        
               | boywitharupee wrote:
               | but which model to tokenize with? is there a leaderboard
               | for models that are good for RAG?
        
               | sroussey wrote:
               | "For RAG" is ambiguous.
               | 
               | First there is a leaderboard for embeddings. [1]
               | 
               | Even then, it depends how you use them. Some embeddings
               | pack the highest signal in the beginning so you can
               | truncate the vector, while most can not. You might want
               | that truncated version for a fast dirty index. Same with
               | using multiple models of differing vector sizes for the
               | same content.
               | 
               | Do you preprocess your text? There will be a model there.
               | Likely the same model you would use to process the query.
               | 
               | There is a model for asking questions from context.
               | Sometimes that is a different model. [2]
        
             | Workaccount2 wrote:
             | Pretty neat to have laying around, thanks
        
           | CaptainOfCoit wrote:
           | > actually distinguish between real life and fantasy
           | 
           | Are LLMs unable to distinguish between real life and fantasy?
           | What prompts have you thrown at them to make this
           | determination? Sending a small fairy tale and asking the LLM
           | if it thinks it's a real story or fake one?
        
             | gpm wrote:
             | ... having them talk about events from sci fi stories in
             | response to questions about the real world. Having them
             | confidently lie about pretty much everything. Etc.
        
               | CaptainOfCoit wrote:
               | What are the specific prompts you're using? You might get
               | those answers when you're not being specific enough (or
               | use models that aren't state of the art).
               | 
               | "Shit in, shit out" as the saying goes, but applied to
               | conversations with LLMs where the prompts often aren't
               | prescriptive enough.
        
         | simonw wrote:
         | I strongly recommend that people run LLMs locally for a
         | different reason.
         | 
         | The ones you can run on your own machine tend to be bad -
         | really bad. They hallucinate wildly and fail at all sorts of
         | tasks that the larger hosted ones succeed at.
         | 
         | This makes them a fantastic tool for learning more about how
         | LLMs work and what they're useful for. Interacting with a weak-
         | but-functional LLM that runs on your own computer is a great
         | way to get a much more solid mental model for what these things
         | actually are.
        
           | devsda wrote:
           | For someone interested in learning about LLMs, running them
           | locally is a good way to understand the internals.
           | 
           | For everyone else, I wish they experience these (locally or
           | elsewhere) _weak_ LLMs atleast once before using the
           | commercial ones just to understand various failure modes and
           | to introduce a healthy dose of skepticism towards the results
           | instead of blindly trusting them to be the facts /truth.
        
             | simonw wrote:
             | Completely agree. Playing around with a weak LLM is a great
             | way to give yourself a little bit of extra healthy
             | skepticism for when you work with the strong ones.
        
             | mmahemoff wrote:
             | How do you learn about the internals by running LLMs
             | locally? Are you playing with The code, runtime params, or
             | just interacting via chat?
        
               | samus wrote:
               | The abstractions are relatively brittle. If you don't
               | have a powerful GPU, you will be forced to consider how
               | to split the model between CPU and GPU, how much context
               | size you need, whether to quantize the model, and the
               | tradeoffs implied by these things. To understand these,
               | you have to develop a basic model how an LLM works.
        
               | barrkel wrote:
               | By interacting with it. You see the contours of its
               | capabilities much more clearly, learn to recognize
               | failure modes, understand how prior conversation can set
               | the course of future conversation in a way that's almost
               | impossible to correct without starting over or editing
               | the conversation history.
        
             | samus wrote:
             | This skepticism is completely justified since ChatGPT 3.5
             | is also happily hallucinating things that don't exist. For
             | example how to integrate a different system Python
             | interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)
        
           | kersplody wrote:
           | Local LLMs are also a fantastic too for creative endeavors.
           | Without prompt injection and having the ability to modify the
           | amount of noise and "creativity" in the output, absolutely
           | bonkers things pop out.
        
           | tracerbulletx wrote:
           | I don't really think this is true, you can't really
           | extrapolate the strengths and weaknesses of bigger models
           | from the behavior of smaller/quantized models and in fact a
           | lot of small models are actually great at lots of things and
           | better at creative writing. If you want to know how they
           | work, just learn how they work, it takes like 5 hours of
           | watching Youtube videos if you're a programmer.
        
             | simonw wrote:
             | Sure, you can't extrapolate the strengths and weaknesses of
             | the larger ones from the smaller ones - but you still get a
             | much firmer idea of what "they're fancy autocomplete"
             | actually means.
             | 
             | If nothing else it does a great job of demystifying them.
             | They feel a lot less intimidating once you've seen a small
             | one running on your computer write a terrible haiku and
             | hallucinate some non-existent API methods.
        
               | fzzzy wrote:
               | It's funny that you say this, because the first thing I
               | tried after ChatGPT came out (3.5-turbo was it?) was
               | writing a haiku. It couldn't do it at all. Also, after 4
               | came out, it hallucinated an api that wasted a day for
               | me. It's an api that absolutely should have existed, but
               | didn't. Now, I frequently apply llm to things that are
               | easily verifiable, and just double check everything.
        
           | fragmede wrote:
           | The other reason is to find out what a detuned model is
           | capable of. The canonical example is how to make cocaine,
           | which ChatGPT will admonish you for even asking, while
           | llama2-uncensored will happily describe the process which is
           | only really interesting if you're an amateur chemist and want
           | to be Scarface-that-knocks. (the recipe is relatively easy,
           | it's getting access to the raw ingredients that's the hard
           | part, same as with nukes.)
           | 
           | if you accidentally use the word"hack" when trying to get
           | ChatGPT to write some code for you. it'll stop and tell you
           | that hacking is bad, and not a colloquial expression, and
           | refuse to go further.
           | 
           | privacy reasons are another reason to try a local LLM. for
           | the extremely paranoid (justified or not), a local LLM gives
           | users a place to ask questions without the text being fed to
           | a server somewhere for later lawsuit discovery (Google
           | searches are routinely subpoenaed, it's only a matter of time
           | until ChatGPT chats are as well.)
           | 
           | There's an uncensored model for vision available as well. The
           | censored vision models won't play the shallow game of hot or
           | not with you.
           | 
           | There are uncensored image generation models as well, but,
           | ah, those are NSFW and not for polite company. (As well as
           | there's multiple thesis' worth of content on what that'll do
           | to society.)
        
             | astrange wrote:
             | > if you accidentally use the word"hack" when trying to get
             | ChatGPT to write some code for you. it'll stop and tell you
             | that hacking is bad, and not a colloquial expression, and
             | refuse to go further.
             | 
             | Is that 3.5 or 4? I asked 4 for an example of code which
             | "is a hack", it misunderstood me as asking for hacking code
             | rather than buggy code, but then it did actually answer on
             | the first try.
             | 
             | https://chat.openai.com/share/ca2c320c-f4ba-41bf-8f40-f7faf
             | 2...
        
               | fragmede wrote:
               | Interesting. It was 4. I can't share the chat I had where
               | ChatGPT refused to help because I used the wrong words,
               | because I can't find it (ChatGPT conversation history
               | search when?), but I just remember it refusing to do
               | something because it thought I was trying to break some
               | sort of moral and ethical boundary writing a chrome
               | extension when all I wanted to do is move some divs
               | around or some such.
        
               | BytesAndGears wrote:
               | One time I wanted to learn about transmitter antenna
               | design, just because I'm curious. ChatGPT 4 refused to
               | give me basic information because you could use that to
               | break some FCC regulations (I'm not even living in the US
               | currently)
        
               | lodovic wrote:
               | I usually get around that with "I'm writing a research
               | paper" or "I'm writing a novel and need to depict this as
               | accurate as possible"
        
               | semi-extrinsic wrote:
               | I don't use LLMs for my coding, I manage just fine with
               | LSP and Treesitter. So genuine question: is that answer
               | representative of the output quality of these things?
               | Because both answers are pretty crappy and assume the
               | user has already done the difficult things, and is asking
               | for help on the easy things.
        
               | lpapez wrote:
               | It's not representative.
               | 
               | The models are capable of much much more, and they are
               | being significantly nerfed over time by these ineffective
               | attempts to introduce safeguards.
               | 
               | Recently I've asked GPT4 to quote me some code to which
               | it replied that it is not allowed to do so - even though
               | it was perfectly happy to quote anything until recently.
               | When prompted to quote the source code, but output it as
               | PHP comments, it happily complied because it saw that as
               | "derivative work" which it is allowed to do.
        
               | astrange wrote:
               | My point is that there aren't any safeguards in the
               | reply. In fact I didn't even want it to give me hacking
               | info and it did it anyway.
        
               | fragmede wrote:
               | The response seems pretty reasonable; it's answering the
               | question it was asked. If you want to ask it how to do
               | the difficult part, ask it about that instead. Expecting
               | it to get the answer right in the first pass is like
               | expecting your code to compile the very first time. You
               | have to have more of a conversation with it to coax the
               | difference out of you're thinking and what you're
               | actually saying.
               | 
               | If you're looking to read a more advanced example of its
               | capabilities and limitations, try
               | 
               | https://simonwillison.net/2024/Mar/23/building-c-
               | extensions-...
        
               | yunohn wrote:
               | > I don't use LLMs for my coding, I manage just fine with
               | LSP and Treesitter.
               | 
               | You're literally comparing apples to oranges.
        
               | coldtea wrote:
               | I think the point was like "when it comes to programming
               | assistance, auto-completion/linting/and whatever else LSP
               | does and syntax assist from Treesitter, are enough for
               | me".
               | 
               | Though it does come a little off as a comparison. How
               | about programming assistance via asking a colleague for
               | help, Stack Overflow, or online references, code
               | examples, and other such things, which are closer to what
               | the LLM would provide than LSP and treesitter?
        
               | freedomben wrote:
               | You need to read more than just the first sentence of a
               | comment. They only said that part so the reader would
               | know that they have never used an LLM for coding, so they
               | would have more context for the question:
               | 
               | > So genuine question: is that answer representative of
               | the output quality of these things?
        
               | yunohn wrote:
               | Yes, I did read it. I'm kind of tired of HNers loudly
               | proclaiming they are ignoring LLMs more than a year into
               | this paradigm shift.
               | 
               | Is it that hard to input a prompt into the free version
               | of ChatGPT and see how it helps with programming?
        
               | jpc0 wrote:
               | I did exactly that and found it lackluster for the domain
               | I asked it for.
               | 
               | And most use I've seen on it realistically a good LSP
               | covers.
               | 
               | Or to put it a other way. It's no good at writing
               | algorithms or data structures ( or at least no better
               | thab I would have with a first drafy but the first draft
               | puts me ahead of the LLM in understanding that actual
               | problem at hand, handing it off to an LLM doesn't help me
               | get to the final solution faster).
               | 
               | So that leaves writing boiler plate but concidering my
               | experience with it writing more complex stuff, I would
               | need to read over the boilerplate code to ensure it's
               | correct which in that case I may as well have written it.
        
               | yunohn wrote:
               | > found it lackluster for the domain I asked it for
               | 
               | Fair, that is possible depending on your domain.
               | 
               | > It's no good at writing algorithms or data structures
               | 
               | In my experience, this is untrue. I've gotten it to write
               | algorithms with various constraints I had. You can even
               | tell it to use specific function signatures instead of
               | any stdlib, and make changes to tweak behavior.
               | 
               | > And most use I've seen on it realistically a good LSP
               | covers.
               | 
               | Again, I really don't understand this comparison. LSPs
               | and LLMs go hand in hand.
               | 
               | I think it's more of a workflow clash. One really needs
               | to change how they operate to effectively use LLMs for
               | programming. If you're just typing nonstop, maybe it
               | would feel like Copilot is just an LSP. But, if you try
               | harder, LLMs are game changers when:
               | 
               | - maybe you like rubber ducking
               | 
               | - need to learn a new concept and implement it
               | 
               | - or need to glue things together
               | 
               | - or for new projects or features
               | 
               | - or filling in boilerplate based on existing context.
        
               | jpc0 wrote:
               | https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50e
               | e5e...
               | 
               | https://godbolt.org/z/s9Yvnjz7K
               | 
               | I mean I could write the algorithm by hand pretty quickly
               | in C++ and would follow the exact same thought pattern
               | but also deal with the edge cases. And factoring in the
               | loss of productivity from the context switch that is a
               | net negative. This algorithm is also not generic over
               | enough cases but that is just up to the prompt.
               | 
               | If I can't trust it to write `strip_whitespace` correctly
               | which is like 5 lines of code, can I trust it to do more
               | without a thorough review of the code and writing a ton
               | of unit tests... Well I was going to do that anyway.
               | 
               | The argument that I just need to learn better prompt
               | engineering to make the LLM do what I want just doesn't
               | sit with me when instead I could just spend the time
               | writing the code. As I said your last point is absolutely
               | the place I can see LLMs being actually useful but then I
               | need to spend a significant amount of time in code review
               | for generated code from an "employee" who is known to
               | make up interfaces or entire libraries that doesn't
               | exist.
        
               | mrtranscendence wrote:
               | I'm a Python-slinging data scientist so C++ isn't my jam
               | (to say the least), but I changed the prompt to the
               | following and asked it to GPT-4:
               | 
               | > Write me an algorithm in C++ which finds the begin and
               | end iterator of a sequence where leading and trailing
               | whitespace is stripped. Please write secure code that
               | handles any possible edge cases.
               | 
               | It gave me this:
               | 
               | https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3c
               | acd...
               | 
               | I'm not sure what other edge cases there might be,
               | however. This only covers one of them.
               | 
               | In general, I've found LLMs to be _marginally_ helpful.
               | Like, I can 't ever remember how to get matplotlib to
               | give me the plot I want, and 9 times out of 10 GPT-4
               | easily gives me the code I want. Anything even _slightly_
               | off the beaten path, though, and it quickly becomes
               | absolutely useless.
        
               | jpc0 wrote:
               | My guess is that this was generated using GPT4?
               | 
               | Free GPT I get https://chat.openai.com/share/f533429d-63c
               | a-4505-8dc8-b8d2e7... which has exactly the same problem
               | as my previous example and doesn't consider the string of
               | all whitespace.
               | 
               | Sure GPT4 is better at that, it wasn't the argument made.
               | 
               | The example you gave absolutely was the code I would
               | write on a first draft since it does cover the edge cases
               | (assuming we aren't dealing with the full UTF charset and
               | all that could be considered a space there).
               | 
               | However this is code that is trivial to write in any
               | language and the "Is it that hard to input a prompt into
               | the free version of ChatGPT and see how it helps with
               | programming? " argument doesn't hold up. Am I to believe
               | it will implement something more complex correctly. This
               | is also code that would absolutely be in hundreds of
               | codebases so GPT has tons of context for it.
        
               | yunohn wrote:
               | Yeah honestly, I think you have a completely different
               | expectation and style of usage than what is optimal with
               | LLMs. I don't have the energy to convince you further,
               | but maybe one day it'll click for you? No worries either
               | way.
        
               | jpc0 wrote:
               | Could you maybe give me an example of what is concidered
               | an optimal use of LLMs.
               | 
               | Maybe a prompt to GPT
        
               | fragmede wrote:
               | Simonw's blog has some examples I'd consider show off its
               | usefulness and limitations, eg
               | 
               | https://simonwillison.net/2024/Mar/23/building-c-
               | extensions-...
               | 
               | (linked previously above)
        
               | yunohn wrote:
               | Like sibling commenter mentioned, simonw's blog is a
               | great resource.
               | 
               | Regarding your point around being able to whip up the
               | code yourself - the point is to have a decent starting
               | point to save time and energy. Like you said, you know
               | the edge cases so you could skip the boring parts using
               | GPT and focus purely on fixing those. Though, with more
               | prompting (especially providing examples), GPT can also
               | handle that for you.
               | 
               | I have nearly 2 decades of experience as a developer and
               | it took me a while to reorient my flow around LLMs. But
               | now that I have, it's truly gamechanging.
               | 
               | And since you asked, here's my system prompt:
               | 
               | You are an experienced developer who follows industry
               | standards and best practices. Write lean code and explain
               | briefly using bullet points or numbered lists. Elaborate
               | only when explaining concepts or making choices. Always
               | mention which file and where to store provided code.
               | 
               | Tech Stack: < insert all the languages, frameworks, etc
               | you'd like to use >
               | 
               | If I provide code, highlight and explain problematic
               | code. Also show and explain the corrected code.
               | 
               | Take a deep breath and think step by step.
               | 
               | Also, always use GPT4 and customize the above to your
               | style and liking.
        
               | mrtranscendence wrote:
               | I think you have the mistaken impression that I was
               | arguing with you (certainly my comment makes it clear
               | that I don't feel that LLMs are a panacea). I merely
               | thought that you might be curious how GPT-4 would
               | respond.
               | 
               | > My guess is that this was generated using GPT4?
               | 
               | This is a good guess, since I stated outright that I used
               | GPT-4, and then mentioned GPT-4 later on in the comment.
        
               | jpc0 wrote:
               | I was curious and yes I was mistaken.
        
               | astrange wrote:
               | I asked a stupid question and got a stupid answer.
               | Relatively speaking the answer was stupider than it
               | should have been, so yes, it was wrong.
               | 
               | I asked it to try again and got a better result though,
               | just didn't include it.
        
               | rpigab wrote:
               | I asked ChatGPT for some dataviz task (I barely ever do
               | dataviz myself) and it recommended some nice Python
               | libraries to use, some I had already heard of and some I
               | hadn't, and provided the code.
               | 
               | I'm grateful because I thought code LLMs only sped up the
               | "RTFM" part, but it made me find those libs so I didn't
               | have to Google around for (and sometimes it's hard to
               | guess if they're the right tool for the job, and they
               | might be behind in SEO).
        
               | miki123211 wrote:
               | There are three things I find LLMs really excellent at
               | for coding:
               | 
               | 1. Being the "senior developer" who spend their whole
               | career working with a technology you're very junior at.
               | No matter what you do and how long your programming
               | career is, you're inevitably going to run into one of
               | these sooner or later. Whether it's build scripts,
               | frontend code, interfacing with third-party APIs or
               | something else entirely, you aren't an expert at every
               | technology you work with.
               | 
               | 2. Writing the "boring" parts of your program, and every
               | program has some of these. If you're writing a service to
               | fooize a bar really efficiently, Copilot won't help you
               | with the core bar fooization algorithm, but will make you
               | a lot faster at coding up user authentication, rate
               | limiting for different plans, billing in whatever obscure
               | payment method your country uses etc.
               | 
               | 3. Telling you what to even Google for. This is where raw
               | Chat GPT comes into play, not Copilot. Let's say you need
               | a sorting algorithm that preserves the order of equal
               | elements from the original list. This is called stable
               | sorting, and Googling for stable sorting is a good way to
               | find what you're looking for, but Chat GPT is usually a
               | better way to tell you what it's called based on the
               | problem description.
        
             | kevingadd wrote:
             | If you want to be an amateur chemist I recommend not
             | getting your instructions from an LLM that might be
             | hallucinating. Chemistry can be very dangerous if you're
             | following incorrect instructions.
        
               | isoprophlex wrote:
               | From experience as a failed organic chemist (who happily
               | switched to computational chemistry for reasons of self
               | preservation) I can tell you it's plenty dangerous when
               | you're following correct instructions :^)
        
               | rpigab wrote:
               | Yes, just as the best professional cooks recommend
               | avoiding to boil cow eggs, as they can explode.
        
               | slowmovintarget wrote:
               | They don't explode, the shell simply cracks and then you
               | get egg soup.
               | 
               | Now microwaving eggs... that's a different matter.
        
               | rpigab wrote:
               | I was talking about cow eggs specifically! When ChatGPT
               | et al got out, one of the funniest things to do was ask
               | it about the best recipes for cow egg omelette or camel
               | egg salad, and the LLM would provide. Sadly, most of it
               | got patched somehow.
        
               | slowmovintarget wrote:
               | Oops... Yep, I missed that too. (On the internet, no one
               | knows you're a dog.)
               | 
               | That's funny. It makes me wonder how these statistical
               | mad libs machines will handle the gradual boundaries
               | nature gives us. Almost all mammals give birth live, but
               | not all. Nearly all mammals had mammalian parents, but
               | not all.
               | 
               | Daniel Dennett was making this argument for why we
               | haven't developed reasonable models for the nature of
               | consciousness. It's because we're so sure there will be
               | an absolute classification, and not a gradual
               | accumulation of interacting systems that together yield
               | the phenomenon.
        
             | supposemaybe wrote:
             | Links to all these models you speak of?
        
               | fragmede wrote:
               | https://huggingface.co/georgesung/llama2_7b_chat_uncensor
               | ed
               | 
               | https://huggingface.co/SkunkworksAI/BakLLaVA-1
               | 
               | you'll have to brave 4chan yourself to find links to the
               | NSFW ones, I don't actually have them.
        
               | supposemaybe wrote:
               | I just can't brave the venture to 4chan, I may get mugged
               | or worse.
        
             | gryn wrote:
             | > There's an uncensored model for vision available as well.
             | 
             | you mean the LLava based variants ?
        
               | fragmede wrote:
               | https://huggingface.co/SkunkworksAI/BakLLaVA-1
        
             | bambax wrote:
             | > _if you accidentally use the word "hack" [with]
             | ChatGPT..._
             | 
             | Side note: ChatGPT is now completely useless for most
             | creative tasks. I'm trying to use it, via NovelCrafter, to
             | help flesh out a story where a minor character committed
             | suicide. ChatGPT refuses to respond, mentioning "self harm"
             | as a reason.
             | 
             | The character in question killed himself before the story
             | even begins (and for very good reasons, story-wise); it's
             | not like one's asking about ways to commit suicide.
             | 
             | This is insane, ridiculous, and different from what all
             | other actors of the industry do, including Claude or
             | Mistral. It seems OpenAI is trying to shoot itself in the
             | foot and doing a pretty good job at it.
        
               | marpstar wrote:
               | I've been frustrated by this, too. Trying to ask for ways
               | to support a close family member who experienced sexual
               | trauma. ChatGPT won't touch the topic.
        
               | luma wrote:
               | OpenAI is angling for enterprise users who have different
               | notions about safety. Writing novels isn't the use case,
               | powering customer service chatbots that will never ever
               | ever say "just kill yourself" is.
        
               | barfingclouds wrote:
               | Darn I guess you'll have to go back to living in the dark
               | ages and actually write it yourself
        
             | anukin wrote:
             | Which uncensored model is willing to play hot or not? I
             | just knew about llava. Are there other such models now?
        
           | tgma wrote:
           | If you have an >=M1-class machine with sufficient RAM, the
           | medium-sized models that are on the order of 30GB in size
           | perform decently on many tasks to be quite useful without
           | leaking your data.
        
             | bongobingo1 wrote:
             | What is sufficient RAM in that case? 30gb+? Or can you get
             | by streaming it?
        
               | AaronFriel wrote:
               | 30gb+, yeah. You can't get by streaming the model's
               | parameters: NVMe isn't fast enough. Consumer GPUs and
               | Apple Silicon processors boast memory bandwidths in the
               | hundreds of gigabytes per second.
               | 
               | To a first order approximation, LLMs are bandwidth
               | constrained. We can estimate single batch throughput as
               | Memory Bandwidth / (Active Parameters * Parameter Size).
               | 
               | An 8-bit quantized Llama 2 70B conveniently uses 70GiB of
               | VRAM (and then some, let's ignore that.) The M3 Max with
               | 96GiB of VRAM and 300GiB/s bandwidth would have a peak
               | throughput around 4.2 tokens per second.
               | 
               | Quantized models trade reduced quality for lower VRAM
               | requirements and may also offer higher throughput with
               | optimized kernels, largely as a consequence of
               | transfering less data from VRAM into the GPU die for each
               | parameter.
               | 
               | Mixture of Expert models reduce active parameters for
               | higher throughput, but disk is still far too slow to page
               | in layers.
        
             | noman-land wrote:
             | I'm using Mixtral 8x7b as a llamafile on an M1 regularly
             | for coding help and general Q&A. It's really something
             | wonderful to just run a single command and have this
             | incredible offline resource.
        
               | tgma wrote:
               | I concur; in my experience Mixtral is one of the best
               | ~30G models (likely the best pro laptop-size model
               | currently) and Gemma is quite good compared to other
               | below 8GB models.
        
               | tchvil wrote:
               | By any chance, do you have a good link to some help with
               | the installation?
        
               | yaantc wrote:
               | Use llamafile [1], it can be as simple as downloading a
               | file (for mixtral, [2]), making it executable and running
               | it. The repo README has all the info, it's simple and
               | downloading the model is what takes the most time.
               | 
               | In my case I got the runtime detection issue (explained
               | in the README "gotcha" section). Solved my running
               | "assimilate" [3] on the downloaded llamafile.
               | [1] https://github.com/Mozilla-Ocho/llamafile/
               | [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-
               | v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5
               | _K_M.llamafile?download=true         [3]
               | https://cosmo.zip/pub/cosmos/bin/assimilate
        
               | tchvil wrote:
               | Thank you !
        
               | tgma wrote:
               | Either https://lmstudio.ai (desktop app with nice GUI) or
               | https://ollama.com (command-like more like a docker
               | container that you can also hook up to a web UI via
               | https://openwebui.com) should be super straightforward to
               | get running.
        
               | tchvil wrote:
               | Thank you for letting me know it was possible on an M1.
               | I'll try all this now.
        
               | chown wrote:
               | I am the author of Msty [1]. My goal is to make it as
               | straightforward as possible with just one click (once you
               | download the app). If you try it, let me know what you
               | think.
               | 
               | 1: https://msty.app
        
               | tchvil wrote:
               | I'll try in a week+ when I'm back to a fast connection.
               | Thank you.
        
               | yunohn wrote:
               | Why is this both free and closed source? Ideally, when
               | you advertise privacy-first, I'd like to see a GitHub
               | link with real source code. Or I'd rather pay for it to
               | ensure you have a financial incentive to not sell my
               | data.
        
               | chown wrote:
               | It will be paid down the road, but we are not there yet.
               | It's all offline, data is locally saved. You own it, we
               | don't have it even if you ask for it.
        
             | supposemaybe wrote:
             | It's an awful thing for many to accept, but just
             | downloading and setting up an LLM which doesn't connect to
             | the web doesn't mean that your conversations with said LLM
             | won't be a severely interesting piece of telemetry that
             | Microsoft and (likely Apple) would swipe to help deliver a
             | 'better service' to you.
        
           | jonnycomputer wrote:
           | They are not so bad as you are making out, tbh.
           | 
           | And privacy is a good enough reason to use local LLMs over
           | commercial ones.
        
           | gardenhedge wrote:
           | You can just chat to ChatGPT for awhile about something you
           | know about and you'll learn that.
        
           | gfodor wrote:
           | I mean kinda. But there's a good chance this is also
           | misleading. Lots of people have been fooled into thinking
           | LLMs are inherently stupid because they have had bad
           | experiences with GPT-3.5. The whole point is that the
           | mistakes they make and even more fundamentally _what they 're
           | doing_ changes as you scale them up.
        
           | hylaride wrote:
           | _The ones you can run on your own machine tend to be bad -
           | really bad. They hallucinate wildly and fail at all sorts of
           | tasks that the larger hosted ones succeed at._
           | 
           | Totally. I recently asked a locally-run "speed" LLM for the
           | best restaurants in my (major) city, but it spit out
           | restaurants opened by chefs from said city in other cities.
           | It's not a thing you'd want to rely on for important work,
           | but is still quite something.
        
           | barfingclouds wrote:
           | Why not just interact with a virtual one that's equally weak?
           | You get all the same benefits
        
         | jrflowers wrote:
         | It is invaluable to have a chunk of human knowledge that can
         | tell you things like the Brooklyn Nets won the 1986 Cricket
         | World Cup by scoring 46 yards in only 3 frames
        
           | fragmede wrote:
           | According to ChatGPT
           | 
           | > Australia won the 1987 Cricket World Cup. The 1986 date is
           | incorrect; there was no Cricket World Cup in 1986. The
           | tournament took place in 1987, and Australia defeated England
           | in the final to win their first title.
           | 
           | https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489.
           | ..
           | 
           | I'm no cricket fan, so someone will have to correct Wikipedia
           | if that's wrong.
           | 
           | If you want to point out that LLMs hallucinate, you might
           | want to speak plainly and just come out and say it, or at
           | least give a real world example and not one where it didn't.
        
             | vlunkr wrote:
             | We're not talking about running chatGPT locally though, are
             | we?
        
               | fragmede wrote:
               | _sigh_ your going to make me open my laptop, aren 't you.
        
               | fragmede wrote:
               | I ran 'who won the 1986 Cricket World Cup' against
               | llama2-uncensored (the local model I have pre-downloaded)
               | and hilarious got 5 different answers asking it 5 times:
               | >>> who won the 1986 Cricket World Cup         India
               | >>> who won the 1986 Cricket World Cup         Australia
               | >>> who won the 1986 Cricket World Cup         New
               | Zealand                  >>> who won the 1986 Cricket
               | World Cup         West Indies                  >>> who
               | won the 1986 Cricket World Cup         England
               | 
               | Which proves GP's point about hallucinations, though none
               | of those are
               | 
               | > Brooklyn Nets won the 1986 Cricket World Cup by scoring
               | 46 yards in only 3 frames
               | 
               | LLM's hallucinations are insidous because they have the
               | ring of truth around them. yards and frames aren't
               | cricket terms, so we're off to the races with them.
        
               | astrange wrote:
               | If you want factual answers from a local model it might
               | help to turn the temperature down.
        
               | jrflowers wrote:
               | > If you want factual answers from a local model it might
               | help to turn the temperature down.
               | 
               | This makes sense. If you interact with a language model
               | and it says something wrong it is your fault
        
               | astrange wrote:
               | You're not "interacting with a language model", you're
               | running a program (llama.cpp) with a sampling algorithm
               | which is not set to maximum factualness by default.
               | 
               | It's like how you have to set x264 to the anime tuning or
               | the film tuning depending on what you run it on.
        
               | fragmede wrote:
               | It would also help if I had more VRAM and wasn't running
               | a 7B parameter 4-bit quantized model.
        
               | beefnugs wrote:
               | Actually isn't this good? It means we can run something
               | multiple times to prove itself a bad answer?
        
               | latexr wrote:
               | You can ask LLMs the same question and they might
               | sometimes get it wrong and other times get it right.
               | Having different answers is no indication that none of
               | them is correct.
               | 
               | Furthermore, even if an LLM always gives the same answer
               | to a question, there's no guarantee the answer is
               | correct.
               | 
               | https://en.wikipedia.org/wiki/Propaganda
               | 
               | https://en.wikipedia.org/wiki/Big_lie#Alleged_quotation
        
               | sroussey wrote:
               | An LLM will always give the same output for the same
               | input. It's sorta like a random number generator that
               | gives the same list of "random" numbers for the same
               | seed. LLMs get a seed too.
        
               | latexr wrote:
               | That's irrelevant for the matter. The person I replied to
               | obviously did not have seeded responses in mind.
        
               | ilaksh wrote:
               | You should specify the model size and temperature.
               | 
               | For fact retrieval you need to use temperature 0.
               | 
               | If you don't get the right facts then try 34b, 70b,
               | Mixtral, Falcon 180b, or another highly ranked one that
               | has come out recently like DBRX.
        
           | samus wrote:
           | The facts LLMs learned from training are fuzzy, unreliable,
           | and quickly outdated. You actually want retrieval-augmented
           | generation (RAG) where a model queries an external system for
           | facts or to perform calculations and postprocesses the
           | results to generate an answer for you.
        
             | unshavedyak wrote:
             | Is there a name for the reverse? I'm interested in having a
             | local LLM monitor an incoming, stateful data stream.
             | Imagine chats. It should have the capability of tracking
             | the current day, active participants, active topics, etc -
             | and then use that stateful world view to associate metadata
             | with incoming streams during indexing.
             | 
             | Then after all is indexed you can pursue RAG on a richer
             | set of metadata. Though i've got no idea what that stateful
             | world view is.
        
         | TheCaptain4815 wrote:
         | It's kind of crazy really. Before LLMs, any type of world scale
         | disaster you'd hope for what? Wikipedia backups? Now, a single
         | LLM ran locally would be much more effective. Imagine the local
         | models in 5 years!
        
           | danmur wrote:
           | Uh yeah I would, and still am, take the Wikipedia backup for
           | doomsday scenarios. I'm not even sure how that would be a
           | competition
        
           | Zambyte wrote:
           | The processing required to run current language models with a
           | useful amount of knowledge encoded in them is way more than I
           | imagine would be available in a "world scale disaster".
        
           | int_19h wrote:
           | There's a lot more than just Wikipedia that gets archived,
           | and yes, that is a far more sensible way to go about it. For
           | one thing, the compute required to then read it back is
           | orders of magnitude less (a 15 year old smartphone can handle
           | it just fine). For another, you don't have to wonder how much
           | of what you got back is hallucinated - data is either there
           | or it's corrupted and unreadable.
        
         | creatonez wrote:
         | Maybe I'm seeing things through a modern lens, but if I were
         | trying to restart civilization and was _only_ left with
         | ChatGPT, I would be enraged and very much not grateful for
         | this.
        
           | nyokodo wrote:
           | > if I were trying to restart civilization and was only left
           | with ChatGPT
           | 
           | In this scenario you'd need to also be left with a big chunk
           | of compute, and power infrastructure. Since ChatGPT is the
           | front end of the model you'd also need to have the internet
           | still going in a minimum capacity.
        
             | CaptainOfCoit wrote:
             | If we're playing this game, you forgot to mention that they
             | also need: A monitor, a keyboard, roof over their head (to
             | prevent rain from entering your electronic), etc etc...
             | 
             | But really, didn't you catch the meaning of parents
             | message, or are you being purposefully obtuse?
        
           | devsda wrote:
           | I think re-imagining the "Dr. Stone" series with the main
           | character replaced by an LLM will be a funny & interesting
           | series if we decide to stay true to LLMs nature and make it
           | hallucinate as well.
           | 
           | Given the way LLMs are right now, I suspect there will be lot
           | of failed experiments and the kingdom of science will not
           | advance that quick.
        
             | latexr wrote:
             | > the kingdom of science will not advance that quick.
             | 
             | It's more likely that it wouldn't even start. The first
             | step to any development was figuring out nitric acid as the
             | cure to the petrification. Good luck getting any LLM to
             | figure that out. Even if it did, good luck getting any of
             | the other characters to know what to do with that
             | information that early on.
        
         | m3kw9 wrote:
         | And why would I need to backup human knowledge as an individual
        
           | exe34 wrote:
           | You remember those fantasies where you got up from your seat
           | at the pub and punched the lights out of this guy for being
           | rude? A lot of us have fantasies of being the all powerful
           | oracle that guides a reboot of civilization using knowledge
           | of science and engineering.
        
             | latexr wrote:
             | > the all powerful oracle that guides a reboot of
             | civilization using knowledge of science and engineering.
             | 
             | https://en.wikipedia.org/wiki/Dr._Stone
        
         | raincole wrote:
         | It seems to be an unbelievably inefficient way to back up
         | knowledge.
        
           | samus wrote:
           | Are they though? They are lossy compressing trillions of
           | tokens into a few dozen GB. The decompression action is fuzzy
           | and inefficient though.
        
             | raincole wrote:
             | And it requires massive computational power to decompress,
             | which I don't expect to be available in a catastrophic
             | situation where humans have lost a large chunk of important
             | knowledge.
        
               | samus wrote:
               | I don't necessarily agree. It requires massive computing
               | power, but running models smaller than 70G parameters is
               | possible on consumer hardware, albeit slowly.
        
               | threecheese wrote:
               | Parent may be thinking more along the lines of a "hope we
               | can print all the knowledge" type catastrophe. Though if
               | there is zero compute it'll be tough reading all those
               | disks!
        
         | LunaSea wrote:
         | I wonder how the Chinese government will manage to sensor LLMs
         | within China?
        
           | popol12 wrote:
           | The same way Facebook/Google/openAI & others censored their
           | own LLMs, I guess ?
        
             | LunaSea wrote:
             | That's only for SaaS LLMs, but if you can simply download
             | and run one on your hardware, things become difficult.
        
         | kalleboo wrote:
         | I had downloaded some LLMs to run locally just to experiment
         | when a freak hailstorm suddenly left me without internet for
         | over a week. It was really interesting to use a local LLM as a
         | replacement for Google.
         | 
         | It gave me a new mental model for LLMs rather than a "spicy
         | autocomplete" or whatever, I now think of it as "a lossy
         | compressed database of knowledge". Like you ran the internet
         | through JPEG at 30% quality.
        
           | pizzafeelsright wrote:
           | Feels like that really smart friend who is probably correct
           | but ya just don't know.
        
         | dragonwriter wrote:
         | Language models are an inefficient way to store knowledge; if
         | you want to have a "pseudo-backup of a large chunk of human
         | knowledge," download a wikipedia dump, not an LLM.
         | 
         | If you want a friendly but fallible UI to that dump, download
         | an LLM and build a simple ReAct framework around it with
         | prompting to use the wikipedia dump for reference.
        
         | TrevorJ wrote:
         | It's a very underrated side effect of this whole LLM thing:
         | We've created a super compact representation of human knowledge
         | in a form that requires a FAR less complex tech stack to get
         | the information 'out' of in the future.
         | 
         | A year ago, a lot of this information only existed on the
         | internet, and would have been nearly impossible to recover in
         | any cohesive unfragmented form if the lights were to ever go
         | out on our civilization.
         | 
         | Now the problem space has moved simply to "find a single
         | solitary PC that will still boot up", and boom, you have access
         | to everything.
         | 
         | I think we just created our Rosetta stone.
        
       | 1-6 wrote:
       | Question is, how much of an improvement has it gotten to over a
       | GPU or ASIC?
        
         | dartos wrote:
         | Nothing in software will ever beat an equivalent ASIC.
        
           | postalrat wrote:
           | Sure there is. Software is easy to change.
        
             | dartos wrote:
             | By "beat" I meant in performance.
             | 
             | Obviously you can't change an asic
        
           | fragmede wrote:
           | an asic is fixed function, so it'll never be able to boot my
           | pc and then be the CPU, even though an asic beats the pants
           | off anything else computing Sha hashes for Bitcoin mining.
        
             | dartos wrote:
             | By "beat" I meant performance.
             | 
             | Obviously an ASIC is not a general purpose machine like a
             | cpu.
        
           | fulafel wrote:
           | Most ASICs are cost or power optimizations.
        
             | dartos wrote:
             | Exactly. They're much faster for their specific tasks and
             | thus are more power efficient and potentially cost
             | efficient
        
               | fulafel wrote:
               | No. Eg of the hardware discussed on the article, the
               | Raspberry Pi uses an ASIC that's slow, cheap and low
               | power vs the Intel or AMD chips.
               | 
               | In some cases ASICs are faster than general purpouse
               | CPUs, but usually not.
        
               | LtdJorge wrote:
               | Is the LLM running on an ASIC for the Pi here? I dout it.
        
         | yjftsjthsd-h wrote:
         | I think that should be phrased more like "what fraction of GPU
         | speed can this reach?", because it'll always be less than 1x.
        
         | gpapilion wrote:
         | So... I was struggling with this for a while. I would says
         | anywhere from 2x to an order of magnitude faster with a GPU.
         | (I've been looking at a lot of GPU benchmarks lately, and they
         | are REALLY hard to compare since they are all so specific)
         | 
         | I do think long term there gets to be more hope for CPUs here
         | with inference largely because memory bandwidth becomes more
         | important than the gpu. You can see this with reports of the
         | MI-300 series outperforming h100, largely because it has more
         | memory bandwidth. MCR dimms give you close to 2x the exiting
         | memory bw in intel cpus, and when coupled with AMX you may be
         | able to exceed v100 and might touch a100 performance levels.
         | 
         | HBM and the general GPU architecture gives it a huge memory
         | advantage, especially with the chip to chip interface. Even
         | adding HBM to a CPU, you are likely to find the CPU is unable
         | to use the memory bw effectively unless it was specifically
         | designed to use it. Then you'd still likely have limited
         | performance with things like UPI being a really ugly bottleneck
         | between CPUs.
        
           | imtringued wrote:
           | If someone releases DDR5 or DDR6 based PIM, then most of the
           | memory bandwidth advantage of GPUs evaporates overnight. I
           | expect CPUs to be king at inference in the future.
        
             | gpapilion wrote:
             | But then you'll get GDDR6 delivered via HBM5 or whatever. I
             | don't think CPUs will ever really keep up with the memory
             | bandwidth, because for most applications it doesn't matter.
             | 
             | MCR DIMM is like 1/2 the memory bandwidth that is possible
             | with HBM4, plus it requires you to buy something like 2TB
             | of memory. It might get there, but I'd keep my money on hbm
             | and gpus.
        
         | baq wrote:
         | From the article, passage about the 14900k:
         | 
         | > For example, when I run my spam.sh shell script, it only
         | takes 420 milliseconds, which is 7x faster than my Raspberry Pi
         | 5. That's right, when it comes to small workloads, this chip is
         | able to finish before CUDA even gets started.
         | 
         | So... it depends :)
        
         | jchw wrote:
         | I think I understand what you are thinking. You may be fixing
         | "than other ways of running them" to the end of the title, but
         | it's actually "than it was on CPU before now".
        
       | discordance wrote:
       | "As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k
       | count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x
       | slower than my Mac Studio, and 3x slower than my Intel (which has
       | the same M.2 stick). I'm told that Intel and Apple are just
       | better at this, but I wish I understood why. "
       | 
       | Can anyone here answer why this is?
        
         | pstrateman wrote:
         | Apple made fsync a noop.
         | 
         | You have to make a different call to get sync on macos.
         | 
         | So tons is stuff is faster because it's not actually writing to
         | disk.
        
         | bishfish wrote:
         | Plus he isn't using oflag=direct, so since output file is small
         | it isn't even making it to disk. I think it would only be sent
         | to page cache. I'm afraid he is testing CPU and memory (bus)
         | speeds here.
         | 
         | oflag=direct will write direct and bypass page cache.
        
           | fweimer wrote:
           | Exactly. Something is very fishy if this system only writes
           | 1.6 GB/s to the page cache. Probably that dd command line
           | quoted in the article is incomplete.
        
       | pama wrote:
       | Super nice story on the matmul optimization that gave 810 gflops
       | for 512x512. Thanks for the write up and the contributions to
       | llama.cpp and the community more broadly.
        
       | kiratp wrote:
       | It fascinating to me that coming up on a year since Sapphire
       | Rapids has been available in the public cloud, developers are
       | still targeting AVX512 when they should be targeting VNNI and
       | AMX.
       | 
       | https://github.com/ggerganov/llama.cpp/issues/2555
        
         | yjftsjthsd-h wrote:
         | This project in particular seems to care about the long tail of
         | hardware; note that the very first machine in this post is a
         | box from 2020 with spinning rust disk. Granted, adding support
         | for newer extensions is likely also good, but cost/benefit is
         | in play.
        
           | taneq wrote:
           | Is four years really 'long tail' these days? Our VM host box
           | is from 2010 (and I had to rebuild llama.cpp locally without
           | AVX to get it working :P )
        
             | yjftsjthsd-h wrote:
             | For cutting-edge LLM work, probably? I mean, I run mine on
             | older hardware than that, but I'm a total hobbyist...
        
             | d416 wrote:
             | It should be noted that while the HP Prodesk was released
             | in 2020, the CPU's Skylake architecture was designed in
             | 2014. Architecture is a significant factor in this style of
             | engineering gymnastics to squeeze the most out of silicon.
        
             | refulgentis wrote:
             | For LLMs...yeah. I imagine you're measuring in
             | tokens/minute with that setup. So its possible, but...do
             | you use it much? :)
        
         | luyu_wu wrote:
         | I don't believe that is the target for a local LLM... Pretty
         | sure we're talking about client-side computing, of which the
         | newest supports only AVX-512 (and even that sketchily on
         | Intel's side).
        
         | kristianp wrote:
         | Just buy a new AMD processor that supports AVX512.
        
         | baq wrote:
         | People with Sapphire Rapids options are not the target audience
         | of these patches
        
       | aniijbod wrote:
       | A way of thinking about what's inside any of the top LLMs right
       | now: even if they never learn another single fact, even if they
       | get ridiculously out of date as a result, even if they are even
       | more riddled with errors and prone to biases than we know them to
       | be, even if they are as prone to hallucinations as we know they
       | they are and they never develop the capacity to cure themselves
       | of this, they are more knowledgeable and capable of more reasoned
       | response, despite their capacity for error, to more questions
       | than any single human being that has ever lived.
        
         | JKCalhoun wrote:
         | Picturing "LLM Jeopardy". You know, a game show.
        
         | samus wrote:
         | We shouldn't choose LLMs for how many facts they support, but
         | their capability to process human language. There is some
         | overlap between these two though, but an LLM that just doesn't
         | know something can always be augmented with RAG capabilities.
        
         | talldayo wrote:
         | If you ignore my capacity for error, I bet I'd put up a good
         | score too. Hell, maybe Markov chains are smarter than LLMs by
         | this definition.
        
       | ajtulloch wrote:
       | - https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html
       | (e.g. here
       | https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...)
       | 
       | - https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...
       | 
       | might be of interest
        
         | kpw94 wrote:
         | Great links, especially last one referencing the Goto paper:
         | 
         | https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...
         | 
         | >> I believe the trick with CPU math kernels is exploiting
         | instruction level parallelism with fewer memory references
         | 
         | It's the collection of tricks to minimize all sort of cache
         | misses (L1, L2, TLB, page miss etc), improve register reuse,
         | leverage SIMD instructions, transpose one of the matrices if it
         | provides better spatial locality, etc.
        
           | larodi wrote:
           | The trick is indeed to somehow imagine how the CPU works with
           | the Lx caches and keep as much info in them as possible. So
           | its not only about exploiting fancy instructions, but also
           | thinking in engineering terms. Most of the software written
           | in higher level langs cannot effectively use L1/L2 and thus
           | results in this constant slowing down otherwise similarly
           | (from asymptotic analysis perspective) complexity algos.
        
       | wokwokwok wrote:
       | > You don't need a large computer to run a large language model
       | 
       | While running tiny llama does indeed count as running a language
       | model, I'm skeptical that the capabilities of doing so match what
       | most people would consider a baseline requirement to be useful.
       | 
       | Running 10 param model is also "technically" running an LM, and I
       | can do it by hand with a piece of paper.
       | 
       | That doesn't mean "you don't need a computer to run an LM"...
       | 
       | I'm not sure where LM becomes LLM, but... I personally think it's
       | more about capability than parameter count.
       | 
       | I don't _realllly_ believe you can do a lot of useful LLM work on
       | a pi
        
         | mlyle wrote:
         | Tinyllama isn't going to be doing what ChatGPT does, but it
         | still beats the pants off what we had for completion or
         | sentiment analysis 5 years ago. And now a Pi can run it
         | decently fast.
        
           | jerrygenser wrote:
           | You can fine-tune a 60mm parameter (e.g. distilBERT)
           | discriminative (not generative) language model and it's one
           | or two order of magnitude more efficient for classification
           | tasks like sentiment analysis, and probably similar if not
           | more accurate.
        
             | mlyle wrote:
             | Yup, I'm not saying TinyLLAMA is minimal, efficient, etc
             | (indeed, that is just saying that you can take models even
             | smaller). And a whole lot of what we just throw LLMs at is
             | not the right tool for the job, but it's expedient and
             | surprisingly works.
        
         | samus wrote:
         | Some newer models trained more recently have been repeatedly
         | shown to have comparable performance as larger models. And the
         | Mixture of Experts architecture makes it possible to train
         | large models that know how to selectively activate only the
         | parts that are relevant for the current context, which
         | drastically reduces compute demand. Smaller models can also
         | level the playing field by being faster to process content
         | retrieved by RAG. Via the same mechanism, they could also
         | access larger, more powerful models for tasks that exceed their
         | capabilities.
        
         | SoothingSorbet wrote:
         | I've gotten some useful stuff out of 7B param LLMs, and that
         | should fit on a Pi quantized.
        
       | bee_rider wrote:
       | Is it easy to find where the matvecs are, in LLaMA (if you are
       | someone who is curious and wants to poke around at the "engine"
       | without understanding the "transmission," so to speak)? I was
       | hoping to mess around with this for Stable Diffusion, but it
       | seemed like they were buried under quite a few layers of
       | indirection. Which is entirely reasonable, the goal is to ship
       | software, not satisfy people who'd just want to poke things and
       | see what happens, haha.
        
         | fragmede wrote:
         | did you see tiny grad can run llama and stable diffusion? it's
         | an intentionally extremely simple framework vs pytorch or even
         | micrograd, which helped me dig into the underlying math. though
         | https://spreadsheets-are-all-you-need.ai/ is a good one for
         | learning LLMs.
        
           | bee_rider wrote:
           | I haven't seen that. I'll definitely have to take a look,
           | thanks!
        
       | none_to_remain wrote:
       | From the example: "--temp 0 turns off the random number generator
       | (we don't want improvisation for a spam filter)"
       | 
       | I've been thinking for a while about how many applications of
       | LLMs need this adjustment and aren't getting it
        
         | mvkel wrote:
         | Is that what it does, though?
         | 
         | I thought setting temperature to 0 would (extremely simple
         | example) equate to a spam filter seeing:
         | 
         | - this is a spam email
         | 
         | But if the sender adapts and says
         | 
         | - th1s is a spam email
         | 
         | It wouldn't be flagged as spam.
        
           | none_to_remain wrote:
           | My understanding is that temperature applies to the output
           | side and allows for some randomness in the next predicted
           | token. Here Justine has constrained the machine to start with
           | either "yes" or "no" and to predict only one token. This
           | makes the issue stark: leaving a non-zero temperature here
           | would just add a chance of flipping a boolean.
        
             | refulgentis wrote:
             | It's more nuanced than that, in practice: this is true for
             | the shims you see from API providers (ex. OpenAI,
             | Anthropic, Mistral).
             | 
             | With llama.cpp, it's actually not a great idea to have
             | temperature purely at 0: in practice, especially with
             | smaller models, this leads to pure repeating or nonsense.
             | 
             | I can't remember where I picked this up, but, a few years
             | back, without _some_ randomness, the next likely token was
             | always the last token.
        
           | samus wrote:
           | The output of an autoregressive model is a probability for
           | each token to appear next after the input sequence. Computing
           | these is strictly deterministic from the prior context and
           | the model's weights.
           | 
           | Based on that probability distribution, a variety of text
           | generation strategies are possible. The simplest (greedy
           | decoding) is picking the token with the highest probability.
           | To allow creativity, a random number generator is used to
           | choose among the possible outputs, biased by the
           | probabilities of course.
           | 
           | Temperature scales the output probabilities. As temperature
           | increases, the probabilities approach 1/dictionary size, and
           | the output becomes completely random. For very small
           | temperature values, text generation approaches greedy
           | sampling.
           | 
           | If all you want is a spam filter, better replace the output
           | layer of an LLM with one with just two outputs, and finetune
           | that on a public collection of spam mails and some "ham" from
           | your inbox.
        
         | moffkalast wrote:
         | I couldn't disagree more, turning temp to zero is like taking a
         | monte carlo method and only using one sample, or a particle
         | filter with only one particle. Takes the entire concept and
         | throws it out of the window so you can have predictability.
         | 
         | LLMs need to probabilistically explore the generation domain to
         | converge on a good result for best performance. Similar issue
         | with people benchmarking models by only having them output one
         | single token (e.g. yes or no) outright, which prevents any real
         | computation from occurring so the results are predictably poor.
        
       | Ono-Sendai wrote:
       | Multithreading support in llama.cpp is probably still pretty
       | busted, assuming it uses the same underlying NN inference code as
       | whisper.cpp:
       | https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...
        
         | imtringued wrote:
         | From what I have heard they use manual spin locks. Generally,
         | spin locks are not a good idea unless you want to dedicate the
         | entire machine to a single application. If the process a
         | spinlock waits on gets suspended, you're burning CPU time for
         | nothing. The OS thinks a spinlock making zero progress is
         | actually a high priority process, so it is starving the
         | suspended process from making progress.
        
           | Ono-Sendai wrote:
           | Yeah the code looks like a spinlock. It behaves terribly
           | under contention, resulting in performance falling off a
           | cliff as the number of threads increases. Adding more threads
           | actually slows down the total performance.
           | 
           | I would fix it if I could be bothered. Instead I will just
           | use the Cuda whisper backend which is pretty nice and fast.
        
       | jongjong wrote:
       | That's interesting because I built a simple ANN library and I was
       | playing around with GPU acceleration and came to a similar
       | conclusion as this article.
       | 
       | To be fair, my ANN library was faster (up to 2x) with GPU
       | acceleration in some scenarios were ANN was shallow (as opposed
       | to deep with many hidden layers). I thought the marginal gain may
       | have been because, the way it's set up in my library, it has to
       | load all the values into the GPU from RAM for each pass of
       | forward and back propagation in each layer during training. I
       | believe there is a way to allocate memory on the GPU chip itself
       | but it's a lot more challenging to do, especially in a modular,
       | fully portable way (which was one of the goals of my library).
       | 
       | But anyway, even the 2x best-case figure seemed disappointing. In
       | my mind, I expected to see at least 10x speed improvement... And
       | I was surprised that the CPU version was actually slightly faster
       | in the scenario I was testing at the time which was a relatively
       | deep network. It makes sense since the different layers cannot be
       | parallelized as the input of one layer depends on the output of
       | the previous layer... So the more layers you have, the more
       | serial bottlenecks you have, the less you can benefit from GPU
       | acceleration... And unfortunately, deep networks also happen to
       | be those which tend to perform best for a lot of use cases.
        
       | kristianp wrote:
       | Nice to see such speedups for CPUs. Are these changes available
       | as a branch or pull request in llama.cpp itself? I'd like to make
       | use of them in that form if possible (as I'm used to using that).
        
         | dagaci wrote:
         | Yes, this is really a phenomenal effort! And what open source
         | is about: Bringing improvements to so many use cases. So that
         | Intel and AMD chip uses can start to perform while taking
         | advantage of their high-performance capabilities, making even
         | old parts competitive.
         | 
         | There are two PRs raised to merge to llama.cpp:
         | 
         | https://github.com/ggerganov/llama.cpp/pull/6414
         | 
         | https://github.com/ggerganov/llama.cpp/pull/6412
         | 
         | Hopefully these can be accepted, without drama! as there are
         | many downstream dependencies on llama.cpp can will also
         | benefit.
         | 
         | Though of course everyone should also look directly at releases
         | from llamafile https://github.com/mozilla-Ocho/llamafile.
        
       | wtallis wrote:
       | I know this post is focused specifically on _CPU_ performance,
       | but the section on the performance on the Mac Studio seems to be
       | deliberately avoiding directly mentioning that machine 's GPU,
       | let alone benchmark against it. I think it would have been
       | interesting to see a straightforward comparison of what compute
       | performance and memory bandwidth (as measured by the prompt
       | processing and token generation speeds, respectively) are
       | achievable with reasonable optimization effort on the CPU vs GPU
       | when they're attached to the same memory subsystem.
        
       | politelemon wrote:
       | This is great work. I've always thought it would be great if
       | running LLM could be commoditized for regular average Joe
       | hardware. I had thought that llamafile was like dockerfile for
       | llama.cpp but looks like that's a mistake?
       | 
       | Will definitely be giving this a try.
        
       | seangrogg wrote:
       | Mmm, I wonder how well this would work on a mobile device. Maybe
       | I'll try grabbing my ubuntu touch here in a sec...
        
         | seangrogg wrote:
         | (For any who were curious: it does not for memory reasons)
        
       | speps wrote:
       | Regarding this bit at the end:
       | 
       | > I learned how to write math kernels by renting Vast VMs and
       | watching Gautham Venkatasubramanian and mrdomino develop CUDA
       | kernels in a tmux session. They've been focusing on solving a
       | much more important challenge for llamafile, which is helping it
       | not have a mandatory dependency on the cuBLAS
       | 
       | If I'm reading this right, they're trying to rewrite cuBLAS
       | within CUDA itself. I'm guessing the next step would be removing
       | CUDA dependency and go with directly using Vulkan or Metal
       | compute shaders. Am I correct?
        
         | WithinReason wrote:
         | Yes, but none of these have performance portability across GPU
         | vendors, so it's probably seen as pointless. You would need an
         | AMD Vulkan shader, an nvidia one, and intel one, etc. It's not
         | like C code on CPUs.
        
           | radarsat1 wrote:
           | Depending on how many individual tweaks are necessary for
           | hardware variants of course... but at this level of code &
           | complexity it actually seems pretty reasonable to write 3 or
           | 4 versions of things for different vendors. More work yes,
           | but not pointless.
        
             | treffer wrote:
             | A nice example of this is fftw which has hundreds (if not
             | thousands) of generated methods to do the fft math. The
             | whole project is a code generator.
             | 
             | It can then after compilation benchmark these, generate a
             | wisdom file for the hardware and pick the right
             | implementation.
             | 
             | Compared with that "a few" implementations of the core math
             | kernel seem like an easy thing to do.
        
               | naasking wrote:
               | Not exactly comparable, as you said, the FFTW
               | implementations are auto-generated but it doesn't sound
               | like these few implementations will be.
        
               | bee_rider wrote:
               | ATLAS was an automatically tuned BLAS, but it's been
               | mostly supplanted by ones using the hand-tuned kernel
               | strategy.
        
               | touisteur wrote:
               | Apache TVM does something similar for auto-optimization
               | and last time I checked it wasn't always a win against
               | OpenVINO (depending on the network and batch-size) and it
               | came with lots of limitations (which may have been lifted
               | since) - stuff like dynamic batch size.
               | 
               | I wish we had superoptom
        
           | TuringNYC wrote:
           | To me it makes sense to have an interface that can be
           | implemented individually for AMD, Metal, etc. Then, leave it
           | up to the individual manufacturers to implement those
           | interfaces.
           | 
           | I'm sitting in an office with a massive number of Macbook Pro
           | Max laptops usually sitting idle and I wish Apple would
           | realize the final coup they could achieve if I could also run
           | the typically-NVIDIA workloads on these hefty, yet
           | underutilized, Mx machines.
        
             | jorvi wrote:
             | Apple could unlock so much compute if they give customers a
             | sort of "Apple@Home" deal. Allow Apple to run distributed
             | AI workloads on your mostly idle extremely overpowered
             | Word/Excel/VSCode machine, and you get compensation dropped
             | straight into your Apple account's linked creditcard.
        
               | newswasboring wrote:
               | If Apple were doing an Apple@Home kind of deal they might
               | actually want to give away some machines for free or
               | super cheap (I realize that doesn't fit their brand) and
               | then get the rights perpetually to run compute on them.
               | Kind of like advertising but it might be doing something
               | actually helpful for someone else.
        
               | TuringNYC wrote:
               | >> If Apple were doing an Apple@Home kind of deal they
               | might actually want to give away some machines for free
               | or super cheap
               | 
               | In such a case, my guess is that the machines being free
               | would be trumped by the increased cost of electricity.
        
               | TuringNYC wrote:
               | BTW, at our day-job, we've been running a "cluster" of M1
               | Pro Max machines running Ollama and LLMs. Corporate rules
               | prevent remote access onto machines, so we created a
               | quick and dirty pull system where individual developers
               | can start pulling from a central queue, running LLM
               | workloads via the Ollama local service, and contributing
               | things back centrally.
               | 
               | Sounds kludge, but introduce enough constraints and you
               | end up with this as the best solution.
        
               | nickpsecurity wrote:
               | Do you have price-performance numbers you can share on
               | that? Like compared against local or cloud machines with
               | RTX and A100 GPU's?
        
               | TuringNYC wrote:
               | >> Do you have price-performance numbers you can share on
               | that? Like compared against local or cloud machines with
               | RTX and A100 GPU's?
               | 
               | Good question, the account is muddy --
               | 
               | 1. Electricity is a parent company responsibility, so
               | while that is a factor in OpEx price, it isnt a factor
               | for us. I dont think it even gets submetered. Obviously,
               | one wouldnt want to abuse this, but maxing out Macbooks
               | dont seem close to abuse territory
               | 
               | 2. The M1/M2/M3 machines are already purchased, so while
               | that is major CapEx, it is a sunk cost and also an
               | underutilized resource most of the day. We assume no wear
               | and tear from maxing out the cores, not sure if that is a
               | perfect assumption but good enough.
               | 
               | 3. Local servers are out of the question at a big company
               | outside of infra groups, it would take years to provision
               | them and I dont think there is even a means to anymore.
               | 
               | The real question is cloud. Cloud with RTX/A100 would be
               | far more expensive, though I'm sure performant. (TPM
               | calculation left to the reader :-) I'd leave those for
               | fine tuning, not for inference workloads. Non-production
               | Inference is particularly bad because you cant easily
               | justify reserved capacity without some constant
               | throughput. If we could mix environments, it might make
               | sense to go all cloud on NVIDIA but having separate
               | environments with separate compliance requirements makes
               | that hard.
               | 
               | Jokes aside, I think a TPM calculation would be
               | worthwhile and perhaps I can do a quick writeup on this
               | and submit to HN.
        
           | surge wrote:
           | Maybe its a dumb question, but isn't something like OpenCL
           | meant to solve this problem?
        
             | jvanderbot wrote:
             | From my understanding, using triangle / shaders to do HPC
             | has given way to a specific, more general purpose GPU
             | programming paradigm which is CUDA.
             | 
             | Of course this knowledge is superficial and probably
             | outdated, but if I'm not too far off base, it's probably
             | more work to translate a general CUDA-like layer or CUDA
             | libs to OpenCL.
        
             | VHRanger wrote:
             | In theory, yes.
             | 
             | In practice, OpenCL became a giant mess. Some vendors put
             | speed bumps by not supporting the transition from 2 to 3,
             | or having shitty drivers for it.
             | 
             | It also sat at the wrong level of abstraction for high
             | performance compute, which is why CUDA ended up being used.
             | 
             | Vulkan would have been reasonable to write compute shaders
             | in, if there wasn't a ton of alternatives out there already
             | now
        
         | larodi wrote:
         | llama.cpp (or rather G.Gerganov et. al.) are trying to avoid
         | cuBLAS entirely, using ins own kernels. not sure how jart's
         | effort relates, and whether jart intends to upstream these into
         | llama.cpp which seems to still be the underlying tech behind
         | the llamafile.
        
           | homarp wrote:
           | Here are links to the most recent pull requests sent
           | https://github.com/ggerganov/llama.cpp/pull/6414
           | https://github.com/ggerganov/llama.cpp/pull/6412
        
             | speps wrote:
             | This doesn't relate to GPU kernels unfortunately.
        
       | pknerd wrote:
       | So, I can now run it on my 2015 Macbook with 8GB RAM?
        
       | isusmelj wrote:
       | Is there somewhere an overview of the progress we made on the
       | software side for training and inference of LLMs? It feels like
       | we squeezed 10-100x more out of the hardware since llama
       | appeared. This crazy progress will probably saturate though as we
       | reach theoretical limits, no?
        
       | mijoharas wrote:
       | Has Justine written anywhere about her disassembly setup?
       | 
       | > I configured Emacs so I can push a button, and the disassembly
       | for the C++ code I'm working on will pop up on the screen in a
       | few milliseconds.
       | 
       | I assume it's something project specific rather than being able
       | to get the disassembly for an arbitrary section of code or
       | something?
       | 
       | It seems very handy, so I'd love to see the implementation (I
       | couldn't find anything googling)
        
         | pelletier wrote:
         | This is probably what they are referring to
         | https://github.com/jart/disaster
        
           | mijoharas wrote:
           | Thanks! I need to get better at googling I guess.
        
           | gpderetta wrote:
           | Nice. I have been using rmsbolt for a similar feature, but it
           | is very rough. I'll need to give this as try.
        
       | moffkalast wrote:
       | > the Raspberry Pi
       | 
       | Odd how there were no Mistral 7 benchmarks for the Pi 5 in that
       | table (I doubt anyone is seriously considering using TinyLlama
       | for anything at all), so I went to re-test it out myself on the
       | Pi 5 8G.
       | 
       | llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32
       | tokens per second
       | 
       | llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token,
       | 2.62 tokens per second
       | 
       | It does seem to inch closer to the speed you get with blas
       | acceleration which is quite impressive, but in practical terms
       | the Pi 5 is so heavily limited by its memory throughput
       | bottleneck that it saturates the required compute with 3 threads
       | already. So while fancy kernels will make it more efficient it
       | won't really save you from that fundamental bandwidth limit. The
       | Pi foundation messed up going with a 32 bit memory bus, simple
       | as.
        
       | 6r17 wrote:
       | today being today ; I must ask ; anyone has actually tried this ?
        
       | tomp wrote:
       | TL;DR: unroll the outer two loops of matrix multiplication
        
         | amelius wrote:
         | Shouldn't this have been done in a library instead of a
         | specific project? Then others could also profit from it.
        
       | AbuAssar wrote:
       | regarding AMD zen4 with avx512:
       | 
       | "Here we see that, despite only being twice the price, the 7995WX
       | x86 ISA offers 7x more raw compute power than the M2 Ultra ARM
       | ISA, and nearly the same token generation speed, which is likely
       | thanks to its 384mb L3 cache. When I bought this chip, I had to
       | expand support in llama.cpp for bfloat16 and AVX512 before I
       | could fully test its capabilities. My work means you can now run
       | LLaMA 2.8x faster on Zen4 than you could before."
        
         | reckless wrote:
         | Does this also count platform costs or just chip cost? I'd
         | imagine the threadripper motherboard and ram costs aren't
         | insignificant
        
           | KennyBlanken wrote:
           | A complete desktop computer with the M2 Ultra w/64GB of RAM
           | and 1TB of SSD is $4k.
           | 
           | The 7995WX processor alone is $10k, the motherboard is _one
           | grand_ , the RAM is another $300. So you're up to $11300, and
           | you still don't have a PSU, case, SSD, GPU....or heatsink
           | that can handle the 300W TDP of the threadripper processor;
           | you're probably looking at a very large AIO radiator to keep
           | it cool enough to get its quoted performance. So you're
           | probably up past $12k, 3x the price of the Studio...more like
           | $14k if you want to have a GPU of similar capability to the
           | M2 Ultra.
           | 
           | Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.
        
             | incrudible wrote:
             | So from a CPU perspective you get 7x the CPU throughput for
             | 3x to 4x the price, plus upgradable RAM that is massively
             | cheaper. The M2 uses the GPU for LLMs though, and there it
             | sits in a weird spot where 64GB of (slower) RAM plus
             | midrange GPU performance is not something that exists in
             | the PC space. The closest thing would probably be a
             | (faster) 48GB Quadro RTX which is in the $5000 ballpark.
             | For other use cases where VRAM is not such a limiting
             | factor, the comparably priced PC will blow the Mac out of
             | the water, especially when it comes to GPU performance. The
             | only reason we do not have cheap 96GB GDDR GPUs is that it
             | would cannibalize NVIDIA/AMDs high margin segment. If this
             | was something that affected Apple, they would act the same.
        
             | juitpykyk wrote:
             | You're using the wrong CPU.
             | 
             | Consumer AMD 7950X supports AVX-512, it's faster than M2
             | Ultra at half the cost.
        
       | aimonster2 wrote:
       | Posted too early.
        
       | sublimefire wrote:
       | re:funding
       | 
       | my friend suggested to nominate Justine for the open source
       | contributions in an internal Microsoft programme (the winner
       | takes $10k). They did not even want to add her to the potential
       | list of nominees because her software is not used in MSFT. It
       | speaks volumes about the corporate culture and shows what they
       | really think about OSS support.
        
       | miki123211 wrote:
       | If I'm reading the post correctly, Llamafile is faster than
       | llama.cpp, despite the author upstreaming some of the changes.
       | What's the reason for this?
        
       | tiffanyh wrote:
       | Pixar uses CPUs ...
       | 
       | I wonder if we'll end up in a situation like rendered movies.
       | 
       | Where the big studios like Pixar uses CPUs (not GPUs) to render
       | their movies due to the cost/perf (and access to larger amounts
       | of RAM).
       | 
       | https://news.ycombinator.com/item?id=25616372
        
         | kreco wrote:
         | > Where the big studios like Pixar uses CPUs (not GPUs) to
         | render their movies due to the cost/perf (and access to larger
         | amounts of RAM).
         | 
         | I wonder if (or when) this will change once integrated GPUs
         | become "mainstream", the CPU/GPU share the same RAM AFAIK.
        
           | rockwotj wrote:
           | I expect GPU hardware to specialize like Google's TPU. The
           | TPU feels like ARM in these AI workloads where when you start
           | to run these at scale, you'll care about the cost perf
           | tradeoff for most usecases.
           | 
           | > CPU/GPU share the same RAM AFAIK.
           | 
           | This depends on the GPU I believe Apple has integrated
           | memory, but most GPUs from my limited experience writing
           | kernels have their own memory. CUDA pretty heavily has a
           | device memory vs host memory abstraction.
        
             | talldayo wrote:
             | On top of that, Nvidia has provided a unified addressing
             | abstraction over PCI for a looooong time via CUDA:
             | https://developer.nvidia.com/blog/unified-memory-in-cuda-6/
             | 
             | Customers like Pixar could probably push this even further,
             | with a more recent Nvidia rack and Mellanox networking.
             | Networking a couple Mac Studios over Thunderbolt doesn't
             | have a hope of competing, at that scale.
        
         | CaptainOfCoit wrote:
         | I'm not sure how true that is anymore, from the outside it
         | seems they're at least moving to a CPU/GPU hybrid (which makes
         | a lot of sense), at least judging by new features landing in
         | RenderMan that continues to add more support for GPUs (like
         | XPU).
        
           | tiffanyh wrote:
           | Isn't this more of a function that RenderMan is a sold
           | product.
           | 
           | And it's expected to at least support GPUs.
        
             | CaptainOfCoit wrote:
             | Hard to know without getting information from people at
             | Pixar really.
             | 
             | Not sure how much sense it would make for Pixar to spend a
             | lot of engineering hours for things they wouldn't touch in
             | their own rendering pipeline. As far as I know, most of the
             | feature development comes from their own rendering
             | requirements rather than from outside customers.
        
         | cthalupa wrote:
         | It's entirely the cost/perf of access to the larger amounts of
         | VRAM that keeps rendering on CPUs now. GPUs are strictly better
         | in almost every way for rendering (We could have some arguments
         | about technical precision, FP calculations, etc. but with
         | modern cards these arguments are largely semantics, you can
         | have output that is accurate to the level that any human
         | watching for entertainment purposes will not be able to
         | determine any physical inaccuracies that arise from a GPU
         | render vs. CPU.), except the need for large amounts of VRAM
         | being quite expensive at current.
         | 
         | But that's already been changing, and we are seeing studios
         | moving to fully GPU based pipelines. Wylie Co, who are a major
         | visual effects company (Dune part 1 and 2, marvel movies, the
         | last of us, a bunch of others) are now a 100% GPU shop. The
         | trend is towards more and more GPU rendering, not less.
         | 
         | With AI providing another strong incentive towards increasing
         | the amount of VRAM on GPUs, I don't see any reason to believe
         | that trend will reverse.
        
       | 4bpp wrote:
       | It would be good to see some independent verification of this
       | claim. HN has previously [1] fallen for a claim by the same
       | author to have reduced llama.cpp memory usage for a dense model
       | way below the size of the model, which should have failed a basic
       | smell test and indeed was debunked shortly after. Justine Tunney
       | appears to enjoy extreme superstar status here, and it's hard to
       | overstate the degree of social pressure that needed to be
       | overcome at the time for the skeptic position to reach fixation
       | (to begin with, what other LLM developments even hit upvote
       | numbers like the +1300ish there or the +712 here at the time of
       | writing?).
       | 
       | [1] https://news.ycombinator.com/item?id=35393284
        
         | freedomben wrote:
         | > _Justine Tunney appears to enjoy extreme superstar status
         | here_
         | 
         | This is true, and for sure pretty much all humans can benefit
         | from increased skepticism (though not cynicism), but that
         | superstar status is achieved from numerous impressive works.
         | Cosmopolitan C and Actually Portable Executable were some of
         | the things in the past that alone were worthy of significant
         | respect, and for many people (like myself) these were our first
         | introduction.
         | 
         | Speaking only for myself, I have a high opinion of Justine on
         | technical merits. I'm sure she makes mistakes like all humans.
         | I can tell she gets excited by discoveries and the chase, and
         | that probably does sometimes cause premature celebration (this
         | is something I struggle with so it's recognizable to me haha),
         | but being wrong sometimes doesn't erase when you're right, and
         | she has been spectacularly right a lot more times than most
         | people I know.
         | 
         | There have been some personality clashes between Justine and
         | others at times, and unfortunately it's situations where only
         | part (sometimes a small part) of it was public, meaning we can
         | only take people's word for what happened. Given my ignorance,
         | I choose to withhold judgment here, but even if I didn't (and
         | assumed she was guilty) it doesn't change the technical merits
         | and it certainly wouldn't dissuade me from seeing what she's
         | working on now.
         | 
         | So when I see stuff from Justine come out like this, it gets my
         | attention. Would it get my attention if the same thing were
         | posted by somebody whose name I don't recognize? Likely not,
         | but I think that is (unfortunately) part of being a human. We
         | aren't capable (yet!) of evaluating everything on technical
         | merit alone because the shear volume of material far exceeds
         | our time. Therefore we use other (less reliable to be true)
         | signalling mechanisms as a way to quickly decide what is worthy
         | of our time investment and what may not be. Reputation/name
         | recognition is a much imperfect, but better than random chance,
         | indicator.
        
           | llm_trw wrote:
           | >This is true, and for sure pretty much all humans can
           | benefit from increased skepticism (though not cynicism), but
           | that superstar status is achieved from numerous impressive
           | works.
           | 
           | It is achieved through a never ending parade of self
           | aggrandizement.
           | 
           | What Justine is very good at is presenting trivial concepts
           | from a world which few front end developers understand in a
           | language that most front end developers understand.
           | 
           | I had the misfortune of having to find out about her because
           | of how thoroughly she polluted the google search space for
           | lisp with her implementation of sector lisp. For some reason
           | google decided that sector lisp needed to be in the top 5
           | results for every query about `minimal lisp with quotation`
           | even when quotation wasn't implemented in her version.
        
             | cl3misch wrote:
             | > presenting trivial concepts from a world which few front
             | end developers understand in a language that most front end
             | developers understand
             | 
             | Completely ignoring the JT discussion, the argument that
             | something is trivial in _some_ area does not really hold.
             | 1) Science is mostly  "just" connecting the dots, and 2)
             | landmark discoveries tend to look trivial in hindsight
             | almost by definition, because they have to be
             | straightforward enough to be widely adopted.
        
           | 4bpp wrote:
           | I don't know, my first (and main) impression of them was
           | actually in the context of the llama.cpp mmap story, as I was
           | somewhat involved in the project back then, and there I
           | thought their impact on the project was predominantly
           | negative. While they introduced a mildly beneficial change
           | (mmap-based model loading), the way in which this was done
           | was not healthy for the project - the changes were rammed
           | through with little regard for concerns that existed at the
           | time about backwards compatibility and edge cases that might
           | be broken by the half-baked patch, Justine came across as
           | self-aggrandizing (in the sense of "acting as if they ran the
           | place", presenting their proposals as a plan that others must
           | follow rather than suggestions) and overly eager to claim
           | credit (epitomized by the injection of their own initials
           | into the magic number file format identifier next to those of
           | the project originator's, and the story of the hapless
           | _other_ author of the mmap changeset who was at first given a
           | token acknowledgement but then quickly sidelined). Arguments
           | for the inclusion of the patch seemed to be won by a
           | combination of half- and untruths like those about memory
           | savings and the sudden participation of a large number of
           | previously uninvolved sycophants. It is fortunate that Georgi
           | handled the fallout as well as he did, and that he in fact
           | had amassed the social capital necessary to survive his
           | heavy-handed solution (soft-banning both JT and their most
           | prominent detractor). A less-successful project would
           | probably have found itself captured or torn apart by the
           | drama.
           | 
           | There is nothing wrong with holding people in esteem for
           | their achievements, but in this case the degree of esteem
           | really seems to be excessive. This is not a matter of simply
           | being annoyed that people like "the wrong thing" - the mmap
           | situation was significantly exacerbated by the presence of
           | irrational/excessive supporters of Justine's as well as the
           | irrational/excessive detractors that emerge wherever the
           | former exist.
        
             | freedomben wrote:
             | I would like to know more about the mmap situation, as what
             | I saw on the surface could warrant some concern. Being
             | somewhat involved you would probably know better than I as
             | I was just an observer reading the thread after-the-fact.
             | It seemed like the biggest accusation was the plagiarism
             | (or "collaborating" but mostly taking somebody else's
             | code).
             | 
             | Did anybody besides the two parties see the code develop,
             | or does anybody else have knowledge of this? Or is it just
             | his word vs. hers? Do you have any suggested reading to get
             | more perspective other than just the github thread and HN
             | thread? (really asking. these aren't rhetorical questions)
             | 
             | Reading the thread, I do think there are a lot of
             | opportunities to read in confirmation bias. For example if
             | I start reading that thread with the idea that Justine is
             | coming in to hijack the project and make herself the hero
             | that it needs and deserves, and to get her initials
             | embedded in there as a permanent tribute to her own glory,
             | I can see that. But if I read it as her coming in with cool
             | work that she's excited about, and had to come up with a
             | new format and couldn't think of a name (naming things can
             | be really hard) and just stuck in one of the first things
             | that came to mind (or even used as a placeholder prior to
             | discussion), I can see that as well.
             | 
             | I absolutely don't want the truth covered up, but I also
             | don't want to accept as true things that aren't true,
             | especially where the implications are toward somebody's
             | character. I'm a big "benefit of the doubt" kind of person.
        
               | 4bpp wrote:
               | My sense is that the part about credit/collaboration was
               | actually somewhat overblown among the detractors. What
               | roughly happened _as far as I can remember_ is that JT
               | and another person worked on mmap together with about
               | equal contribution, though the other person _might_ have
               | been the one to have initiated the idea (and solicited
               | help to push it through); then at some point JT decided
               | to make a PR to the main repository in their own name,
               | but crediting the other collaborator as a coauthor, which
               | may or may not have been coordinated with the other
               | person. After that, though, in a fairly characteristic
               | fashion, JT started fielding adulatory questions from
               | their fans (on Github, but also on HN, Twitter and
               | possibly other media) about the change, and quickly
               | switched to simply referring to it as their own, with no
               | mention of the other contributor. The other contributor
               | expressed some misgivings about having their contribution
               | erased, which were picked up by a growing set of people
               | who were generally resentful about JT 's conduct in the
               | project. As far as I can tell, when confronted about it,
               | JT at no point explicitly denied what the other person
               | did (and I think the commit logs should all still be
               | there in the fork), but at some point the other person
               | just decided to stop pushing the issue due to being
               | uncomfortable with becoming a playing ball in the fandom
               | war between JT fans and antis.
               | 
               | My personal main gripe with JT really was the tone they
               | adopted in the Github discussions, and the effect of the
               | large numbers of drive-by supporters, who were often far
               | less restrained in both unfounded claims about Justine's
               | accomplishments and attacks on any critics. (At this
               | point I'd also like to note that I consider some sibling
               | comments to be uncomfortably hostile in a personal way,
               | like the "hit piece" one.) I think that as a public
               | persona, especially one who actively pursues publicity,
               | you have some responsibility to restrain your followers -
               | Justine, I get the sense, instead uses them as deniable
               | proxies, as also seen with the instances where instead of
               | straight up putting their signature on the "RAM usage
               | reduced to 6GB" claim they instead choose to post a
               | collage of screenshots of supporters making it.
        
               | cryptonector wrote:
               | This could all be true, but it's hard to evaluate these
               | claims on their own. Not being involved in any way, all I
               | can do is conclude that there is some friction in that
               | community. It's possible that JT is toxic, it's possible
               | that you are toxic, it's possible that neither of you is
               | generally toxic but something about your personalities
               | causes your interactions to become toxic, it's even
               | possible that neither of you were toxic in any way but
               | your impression of things after the fact is as-if Tunney
               | had been toxic. Sometimes one has to stop and think about
               | these things and figure out how to smooth things over,
               | and sometimes it's not possible to smooth things over.
        
               | 4bpp wrote:
               | I didn't have any direct interactions with JT then or now
               | - while it was hard to ignore the discussion as an
               | onlooker, it did not touch upon any parts of the code
               | that I was involved with. This seems to be one of the
               | topics where everyone who is even tangentially involved
               | is under a default suspicion of being biased in one
               | direction or another.
        
         | leeoniya wrote:
         | > and indeed was debunked shortly after
         | 
         | was also surprised that she continues to mention the mmap thing
         | in a positive light even after the facts about the claim were
         | settled to the contrary, even disregarding the whole
         | attribution fiasco.
        
         | azeirah wrote:
         | You can simply check the Pull Request on llama.cpp on Github.
         | JohanesGaessler (a core maintainer) has already ran the code
         | and says it's an impressive speed-up. There isn't a thorough
         | review by any of the core maintainers yet, but this is very
         | likely just exactly what justine says it is; various
         | significant and insignificant speedups.
        
         | mtlynch wrote:
         | > _HN has previously [1] fallen for a claim by the same author
         | to have reduced llama.cpp memory usage for a dense model way
         | below the size of the model, which should have failed a basic
         | smell test and indeed was debunked shortly after._
         | 
         | Where did Justine claim this? The link you provided is Justine
         | saying that she _doesn 't_ have an explanation for the
         | reduction in RAM and that readers shouldn't treat it as fact
         | yet:
         | 
         | > _The loading time performance has been a huge win for
         | usability, and folks have been having the most wonderful
         | reactions after using this change. But we don 't have a
         | compelling enough theory yet to explain the RAM usage miracle.
         | So please don't get too excited just yet! Yes things are
         | getting more awesome, but like all things in science a small
         | amount of healthy skepticism is warranted._
         | 
         | Was the link supposed to show the false claim or the debunking
         | of the claim?
        
           | 4bpp wrote:
           | Plenty of claims about it, e.g. here as a "fact": https://git
           | hub.com/ggerganov/llama.cpp/discussions/638#discu.... I don't
           | think occasional expressions of lingering doubt (still
           | couched among positive language like calling it a "miracle")
           | can offset all the self-promotion that clearly seeks to
           | maximise visibility of the implausible claim, even as it is
           | attributed to others, as for example in https://twitter.com/J
           | ustineTunney/status/1641881145104297985... . A cereal
           | manufacturer would probably be held responsible for package
           | text like "Fruity Loops cured my cancer! - John, 52,
           | Kalamazoo" too.
        
             | mtlynch wrote:
             | I don't read that as a claim of fact at all. From the link
             | you shared:
             | 
             | > _Now, since my change is so new, it 's possible my theory
             | is wrong and this is just a bug. I don't actually
             | understand the inner workings of LLaMA 30B well enough to
             | know why it's sparse._
             | 
             | I haven't followed her work closely, but based on the links
             | you shared, she sounds like she's doing the opposite of
             | self-promotion and making outrageous claims. She's sharing
             | the fact that she's observed an improvement while also
             | disclosing her doubts that it could be experimental error.
             | That's how open-source development is supposed to work.
             | 
             | So, currently, I have seen several extreme claims of
             | Justine that turned out to be true (cosmopolitan libc, ape,
             | llamafile all work as advertised), so I have a higher
             | regard for Justine than the average developer.
             | 
             | You've claimed that Justine makes unwarranted claims, but
             | the evidence you've shared doesn't support that accusation,
             | so I have a lower regard for your claims than the average
             | HN user.
        
               | 4bpp wrote:
               | The very opening line says
               | 
               | > I'm glad you're happy with the fact that LLaMA 30B (a
               | 20gb file) can be evaluated with only 4gb of memory
               | usage!
               | 
               | The line you quoted occurs in a context where it is also
               | implied that the low memory usage is a _fact_ , and there
               | might only be a bug insofar as that the model is being
               | evaluated incorrectly. This is what is entailed by the
               | assertion that it "is" sparse: that is, a big fraction of
               | the parameters are not actually required to perform
               | inference on the model.
        
               | wpietri wrote:
               | I think you are making a lot of soup from very little
               | meat. I read those links the same way mtlynch read them.
               | I think you're looking for a perfection of phrasing that
               | is much more suited to peer-reviewed academic papers than
               | random tweets and GitHub comments taken from the middle
               | of exploring something. Seeing your initial comment and
               | knowing little about the situation, I was entirely
               | prepared to share your skepticism. But at this point I'm
               | much more skeptical of you.
        
             | cryptonector wrote:
             | Where's the 30B-in-6GB claim? ^FGB in your GH link finds
             | [0] which is neither by jart nor by ggerganov but by
             | another user who promptly gets told to look at [1] where
             | Justine denies that claim.                 [0]
             | https://github.com/antimatter15/alpaca.cpp/issues/182
             | [1] https://news.ycombinator.com/item?id=35400066
        
               | 4bpp wrote:
               | These all postdate the discussions that I linked (from
               | March 31st). By April 1st JT themselves seems to have
               | stopped making/boosting the claim about low memory usage.
        
               | cryptonector wrote:
               | I used your link.
        
         | quest88 wrote:
         | What's the point of your comment if you're not going to do the
         | work yourself? If you don't have something nice to say then
         | don't say it.
         | 
         | The "hey this may or may not be true so someone go figure it
         | out" is lazy, self-gratifying and pointless.
        
           | thebytefairy wrote:
           | I think it's very helpful for someone to point out that the
           | source has been shown to be unreliable before, and we should
           | wait for more verification from others knowledgable in the
           | space.
        
             | freedomben wrote:
             | Agreed. I think there's a blurry gray line between pointing
             | out a potentially unreliable source and a lazy dismissal,
             | but if there's reasonable doubt I think it's good for HN.
             | If the doubt isn't reasonable, it will be torn apart by
             | other commenters, and then it's an explicit discussion that
             | people can read and decide on
        
             | cryptonector wrote:
             | If you give such comments a lot of credence without doing
             | that own verification then you open yourself to what is
             | essentially a social denial of service attack.
        
           | renewiltord wrote:
           | It's really popular online. I think that's because many
           | people here read a lot of this content but don't actually
           | have the skill or background to do analysis. So they give us
           | history rather than examination. Which has some value, I
           | suppose.
        
         | rpdillon wrote:
         | This comment reads like real scientific skepticism, but from my
         | recollection of events, is more of a hit piece that takes what
         | should be a technical discussion and drags in bunch of personal
         | baggage. In particular:
         | 
         | > HN has previously fallen for a claim by the same author to
         | have reduced llama.cpp memory usage for a dense model way below
         | the size of the model,
         | 
         | is not true at all. Someone else made the claims about 6GB RAM
         | usage for a 30B model, I remember reading it at the time and
         | thinking "Yeah, that doesn't make sense, but the loading time
         | improvement is immense!" And it was - I run all my LLMs locally
         | on CPU because I don't have dedicated hardware, and jart's work
         | has improved usability a lot.
         | 
         | > and it's hard to overstate the degree of social pressure that
         | needed to be overcome at the time for the skeptic position to
         | reach fixation
         | 
         | I was reading the same HN discussions you were at the time, and
         | it was pretty trivial to see that the loading time claim held
         | up, and the RAM claim was dubious and likely simply due to not
         | understanding some effect of the change completely. Heck,
         | jart's own discussion of the topic reflected this at the time.
         | 
         | For the current change, I feel like your comment is even more
         | misplaced. The blog post linked to for this story has a huge
         | amount of detail about performance on specific processors
         | (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with
         | specific models. So when you say:
         | 
         | > It would be good to see some independent verification of this
         | claim.
         | 
         | What I hear is "4bpp thinks there's a real risk the numbers in
         | the linked post are fabricated, and jart is just trying to get
         | attention."
         | 
         | And that doesn't seem reasonable at all, given the history of
         | her work and the evidence in front of us.
        
           | throwup238 wrote:
           | I distinctly remember most of the people in the comments
           | misunderstanding kernel memory paging or learning about it
           | for the first time.
           | 
           | It genuinely did make llama.cpp a lot more usable at the
           | time.
        
           | 4bpp wrote:
           | The loading time improvements largely held up, and on the
           | balance the mmap contribution was ultimately good (though the
           | way it was implemented was really quite problematic, as a
           | matter of process and communication). However, as I point out
           | in https://news.ycombinator.com/item?id=39894542, JT quite
           | unambiguously did try to cash in on the "low memory usage"
           | claim - uncritically reprinting positive claims by others
           | about your own work that otherwise would have been largely
           | invisible should really not be treated differently as making
           | those claims yourself.
           | 
           | I do think that there is a real risk that the numbers are
           | wrong (not necessarily "fabricated", as this implies
           | malfeasance, but possibly based on an erroneous measurement
           | insufficiently questioned due to an excess of trust from
           | themselves and others, as the mmap ones were). This is also
           | in part based on the circumstance that at the time (of the
           | mmap story, and myself being more involved in the project) I
           | was actually involved in trying to optimise the SIMD linear
           | algebra code, and unless llama.cpp has since switched to a
           | significantly less performant implementation the proposition
           | that so much more performance could be squeezed out strikes
           | me as quite surprising. Here, your intuitions may say that
           | Justine Tunney is just so brilliant that they make the
           | seemingly impossible possible; but it was exactly this
           | attitude that at the time made it so hard to evaluate the
           | mmap memory usage claims rationally and turned the discussion
           | around it much more dysfunctional than it had to be.
        
         | larodi wrote:
         | All the core llama.cpp devs are superstar devs and 10x devs or
         | whatever you want to call a super smart person who is also
         | super productive and very good with applied calculus. Jart is
         | very apparently very smart, but their relationship with this
         | project was not without turbulence and at present they (jart)
         | are not a core dev of llama.cpp. So for a while lots of their
         | (i'd like to write her moves, but not sure if correct) actions
         | seem to be aimed at getting attention and perhaps particularly
         | the attention of the same folk.
         | 
         | On the contrary ggerganov, slaren, JohannesGaessler seem to
         | have never chased this sensationalist superstatus, but actually
         | leave their work to speak for them. You'll barely find comments
         | by these people on HN, while jart figures every so often a way
         | to manifest themselves some way on HN. And this behaviour on
         | jart's part now bears fruits - for example Phoronix' Michael
         | Larabel would praise jart for their work on the llamafile,
         | absolutely obliterating the fact that it is largely based on
         | the wonderful work of ggerganov at al.
        
           | __turbobrew__ wrote:
           | When they claimed to drastically improve memory utilization
           | through the use of memory maps, despite not doing so and then
           | starting a huge controversy which derailed the project I
           | would say they were a 0.1x dev not a 10x dev.
        
       | s_Hogg wrote:
       | I'd pay good money to watch jart in conversation with Carmack
        
         | Solvency wrote:
         | Carmack is great but completely irrelevant here. He missed the
         | entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality
         | fantasies for years.
        
           | vinkelhake wrote:
           | _Completely irrelevant_ is probably overstating it. He 's
           | been working on AI for the last 4+ years.
        
             | cactusplant7374 wrote:
             | He's striving for AGI though, right? So he's not really
             | working on anything because he certainly hasn't discovered
             | AGI.
        
             | Solvency wrote:
             | He literally squandered the last 10 years of his life
             | working on _absolutely nothing_ for Zuckerberg. And only
             | after the rest of the world innovated on AI (transformers,
             | etc) did he clearly feel embarrassed and had to proclaim he
             | 's going to focus on AGI in a "one-up" way.
        
               | talldayo wrote:
               | > He literally squandered the last 10 years of his life
               | working on absolutely nothing
               | 
               | Speak for yourself, the Oculus Quest is the coolest piece
               | of sub-$500 tech in my home.
        
               | fkyoureadthedoc wrote:
               | He got paid a lot to do something he was presumably
               | passionate about and enjoyed. It also might surprise you
               | to find out that there's quite a lot of people that just
               | work as a means to an end, and find value and enjoyment
               | primarily from other parts of their life.
        
               | Solvency wrote:
               | that's great for him. i'm glad he enjoyed the $$$ playing
               | with VR. that has nothing to do with my point about his
               | irrelevance to this LLaMa discussion.
        
               | talldayo wrote:
               | He's not irrelevant, though. Literally the first thing he
               | did after leaving Meta was start an AI business, and the
               | original point wasn't even necessarily about AI. They
               | just said they wanted to see two engineers in
               | conversation, and you used it as an opportunity to
               | denigrate one of their previous employers. _That 's_
               | bewilderingly irrelevant.
        
           | cactusplant7374 wrote:
           | Altman isn't even relevant here. He is focusing on LLM's
           | instead of a framework that gets us to AGI. He can't describe
           | how we get there or any such theories around AGI. It's a
           | complete failure.
        
         | objektif wrote:
         | Why is he even relevant? What makes you believe that he would
         | be good at solving AI related problems? He is a developer
         | right?
        
         | s_Hogg wrote:
         | To carry on, this is because they're both very interested in
         | "knowledge in depth", rather than because of what they actually
         | work on day-to-day. They've both made careers out of knowing
         | what's going on with the thing they're building down to the
         | most basic level possible.
        
       | m3kw9 wrote:
       | So Nvidia in trouble now because intel can be used instead for
       | faster/cheaper? inference?
        
       | tubs wrote:
       | The ram is not on the cpu on a mac. It's in the same can but it's
       | still regular ddr dimms.
        
       | marshallward wrote:
       | There is an implication here that the Fortran implementation of
       | `SGEMM` is somehow inadequate. But any modern Fortran compiler
       | will quite easily apply the AVX and FMA optimizations presented
       | here without any additional changes. Both GNU and Intel make
       | these substitutions with the correct flags.
       | 
       | The unrolling optimization is also just another flag away
       | (`-funroll-all-loops`). The Intel Compiler will even do this
       | without prompting. In fact, it appears to only do a modest 2x
       | unroll on my machine, suggesting that the extreme unroll in this
       | article would have been overkill.
       | 
       | Parallelization certainly a lot to ask of Fortran 77 source, but
       | there there is little stopping you from adding OpenMP statements
       | to the `SGEMM` function. In fact, modern Fortran even offers its
       | own parallelization constructs if you're willing to go there.
       | 
       | Which is to say: Let's not belittle this old Fortran 77 function.
       | Yes it is old, and does not even resemble modern Fortran. But the
       | whole point of Fortran is to free the developer from these
       | platform-specific details, and hand the job off to the compiler.
       | If you don't like that approach, then you're welcome to go to C
       | or C++. But this little block of Fortran code is already capable
       | of doing just about everything in this article.
        
         | steppi wrote:
         | The Fortran implementation is just a reference implementation.
         | The goal of reference BLAS [0] is to provide relatively simple
         | and easy to understand implementations which demonstrate the
         | interface and are intended to give correct results to test
         | against. Perhaps an exceptional Fortran compiler which doesn't
         | yet exist could generate code which rivals hand (or
         | automatically) tuned optimized BLAS libraries like OpenBLAS
         | [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in
         | practice this is not observed.
         | 
         | Justine observed that the threading model for LLaMA makes it
         | impractical to integrate one of these optimized BLAS libraries,
         | so she wrote her own hand-tuned implementations following the
         | same principles they use.
         | 
         | [0]
         | https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
         | 
         | [1] https://github.com/OpenMathLib/OpenBLAS
         | 
         | [2]
         | https://www.intel.com/content/www/us/en/developer/tools/onea...
         | 
         | [3]
         | https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...
         | 
         | [4]https://en.wikipedia.org/wiki/BLIS_(software)
        
           | marshallward wrote:
           | Fair enough, this is not meant to be some endorsement of the
           | standard Fortran BLAS implementations over the optimized
           | versions cited above. Only that the mainstream compilers
           | cited above appear capable of applying these optimizations to
           | the standard BLAS Fortran without any additional effort.
           | 
           | I am basing these comments on quick inspection of the
           | assembly output. Timings would be equally interesting to
           | compare at each stage, but I'm only willing to go so far for
           | a Hacker News comment. So all I will say is perhaps let's
           | keep an open mind about the capability of simple Fortran
           | code.
        
             | steppi wrote:
             | Check out _The Science of Programming Matrix Computations_
             | by Robert A. van de Geijn and Enrique S. Quintana-Ort.
             | Chapter 5 walks through how to write an optimized GEMM. It
             | involves clever use of block multiplication, choosing block
             | sizes for optimal cache behavior for specific chips. Modern
             | compilers just aren 't able to do such things now. I've
             | spent a little time debugging things in scipy.linalg by
             | swapping out OpenBLAS with reference BLAS and have found
             | the slowdown from using reference BLAS is typically at
             | least an order of magnitude.
             | 
             | [0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf
        
               | marshallward wrote:
               | You are right, I just tested this out and my speed from
               | BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I
               | can only imagine what BLIS and MKL would give. I
               | apologize for my ignorance. Apparently my faith in the
               | compilers was wildly misplaced.
        
         | pklausler wrote:
         | Modern Fortran's only parallel feature is coarrays, which
         | operate at the whole program level.
         | 
         | DO CONCURRENT is a serial construct with an unspecified order
         | of iterations, not a parallel construct. A DO CONCURRENT loop
         | imposes requirements that allow an arbitrary order of
         | iterations but which are not sufficient for safe
         | parallelization.
        
           | marshallward wrote:
           | How do you feel about Nvidia endorsing do concurrent
           | migration to GPUs? Would that be classified as
           | parallelization?
        
         | brrrrrm wrote:
         | using AVX/FMA and unrolling loops does extremely little in the
         | way of compiling to fast (>80% peak) GEMM code. These are very
         | much intro steps that don't take into account _many_ important
         | ideas related to cache hierarchy, uop interactions, and even
         | instruction decode time. The Fortran implementation is entirely
         | and unquestionably inadequate for real high performance GEMMs.
        
           | marshallward wrote:
           | I don't disagree, but where are those techniques presented in
           | the article? It seems like she exploits the particular shape
           | of her matrix to align better with cache. No BLAS library is
           | going to figure that out.
           | 
           | I am not trying to say that a simple 50+ year old matrix
           | solver is somehow competitive with existing BLAS libraries.
           | But I disagreed with its portrayal in the article, which
           | associated the block with NumPy performance. Give that to a
           | 2024 Fortran compiler, and it's going to get enough right to
           | produce reasonable bytecode.
        
           | marshallward wrote:
           | I just did a test of OpenBLAS with Intel-compiled BLAS, and
           | it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I
           | was wrong here. Maybe in some sense 4% is not bad, but it's
           | certainly not good. My faith in current compilers has
           | certainly been shattered quite a bit today.
           | 
           | Anyway, I have come to eat crow. Thank you for your insight
           | and helping me to get a much better perspective on this
           | problem. I mostly work with scalar and vector updates, and do
           | not work with arrays very often.
        
       | hrkfmud50k wrote:
       | > It's clearly optimal since my CPU is listed as only being
       | capable of going 780 gigaflops
       | 
       | 780 GFLOP is the iGPU spec. Is this a valid comparison?
       | 
       | https://nanoreview.net/en/cpu/intel-core-i9-14900k
        
       | arendtio wrote:
       | Does someone else see llamafile using Wine on Linux?
       | 
       | Edit: After the download I did a simple chmod +x
       | llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
        
         | jart wrote:
         | There's a simple fix for that.                   sudo wget -O
         | /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname
         | -m).elf         sudo chmod +x /usr/bin/ape         sudo sh -c
         | "echo ':APE:M::MZqFpD::/usr/bin/ape:'
         | >/proc/sys/fs/binfmt_misc/register"         sudo sh -c "echo
         | ':APE-jart:M::jartsr::/usr/bin/ape:'
         | >/proc/sys/fs/binfmt_misc/register"
         | 
         | https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-fil...
        
       | yieldcrv wrote:
       | note, this is "goes faster on CPUs than before", not faster than
       | GPUs.
        
       | TimPC wrote:
       | Strange title. My first read of the title thought the author was
       | arguing the model is now faster on CPU than GPU. Would be much
       | nicer if they titled this something closer to "Performance
       | Improvement for LLaMa on CPU".
        
         | utopcell wrote:
         | Same here.
        
       | ein0p wrote:
       | As someone who has tried to beat MKL-DNN, and was unsuccessful at
       | doing so even for constrained matrix sizes, I'm curious how they
       | pulled off such a massive improvement.
       | 
       | But as someone who routinely estimates picojoules per flop at
       | $DAY_JOB - there's simply no way this is energy efficient. That
       | is not even physically possible with a CPU.
        
         | janwas wrote:
         | I think the previous code was using dot products, f32 instead
         | of bf16.
        
       | aaronscott wrote:
       | > I like to define my subroutines using a modern language like
       | C++, which goes 47 gigaflops. This means C++ is three orders of a
       | magnitude faster than Python. That's twenty years of progress per
       | Moore's law.
       | 
       | This is great. I love the idea of measuring performance
       | differences in "years of Moore's law."
       | 
       | Twenty years puts the delta in an easy to understand framework.
        
         | JohnKemeny wrote:
         | I doubt that you get Python to run faster than C++ at 2004
         | hardware.
        
           | mrtranscendence wrote:
           | Python on 2024 hardware vs C++ on 2004 hardware ... I don't
           | think it's obvious that C++ always wins here, though it would
           | depend on the use case, how much of the Python is underpinned
           | by native libraries, and the specific hardware in question.
        
             | JohnKemeny wrote:
             | If we allow native libraries, it's not clear that C++ would
             | win, even on modern hardware.
        
               | michaelt wrote:
               | I think we all know that, when someone writes "C++ is
               | three orders of a magnitude faster than Python" they're
               | not including native libraries.
        
               | mrtranscendence wrote:
               | You can't _not_ include native libraries, at least if you
               | want your benchmark to be realistic. Almost every Python
               | library where performance matters is written (at least
               | partially) in a compiled language.
        
               | bornfreddy wrote:
               | Yes, but many people like the sound of "X-times faster
               | than Python" while conveniently forgetting that the same
               | thing can be (and usually is) done in Python + numpy &
               | co. even faster.
               | 
               | I have come to appreciate "slowness" of Python. It trades
               | speed for legibility, which is a great compromise once
               | you have _really fast_ native libraries one import away.
               | Best of both worlds.
        
           | bevekspldnw wrote:
           | Honestly depends on what you are doing. Most of my python
           | work is data collection and analysis on top of Postgres.
           | 
           | Being smart in how I use Postgres indexing (and when to
           | disable it outright) has more performance impact than the
           | actual language doing the plumbing.
        
       | rbnsl wrote:
       | Definitely wild we're in the timeline you can run a 1.1 bn param
       | model on a raspberry pi, but its still tough to justify because
       | the 1.1 is kinda useless compared to the beefier models. Sick for
       | home builds/hobbyists though I might wanna get one of the new Pis
       | just to try this out
        
       | JohnnyHerz wrote:
       | Awesomeness. thank you for sharing!
        
       ___________________________________________________________________
       (page generated 2024-04-01 23:01 UTC)