[HN Gopher] LLaMA now goes faster on CPUs
___________________________________________________________________
LLaMA now goes faster on CPUs
Author : lawrencechen
Score : 1136 points
Date : 2024-04-01 02:17 UTC (20 hours ago)
(HTM) web link (justine.lol)
(TXT) w3m dump (justine.lol)
| bottlepalm wrote:
| I think it's a good idea for everyone to download and be able to
| run a LLM locally, even if you have the minimum of requirements.
| As a pseudo-backup of a large chunk of human knowledge.
| TaylorAlexander wrote:
| I contend that most human knowledge is not written down or if
| it is written down it's not publicly available on the internet
| and so does not exist in these datasets.
|
| There's so much subtle knowledge like the way a mother learns
| to calm her child or the way a carpenter learns to work
| different kinds of wood which may be written down in part, but
| may also be learned through lived experience or transferred
| from human to human such that little of it gets written down
| and posted online.
| mickdarling wrote:
| Wait till all the videos ever created are tokenized and
| ingested into a training dataset. Carpentry techniques are
| certainly there. The subtleties of parenting maybe harder to
| derive from that, but maybe lots of little snippets of
| people's lives will add up to a general understanding of
| parenting. There have certainly been bigger surprises in the
| field.
| oblio wrote:
| What about smells or tastes? Or feelings?
|
| I can't help but feel we're at the "aliens watch people eat
| from space and recreate chemically identical food that has
| no taste" phase of AI development.
| skeledrew wrote:
| If the food is chemically identical then the taste would
| be the same though, since taste (and smell) is about
| chemistry. I do get what you're saying though.
| nyokodo wrote:
| > If the food is chemically identical...
|
| If it were 99.9% chemically identical but they left out
| the salt and spices...
| skeledrew wrote:
| I'd say that, when it comes to chemistry, only 100%
| reproduction can be considered identical. Anything less
| is to be deemed similar to some degree.
|
| And so without the correct amount of salt and/or spices,
| we're talking about food that's very similar, and not
| identical.
| samus wrote:
| Their perception is very likely to be totally different.
|
| * They might not perceive some substances at all, others
| that we don't notice might make it unpalatable.
|
| * Some substances might be perceived differently than us,
| or be indistinguishable from others.
|
| * And some might require getting used to.
|
| Note that all of the above phenomena also occur in humans
| because of genetics, cultural background, or experiences!
| skeledrew wrote:
| This may come off as pedantic, but "identical" is a very
| strong term when it comes to something like chemistry.
| The smallest chemical difference can manifest as a large
| physical difference. Consider that genetically, humans
| are about 60% similar to the fruit fly, yet phenotically,
| the similarity could be considered under 1%.
| dekhn wrote:
| https://en.wikipedia.org/wiki/Knowledge_argument
| mickdarling wrote:
| Well, I have synesthetic smell/color senses, so I don't
| even know what other humans experience, nor they me. But,
| I have described it in detail to many people and they
| seem to get the idea, and can even predict how certain
| smells will "look" to me. All that took was using words
| to describe things.
| nyokodo wrote:
| > All that took was using words to describe things.
|
| All that took was words and a shared experience of
| smelling.
| mickdarling wrote:
| How rude, what do our bathing habits have to do with
| this? ;-)
|
| But, fair point. The gist I was trying to get across is
| that I don't even know what a plant smells like to you,
| and you don't know what a plant smells like to me. Those
| aren't comparable with any objective data. We make
| guesses, and we try to get close with our descriptions,
| which are in words. That's the best we can do and we
| share our senses. Asking more from computers seems overly
| picky to me.
| visarga wrote:
| I think we can safely say that any taste, smell,
| sensation or emotion of any importance has been described
| 1000 times over in the text corpus of GPT. Even though it
| is fragmented, by sheer volume there is enough signal in
| the training set, otherwise it would not be able to
| generate coherent text. In this case I think the map
| (language) is asymptotically close to the territory
| (sensations & experience in general).
| andersa wrote:
| What makes you think they aren't already?
| spacephysics wrote:
| For sure agree, however as the storage of information
| evolves, it's becoming more efficient over time
|
| From oral tradition to tablets to scrolls to books to mass
| produced books to digital and now these LLMs, I think it's
| still a good idea to preserve what we have the best we can.
| Not as a replacement, but a hedge against a potential library
| of Alexandria incident.
|
| I could imagine a time in the near future where the models
| are domain-specific, and just like there are trusted
| encyclopedia publishers there are trusted model publishers
| that guarantee a certain level of accuracy.
|
| It's not like reading a book, but I for sure had an easier
| time learning golang talking with ChatGPT than a book
| nyokodo wrote:
| > a hedge against a potential library of Alexandria
| incident
|
| What would cause a Library of Alexandria incident wiping
| out all human knowledge elsewhere, that would also allow
| you to run a local LLM?
| AnthonyMouse wrote:
| To run a local LLM you need the device it currently runs
| on and electricity. There are actually quite a lot of
| ways to generate electricity, but to name one, a diesel
| generator that can run on vegetable oil.
|
| What you're really asking is, what could cause a modern
| Library of Alexandria incident? But the fact is we keep
| the only copy of too many things on the servers of the
| major cloud providers. Which are then intended to have
| their own internal redundancy, but that doesn't protect
| you against a targeted attack or a systemic failure when
| all the copies are under the same roof and you lose every
| redundant copy at once from a single mistake replicated
| in a monoculture.
| spacephysics wrote:
| A more dooms-day prepping would call for some heavy lead-
| faraday cage to store the storage mediums in the event of
| an EMP/major solar flare.
|
| Or more Sci-fi related, some hyper computer virus that
| ends up infecting all internet connected devices.
|
| Not too far fetched if we can conceive of some AI enabled
| worm that mutates depending on the target, I could
| imagine a model of sorts being feasible within the next
| 5-10 years
| _ache_ wrote:
| I think you underestimate the amount of information contained
| in books and the extent to which our society (as a whole)
| depends on them.
| Barrin92 wrote:
| society depends much more on social networks, mentorship
| and tacit knowledge than books. It's easy to test this.
| Just run the thought experiment by a few people, if you
| could get only one, would you take an Ivy league degree
| without the education or the education without the degree?
|
| Venture capital in tech is a good example of this. The book
| knowledge is effectively globally distributed and almost
| free, effectively success happens in a few geographically
| concentrated counties.
| skeledrew wrote:
| I'd content that those are skills (gained through experience)
| rather than knowledge (gained through rote learning).
| TaylorAlexander wrote:
| I think it's worth expanding your definition of knowledge.
| bamboozled wrote:
| Yes but it contains enough hints to help someone find their
| way on the these types of tasks.
| nicklecompte wrote:
| It's not even "human knowledge" that can't be written down -
| it seems all vertebrates understand causality, quantity (in
| the sense of intuitively understanding what numbers are), and
| object permanence. Good luck writing those concepts down in a
| way that GPT can use!
|
| In general AI in 2024 is not even close to understanding
| these ideas, nor does any AI developer have a clue how to
| build an AI with this understanding. The best we can do is
| imitating object permanence for a small subset of perceptible
| objects, a limitation not found in dogs or spiders.
| wruza wrote:
| That's where humans suck. The classic "you're not doing it
| right" then proceeds to quickly show how to do it without
| verbalizing any info on learning process, pitfalls, failure
| modes, etc, as if just showing it was enough for themselves
| to learn. Most people do[n't do] that, not even a sign of
| reflection.
|
| My worst case was with a guy who asked me to write an
| arbitrage betting bot. When I asked how to calculate coeffs,
| he pointed at two values and said "look, there <x>, there <y>
| _thinks for a minute_ then it 's <z>!". When I asked how
| exactly did he calculate it, he simply repeated with
| different numbers.
| samus wrote:
| > When I asked how exactly did he calculate it, he simply
| repeated with different numbers.
|
| Now you know how an LLM feels during training!
| stavros wrote:
| Probably during inference, as well.
| Aerroon wrote:
| People often don't know how to verbalize them in the first
| place. Some of these topics are very complex, but our
| intuition gets us halfway there.
|
| Once upon a time I was good at a video game. Everyone
| realized that positioning is extremely important in this
| game.
|
| I have good positioning in that game and was asked many
| times to make a guide about positioning. I never did,
| because I don't really know how. There is too much
| information they you need to convey to cover all the
| various situations.
|
| I think you would first have to come up with a framework on
| positioning to be able to really teach this to someone
| else. Some kind of base truths/patterns that you can then
| use to convey the meaning. I believe the same thing applies
| to a lot of these processes that aren't verbalized.
| snovv_crash wrote:
| Often for this kind of problem writing a closed form
| solution is simply intractable. However, it's often still
| possible to express the cost function of at least a big
| portion of what goes into a human-optimal solution. From
| here you can sample your space, do gradient descent or
| whatever to find some acceptable solution that has a more
| human-intuitive property.
| michaelt wrote:
| It's not necessarily that it's intractable - just that a
| thing can be very hard to describe, under some
| circumstances.
|
| Imagine someone learning English has written "The
| experiment reached it's conclusion" and you have to
| correct their grammar. Almost any english speaker can
| correct "it's" to "its" but unless they (and the person
| they're correcting) know a bunch of terms like 'noun' and
| 'pronoun' and 'possessive' they'll have a very hard time
| explaining why.
| Shorel wrote:
| I wouldn't say this is where humans suck. On the contrary,
| this how we find human language is such a fantastic tool to
| serialize and deserialize human mental processes.
|
| Language is so good, that an artificial language tool,
| without any understanding of these mental processes, can
| appear semi-intelligent to us.
|
| A few people unable to do this serialization doesn't mean
| much on the larger scale. Just that their ideas and mental
| processes will be forgotten.
| HarHarVeryFunny wrote:
| > I contend that most human knowledge is not written down
|
| Yes - the available training data is essentially mostly a
| combination of declarative knowledge (facts - including
| human-generated artifacts) and procedural knowledge (how to
| do things). What is missing is the learning process of taking
| a description of how to do something, and trying to apply
| that yourself in a specific situation.
|
| No amount of reading books, or reading other people's blogs
| on how they did something, can avoid the need for hands-on
| experience if you want to learn how to do it yourself.
|
| It's not just a matter of information that might be missing
| or unclear in instructional material, including how to cope
| with every type of failure and unexpected outcome, but
| crucially how to do this _yourself_ - if you are to be the
| actor, then it 's the predictive process in _your_ mind that
| matters.
|
| Partly for this reason, and partly because current AI's
| (transformer-based LLMs) don't support online learning (try &
| fail skill acquisition), I think we're going to see two
| distinct phases of AI.
|
| 1) The current "GenAI" phase where AI can only produce mash-
| ups of things it saw in it's pre-training data, augmented by
| similar "book learning" provided in-context which can be
| utilized by in-context learning. I'd characterize what this
| type of AI to be useful for, and capable of, as "automation".
| Applying that book (incl. anecdotal) knowledge to new
| situations where mash-up is all you need.
|
| 2) The second phase is where we have something closer to AGI,
| even if still below human level, which is no longer just a
| pre-trained transformer, but also has online learning and is
| agentic - taking actions predicated on innate traits like
| curiosity and boredom, so that given the book knowledge it
| can (& will!) then learn to apply that by
| experimentation/practice and learning from its own mistakes.
|
| There will no doubt be advances beyond this "phase two" as
| well, but it seems we're likely to be stuck at "phase one"
| for a while (even as models become much better at phase one
| capabilities), until architectures fundamentally advance
| beyond transformers to allow this type of on-the-job training
| and skill acquisition.
| texuf wrote:
| Any recommendations for the latest and greatest way to run
| these locally?
| speps wrote:
| llamafile as per TFA...
| etc-hosts wrote:
| https://justine.lol/oneliners/
| threecheese wrote:
| This looks amazing, but the docs mention .llamafiles exceed
| the Windows executable size limit, and there are
| workarounds to externalize the weights. Do you think this
| is an impediment to its becoming popular? Or is MS consumer
| hardware just far enough behind (w/o dedi gpu) that
| "there's time"?
| fragmede wrote:
| ollama
| slowmotiony wrote:
| I use a tool called LM Studio, makes it trivial to run these
| models on a Mac. You can also use it as a local API so it
| kinda acts like a drop-in replacement for the openAI API.
| chown wrote:
| I am the author of Msty [1]. My goal is to make it as
| straightforward as possible with just one click (once you
| download the app). If you end up trying it, I would love to
| hear your feedback.
|
| 1: https://msty.app
| mikewarot wrote:
| I don't see LLMs as a large chunk of knowledge, I see them as
| an emergent alien intelligence snapshotted at the moment it
| appeared to stop learning. It's further hobbled by the limited
| context window it has to use, and the probabilistic output
| structure that allows for outside random influences to pick its
| next word.
|
| Both the context window and output structure are, in my
| opinion, massive impedance mismatches for the emergent
| intellect embedded in the weights of the model.
|
| If there were a way to match the impedance, I strongly suspect
| we'd already have AGI on our hands.
| bamboozled wrote:
| What is alien about them ?
|
| LLMs are of this earth and created by our species. Seems
| quite familiar to me.
| jfoster wrote:
| They can write in a way similar to how a human might write,
| but they're not human.
|
| The chat interfaces (Claude, ChatGPT) certainly have a
| particular style of writing, but the underlying LLMs are
| definitely capable of impersonating as our species in the
| medium of text.
| bamboozled wrote:
| But they're extremely relatable to us because it's
| regurgitating us.
|
| I saw this talk with Geoffrey Hinton the other day and he
| said he was astonished at the capabilities of ChatGPT-4
| because he asked it what the relationship between a
| compost heap and a nuclear bomb was, and he couldn't
| believe it answered, he really thought it was proof the
| thing could reason. Totally mind blown.
|
| However I got it right away with zero effort.
|
| Either I'm a super genius or this has been discussed
| before and made it's way into the training data.
|
| Usual disclaimer: I don't think this invalidates the
| usefulness of AI or LLMs, just that we might be
| bamboozling ourselves into the idea that we've created an
| alien intelligence.
| EMM_386 wrote:
| > Either I'm a super genius or this has been discussed
| before and made it's way into the training data.
|
| If an LLM can tell you the relatonship between a compost
| heap and nuclear bomb, that doesn't mean that was in the
| training data.
|
| It could be because a compost heap "generates heat", and
| a nuclear bomb also "generates heat" and due to that
| relationship they have something in common. The model
| will pick up on these similar patterns. They tokens are
| positioned closer to each other in the high dimensional
| vector space.
|
| But for any given "what does x have in common with y",
| that doesn't necessarily mean someone has asked that
| before and it's in the training data. Is that reasoning?
| I don't know ... how does the brain do it?
| moffkalast wrote:
| > how does the brain do it?
|
| It's a lot of organic matmuls. ;)
| bamboozled wrote:
| I mean that's what sucks about Open AI isn't it ? They
| won't tell us what is in the training data so we don't
| know. All I'm saying is that it wouldn't be surprising if
| this was discussed previously somewhere in a pop science
| book.
|
| That answer was close btw !
| fragmede wrote:
| They don't think, they don't reason, they don't understand.
| Except they do. But it's hard for human words for thought
| processes to apply when giving it an endless string of
| AAAAA's makes it go bananas.
|
| That's not familiar behavior. Nor is the counting reddit
| derived output. It's also not familiar for a single person
| to have the breadth and depth of knowledge that ChatGPT
| has. Sure, some people know more than others, but even
| without hitting the Internet, it has a ridiculous amount of
| knowledge, far surpassing a human, making it, to me, alien.
| though, it's inability to do math sometimes is humanizing
| to me for some reason.
|
| ChatGPT's memory is also unhuman. It has a context window
| which is a thing, but also it only knows about things
| you've told it in each chat. Make a new chat and it's
| totally forgotten the nickname you gave it.
|
| I don't think of HR Geiger's work, though made by a human,
| as familiar to me. it feels quite alien to me, and it's not
| just me, either. Dali, Bosch, and Escher are other human
| artists who's work can be unfamiliar and alien. So being
| created by our species doesn't automatically imbue
| something with familiar human processes.
|
| So it dot products, it matrix multiplies, instead of
| reasoning and understanding. It's the Chinese room
| experiment on steroids; it turns out a sufficiently large
| corpus on a sufficiently large machine does make it look
| like something"understands".
| trimethylpurine wrote:
| The word "alien" works in this context but, as the
| previous commenter mentioned, it also carries the
| implication of foreign origin. You could use "uncanny"
| instead. Maybe that's less arbitrary and more specific to
| these examples.
|
| "Alien" still works, but then you might have to add all
| the context at length, as you've done in this last
| comment.
| fire_lake wrote:
| Hype people do this all the time - take a word that has a
| particular meaning in a narrow context and move it to a
| broader context where people will give it a sexier
| meaning. AI researchers unveil alien
| intelligence
|
| Is way better headline.
| samus wrote:
| The context window is comparable to human short-term
| memory. LLMs are missing episodic memory and means to
| migrate knowledge between the different layers and into
| its weights.
|
| Math is mostly impeded by the tokenization, but it would
| still make more sense to adapt them to use RAG to process
| questions that are clearly calculations or chains of
| logical inference. With proper prompt engineering, they
| can process the latter though, and deviating from
| strictly logical reasoning is sometimes exactly what we
| want.
|
| The ability to reset the text and to change that history
| is a powerful tool! It can make the model roleplay and
| even help circumvent alignment.
|
| I think that LLMs could one day serve as the language
| center of an AGI.
| taneq wrote:
| In all fairness, going up to SMS random human and yelling
| AAAAAAAAAAAAAA... at them for long enough will produce
| some out-of-distribution responses too.
| cloudwalk9 wrote:
| Makes me think that TikTok and YT pranksters are
| accidentally producing psychological data on what makes
| people tick under scenarios of extreme deliberate
| annoyance. Although the quality (and importance) of that
| data is obviously highly variable and probably not very
| high, and depends on what the prank is.
| inference-lord wrote:
| Do you find a large database or spreadsheet that hold
| more information than you can "alien" too?
| mikewarot wrote:
| Alien meaning unfamiliar, not necessarily extraterrestrial.
|
| Aliens are people from other countries, for example.
|
| Exotic would be another good word to use.
| namarie wrote:
| I can agree on the context windows, but what other output
| structure would you have?
| moffkalast wrote:
| Working with pure bytes is one option that's being
| researched. That way you're not really constrained by
| anything at all. Sound, images, text, video, etc. Anything
| goes in, anything comes out. It's hard to say if it's
| feasible with current compute yet without tokenizers to
| reduce dimensionality.
| mlsu wrote:
| Disagree. The input/output structure (tokens) is the
| interface for both inference _and_ for training. There is an
| emergent intellect embedded in the weights of the model.
| However, it is _only_ accessible through the autoregressive
| token interface.
|
| This is a fundamental limitation, much more fundamental than
| appears at first. It means that the only way to touch the
| model, and for the model to touch the world, is through the
| tokenizer (also, btw, why tokenizer is so essential to model
| performance). Touching the world through a tokenizer is
| actually quite limited.
|
| So there is an intelligence in there for sure, but it is
| locked in an ontology that is tied to its interface. This is
| even more of a limitation than e.g. weights being frozen.
| gpm wrote:
| If you want to download a backup of a large chunk of human
| knowledge... download wikipedia. It's a similar size to a small
| LLM and can actually distinguish between real life and fantasy:
| https://en.wikipedia.org/wiki/Wikipedia:Database_download
|
| If you just want to play around with an LLM though, absolutely.
| int_19h wrote:
| Kiwix provides prepackaged highly compressed archives of
| Wikipedia, Project Gutenberg, and many other useful things:
| https://download.kiwix.org/zim/.
|
| Between that and dirt cheap storage prices, it is possible to
| have a local, offline copy of more human knowledge than one
| can sensibly consume in a lifetime. Hell, it's possible to
| have it all on one's _smartphone_ (just get one with an SD
| card slot and shove a 1+ Tb one in there).
| claritise wrote:
| Just create a RAG with wikipedia as the corpus and a low
| parameter model to run it and you can basically have an
| instantly queryable corpus of human knowledge runnable on
| an old raspberry pi.
| CaptainOfCoit wrote:
| > a low parameter model
|
| > on an old raspberry pi
|
| I bet the LLM responses will be great... You're better
| off just opening up a raw text dump of Wikipedia markup
| files in vim.
| boywitharupee wrote:
| but which model to tokenize with? is there a leaderboard
| for models that are good for RAG?
| sroussey wrote:
| "For RAG" is ambiguous.
|
| First there is a leaderboard for embeddings. [1]
|
| Even then, it depends how you use them. Some embeddings
| pack the highest signal in the beginning so you can
| truncate the vector, while most can not. You might want
| that truncated version for a fast dirty index. Same with
| using multiple models of differing vector sizes for the
| same content.
|
| Do you preprocess your text? There will be a model there.
| Likely the same model you would use to process the query.
|
| There is a model for asking questions from context.
| Sometimes that is a different model. [2]
| Workaccount2 wrote:
| Pretty neat to have laying around, thanks
| CaptainOfCoit wrote:
| > actually distinguish between real life and fantasy
|
| Are LLMs unable to distinguish between real life and fantasy?
| What prompts have you thrown at them to make this
| determination? Sending a small fairy tale and asking the LLM
| if it thinks it's a real story or fake one?
| gpm wrote:
| ... having them talk about events from sci fi stories in
| response to questions about the real world. Having them
| confidently lie about pretty much everything. Etc.
| CaptainOfCoit wrote:
| What are the specific prompts you're using? You might get
| those answers when you're not being specific enough (or
| use models that aren't state of the art).
|
| "Shit in, shit out" as the saying goes, but applied to
| conversations with LLMs where the prompts often aren't
| prescriptive enough.
| simonw wrote:
| I strongly recommend that people run LLMs locally for a
| different reason.
|
| The ones you can run on your own machine tend to be bad -
| really bad. They hallucinate wildly and fail at all sorts of
| tasks that the larger hosted ones succeed at.
|
| This makes them a fantastic tool for learning more about how
| LLMs work and what they're useful for. Interacting with a weak-
| but-functional LLM that runs on your own computer is a great
| way to get a much more solid mental model for what these things
| actually are.
| devsda wrote:
| For someone interested in learning about LLMs, running them
| locally is a good way to understand the internals.
|
| For everyone else, I wish they experience these (locally or
| elsewhere) _weak_ LLMs atleast once before using the
| commercial ones just to understand various failure modes and
| to introduce a healthy dose of skepticism towards the results
| instead of blindly trusting them to be the facts /truth.
| simonw wrote:
| Completely agree. Playing around with a weak LLM is a great
| way to give yourself a little bit of extra healthy
| skepticism for when you work with the strong ones.
| mmahemoff wrote:
| How do you learn about the internals by running LLMs
| locally? Are you playing with The code, runtime params, or
| just interacting via chat?
| samus wrote:
| The abstractions are relatively brittle. If you don't
| have a powerful GPU, you will be forced to consider how
| to split the model between CPU and GPU, how much context
| size you need, whether to quantize the model, and the
| tradeoffs implied by these things. To understand these,
| you have to develop a basic model how an LLM works.
| barrkel wrote:
| By interacting with it. You see the contours of its
| capabilities much more clearly, learn to recognize
| failure modes, understand how prior conversation can set
| the course of future conversation in a way that's almost
| impossible to correct without starting over or editing
| the conversation history.
| samus wrote:
| This skepticism is completely justified since ChatGPT 3.5
| is also happily hallucinating things that don't exist. For
| example how to integrate a different system Python
| interpreter into pyenv. Though maybe ChatGPT 4 doesn't :)
| kersplody wrote:
| Local LLMs are also a fantastic too for creative endeavors.
| Without prompt injection and having the ability to modify the
| amount of noise and "creativity" in the output, absolutely
| bonkers things pop out.
| tracerbulletx wrote:
| I don't really think this is true, you can't really
| extrapolate the strengths and weaknesses of bigger models
| from the behavior of smaller/quantized models and in fact a
| lot of small models are actually great at lots of things and
| better at creative writing. If you want to know how they
| work, just learn how they work, it takes like 5 hours of
| watching Youtube videos if you're a programmer.
| simonw wrote:
| Sure, you can't extrapolate the strengths and weaknesses of
| the larger ones from the smaller ones - but you still get a
| much firmer idea of what "they're fancy autocomplete"
| actually means.
|
| If nothing else it does a great job of demystifying them.
| They feel a lot less intimidating once you've seen a small
| one running on your computer write a terrible haiku and
| hallucinate some non-existent API methods.
| fzzzy wrote:
| It's funny that you say this, because the first thing I
| tried after ChatGPT came out (3.5-turbo was it?) was
| writing a haiku. It couldn't do it at all. Also, after 4
| came out, it hallucinated an api that wasted a day for
| me. It's an api that absolutely should have existed, but
| didn't. Now, I frequently apply llm to things that are
| easily verifiable, and just double check everything.
| fragmede wrote:
| The other reason is to find out what a detuned model is
| capable of. The canonical example is how to make cocaine,
| which ChatGPT will admonish you for even asking, while
| llama2-uncensored will happily describe the process which is
| only really interesting if you're an amateur chemist and want
| to be Scarface-that-knocks. (the recipe is relatively easy,
| it's getting access to the raw ingredients that's the hard
| part, same as with nukes.)
|
| if you accidentally use the word"hack" when trying to get
| ChatGPT to write some code for you. it'll stop and tell you
| that hacking is bad, and not a colloquial expression, and
| refuse to go further.
|
| privacy reasons are another reason to try a local LLM. for
| the extremely paranoid (justified or not), a local LLM gives
| users a place to ask questions without the text being fed to
| a server somewhere for later lawsuit discovery (Google
| searches are routinely subpoenaed, it's only a matter of time
| until ChatGPT chats are as well.)
|
| There's an uncensored model for vision available as well. The
| censored vision models won't play the shallow game of hot or
| not with you.
|
| There are uncensored image generation models as well, but,
| ah, those are NSFW and not for polite company. (As well as
| there's multiple thesis' worth of content on what that'll do
| to society.)
| astrange wrote:
| > if you accidentally use the word"hack" when trying to get
| ChatGPT to write some code for you. it'll stop and tell you
| that hacking is bad, and not a colloquial expression, and
| refuse to go further.
|
| Is that 3.5 or 4? I asked 4 for an example of code which
| "is a hack", it misunderstood me as asking for hacking code
| rather than buggy code, but then it did actually answer on
| the first try.
|
| https://chat.openai.com/share/ca2c320c-f4ba-41bf-8f40-f7faf
| 2...
| fragmede wrote:
| Interesting. It was 4. I can't share the chat I had where
| ChatGPT refused to help because I used the wrong words,
| because I can't find it (ChatGPT conversation history
| search when?), but I just remember it refusing to do
| something because it thought I was trying to break some
| sort of moral and ethical boundary writing a chrome
| extension when all I wanted to do is move some divs
| around or some such.
| BytesAndGears wrote:
| One time I wanted to learn about transmitter antenna
| design, just because I'm curious. ChatGPT 4 refused to
| give me basic information because you could use that to
| break some FCC regulations (I'm not even living in the US
| currently)
| lodovic wrote:
| I usually get around that with "I'm writing a research
| paper" or "I'm writing a novel and need to depict this as
| accurate as possible"
| semi-extrinsic wrote:
| I don't use LLMs for my coding, I manage just fine with
| LSP and Treesitter. So genuine question: is that answer
| representative of the output quality of these things?
| Because both answers are pretty crappy and assume the
| user has already done the difficult things, and is asking
| for help on the easy things.
| lpapez wrote:
| It's not representative.
|
| The models are capable of much much more, and they are
| being significantly nerfed over time by these ineffective
| attempts to introduce safeguards.
|
| Recently I've asked GPT4 to quote me some code to which
| it replied that it is not allowed to do so - even though
| it was perfectly happy to quote anything until recently.
| When prompted to quote the source code, but output it as
| PHP comments, it happily complied because it saw that as
| "derivative work" which it is allowed to do.
| astrange wrote:
| My point is that there aren't any safeguards in the
| reply. In fact I didn't even want it to give me hacking
| info and it did it anyway.
| fragmede wrote:
| The response seems pretty reasonable; it's answering the
| question it was asked. If you want to ask it how to do
| the difficult part, ask it about that instead. Expecting
| it to get the answer right in the first pass is like
| expecting your code to compile the very first time. You
| have to have more of a conversation with it to coax the
| difference out of you're thinking and what you're
| actually saying.
|
| If you're looking to read a more advanced example of its
| capabilities and limitations, try
|
| https://simonwillison.net/2024/Mar/23/building-c-
| extensions-...
| yunohn wrote:
| > I don't use LLMs for my coding, I manage just fine with
| LSP and Treesitter.
|
| You're literally comparing apples to oranges.
| coldtea wrote:
| I think the point was like "when it comes to programming
| assistance, auto-completion/linting/and whatever else LSP
| does and syntax assist from Treesitter, are enough for
| me".
|
| Though it does come a little off as a comparison. How
| about programming assistance via asking a colleague for
| help, Stack Overflow, or online references, code
| examples, and other such things, which are closer to what
| the LLM would provide than LSP and treesitter?
| freedomben wrote:
| You need to read more than just the first sentence of a
| comment. They only said that part so the reader would
| know that they have never used an LLM for coding, so they
| would have more context for the question:
|
| > So genuine question: is that answer representative of
| the output quality of these things?
| yunohn wrote:
| Yes, I did read it. I'm kind of tired of HNers loudly
| proclaiming they are ignoring LLMs more than a year into
| this paradigm shift.
|
| Is it that hard to input a prompt into the free version
| of ChatGPT and see how it helps with programming?
| jpc0 wrote:
| I did exactly that and found it lackluster for the domain
| I asked it for.
|
| And most use I've seen on it realistically a good LSP
| covers.
|
| Or to put it a other way. It's no good at writing
| algorithms or data structures ( or at least no better
| thab I would have with a first drafy but the first draft
| puts me ahead of the LLM in understanding that actual
| problem at hand, handing it off to an LLM doesn't help me
| get to the final solution faster).
|
| So that leaves writing boiler plate but concidering my
| experience with it writing more complex stuff, I would
| need to read over the boilerplate code to ensure it's
| correct which in that case I may as well have written it.
| yunohn wrote:
| > found it lackluster for the domain I asked it for
|
| Fair, that is possible depending on your domain.
|
| > It's no good at writing algorithms or data structures
|
| In my experience, this is untrue. I've gotten it to write
| algorithms with various constraints I had. You can even
| tell it to use specific function signatures instead of
| any stdlib, and make changes to tweak behavior.
|
| > And most use I've seen on it realistically a good LSP
| covers.
|
| Again, I really don't understand this comparison. LSPs
| and LLMs go hand in hand.
|
| I think it's more of a workflow clash. One really needs
| to change how they operate to effectively use LLMs for
| programming. If you're just typing nonstop, maybe it
| would feel like Copilot is just an LSP. But, if you try
| harder, LLMs are game changers when:
|
| - maybe you like rubber ducking
|
| - need to learn a new concept and implement it
|
| - or need to glue things together
|
| - or for new projects or features
|
| - or filling in boilerplate based on existing context.
| jpc0 wrote:
| https://chat.openai.com/share/c8c19f42-240f-44e7-baf4-50e
| e5e...
|
| https://godbolt.org/z/s9Yvnjz7K
|
| I mean I could write the algorithm by hand pretty quickly
| in C++ and would follow the exact same thought pattern
| but also deal with the edge cases. And factoring in the
| loss of productivity from the context switch that is a
| net negative. This algorithm is also not generic over
| enough cases but that is just up to the prompt.
|
| If I can't trust it to write `strip_whitespace` correctly
| which is like 5 lines of code, can I trust it to do more
| without a thorough review of the code and writing a ton
| of unit tests... Well I was going to do that anyway.
|
| The argument that I just need to learn better prompt
| engineering to make the LLM do what I want just doesn't
| sit with me when instead I could just spend the time
| writing the code. As I said your last point is absolutely
| the place I can see LLMs being actually useful but then I
| need to spend a significant amount of time in code review
| for generated code from an "employee" who is known to
| make up interfaces or entire libraries that doesn't
| exist.
| mrtranscendence wrote:
| I'm a Python-slinging data scientist so C++ isn't my jam
| (to say the least), but I changed the prompt to the
| following and asked it to GPT-4:
|
| > Write me an algorithm in C++ which finds the begin and
| end iterator of a sequence where leading and trailing
| whitespace is stripped. Please write secure code that
| handles any possible edge cases.
|
| It gave me this:
|
| https://chat.openai.com/share/55a4afe2-5db2-4dd1-b516-a3c
| acd...
|
| I'm not sure what other edge cases there might be,
| however. This only covers one of them.
|
| In general, I've found LLMs to be _marginally_ helpful.
| Like, I can 't ever remember how to get matplotlib to
| give me the plot I want, and 9 times out of 10 GPT-4
| easily gives me the code I want. Anything even _slightly_
| off the beaten path, though, and it quickly becomes
| absolutely useless.
| jpc0 wrote:
| My guess is that this was generated using GPT4?
|
| Free GPT I get https://chat.openai.com/share/f533429d-63c
| a-4505-8dc8-b8d2e7... which has exactly the same problem
| as my previous example and doesn't consider the string of
| all whitespace.
|
| Sure GPT4 is better at that, it wasn't the argument made.
|
| The example you gave absolutely was the code I would
| write on a first draft since it does cover the edge cases
| (assuming we aren't dealing with the full UTF charset and
| all that could be considered a space there).
|
| However this is code that is trivial to write in any
| language and the "Is it that hard to input a prompt into
| the free version of ChatGPT and see how it helps with
| programming? " argument doesn't hold up. Am I to believe
| it will implement something more complex correctly. This
| is also code that would absolutely be in hundreds of
| codebases so GPT has tons of context for it.
| yunohn wrote:
| Yeah honestly, I think you have a completely different
| expectation and style of usage than what is optimal with
| LLMs. I don't have the energy to convince you further,
| but maybe one day it'll click for you? No worries either
| way.
| jpc0 wrote:
| Could you maybe give me an example of what is concidered
| an optimal use of LLMs.
|
| Maybe a prompt to GPT
| fragmede wrote:
| Simonw's blog has some examples I'd consider show off its
| usefulness and limitations, eg
|
| https://simonwillison.net/2024/Mar/23/building-c-
| extensions-...
|
| (linked previously above)
| yunohn wrote:
| Like sibling commenter mentioned, simonw's blog is a
| great resource.
|
| Regarding your point around being able to whip up the
| code yourself - the point is to have a decent starting
| point to save time and energy. Like you said, you know
| the edge cases so you could skip the boring parts using
| GPT and focus purely on fixing those. Though, with more
| prompting (especially providing examples), GPT can also
| handle that for you.
|
| I have nearly 2 decades of experience as a developer and
| it took me a while to reorient my flow around LLMs. But
| now that I have, it's truly gamechanging.
|
| And since you asked, here's my system prompt:
|
| You are an experienced developer who follows industry
| standards and best practices. Write lean code and explain
| briefly using bullet points or numbered lists. Elaborate
| only when explaining concepts or making choices. Always
| mention which file and where to store provided code.
|
| Tech Stack: < insert all the languages, frameworks, etc
| you'd like to use >
|
| If I provide code, highlight and explain problematic
| code. Also show and explain the corrected code.
|
| Take a deep breath and think step by step.
|
| Also, always use GPT4 and customize the above to your
| style and liking.
| mrtranscendence wrote:
| I think you have the mistaken impression that I was
| arguing with you (certainly my comment makes it clear
| that I don't feel that LLMs are a panacea). I merely
| thought that you might be curious how GPT-4 would
| respond.
|
| > My guess is that this was generated using GPT4?
|
| This is a good guess, since I stated outright that I used
| GPT-4, and then mentioned GPT-4 later on in the comment.
| jpc0 wrote:
| I was curious and yes I was mistaken.
| astrange wrote:
| I asked a stupid question and got a stupid answer.
| Relatively speaking the answer was stupider than it
| should have been, so yes, it was wrong.
|
| I asked it to try again and got a better result though,
| just didn't include it.
| rpigab wrote:
| I asked ChatGPT for some dataviz task (I barely ever do
| dataviz myself) and it recommended some nice Python
| libraries to use, some I had already heard of and some I
| hadn't, and provided the code.
|
| I'm grateful because I thought code LLMs only sped up the
| "RTFM" part, but it made me find those libs so I didn't
| have to Google around for (and sometimes it's hard to
| guess if they're the right tool for the job, and they
| might be behind in SEO).
| miki123211 wrote:
| There are three things I find LLMs really excellent at
| for coding:
|
| 1. Being the "senior developer" who spend their whole
| career working with a technology you're very junior at.
| No matter what you do and how long your programming
| career is, you're inevitably going to run into one of
| these sooner or later. Whether it's build scripts,
| frontend code, interfacing with third-party APIs or
| something else entirely, you aren't an expert at every
| technology you work with.
|
| 2. Writing the "boring" parts of your program, and every
| program has some of these. If you're writing a service to
| fooize a bar really efficiently, Copilot won't help you
| with the core bar fooization algorithm, but will make you
| a lot faster at coding up user authentication, rate
| limiting for different plans, billing in whatever obscure
| payment method your country uses etc.
|
| 3. Telling you what to even Google for. This is where raw
| Chat GPT comes into play, not Copilot. Let's say you need
| a sorting algorithm that preserves the order of equal
| elements from the original list. This is called stable
| sorting, and Googling for stable sorting is a good way to
| find what you're looking for, but Chat GPT is usually a
| better way to tell you what it's called based on the
| problem description.
| kevingadd wrote:
| If you want to be an amateur chemist I recommend not
| getting your instructions from an LLM that might be
| hallucinating. Chemistry can be very dangerous if you're
| following incorrect instructions.
| isoprophlex wrote:
| From experience as a failed organic chemist (who happily
| switched to computational chemistry for reasons of self
| preservation) I can tell you it's plenty dangerous when
| you're following correct instructions :^)
| rpigab wrote:
| Yes, just as the best professional cooks recommend
| avoiding to boil cow eggs, as they can explode.
| slowmovintarget wrote:
| They don't explode, the shell simply cracks and then you
| get egg soup.
|
| Now microwaving eggs... that's a different matter.
| rpigab wrote:
| I was talking about cow eggs specifically! When ChatGPT
| et al got out, one of the funniest things to do was ask
| it about the best recipes for cow egg omelette or camel
| egg salad, and the LLM would provide. Sadly, most of it
| got patched somehow.
| slowmovintarget wrote:
| Oops... Yep, I missed that too. (On the internet, no one
| knows you're a dog.)
|
| That's funny. It makes me wonder how these statistical
| mad libs machines will handle the gradual boundaries
| nature gives us. Almost all mammals give birth live, but
| not all. Nearly all mammals had mammalian parents, but
| not all.
|
| Daniel Dennett was making this argument for why we
| haven't developed reasonable models for the nature of
| consciousness. It's because we're so sure there will be
| an absolute classification, and not a gradual
| accumulation of interacting systems that together yield
| the phenomenon.
| supposemaybe wrote:
| Links to all these models you speak of?
| fragmede wrote:
| https://huggingface.co/georgesung/llama2_7b_chat_uncensor
| ed
|
| https://huggingface.co/SkunkworksAI/BakLLaVA-1
|
| you'll have to brave 4chan yourself to find links to the
| NSFW ones, I don't actually have them.
| supposemaybe wrote:
| I just can't brave the venture to 4chan, I may get mugged
| or worse.
| gryn wrote:
| > There's an uncensored model for vision available as well.
|
| you mean the LLava based variants ?
| fragmede wrote:
| https://huggingface.co/SkunkworksAI/BakLLaVA-1
| bambax wrote:
| > _if you accidentally use the word "hack" [with]
| ChatGPT..._
|
| Side note: ChatGPT is now completely useless for most
| creative tasks. I'm trying to use it, via NovelCrafter, to
| help flesh out a story where a minor character committed
| suicide. ChatGPT refuses to respond, mentioning "self harm"
| as a reason.
|
| The character in question killed himself before the story
| even begins (and for very good reasons, story-wise); it's
| not like one's asking about ways to commit suicide.
|
| This is insane, ridiculous, and different from what all
| other actors of the industry do, including Claude or
| Mistral. It seems OpenAI is trying to shoot itself in the
| foot and doing a pretty good job at it.
| marpstar wrote:
| I've been frustrated by this, too. Trying to ask for ways
| to support a close family member who experienced sexual
| trauma. ChatGPT won't touch the topic.
| luma wrote:
| OpenAI is angling for enterprise users who have different
| notions about safety. Writing novels isn't the use case,
| powering customer service chatbots that will never ever
| ever say "just kill yourself" is.
| barfingclouds wrote:
| Darn I guess you'll have to go back to living in the dark
| ages and actually write it yourself
| anukin wrote:
| Which uncensored model is willing to play hot or not? I
| just knew about llava. Are there other such models now?
| tgma wrote:
| If you have an >=M1-class machine with sufficient RAM, the
| medium-sized models that are on the order of 30GB in size
| perform decently on many tasks to be quite useful without
| leaking your data.
| bongobingo1 wrote:
| What is sufficient RAM in that case? 30gb+? Or can you get
| by streaming it?
| AaronFriel wrote:
| 30gb+, yeah. You can't get by streaming the model's
| parameters: NVMe isn't fast enough. Consumer GPUs and
| Apple Silicon processors boast memory bandwidths in the
| hundreds of gigabytes per second.
|
| To a first order approximation, LLMs are bandwidth
| constrained. We can estimate single batch throughput as
| Memory Bandwidth / (Active Parameters * Parameter Size).
|
| An 8-bit quantized Llama 2 70B conveniently uses 70GiB of
| VRAM (and then some, let's ignore that.) The M3 Max with
| 96GiB of VRAM and 300GiB/s bandwidth would have a peak
| throughput around 4.2 tokens per second.
|
| Quantized models trade reduced quality for lower VRAM
| requirements and may also offer higher throughput with
| optimized kernels, largely as a consequence of
| transfering less data from VRAM into the GPU die for each
| parameter.
|
| Mixture of Expert models reduce active parameters for
| higher throughput, but disk is still far too slow to page
| in layers.
| noman-land wrote:
| I'm using Mixtral 8x7b as a llamafile on an M1 regularly
| for coding help and general Q&A. It's really something
| wonderful to just run a single command and have this
| incredible offline resource.
| tgma wrote:
| I concur; in my experience Mixtral is one of the best
| ~30G models (likely the best pro laptop-size model
| currently) and Gemma is quite good compared to other
| below 8GB models.
| tchvil wrote:
| By any chance, do you have a good link to some help with
| the installation?
| yaantc wrote:
| Use llamafile [1], it can be as simple as downloading a
| file (for mixtral, [2]), making it executable and running
| it. The repo README has all the info, it's simple and
| downloading the model is what takes the most time.
|
| In my case I got the runtime detection issue (explained
| in the README "gotcha" section). Solved my running
| "assimilate" [3] on the downloaded llamafile.
| [1] https://github.com/Mozilla-Ocho/llamafile/
| [2] https://huggingface.co/jartine/Mixtral-8x7B-Instruct-
| v0.1-llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5
| _K_M.llamafile?download=true [3]
| https://cosmo.zip/pub/cosmos/bin/assimilate
| tchvil wrote:
| Thank you !
| tgma wrote:
| Either https://lmstudio.ai (desktop app with nice GUI) or
| https://ollama.com (command-like more like a docker
| container that you can also hook up to a web UI via
| https://openwebui.com) should be super straightforward to
| get running.
| tchvil wrote:
| Thank you for letting me know it was possible on an M1.
| I'll try all this now.
| chown wrote:
| I am the author of Msty [1]. My goal is to make it as
| straightforward as possible with just one click (once you
| download the app). If you try it, let me know what you
| think.
|
| 1: https://msty.app
| tchvil wrote:
| I'll try in a week+ when I'm back to a fast connection.
| Thank you.
| yunohn wrote:
| Why is this both free and closed source? Ideally, when
| you advertise privacy-first, I'd like to see a GitHub
| link with real source code. Or I'd rather pay for it to
| ensure you have a financial incentive to not sell my
| data.
| chown wrote:
| It will be paid down the road, but we are not there yet.
| It's all offline, data is locally saved. You own it, we
| don't have it even if you ask for it.
| supposemaybe wrote:
| It's an awful thing for many to accept, but just
| downloading and setting up an LLM which doesn't connect to
| the web doesn't mean that your conversations with said LLM
| won't be a severely interesting piece of telemetry that
| Microsoft and (likely Apple) would swipe to help deliver a
| 'better service' to you.
| jonnycomputer wrote:
| They are not so bad as you are making out, tbh.
|
| And privacy is a good enough reason to use local LLMs over
| commercial ones.
| gardenhedge wrote:
| You can just chat to ChatGPT for awhile about something you
| know about and you'll learn that.
| gfodor wrote:
| I mean kinda. But there's a good chance this is also
| misleading. Lots of people have been fooled into thinking
| LLMs are inherently stupid because they have had bad
| experiences with GPT-3.5. The whole point is that the
| mistakes they make and even more fundamentally _what they 're
| doing_ changes as you scale them up.
| hylaride wrote:
| _The ones you can run on your own machine tend to be bad -
| really bad. They hallucinate wildly and fail at all sorts of
| tasks that the larger hosted ones succeed at._
|
| Totally. I recently asked a locally-run "speed" LLM for the
| best restaurants in my (major) city, but it spit out
| restaurants opened by chefs from said city in other cities.
| It's not a thing you'd want to rely on for important work,
| but is still quite something.
| barfingclouds wrote:
| Why not just interact with a virtual one that's equally weak?
| You get all the same benefits
| jrflowers wrote:
| It is invaluable to have a chunk of human knowledge that can
| tell you things like the Brooklyn Nets won the 1986 Cricket
| World Cup by scoring 46 yards in only 3 frames
| fragmede wrote:
| According to ChatGPT
|
| > Australia won the 1987 Cricket World Cup. The 1986 date is
| incorrect; there was no Cricket World Cup in 1986. The
| tournament took place in 1987, and Australia defeated England
| in the final to win their first title.
|
| https://chat.openai.com/share/e9360faa-1157-4806-80ea-563489.
| ..
|
| I'm no cricket fan, so someone will have to correct Wikipedia
| if that's wrong.
|
| If you want to point out that LLMs hallucinate, you might
| want to speak plainly and just come out and say it, or at
| least give a real world example and not one where it didn't.
| vlunkr wrote:
| We're not talking about running chatGPT locally though, are
| we?
| fragmede wrote:
| _sigh_ your going to make me open my laptop, aren 't you.
| fragmede wrote:
| I ran 'who won the 1986 Cricket World Cup' against
| llama2-uncensored (the local model I have pre-downloaded)
| and hilarious got 5 different answers asking it 5 times:
| >>> who won the 1986 Cricket World Cup India
| >>> who won the 1986 Cricket World Cup Australia
| >>> who won the 1986 Cricket World Cup New
| Zealand >>> who won the 1986 Cricket
| World Cup West Indies >>> who
| won the 1986 Cricket World Cup England
|
| Which proves GP's point about hallucinations, though none
| of those are
|
| > Brooklyn Nets won the 1986 Cricket World Cup by scoring
| 46 yards in only 3 frames
|
| LLM's hallucinations are insidous because they have the
| ring of truth around them. yards and frames aren't
| cricket terms, so we're off to the races with them.
| astrange wrote:
| If you want factual answers from a local model it might
| help to turn the temperature down.
| jrflowers wrote:
| > If you want factual answers from a local model it might
| help to turn the temperature down.
|
| This makes sense. If you interact with a language model
| and it says something wrong it is your fault
| astrange wrote:
| You're not "interacting with a language model", you're
| running a program (llama.cpp) with a sampling algorithm
| which is not set to maximum factualness by default.
|
| It's like how you have to set x264 to the anime tuning or
| the film tuning depending on what you run it on.
| fragmede wrote:
| It would also help if I had more VRAM and wasn't running
| a 7B parameter 4-bit quantized model.
| beefnugs wrote:
| Actually isn't this good? It means we can run something
| multiple times to prove itself a bad answer?
| latexr wrote:
| You can ask LLMs the same question and they might
| sometimes get it wrong and other times get it right.
| Having different answers is no indication that none of
| them is correct.
|
| Furthermore, even if an LLM always gives the same answer
| to a question, there's no guarantee the answer is
| correct.
|
| https://en.wikipedia.org/wiki/Propaganda
|
| https://en.wikipedia.org/wiki/Big_lie#Alleged_quotation
| sroussey wrote:
| An LLM will always give the same output for the same
| input. It's sorta like a random number generator that
| gives the same list of "random" numbers for the same
| seed. LLMs get a seed too.
| latexr wrote:
| That's irrelevant for the matter. The person I replied to
| obviously did not have seeded responses in mind.
| ilaksh wrote:
| You should specify the model size and temperature.
|
| For fact retrieval you need to use temperature 0.
|
| If you don't get the right facts then try 34b, 70b,
| Mixtral, Falcon 180b, or another highly ranked one that
| has come out recently like DBRX.
| samus wrote:
| The facts LLMs learned from training are fuzzy, unreliable,
| and quickly outdated. You actually want retrieval-augmented
| generation (RAG) where a model queries an external system for
| facts or to perform calculations and postprocesses the
| results to generate an answer for you.
| unshavedyak wrote:
| Is there a name for the reverse? I'm interested in having a
| local LLM monitor an incoming, stateful data stream.
| Imagine chats. It should have the capability of tracking
| the current day, active participants, active topics, etc -
| and then use that stateful world view to associate metadata
| with incoming streams during indexing.
|
| Then after all is indexed you can pursue RAG on a richer
| set of metadata. Though i've got no idea what that stateful
| world view is.
| TheCaptain4815 wrote:
| It's kind of crazy really. Before LLMs, any type of world scale
| disaster you'd hope for what? Wikipedia backups? Now, a single
| LLM ran locally would be much more effective. Imagine the local
| models in 5 years!
| danmur wrote:
| Uh yeah I would, and still am, take the Wikipedia backup for
| doomsday scenarios. I'm not even sure how that would be a
| competition
| Zambyte wrote:
| The processing required to run current language models with a
| useful amount of knowledge encoded in them is way more than I
| imagine would be available in a "world scale disaster".
| int_19h wrote:
| There's a lot more than just Wikipedia that gets archived,
| and yes, that is a far more sensible way to go about it. For
| one thing, the compute required to then read it back is
| orders of magnitude less (a 15 year old smartphone can handle
| it just fine). For another, you don't have to wonder how much
| of what you got back is hallucinated - data is either there
| or it's corrupted and unreadable.
| creatonez wrote:
| Maybe I'm seeing things through a modern lens, but if I were
| trying to restart civilization and was _only_ left with
| ChatGPT, I would be enraged and very much not grateful for
| this.
| nyokodo wrote:
| > if I were trying to restart civilization and was only left
| with ChatGPT
|
| In this scenario you'd need to also be left with a big chunk
| of compute, and power infrastructure. Since ChatGPT is the
| front end of the model you'd also need to have the internet
| still going in a minimum capacity.
| CaptainOfCoit wrote:
| If we're playing this game, you forgot to mention that they
| also need: A monitor, a keyboard, roof over their head (to
| prevent rain from entering your electronic), etc etc...
|
| But really, didn't you catch the meaning of parents
| message, or are you being purposefully obtuse?
| devsda wrote:
| I think re-imagining the "Dr. Stone" series with the main
| character replaced by an LLM will be a funny & interesting
| series if we decide to stay true to LLMs nature and make it
| hallucinate as well.
|
| Given the way LLMs are right now, I suspect there will be lot
| of failed experiments and the kingdom of science will not
| advance that quick.
| latexr wrote:
| > the kingdom of science will not advance that quick.
|
| It's more likely that it wouldn't even start. The first
| step to any development was figuring out nitric acid as the
| cure to the petrification. Good luck getting any LLM to
| figure that out. Even if it did, good luck getting any of
| the other characters to know what to do with that
| information that early on.
| m3kw9 wrote:
| And why would I need to backup human knowledge as an individual
| exe34 wrote:
| You remember those fantasies where you got up from your seat
| at the pub and punched the lights out of this guy for being
| rude? A lot of us have fantasies of being the all powerful
| oracle that guides a reboot of civilization using knowledge
| of science and engineering.
| latexr wrote:
| > the all powerful oracle that guides a reboot of
| civilization using knowledge of science and engineering.
|
| https://en.wikipedia.org/wiki/Dr._Stone
| raincole wrote:
| It seems to be an unbelievably inefficient way to back up
| knowledge.
| samus wrote:
| Are they though? They are lossy compressing trillions of
| tokens into a few dozen GB. The decompression action is fuzzy
| and inefficient though.
| raincole wrote:
| And it requires massive computational power to decompress,
| which I don't expect to be available in a catastrophic
| situation where humans have lost a large chunk of important
| knowledge.
| samus wrote:
| I don't necessarily agree. It requires massive computing
| power, but running models smaller than 70G parameters is
| possible on consumer hardware, albeit slowly.
| threecheese wrote:
| Parent may be thinking more along the lines of a "hope we
| can print all the knowledge" type catastrophe. Though if
| there is zero compute it'll be tough reading all those
| disks!
| LunaSea wrote:
| I wonder how the Chinese government will manage to sensor LLMs
| within China?
| popol12 wrote:
| The same way Facebook/Google/openAI & others censored their
| own LLMs, I guess ?
| LunaSea wrote:
| That's only for SaaS LLMs, but if you can simply download
| and run one on your hardware, things become difficult.
| kalleboo wrote:
| I had downloaded some LLMs to run locally just to experiment
| when a freak hailstorm suddenly left me without internet for
| over a week. It was really interesting to use a local LLM as a
| replacement for Google.
|
| It gave me a new mental model for LLMs rather than a "spicy
| autocomplete" or whatever, I now think of it as "a lossy
| compressed database of knowledge". Like you ran the internet
| through JPEG at 30% quality.
| pizzafeelsright wrote:
| Feels like that really smart friend who is probably correct
| but ya just don't know.
| dragonwriter wrote:
| Language models are an inefficient way to store knowledge; if
| you want to have a "pseudo-backup of a large chunk of human
| knowledge," download a wikipedia dump, not an LLM.
|
| If you want a friendly but fallible UI to that dump, download
| an LLM and build a simple ReAct framework around it with
| prompting to use the wikipedia dump for reference.
| TrevorJ wrote:
| It's a very underrated side effect of this whole LLM thing:
| We've created a super compact representation of human knowledge
| in a form that requires a FAR less complex tech stack to get
| the information 'out' of in the future.
|
| A year ago, a lot of this information only existed on the
| internet, and would have been nearly impossible to recover in
| any cohesive unfragmented form if the lights were to ever go
| out on our civilization.
|
| Now the problem space has moved simply to "find a single
| solitary PC that will still boot up", and boom, you have access
| to everything.
|
| I think we just created our Rosetta stone.
| 1-6 wrote:
| Question is, how much of an improvement has it gotten to over a
| GPU or ASIC?
| dartos wrote:
| Nothing in software will ever beat an equivalent ASIC.
| postalrat wrote:
| Sure there is. Software is easy to change.
| dartos wrote:
| By "beat" I meant in performance.
|
| Obviously you can't change an asic
| fragmede wrote:
| an asic is fixed function, so it'll never be able to boot my
| pc and then be the CPU, even though an asic beats the pants
| off anything else computing Sha hashes for Bitcoin mining.
| dartos wrote:
| By "beat" I meant performance.
|
| Obviously an ASIC is not a general purpose machine like a
| cpu.
| fulafel wrote:
| Most ASICs are cost or power optimizations.
| dartos wrote:
| Exactly. They're much faster for their specific tasks and
| thus are more power efficient and potentially cost
| efficient
| fulafel wrote:
| No. Eg of the hardware discussed on the article, the
| Raspberry Pi uses an ASIC that's slow, cheap and low
| power vs the Intel or AMD chips.
|
| In some cases ASICs are faster than general purpouse
| CPUs, but usually not.
| LtdJorge wrote:
| Is the LLM running on an ASIC for the Pi here? I dout it.
| yjftsjthsd-h wrote:
| I think that should be phrased more like "what fraction of GPU
| speed can this reach?", because it'll always be less than 1x.
| gpapilion wrote:
| So... I was struggling with this for a while. I would says
| anywhere from 2x to an order of magnitude faster with a GPU.
| (I've been looking at a lot of GPU benchmarks lately, and they
| are REALLY hard to compare since they are all so specific)
|
| I do think long term there gets to be more hope for CPUs here
| with inference largely because memory bandwidth becomes more
| important than the gpu. You can see this with reports of the
| MI-300 series outperforming h100, largely because it has more
| memory bandwidth. MCR dimms give you close to 2x the exiting
| memory bw in intel cpus, and when coupled with AMX you may be
| able to exceed v100 and might touch a100 performance levels.
|
| HBM and the general GPU architecture gives it a huge memory
| advantage, especially with the chip to chip interface. Even
| adding HBM to a CPU, you are likely to find the CPU is unable
| to use the memory bw effectively unless it was specifically
| designed to use it. Then you'd still likely have limited
| performance with things like UPI being a really ugly bottleneck
| between CPUs.
| imtringued wrote:
| If someone releases DDR5 or DDR6 based PIM, then most of the
| memory bandwidth advantage of GPUs evaporates overnight. I
| expect CPUs to be king at inference in the future.
| gpapilion wrote:
| But then you'll get GDDR6 delivered via HBM5 or whatever. I
| don't think CPUs will ever really keep up with the memory
| bandwidth, because for most applications it doesn't matter.
|
| MCR DIMM is like 1/2 the memory bandwidth that is possible
| with HBM4, plus it requires you to buy something like 2TB
| of memory. It might get there, but I'd keep my money on hbm
| and gpus.
| baq wrote:
| From the article, passage about the 14900k:
|
| > For example, when I run my spam.sh shell script, it only
| takes 420 milliseconds, which is 7x faster than my Raspberry Pi
| 5. That's right, when it comes to small workloads, this chip is
| able to finish before CUDA even gets started.
|
| So... it depends :)
| jchw wrote:
| I think I understand what you are thinking. You may be fixing
| "than other ways of running them" to the end of the title, but
| it's actually "than it was on CPU before now".
| discordance wrote:
| "As for disk speed, dd if=/dev/zero of=/tmp/output bs=128k
| count=50k; rm -f /tmp/output reports 1.6 GB/s which is 3.6x
| slower than my Mac Studio, and 3x slower than my Intel (which has
| the same M.2 stick). I'm told that Intel and Apple are just
| better at this, but I wish I understood why. "
|
| Can anyone here answer why this is?
| pstrateman wrote:
| Apple made fsync a noop.
|
| You have to make a different call to get sync on macos.
|
| So tons is stuff is faster because it's not actually writing to
| disk.
| bishfish wrote:
| Plus he isn't using oflag=direct, so since output file is small
| it isn't even making it to disk. I think it would only be sent
| to page cache. I'm afraid he is testing CPU and memory (bus)
| speeds here.
|
| oflag=direct will write direct and bypass page cache.
| fweimer wrote:
| Exactly. Something is very fishy if this system only writes
| 1.6 GB/s to the page cache. Probably that dd command line
| quoted in the article is incomplete.
| pama wrote:
| Super nice story on the matmul optimization that gave 810 gflops
| for 512x512. Thanks for the write up and the contributions to
| llama.cpp and the community more broadly.
| kiratp wrote:
| It fascinating to me that coming up on a year since Sapphire
| Rapids has been available in the public cloud, developers are
| still targeting AVX512 when they should be targeting VNNI and
| AMX.
|
| https://github.com/ggerganov/llama.cpp/issues/2555
| yjftsjthsd-h wrote:
| This project in particular seems to care about the long tail of
| hardware; note that the very first machine in this post is a
| box from 2020 with spinning rust disk. Granted, adding support
| for newer extensions is likely also good, but cost/benefit is
| in play.
| taneq wrote:
| Is four years really 'long tail' these days? Our VM host box
| is from 2010 (and I had to rebuild llama.cpp locally without
| AVX to get it working :P )
| yjftsjthsd-h wrote:
| For cutting-edge LLM work, probably? I mean, I run mine on
| older hardware than that, but I'm a total hobbyist...
| d416 wrote:
| It should be noted that while the HP Prodesk was released
| in 2020, the CPU's Skylake architecture was designed in
| 2014. Architecture is a significant factor in this style of
| engineering gymnastics to squeeze the most out of silicon.
| refulgentis wrote:
| For LLMs...yeah. I imagine you're measuring in
| tokens/minute with that setup. So its possible, but...do
| you use it much? :)
| luyu_wu wrote:
| I don't believe that is the target for a local LLM... Pretty
| sure we're talking about client-side computing, of which the
| newest supports only AVX-512 (and even that sketchily on
| Intel's side).
| kristianp wrote:
| Just buy a new AMD processor that supports AVX512.
| baq wrote:
| People with Sapphire Rapids options are not the target audience
| of these patches
| aniijbod wrote:
| A way of thinking about what's inside any of the top LLMs right
| now: even if they never learn another single fact, even if they
| get ridiculously out of date as a result, even if they are even
| more riddled with errors and prone to biases than we know them to
| be, even if they are as prone to hallucinations as we know they
| they are and they never develop the capacity to cure themselves
| of this, they are more knowledgeable and capable of more reasoned
| response, despite their capacity for error, to more questions
| than any single human being that has ever lived.
| JKCalhoun wrote:
| Picturing "LLM Jeopardy". You know, a game show.
| samus wrote:
| We shouldn't choose LLMs for how many facts they support, but
| their capability to process human language. There is some
| overlap between these two though, but an LLM that just doesn't
| know something can always be augmented with RAG capabilities.
| talldayo wrote:
| If you ignore my capacity for error, I bet I'd put up a good
| score too. Hell, maybe Markov chains are smarter than LLMs by
| this definition.
| ajtulloch wrote:
| - https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html
| (e.g. here
| https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...)
|
| - https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...
|
| might be of interest
| kpw94 wrote:
| Great links, especially last one referencing the Goto paper:
|
| https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...
|
| >> I believe the trick with CPU math kernels is exploiting
| instruction level parallelism with fewer memory references
|
| It's the collection of tricks to minimize all sort of cache
| misses (L1, L2, TLB, page miss etc), improve register reuse,
| leverage SIMD instructions, transpose one of the matrices if it
| provides better spatial locality, etc.
| larodi wrote:
| The trick is indeed to somehow imagine how the CPU works with
| the Lx caches and keep as much info in them as possible. So
| its not only about exploiting fancy instructions, but also
| thinking in engineering terms. Most of the software written
| in higher level langs cannot effectively use L1/L2 and thus
| results in this constant slowing down otherwise similarly
| (from asymptotic analysis perspective) complexity algos.
| wokwokwok wrote:
| > You don't need a large computer to run a large language model
|
| While running tiny llama does indeed count as running a language
| model, I'm skeptical that the capabilities of doing so match what
| most people would consider a baseline requirement to be useful.
|
| Running 10 param model is also "technically" running an LM, and I
| can do it by hand with a piece of paper.
|
| That doesn't mean "you don't need a computer to run an LM"...
|
| I'm not sure where LM becomes LLM, but... I personally think it's
| more about capability than parameter count.
|
| I don't _realllly_ believe you can do a lot of useful LLM work on
| a pi
| mlyle wrote:
| Tinyllama isn't going to be doing what ChatGPT does, but it
| still beats the pants off what we had for completion or
| sentiment analysis 5 years ago. And now a Pi can run it
| decently fast.
| jerrygenser wrote:
| You can fine-tune a 60mm parameter (e.g. distilBERT)
| discriminative (not generative) language model and it's one
| or two order of magnitude more efficient for classification
| tasks like sentiment analysis, and probably similar if not
| more accurate.
| mlyle wrote:
| Yup, I'm not saying TinyLLAMA is minimal, efficient, etc
| (indeed, that is just saying that you can take models even
| smaller). And a whole lot of what we just throw LLMs at is
| not the right tool for the job, but it's expedient and
| surprisingly works.
| samus wrote:
| Some newer models trained more recently have been repeatedly
| shown to have comparable performance as larger models. And the
| Mixture of Experts architecture makes it possible to train
| large models that know how to selectively activate only the
| parts that are relevant for the current context, which
| drastically reduces compute demand. Smaller models can also
| level the playing field by being faster to process content
| retrieved by RAG. Via the same mechanism, they could also
| access larger, more powerful models for tasks that exceed their
| capabilities.
| SoothingSorbet wrote:
| I've gotten some useful stuff out of 7B param LLMs, and that
| should fit on a Pi quantized.
| bee_rider wrote:
| Is it easy to find where the matvecs are, in LLaMA (if you are
| someone who is curious and wants to poke around at the "engine"
| without understanding the "transmission," so to speak)? I was
| hoping to mess around with this for Stable Diffusion, but it
| seemed like they were buried under quite a few layers of
| indirection. Which is entirely reasonable, the goal is to ship
| software, not satisfy people who'd just want to poke things and
| see what happens, haha.
| fragmede wrote:
| did you see tiny grad can run llama and stable diffusion? it's
| an intentionally extremely simple framework vs pytorch or even
| micrograd, which helped me dig into the underlying math. though
| https://spreadsheets-are-all-you-need.ai/ is a good one for
| learning LLMs.
| bee_rider wrote:
| I haven't seen that. I'll definitely have to take a look,
| thanks!
| none_to_remain wrote:
| From the example: "--temp 0 turns off the random number generator
| (we don't want improvisation for a spam filter)"
|
| I've been thinking for a while about how many applications of
| LLMs need this adjustment and aren't getting it
| mvkel wrote:
| Is that what it does, though?
|
| I thought setting temperature to 0 would (extremely simple
| example) equate to a spam filter seeing:
|
| - this is a spam email
|
| But if the sender adapts and says
|
| - th1s is a spam email
|
| It wouldn't be flagged as spam.
| none_to_remain wrote:
| My understanding is that temperature applies to the output
| side and allows for some randomness in the next predicted
| token. Here Justine has constrained the machine to start with
| either "yes" or "no" and to predict only one token. This
| makes the issue stark: leaving a non-zero temperature here
| would just add a chance of flipping a boolean.
| refulgentis wrote:
| It's more nuanced than that, in practice: this is true for
| the shims you see from API providers (ex. OpenAI,
| Anthropic, Mistral).
|
| With llama.cpp, it's actually not a great idea to have
| temperature purely at 0: in practice, especially with
| smaller models, this leads to pure repeating or nonsense.
|
| I can't remember where I picked this up, but, a few years
| back, without _some_ randomness, the next likely token was
| always the last token.
| samus wrote:
| The output of an autoregressive model is a probability for
| each token to appear next after the input sequence. Computing
| these is strictly deterministic from the prior context and
| the model's weights.
|
| Based on that probability distribution, a variety of text
| generation strategies are possible. The simplest (greedy
| decoding) is picking the token with the highest probability.
| To allow creativity, a random number generator is used to
| choose among the possible outputs, biased by the
| probabilities of course.
|
| Temperature scales the output probabilities. As temperature
| increases, the probabilities approach 1/dictionary size, and
| the output becomes completely random. For very small
| temperature values, text generation approaches greedy
| sampling.
|
| If all you want is a spam filter, better replace the output
| layer of an LLM with one with just two outputs, and finetune
| that on a public collection of spam mails and some "ham" from
| your inbox.
| moffkalast wrote:
| I couldn't disagree more, turning temp to zero is like taking a
| monte carlo method and only using one sample, or a particle
| filter with only one particle. Takes the entire concept and
| throws it out of the window so you can have predictability.
|
| LLMs need to probabilistically explore the generation domain to
| converge on a good result for best performance. Similar issue
| with people benchmarking models by only having them output one
| single token (e.g. yes or no) outright, which prevents any real
| computation from occurring so the results are predictably poor.
| Ono-Sendai wrote:
| Multithreading support in llama.cpp is probably still pretty
| busted, assuming it uses the same underlying NN inference code as
| whisper.cpp:
| https://github.com/ggerganov/whisper.cpp/issues/200#issuecom...
| imtringued wrote:
| From what I have heard they use manual spin locks. Generally,
| spin locks are not a good idea unless you want to dedicate the
| entire machine to a single application. If the process a
| spinlock waits on gets suspended, you're burning CPU time for
| nothing. The OS thinks a spinlock making zero progress is
| actually a high priority process, so it is starving the
| suspended process from making progress.
| Ono-Sendai wrote:
| Yeah the code looks like a spinlock. It behaves terribly
| under contention, resulting in performance falling off a
| cliff as the number of threads increases. Adding more threads
| actually slows down the total performance.
|
| I would fix it if I could be bothered. Instead I will just
| use the Cuda whisper backend which is pretty nice and fast.
| jongjong wrote:
| That's interesting because I built a simple ANN library and I was
| playing around with GPU acceleration and came to a similar
| conclusion as this article.
|
| To be fair, my ANN library was faster (up to 2x) with GPU
| acceleration in some scenarios were ANN was shallow (as opposed
| to deep with many hidden layers). I thought the marginal gain may
| have been because, the way it's set up in my library, it has to
| load all the values into the GPU from RAM for each pass of
| forward and back propagation in each layer during training. I
| believe there is a way to allocate memory on the GPU chip itself
| but it's a lot more challenging to do, especially in a modular,
| fully portable way (which was one of the goals of my library).
|
| But anyway, even the 2x best-case figure seemed disappointing. In
| my mind, I expected to see at least 10x speed improvement... And
| I was surprised that the CPU version was actually slightly faster
| in the scenario I was testing at the time which was a relatively
| deep network. It makes sense since the different layers cannot be
| parallelized as the input of one layer depends on the output of
| the previous layer... So the more layers you have, the more
| serial bottlenecks you have, the less you can benefit from GPU
| acceleration... And unfortunately, deep networks also happen to
| be those which tend to perform best for a lot of use cases.
| kristianp wrote:
| Nice to see such speedups for CPUs. Are these changes available
| as a branch or pull request in llama.cpp itself? I'd like to make
| use of them in that form if possible (as I'm used to using that).
| dagaci wrote:
| Yes, this is really a phenomenal effort! And what open source
| is about: Bringing improvements to so many use cases. So that
| Intel and AMD chip uses can start to perform while taking
| advantage of their high-performance capabilities, making even
| old parts competitive.
|
| There are two PRs raised to merge to llama.cpp:
|
| https://github.com/ggerganov/llama.cpp/pull/6414
|
| https://github.com/ggerganov/llama.cpp/pull/6412
|
| Hopefully these can be accepted, without drama! as there are
| many downstream dependencies on llama.cpp can will also
| benefit.
|
| Though of course everyone should also look directly at releases
| from llamafile https://github.com/mozilla-Ocho/llamafile.
| wtallis wrote:
| I know this post is focused specifically on _CPU_ performance,
| but the section on the performance on the Mac Studio seems to be
| deliberately avoiding directly mentioning that machine 's GPU,
| let alone benchmark against it. I think it would have been
| interesting to see a straightforward comparison of what compute
| performance and memory bandwidth (as measured by the prompt
| processing and token generation speeds, respectively) are
| achievable with reasonable optimization effort on the CPU vs GPU
| when they're attached to the same memory subsystem.
| politelemon wrote:
| This is great work. I've always thought it would be great if
| running LLM could be commoditized for regular average Joe
| hardware. I had thought that llamafile was like dockerfile for
| llama.cpp but looks like that's a mistake?
|
| Will definitely be giving this a try.
| seangrogg wrote:
| Mmm, I wonder how well this would work on a mobile device. Maybe
| I'll try grabbing my ubuntu touch here in a sec...
| seangrogg wrote:
| (For any who were curious: it does not for memory reasons)
| speps wrote:
| Regarding this bit at the end:
|
| > I learned how to write math kernels by renting Vast VMs and
| watching Gautham Venkatasubramanian and mrdomino develop CUDA
| kernels in a tmux session. They've been focusing on solving a
| much more important challenge for llamafile, which is helping it
| not have a mandatory dependency on the cuBLAS
|
| If I'm reading this right, they're trying to rewrite cuBLAS
| within CUDA itself. I'm guessing the next step would be removing
| CUDA dependency and go with directly using Vulkan or Metal
| compute shaders. Am I correct?
| WithinReason wrote:
| Yes, but none of these have performance portability across GPU
| vendors, so it's probably seen as pointless. You would need an
| AMD Vulkan shader, an nvidia one, and intel one, etc. It's not
| like C code on CPUs.
| radarsat1 wrote:
| Depending on how many individual tweaks are necessary for
| hardware variants of course... but at this level of code &
| complexity it actually seems pretty reasonable to write 3 or
| 4 versions of things for different vendors. More work yes,
| but not pointless.
| treffer wrote:
| A nice example of this is fftw which has hundreds (if not
| thousands) of generated methods to do the fft math. The
| whole project is a code generator.
|
| It can then after compilation benchmark these, generate a
| wisdom file for the hardware and pick the right
| implementation.
|
| Compared with that "a few" implementations of the core math
| kernel seem like an easy thing to do.
| naasking wrote:
| Not exactly comparable, as you said, the FFTW
| implementations are auto-generated but it doesn't sound
| like these few implementations will be.
| bee_rider wrote:
| ATLAS was an automatically tuned BLAS, but it's been
| mostly supplanted by ones using the hand-tuned kernel
| strategy.
| touisteur wrote:
| Apache TVM does something similar for auto-optimization
| and last time I checked it wasn't always a win against
| OpenVINO (depending on the network and batch-size) and it
| came with lots of limitations (which may have been lifted
| since) - stuff like dynamic batch size.
|
| I wish we had superoptom
| TuringNYC wrote:
| To me it makes sense to have an interface that can be
| implemented individually for AMD, Metal, etc. Then, leave it
| up to the individual manufacturers to implement those
| interfaces.
|
| I'm sitting in an office with a massive number of Macbook Pro
| Max laptops usually sitting idle and I wish Apple would
| realize the final coup they could achieve if I could also run
| the typically-NVIDIA workloads on these hefty, yet
| underutilized, Mx machines.
| jorvi wrote:
| Apple could unlock so much compute if they give customers a
| sort of "Apple@Home" deal. Allow Apple to run distributed
| AI workloads on your mostly idle extremely overpowered
| Word/Excel/VSCode machine, and you get compensation dropped
| straight into your Apple account's linked creditcard.
| newswasboring wrote:
| If Apple were doing an Apple@Home kind of deal they might
| actually want to give away some machines for free or
| super cheap (I realize that doesn't fit their brand) and
| then get the rights perpetually to run compute on them.
| Kind of like advertising but it might be doing something
| actually helpful for someone else.
| TuringNYC wrote:
| >> If Apple were doing an Apple@Home kind of deal they
| might actually want to give away some machines for free
| or super cheap
|
| In such a case, my guess is that the machines being free
| would be trumped by the increased cost of electricity.
| TuringNYC wrote:
| BTW, at our day-job, we've been running a "cluster" of M1
| Pro Max machines running Ollama and LLMs. Corporate rules
| prevent remote access onto machines, so we created a
| quick and dirty pull system where individual developers
| can start pulling from a central queue, running LLM
| workloads via the Ollama local service, and contributing
| things back centrally.
|
| Sounds kludge, but introduce enough constraints and you
| end up with this as the best solution.
| nickpsecurity wrote:
| Do you have price-performance numbers you can share on
| that? Like compared against local or cloud machines with
| RTX and A100 GPU's?
| TuringNYC wrote:
| >> Do you have price-performance numbers you can share on
| that? Like compared against local or cloud machines with
| RTX and A100 GPU's?
|
| Good question, the account is muddy --
|
| 1. Electricity is a parent company responsibility, so
| while that is a factor in OpEx price, it isnt a factor
| for us. I dont think it even gets submetered. Obviously,
| one wouldnt want to abuse this, but maxing out Macbooks
| dont seem close to abuse territory
|
| 2. The M1/M2/M3 machines are already purchased, so while
| that is major CapEx, it is a sunk cost and also an
| underutilized resource most of the day. We assume no wear
| and tear from maxing out the cores, not sure if that is a
| perfect assumption but good enough.
|
| 3. Local servers are out of the question at a big company
| outside of infra groups, it would take years to provision
| them and I dont think there is even a means to anymore.
|
| The real question is cloud. Cloud with RTX/A100 would be
| far more expensive, though I'm sure performant. (TPM
| calculation left to the reader :-) I'd leave those for
| fine tuning, not for inference workloads. Non-production
| Inference is particularly bad because you cant easily
| justify reserved capacity without some constant
| throughput. If we could mix environments, it might make
| sense to go all cloud on NVIDIA but having separate
| environments with separate compliance requirements makes
| that hard.
|
| Jokes aside, I think a TPM calculation would be
| worthwhile and perhaps I can do a quick writeup on this
| and submit to HN.
| surge wrote:
| Maybe its a dumb question, but isn't something like OpenCL
| meant to solve this problem?
| jvanderbot wrote:
| From my understanding, using triangle / shaders to do HPC
| has given way to a specific, more general purpose GPU
| programming paradigm which is CUDA.
|
| Of course this knowledge is superficial and probably
| outdated, but if I'm not too far off base, it's probably
| more work to translate a general CUDA-like layer or CUDA
| libs to OpenCL.
| VHRanger wrote:
| In theory, yes.
|
| In practice, OpenCL became a giant mess. Some vendors put
| speed bumps by not supporting the transition from 2 to 3,
| or having shitty drivers for it.
|
| It also sat at the wrong level of abstraction for high
| performance compute, which is why CUDA ended up being used.
|
| Vulkan would have been reasonable to write compute shaders
| in, if there wasn't a ton of alternatives out there already
| now
| larodi wrote:
| llama.cpp (or rather G.Gerganov et. al.) are trying to avoid
| cuBLAS entirely, using ins own kernels. not sure how jart's
| effort relates, and whether jart intends to upstream these into
| llama.cpp which seems to still be the underlying tech behind
| the llamafile.
| homarp wrote:
| Here are links to the most recent pull requests sent
| https://github.com/ggerganov/llama.cpp/pull/6414
| https://github.com/ggerganov/llama.cpp/pull/6412
| speps wrote:
| This doesn't relate to GPU kernels unfortunately.
| pknerd wrote:
| So, I can now run it on my 2015 Macbook with 8GB RAM?
| isusmelj wrote:
| Is there somewhere an overview of the progress we made on the
| software side for training and inference of LLMs? It feels like
| we squeezed 10-100x more out of the hardware since llama
| appeared. This crazy progress will probably saturate though as we
| reach theoretical limits, no?
| mijoharas wrote:
| Has Justine written anywhere about her disassembly setup?
|
| > I configured Emacs so I can push a button, and the disassembly
| for the C++ code I'm working on will pop up on the screen in a
| few milliseconds.
|
| I assume it's something project specific rather than being able
| to get the disassembly for an arbitrary section of code or
| something?
|
| It seems very handy, so I'd love to see the implementation (I
| couldn't find anything googling)
| pelletier wrote:
| This is probably what they are referring to
| https://github.com/jart/disaster
| mijoharas wrote:
| Thanks! I need to get better at googling I guess.
| gpderetta wrote:
| Nice. I have been using rmsbolt for a similar feature, but it
| is very rough. I'll need to give this as try.
| moffkalast wrote:
| > the Raspberry Pi
|
| Odd how there were no Mistral 7 benchmarks for the Pi 5 in that
| table (I doubt anyone is seriously considering using TinyLlama
| for anything at all), so I went to re-test it out myself on the
| Pi 5 8G.
|
| llamafile 0.7: 52 predicted, 150 cached, 430ms per token, 2.32
| tokens per second
|
| llama.cpp + OpenBLAS: 36 predicted, 124 cached, 381ms per token,
| 2.62 tokens per second
|
| It does seem to inch closer to the speed you get with blas
| acceleration which is quite impressive, but in practical terms
| the Pi 5 is so heavily limited by its memory throughput
| bottleneck that it saturates the required compute with 3 threads
| already. So while fancy kernels will make it more efficient it
| won't really save you from that fundamental bandwidth limit. The
| Pi foundation messed up going with a 32 bit memory bus, simple
| as.
| 6r17 wrote:
| today being today ; I must ask ; anyone has actually tried this ?
| tomp wrote:
| TL;DR: unroll the outer two loops of matrix multiplication
| amelius wrote:
| Shouldn't this have been done in a library instead of a
| specific project? Then others could also profit from it.
| AbuAssar wrote:
| regarding AMD zen4 with avx512:
|
| "Here we see that, despite only being twice the price, the 7995WX
| x86 ISA offers 7x more raw compute power than the M2 Ultra ARM
| ISA, and nearly the same token generation speed, which is likely
| thanks to its 384mb L3 cache. When I bought this chip, I had to
| expand support in llama.cpp for bfloat16 and AVX512 before I
| could fully test its capabilities. My work means you can now run
| LLaMA 2.8x faster on Zen4 than you could before."
| reckless wrote:
| Does this also count platform costs or just chip cost? I'd
| imagine the threadripper motherboard and ram costs aren't
| insignificant
| KennyBlanken wrote:
| A complete desktop computer with the M2 Ultra w/64GB of RAM
| and 1TB of SSD is $4k.
|
| The 7995WX processor alone is $10k, the motherboard is _one
| grand_ , the RAM is another $300. So you're up to $11300, and
| you still don't have a PSU, case, SSD, GPU....or heatsink
| that can handle the 300W TDP of the threadripper processor;
| you're probably looking at a very large AIO radiator to keep
| it cool enough to get its quoted performance. So you're
| probably up past $12k, 3x the price of the Studio...more like
| $14k if you want to have a GPU of similar capability to the
| M2 Ultra.
|
| Just the usual "aPPle cOMpuTeRs aRE EXpeNsIVE!" nonsense.
| incrudible wrote:
| So from a CPU perspective you get 7x the CPU throughput for
| 3x to 4x the price, plus upgradable RAM that is massively
| cheaper. The M2 uses the GPU for LLMs though, and there it
| sits in a weird spot where 64GB of (slower) RAM plus
| midrange GPU performance is not something that exists in
| the PC space. The closest thing would probably be a
| (faster) 48GB Quadro RTX which is in the $5000 ballpark.
| For other use cases where VRAM is not such a limiting
| factor, the comparably priced PC will blow the Mac out of
| the water, especially when it comes to GPU performance. The
| only reason we do not have cheap 96GB GDDR GPUs is that it
| would cannibalize NVIDIA/AMDs high margin segment. If this
| was something that affected Apple, they would act the same.
| juitpykyk wrote:
| You're using the wrong CPU.
|
| Consumer AMD 7950X supports AVX-512, it's faster than M2
| Ultra at half the cost.
| aimonster2 wrote:
| Posted too early.
| sublimefire wrote:
| re:funding
|
| my friend suggested to nominate Justine for the open source
| contributions in an internal Microsoft programme (the winner
| takes $10k). They did not even want to add her to the potential
| list of nominees because her software is not used in MSFT. It
| speaks volumes about the corporate culture and shows what they
| really think about OSS support.
| miki123211 wrote:
| If I'm reading the post correctly, Llamafile is faster than
| llama.cpp, despite the author upstreaming some of the changes.
| What's the reason for this?
| tiffanyh wrote:
| Pixar uses CPUs ...
|
| I wonder if we'll end up in a situation like rendered movies.
|
| Where the big studios like Pixar uses CPUs (not GPUs) to render
| their movies due to the cost/perf (and access to larger amounts
| of RAM).
|
| https://news.ycombinator.com/item?id=25616372
| kreco wrote:
| > Where the big studios like Pixar uses CPUs (not GPUs) to
| render their movies due to the cost/perf (and access to larger
| amounts of RAM).
|
| I wonder if (or when) this will change once integrated GPUs
| become "mainstream", the CPU/GPU share the same RAM AFAIK.
| rockwotj wrote:
| I expect GPU hardware to specialize like Google's TPU. The
| TPU feels like ARM in these AI workloads where when you start
| to run these at scale, you'll care about the cost perf
| tradeoff for most usecases.
|
| > CPU/GPU share the same RAM AFAIK.
|
| This depends on the GPU I believe Apple has integrated
| memory, but most GPUs from my limited experience writing
| kernels have their own memory. CUDA pretty heavily has a
| device memory vs host memory abstraction.
| talldayo wrote:
| On top of that, Nvidia has provided a unified addressing
| abstraction over PCI for a looooong time via CUDA:
| https://developer.nvidia.com/blog/unified-memory-in-cuda-6/
|
| Customers like Pixar could probably push this even further,
| with a more recent Nvidia rack and Mellanox networking.
| Networking a couple Mac Studios over Thunderbolt doesn't
| have a hope of competing, at that scale.
| CaptainOfCoit wrote:
| I'm not sure how true that is anymore, from the outside it
| seems they're at least moving to a CPU/GPU hybrid (which makes
| a lot of sense), at least judging by new features landing in
| RenderMan that continues to add more support for GPUs (like
| XPU).
| tiffanyh wrote:
| Isn't this more of a function that RenderMan is a sold
| product.
|
| And it's expected to at least support GPUs.
| CaptainOfCoit wrote:
| Hard to know without getting information from people at
| Pixar really.
|
| Not sure how much sense it would make for Pixar to spend a
| lot of engineering hours for things they wouldn't touch in
| their own rendering pipeline. As far as I know, most of the
| feature development comes from their own rendering
| requirements rather than from outside customers.
| cthalupa wrote:
| It's entirely the cost/perf of access to the larger amounts of
| VRAM that keeps rendering on CPUs now. GPUs are strictly better
| in almost every way for rendering (We could have some arguments
| about technical precision, FP calculations, etc. but with
| modern cards these arguments are largely semantics, you can
| have output that is accurate to the level that any human
| watching for entertainment purposes will not be able to
| determine any physical inaccuracies that arise from a GPU
| render vs. CPU.), except the need for large amounts of VRAM
| being quite expensive at current.
|
| But that's already been changing, and we are seeing studios
| moving to fully GPU based pipelines. Wylie Co, who are a major
| visual effects company (Dune part 1 and 2, marvel movies, the
| last of us, a bunch of others) are now a 100% GPU shop. The
| trend is towards more and more GPU rendering, not less.
|
| With AI providing another strong incentive towards increasing
| the amount of VRAM on GPUs, I don't see any reason to believe
| that trend will reverse.
| 4bpp wrote:
| It would be good to see some independent verification of this
| claim. HN has previously [1] fallen for a claim by the same
| author to have reduced llama.cpp memory usage for a dense model
| way below the size of the model, which should have failed a basic
| smell test and indeed was debunked shortly after. Justine Tunney
| appears to enjoy extreme superstar status here, and it's hard to
| overstate the degree of social pressure that needed to be
| overcome at the time for the skeptic position to reach fixation
| (to begin with, what other LLM developments even hit upvote
| numbers like the +1300ish there or the +712 here at the time of
| writing?).
|
| [1] https://news.ycombinator.com/item?id=35393284
| freedomben wrote:
| > _Justine Tunney appears to enjoy extreme superstar status
| here_
|
| This is true, and for sure pretty much all humans can benefit
| from increased skepticism (though not cynicism), but that
| superstar status is achieved from numerous impressive works.
| Cosmopolitan C and Actually Portable Executable were some of
| the things in the past that alone were worthy of significant
| respect, and for many people (like myself) these were our first
| introduction.
|
| Speaking only for myself, I have a high opinion of Justine on
| technical merits. I'm sure she makes mistakes like all humans.
| I can tell she gets excited by discoveries and the chase, and
| that probably does sometimes cause premature celebration (this
| is something I struggle with so it's recognizable to me haha),
| but being wrong sometimes doesn't erase when you're right, and
| she has been spectacularly right a lot more times than most
| people I know.
|
| There have been some personality clashes between Justine and
| others at times, and unfortunately it's situations where only
| part (sometimes a small part) of it was public, meaning we can
| only take people's word for what happened. Given my ignorance,
| I choose to withhold judgment here, but even if I didn't (and
| assumed she was guilty) it doesn't change the technical merits
| and it certainly wouldn't dissuade me from seeing what she's
| working on now.
|
| So when I see stuff from Justine come out like this, it gets my
| attention. Would it get my attention if the same thing were
| posted by somebody whose name I don't recognize? Likely not,
| but I think that is (unfortunately) part of being a human. We
| aren't capable (yet!) of evaluating everything on technical
| merit alone because the shear volume of material far exceeds
| our time. Therefore we use other (less reliable to be true)
| signalling mechanisms as a way to quickly decide what is worthy
| of our time investment and what may not be. Reputation/name
| recognition is a much imperfect, but better than random chance,
| indicator.
| llm_trw wrote:
| >This is true, and for sure pretty much all humans can
| benefit from increased skepticism (though not cynicism), but
| that superstar status is achieved from numerous impressive
| works.
|
| It is achieved through a never ending parade of self
| aggrandizement.
|
| What Justine is very good at is presenting trivial concepts
| from a world which few front end developers understand in a
| language that most front end developers understand.
|
| I had the misfortune of having to find out about her because
| of how thoroughly she polluted the google search space for
| lisp with her implementation of sector lisp. For some reason
| google decided that sector lisp needed to be in the top 5
| results for every query about `minimal lisp with quotation`
| even when quotation wasn't implemented in her version.
| cl3misch wrote:
| > presenting trivial concepts from a world which few front
| end developers understand in a language that most front end
| developers understand
|
| Completely ignoring the JT discussion, the argument that
| something is trivial in _some_ area does not really hold.
| 1) Science is mostly "just" connecting the dots, and 2)
| landmark discoveries tend to look trivial in hindsight
| almost by definition, because they have to be
| straightforward enough to be widely adopted.
| 4bpp wrote:
| I don't know, my first (and main) impression of them was
| actually in the context of the llama.cpp mmap story, as I was
| somewhat involved in the project back then, and there I
| thought their impact on the project was predominantly
| negative. While they introduced a mildly beneficial change
| (mmap-based model loading), the way in which this was done
| was not healthy for the project - the changes were rammed
| through with little regard for concerns that existed at the
| time about backwards compatibility and edge cases that might
| be broken by the half-baked patch, Justine came across as
| self-aggrandizing (in the sense of "acting as if they ran the
| place", presenting their proposals as a plan that others must
| follow rather than suggestions) and overly eager to claim
| credit (epitomized by the injection of their own initials
| into the magic number file format identifier next to those of
| the project originator's, and the story of the hapless
| _other_ author of the mmap changeset who was at first given a
| token acknowledgement but then quickly sidelined). Arguments
| for the inclusion of the patch seemed to be won by a
| combination of half- and untruths like those about memory
| savings and the sudden participation of a large number of
| previously uninvolved sycophants. It is fortunate that Georgi
| handled the fallout as well as he did, and that he in fact
| had amassed the social capital necessary to survive his
| heavy-handed solution (soft-banning both JT and their most
| prominent detractor). A less-successful project would
| probably have found itself captured or torn apart by the
| drama.
|
| There is nothing wrong with holding people in esteem for
| their achievements, but in this case the degree of esteem
| really seems to be excessive. This is not a matter of simply
| being annoyed that people like "the wrong thing" - the mmap
| situation was significantly exacerbated by the presence of
| irrational/excessive supporters of Justine's as well as the
| irrational/excessive detractors that emerge wherever the
| former exist.
| freedomben wrote:
| I would like to know more about the mmap situation, as what
| I saw on the surface could warrant some concern. Being
| somewhat involved you would probably know better than I as
| I was just an observer reading the thread after-the-fact.
| It seemed like the biggest accusation was the plagiarism
| (or "collaborating" but mostly taking somebody else's
| code).
|
| Did anybody besides the two parties see the code develop,
| or does anybody else have knowledge of this? Or is it just
| his word vs. hers? Do you have any suggested reading to get
| more perspective other than just the github thread and HN
| thread? (really asking. these aren't rhetorical questions)
|
| Reading the thread, I do think there are a lot of
| opportunities to read in confirmation bias. For example if
| I start reading that thread with the idea that Justine is
| coming in to hijack the project and make herself the hero
| that it needs and deserves, and to get her initials
| embedded in there as a permanent tribute to her own glory,
| I can see that. But if I read it as her coming in with cool
| work that she's excited about, and had to come up with a
| new format and couldn't think of a name (naming things can
| be really hard) and just stuck in one of the first things
| that came to mind (or even used as a placeholder prior to
| discussion), I can see that as well.
|
| I absolutely don't want the truth covered up, but I also
| don't want to accept as true things that aren't true,
| especially where the implications are toward somebody's
| character. I'm a big "benefit of the doubt" kind of person.
| 4bpp wrote:
| My sense is that the part about credit/collaboration was
| actually somewhat overblown among the detractors. What
| roughly happened _as far as I can remember_ is that JT
| and another person worked on mmap together with about
| equal contribution, though the other person _might_ have
| been the one to have initiated the idea (and solicited
| help to push it through); then at some point JT decided
| to make a PR to the main repository in their own name,
| but crediting the other collaborator as a coauthor, which
| may or may not have been coordinated with the other
| person. After that, though, in a fairly characteristic
| fashion, JT started fielding adulatory questions from
| their fans (on Github, but also on HN, Twitter and
| possibly other media) about the change, and quickly
| switched to simply referring to it as their own, with no
| mention of the other contributor. The other contributor
| expressed some misgivings about having their contribution
| erased, which were picked up by a growing set of people
| who were generally resentful about JT 's conduct in the
| project. As far as I can tell, when confronted about it,
| JT at no point explicitly denied what the other person
| did (and I think the commit logs should all still be
| there in the fork), but at some point the other person
| just decided to stop pushing the issue due to being
| uncomfortable with becoming a playing ball in the fandom
| war between JT fans and antis.
|
| My personal main gripe with JT really was the tone they
| adopted in the Github discussions, and the effect of the
| large numbers of drive-by supporters, who were often far
| less restrained in both unfounded claims about Justine's
| accomplishments and attacks on any critics. (At this
| point I'd also like to note that I consider some sibling
| comments to be uncomfortably hostile in a personal way,
| like the "hit piece" one.) I think that as a public
| persona, especially one who actively pursues publicity,
| you have some responsibility to restrain your followers -
| Justine, I get the sense, instead uses them as deniable
| proxies, as also seen with the instances where instead of
| straight up putting their signature on the "RAM usage
| reduced to 6GB" claim they instead choose to post a
| collage of screenshots of supporters making it.
| cryptonector wrote:
| This could all be true, but it's hard to evaluate these
| claims on their own. Not being involved in any way, all I
| can do is conclude that there is some friction in that
| community. It's possible that JT is toxic, it's possible
| that you are toxic, it's possible that neither of you is
| generally toxic but something about your personalities
| causes your interactions to become toxic, it's even
| possible that neither of you were toxic in any way but
| your impression of things after the fact is as-if Tunney
| had been toxic. Sometimes one has to stop and think about
| these things and figure out how to smooth things over,
| and sometimes it's not possible to smooth things over.
| 4bpp wrote:
| I didn't have any direct interactions with JT then or now
| - while it was hard to ignore the discussion as an
| onlooker, it did not touch upon any parts of the code
| that I was involved with. This seems to be one of the
| topics where everyone who is even tangentially involved
| is under a default suspicion of being biased in one
| direction or another.
| leeoniya wrote:
| > and indeed was debunked shortly after
|
| was also surprised that she continues to mention the mmap thing
| in a positive light even after the facts about the claim were
| settled to the contrary, even disregarding the whole
| attribution fiasco.
| azeirah wrote:
| You can simply check the Pull Request on llama.cpp on Github.
| JohanesGaessler (a core maintainer) has already ran the code
| and says it's an impressive speed-up. There isn't a thorough
| review by any of the core maintainers yet, but this is very
| likely just exactly what justine says it is; various
| significant and insignificant speedups.
| mtlynch wrote:
| > _HN has previously [1] fallen for a claim by the same author
| to have reduced llama.cpp memory usage for a dense model way
| below the size of the model, which should have failed a basic
| smell test and indeed was debunked shortly after._
|
| Where did Justine claim this? The link you provided is Justine
| saying that she _doesn 't_ have an explanation for the
| reduction in RAM and that readers shouldn't treat it as fact
| yet:
|
| > _The loading time performance has been a huge win for
| usability, and folks have been having the most wonderful
| reactions after using this change. But we don 't have a
| compelling enough theory yet to explain the RAM usage miracle.
| So please don't get too excited just yet! Yes things are
| getting more awesome, but like all things in science a small
| amount of healthy skepticism is warranted._
|
| Was the link supposed to show the false claim or the debunking
| of the claim?
| 4bpp wrote:
| Plenty of claims about it, e.g. here as a "fact": https://git
| hub.com/ggerganov/llama.cpp/discussions/638#discu.... I don't
| think occasional expressions of lingering doubt (still
| couched among positive language like calling it a "miracle")
| can offset all the self-promotion that clearly seeks to
| maximise visibility of the implausible claim, even as it is
| attributed to others, as for example in https://twitter.com/J
| ustineTunney/status/1641881145104297985... . A cereal
| manufacturer would probably be held responsible for package
| text like "Fruity Loops cured my cancer! - John, 52,
| Kalamazoo" too.
| mtlynch wrote:
| I don't read that as a claim of fact at all. From the link
| you shared:
|
| > _Now, since my change is so new, it 's possible my theory
| is wrong and this is just a bug. I don't actually
| understand the inner workings of LLaMA 30B well enough to
| know why it's sparse._
|
| I haven't followed her work closely, but based on the links
| you shared, she sounds like she's doing the opposite of
| self-promotion and making outrageous claims. She's sharing
| the fact that she's observed an improvement while also
| disclosing her doubts that it could be experimental error.
| That's how open-source development is supposed to work.
|
| So, currently, I have seen several extreme claims of
| Justine that turned out to be true (cosmopolitan libc, ape,
| llamafile all work as advertised), so I have a higher
| regard for Justine than the average developer.
|
| You've claimed that Justine makes unwarranted claims, but
| the evidence you've shared doesn't support that accusation,
| so I have a lower regard for your claims than the average
| HN user.
| 4bpp wrote:
| The very opening line says
|
| > I'm glad you're happy with the fact that LLaMA 30B (a
| 20gb file) can be evaluated with only 4gb of memory
| usage!
|
| The line you quoted occurs in a context where it is also
| implied that the low memory usage is a _fact_ , and there
| might only be a bug insofar as that the model is being
| evaluated incorrectly. This is what is entailed by the
| assertion that it "is" sparse: that is, a big fraction of
| the parameters are not actually required to perform
| inference on the model.
| wpietri wrote:
| I think you are making a lot of soup from very little
| meat. I read those links the same way mtlynch read them.
| I think you're looking for a perfection of phrasing that
| is much more suited to peer-reviewed academic papers than
| random tweets and GitHub comments taken from the middle
| of exploring something. Seeing your initial comment and
| knowing little about the situation, I was entirely
| prepared to share your skepticism. But at this point I'm
| much more skeptical of you.
| cryptonector wrote:
| Where's the 30B-in-6GB claim? ^FGB in your GH link finds
| [0] which is neither by jart nor by ggerganov but by
| another user who promptly gets told to look at [1] where
| Justine denies that claim. [0]
| https://github.com/antimatter15/alpaca.cpp/issues/182
| [1] https://news.ycombinator.com/item?id=35400066
| 4bpp wrote:
| These all postdate the discussions that I linked (from
| March 31st). By April 1st JT themselves seems to have
| stopped making/boosting the claim about low memory usage.
| cryptonector wrote:
| I used your link.
| quest88 wrote:
| What's the point of your comment if you're not going to do the
| work yourself? If you don't have something nice to say then
| don't say it.
|
| The "hey this may or may not be true so someone go figure it
| out" is lazy, self-gratifying and pointless.
| thebytefairy wrote:
| I think it's very helpful for someone to point out that the
| source has been shown to be unreliable before, and we should
| wait for more verification from others knowledgable in the
| space.
| freedomben wrote:
| Agreed. I think there's a blurry gray line between pointing
| out a potentially unreliable source and a lazy dismissal,
| but if there's reasonable doubt I think it's good for HN.
| If the doubt isn't reasonable, it will be torn apart by
| other commenters, and then it's an explicit discussion that
| people can read and decide on
| cryptonector wrote:
| If you give such comments a lot of credence without doing
| that own verification then you open yourself to what is
| essentially a social denial of service attack.
| renewiltord wrote:
| It's really popular online. I think that's because many
| people here read a lot of this content but don't actually
| have the skill or background to do analysis. So they give us
| history rather than examination. Which has some value, I
| suppose.
| rpdillon wrote:
| This comment reads like real scientific skepticism, but from my
| recollection of events, is more of a hit piece that takes what
| should be a technical discussion and drags in bunch of personal
| baggage. In particular:
|
| > HN has previously fallen for a claim by the same author to
| have reduced llama.cpp memory usage for a dense model way below
| the size of the model,
|
| is not true at all. Someone else made the claims about 6GB RAM
| usage for a 30B model, I remember reading it at the time and
| thinking "Yeah, that doesn't make sense, but the loading time
| improvement is immense!" And it was - I run all my LLMs locally
| on CPU because I don't have dedicated hardware, and jart's work
| has improved usability a lot.
|
| > and it's hard to overstate the degree of social pressure that
| needed to be overcome at the time for the skeptic position to
| reach fixation
|
| I was reading the same HN discussions you were at the time, and
| it was pretty trivial to see that the loading time claim held
| up, and the RAM claim was dubious and likely simply due to not
| understanding some effect of the change completely. Heck,
| jart's own discussion of the topic reflected this at the time.
|
| For the current change, I feel like your comment is even more
| misplaced. The blog post linked to for this story has a huge
| amount of detail about performance on specific processors
| (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with
| specific models. So when you say:
|
| > It would be good to see some independent verification of this
| claim.
|
| What I hear is "4bpp thinks there's a real risk the numbers in
| the linked post are fabricated, and jart is just trying to get
| attention."
|
| And that doesn't seem reasonable at all, given the history of
| her work and the evidence in front of us.
| throwup238 wrote:
| I distinctly remember most of the people in the comments
| misunderstanding kernel memory paging or learning about it
| for the first time.
|
| It genuinely did make llama.cpp a lot more usable at the
| time.
| 4bpp wrote:
| The loading time improvements largely held up, and on the
| balance the mmap contribution was ultimately good (though the
| way it was implemented was really quite problematic, as a
| matter of process and communication). However, as I point out
| in https://news.ycombinator.com/item?id=39894542, JT quite
| unambiguously did try to cash in on the "low memory usage"
| claim - uncritically reprinting positive claims by others
| about your own work that otherwise would have been largely
| invisible should really not be treated differently as making
| those claims yourself.
|
| I do think that there is a real risk that the numbers are
| wrong (not necessarily "fabricated", as this implies
| malfeasance, but possibly based on an erroneous measurement
| insufficiently questioned due to an excess of trust from
| themselves and others, as the mmap ones were). This is also
| in part based on the circumstance that at the time (of the
| mmap story, and myself being more involved in the project) I
| was actually involved in trying to optimise the SIMD linear
| algebra code, and unless llama.cpp has since switched to a
| significantly less performant implementation the proposition
| that so much more performance could be squeezed out strikes
| me as quite surprising. Here, your intuitions may say that
| Justine Tunney is just so brilliant that they make the
| seemingly impossible possible; but it was exactly this
| attitude that at the time made it so hard to evaluate the
| mmap memory usage claims rationally and turned the discussion
| around it much more dysfunctional than it had to be.
| larodi wrote:
| All the core llama.cpp devs are superstar devs and 10x devs or
| whatever you want to call a super smart person who is also
| super productive and very good with applied calculus. Jart is
| very apparently very smart, but their relationship with this
| project was not without turbulence and at present they (jart)
| are not a core dev of llama.cpp. So for a while lots of their
| (i'd like to write her moves, but not sure if correct) actions
| seem to be aimed at getting attention and perhaps particularly
| the attention of the same folk.
|
| On the contrary ggerganov, slaren, JohannesGaessler seem to
| have never chased this sensationalist superstatus, but actually
| leave their work to speak for them. You'll barely find comments
| by these people on HN, while jart figures every so often a way
| to manifest themselves some way on HN. And this behaviour on
| jart's part now bears fruits - for example Phoronix' Michael
| Larabel would praise jart for their work on the llamafile,
| absolutely obliterating the fact that it is largely based on
| the wonderful work of ggerganov at al.
| __turbobrew__ wrote:
| When they claimed to drastically improve memory utilization
| through the use of memory maps, despite not doing so and then
| starting a huge controversy which derailed the project I
| would say they were a 0.1x dev not a 10x dev.
| s_Hogg wrote:
| I'd pay good money to watch jart in conversation with Carmack
| Solvency wrote:
| Carmack is great but completely irrelevant here. He missed the
| entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality
| fantasies for years.
| vinkelhake wrote:
| _Completely irrelevant_ is probably overstating it. He 's
| been working on AI for the last 4+ years.
| cactusplant7374 wrote:
| He's striving for AGI though, right? So he's not really
| working on anything because he certainly hasn't discovered
| AGI.
| Solvency wrote:
| He literally squandered the last 10 years of his life
| working on _absolutely nothing_ for Zuckerberg. And only
| after the rest of the world innovated on AI (transformers,
| etc) did he clearly feel embarrassed and had to proclaim he
| 's going to focus on AGI in a "one-up" way.
| talldayo wrote:
| > He literally squandered the last 10 years of his life
| working on absolutely nothing
|
| Speak for yourself, the Oculus Quest is the coolest piece
| of sub-$500 tech in my home.
| fkyoureadthedoc wrote:
| He got paid a lot to do something he was presumably
| passionate about and enjoyed. It also might surprise you
| to find out that there's quite a lot of people that just
| work as a means to an end, and find value and enjoyment
| primarily from other parts of their life.
| Solvency wrote:
| that's great for him. i'm glad he enjoyed the $$$ playing
| with VR. that has nothing to do with my point about his
| irrelevance to this LLaMa discussion.
| talldayo wrote:
| He's not irrelevant, though. Literally the first thing he
| did after leaving Meta was start an AI business, and the
| original point wasn't even necessarily about AI. They
| just said they wanted to see two engineers in
| conversation, and you used it as an opportunity to
| denigrate one of their previous employers. _That 's_
| bewilderingly irrelevant.
| cactusplant7374 wrote:
| Altman isn't even relevant here. He is focusing on LLM's
| instead of a framework that gets us to AGI. He can't describe
| how we get there or any such theories around AGI. It's a
| complete failure.
| objektif wrote:
| Why is he even relevant? What makes you believe that he would
| be good at solving AI related problems? He is a developer
| right?
| s_Hogg wrote:
| To carry on, this is because they're both very interested in
| "knowledge in depth", rather than because of what they actually
| work on day-to-day. They've both made careers out of knowing
| what's going on with the thing they're building down to the
| most basic level possible.
| m3kw9 wrote:
| So Nvidia in trouble now because intel can be used instead for
| faster/cheaper? inference?
| tubs wrote:
| The ram is not on the cpu on a mac. It's in the same can but it's
| still regular ddr dimms.
| marshallward wrote:
| There is an implication here that the Fortran implementation of
| `SGEMM` is somehow inadequate. But any modern Fortran compiler
| will quite easily apply the AVX and FMA optimizations presented
| here without any additional changes. Both GNU and Intel make
| these substitutions with the correct flags.
|
| The unrolling optimization is also just another flag away
| (`-funroll-all-loops`). The Intel Compiler will even do this
| without prompting. In fact, it appears to only do a modest 2x
| unroll on my machine, suggesting that the extreme unroll in this
| article would have been overkill.
|
| Parallelization certainly a lot to ask of Fortran 77 source, but
| there there is little stopping you from adding OpenMP statements
| to the `SGEMM` function. In fact, modern Fortran even offers its
| own parallelization constructs if you're willing to go there.
|
| Which is to say: Let's not belittle this old Fortran 77 function.
| Yes it is old, and does not even resemble modern Fortran. But the
| whole point of Fortran is to free the developer from these
| platform-specific details, and hand the job off to the compiler.
| If you don't like that approach, then you're welcome to go to C
| or C++. But this little block of Fortran code is already capable
| of doing just about everything in this article.
| steppi wrote:
| The Fortran implementation is just a reference implementation.
| The goal of reference BLAS [0] is to provide relatively simple
| and easy to understand implementations which demonstrate the
| interface and are intended to give correct results to test
| against. Perhaps an exceptional Fortran compiler which doesn't
| yet exist could generate code which rivals hand (or
| automatically) tuned optimized BLAS libraries like OpenBLAS
| [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in
| practice this is not observed.
|
| Justine observed that the threading model for LLaMA makes it
| impractical to integrate one of these optimized BLAS libraries,
| so she wrote her own hand-tuned implementations following the
| same principles they use.
|
| [0]
| https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...
|
| [1] https://github.com/OpenMathLib/OpenBLAS
|
| [2]
| https://www.intel.com/content/www/us/en/developer/tools/onea...
|
| [3]
| https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...
|
| [4]https://en.wikipedia.org/wiki/BLIS_(software)
| marshallward wrote:
| Fair enough, this is not meant to be some endorsement of the
| standard Fortran BLAS implementations over the optimized
| versions cited above. Only that the mainstream compilers
| cited above appear capable of applying these optimizations to
| the standard BLAS Fortran without any additional effort.
|
| I am basing these comments on quick inspection of the
| assembly output. Timings would be equally interesting to
| compare at each stage, but I'm only willing to go so far for
| a Hacker News comment. So all I will say is perhaps let's
| keep an open mind about the capability of simple Fortran
| code.
| steppi wrote:
| Check out _The Science of Programming Matrix Computations_
| by Robert A. van de Geijn and Enrique S. Quintana-Ort.
| Chapter 5 walks through how to write an optimized GEMM. It
| involves clever use of block multiplication, choosing block
| sizes for optimal cache behavior for specific chips. Modern
| compilers just aren 't able to do such things now. I've
| spent a little time debugging things in scipy.linalg by
| swapping out OpenBLAS with reference BLAS and have found
| the slowdown from using reference BLAS is typically at
| least an order of magnitude.
|
| [0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf
| marshallward wrote:
| You are right, I just tested this out and my speed from
| BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I
| can only imagine what BLIS and MKL would give. I
| apologize for my ignorance. Apparently my faith in the
| compilers was wildly misplaced.
| pklausler wrote:
| Modern Fortran's only parallel feature is coarrays, which
| operate at the whole program level.
|
| DO CONCURRENT is a serial construct with an unspecified order
| of iterations, not a parallel construct. A DO CONCURRENT loop
| imposes requirements that allow an arbitrary order of
| iterations but which are not sufficient for safe
| parallelization.
| marshallward wrote:
| How do you feel about Nvidia endorsing do concurrent
| migration to GPUs? Would that be classified as
| parallelization?
| brrrrrm wrote:
| using AVX/FMA and unrolling loops does extremely little in the
| way of compiling to fast (>80% peak) GEMM code. These are very
| much intro steps that don't take into account _many_ important
| ideas related to cache hierarchy, uop interactions, and even
| instruction decode time. The Fortran implementation is entirely
| and unquestionably inadequate for real high performance GEMMs.
| marshallward wrote:
| I don't disagree, but where are those techniques presented in
| the article? It seems like she exploits the particular shape
| of her matrix to align better with cache. No BLAS library is
| going to figure that out.
|
| I am not trying to say that a simple 50+ year old matrix
| solver is somehow competitive with existing BLAS libraries.
| But I disagreed with its portrayal in the article, which
| associated the block with NumPy performance. Give that to a
| 2024 Fortran compiler, and it's going to get enough right to
| produce reasonable bytecode.
| marshallward wrote:
| I just did a test of OpenBLAS with Intel-compiled BLAS, and
| it was about 6 GFLOP/s vs 150 GFLOP/s, so I must admit that I
| was wrong here. Maybe in some sense 4% is not bad, but it's
| certainly not good. My faith in current compilers has
| certainly been shattered quite a bit today.
|
| Anyway, I have come to eat crow. Thank you for your insight
| and helping me to get a much better perspective on this
| problem. I mostly work with scalar and vector updates, and do
| not work with arrays very often.
| hrkfmud50k wrote:
| > It's clearly optimal since my CPU is listed as only being
| capable of going 780 gigaflops
|
| 780 GFLOP is the iGPU spec. Is this a valid comparison?
|
| https://nanoreview.net/en/cpu/intel-core-i9-14900k
| arendtio wrote:
| Does someone else see llamafile using Wine on Linux?
|
| Edit: After the download I did a simple chmod +x
| llava-v1.5-7b-q4.llamafile; ./llava-v1.5-7b-q4.llamafile
| jart wrote:
| There's a simple fix for that. sudo wget -O
| /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname
| -m).elf sudo chmod +x /usr/bin/ape sudo sh -c
| "echo ':APE:M::MZqFpD::/usr/bin/ape:'
| >/proc/sys/fs/binfmt_misc/register" sudo sh -c "echo
| ':APE-jart:M::jartsr::/usr/bin/ape:'
| >/proc/sys/fs/binfmt_misc/register"
|
| https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-fil...
| yieldcrv wrote:
| note, this is "goes faster on CPUs than before", not faster than
| GPUs.
| TimPC wrote:
| Strange title. My first read of the title thought the author was
| arguing the model is now faster on CPU than GPU. Would be much
| nicer if they titled this something closer to "Performance
| Improvement for LLaMa on CPU".
| utopcell wrote:
| Same here.
| ein0p wrote:
| As someone who has tried to beat MKL-DNN, and was unsuccessful at
| doing so even for constrained matrix sizes, I'm curious how they
| pulled off such a massive improvement.
|
| But as someone who routinely estimates picojoules per flop at
| $DAY_JOB - there's simply no way this is energy efficient. That
| is not even physically possible with a CPU.
| janwas wrote:
| I think the previous code was using dot products, f32 instead
| of bf16.
| aaronscott wrote:
| > I like to define my subroutines using a modern language like
| C++, which goes 47 gigaflops. This means C++ is three orders of a
| magnitude faster than Python. That's twenty years of progress per
| Moore's law.
|
| This is great. I love the idea of measuring performance
| differences in "years of Moore's law."
|
| Twenty years puts the delta in an easy to understand framework.
| JohnKemeny wrote:
| I doubt that you get Python to run faster than C++ at 2004
| hardware.
| mrtranscendence wrote:
| Python on 2024 hardware vs C++ on 2004 hardware ... I don't
| think it's obvious that C++ always wins here, though it would
| depend on the use case, how much of the Python is underpinned
| by native libraries, and the specific hardware in question.
| JohnKemeny wrote:
| If we allow native libraries, it's not clear that C++ would
| win, even on modern hardware.
| michaelt wrote:
| I think we all know that, when someone writes "C++ is
| three orders of a magnitude faster than Python" they're
| not including native libraries.
| mrtranscendence wrote:
| You can't _not_ include native libraries, at least if you
| want your benchmark to be realistic. Almost every Python
| library where performance matters is written (at least
| partially) in a compiled language.
| bornfreddy wrote:
| Yes, but many people like the sound of "X-times faster
| than Python" while conveniently forgetting that the same
| thing can be (and usually is) done in Python + numpy &
| co. even faster.
|
| I have come to appreciate "slowness" of Python. It trades
| speed for legibility, which is a great compromise once
| you have _really fast_ native libraries one import away.
| Best of both worlds.
| bevekspldnw wrote:
| Honestly depends on what you are doing. Most of my python
| work is data collection and analysis on top of Postgres.
|
| Being smart in how I use Postgres indexing (and when to
| disable it outright) has more performance impact than the
| actual language doing the plumbing.
| rbnsl wrote:
| Definitely wild we're in the timeline you can run a 1.1 bn param
| model on a raspberry pi, but its still tough to justify because
| the 1.1 is kinda useless compared to the beefier models. Sick for
| home builds/hobbyists though I might wanna get one of the new Pis
| just to try this out
| JohnnyHerz wrote:
| Awesomeness. thank you for sharing!
___________________________________________________________________
(page generated 2024-04-01 23:01 UTC)