[HN Gopher] GGML - AI at the Edge
___________________________________________________________________
GGML - AI at the Edge
Author : georgehill
Score : 466 points
Date : 2023-06-06 16:50 UTC (6 hours ago)
(HTM) web link (ggml.ai)
(TXT) w3m dump (ggml.ai)
| [deleted]
| zkmlintern wrote:
| [dead]
| huevosabio wrote:
| Very exciting!
|
| Now, we just need a post that benchmarks the different options
| (ggml, tvm, AItemplate, hippoml) and helps deciding which route
| to take.
| Havoc wrote:
| How common is avx on edge platforms?
| binarymax wrote:
| svantana is correct that PCs are edge, but if you meant
| "mobile", then ARM in iOS and Android typically have NEON
| instructions for SIMD, not AVX:
| https://developer.arm.com/Architectures/Neon
| Havoc wrote:
| I was thinking more edge in the distributed serverless sense,
| but I guess for this type of use the compute part is slow not
| the latency so question doesn't make much sense in hindsight
| binarymax wrote:
| Compute _is_ the latency for LLMs :)
|
| And in general, your inference code will be compiled to a
| CPU/Architecture target - so you can know ahead of time
| what instructions you'll have access to when writing your
| code for that target.
|
| For example in the case of AWS Lambda, you can choose
| graviton2 (ARM with NEON), or x86_64 (AVX). The trick is
| that for some processors such as Xeon3+ there is AVX 512,
| and others you will top out at AVX 256. You might be able
| to figure out what exact instruction set your serverless
| target supports.
| svantana wrote:
| Edge just means that the computing is done close to the I/O
| data, so that includes PCs and such.
| Dwedit wrote:
| There was a big stink one time when the file format changed,
| causing older model files to become unusable on newer versions of
| llama.cpp.
| pawelduda wrote:
| I happen to have RPi 4B with HomeAssistant. Is this something I
| could set up on it and integrate with HA to control it with
| speech, or is it overkill?
| boppo1 wrote:
| I doubt it. I'm running 4-bit 30B and 65B models with 64GB ram,
| a 4080 and a 7900x. The 7B models are less demanding, but even
| so, You'll need more than an rpi. Even then, it would be a
| _project_ to get these to control something. This is more
| 'first baby steps' toward the edge.
| pawelduda wrote:
| The article shows example running on RPI that recognizes
| colour names. I could just come up with keywords that would
| invoke certain commands and feed them to HA, which would
| match them to an automation (i.e. turn off kitchen, or just
| kitchen ) . I think a PoC is doable, but I'm aware I could
| run into limitations quickly. Idk might give it a try when
| I'm bored.
|
| Would love voice assistant running locally but probably there
| are solutions out there - didn't get to do the research yet
| nivekney wrote:
| On a similar thread, how does it compare to Hippoml?
|
| Context: https://news.ycombinator.com/item?id=36168666
| brucethemoose2 wrote:
| We don't necessarily know... Hippo is closed source for now.
|
| Its comparable to Apache TVM's vulkan in speed on cuda, see
| https://github.com/mlc-ai/mlc-llm
|
| But honestly, the biggest advantage of llama.cpp for me is
| being able to split a model so performantly. My puny 16GB
| laptop can _just barely_ , but very practically, run LLaMA 30B
| at almost 3 tokens/s, and do it right now. That is crazy!
| smiley1437 wrote:
| >> run LLaMA 30B at almost 3 tokens/s
|
| Please tell me your config! I have an i9-10900 with 32GB of
| ram that only gets .7 tokens/s on a 30B model
| LoganDark wrote:
| > Please tell me your config! I have an i9-10900 with 32GB
| of ram that only gets .7 tokens/s on a 30B model
|
| Have you quantized it?
| smiley1437 wrote:
| The model I have is q4_0 I think that's 4 bit quantized
|
| I'm running in Windows using koboldcpp, maybe it's faster
| in Linux?
| LoganDark wrote:
| > The model I have is q4_0 I think that's 4 bit quantized
|
| That's correct, yeah. Q4_0 should be the smallest and
| fastest quantized model.
|
| > I'm running in Windows using koboldcpp, maybe it's
| faster in Linux?
|
| Possibly. You could try using WSL to test--I think both
| WSL1 and WSL2 are faster than Windows (but WSL1 should be
| faster than WSL2).
| brucethemoose2 wrote:
| I am running linux with cublast offload, and I am using
| the new 3 bit quant that was just pulled in a day or two
| ago.
| brucethemoose2 wrote:
| I'n on a Ryzen 4900HS laptop with a RTX 2060.
|
| Like I said, very modest
| oceanplexian wrote:
| With a single NVIDIA 3090 and the fastest inference branch
| of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-
| LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per
| second on the 30B models. IMO GGML is great (And I totally
| use it) but it's still not as fast as running the models on
| GPU for now.
| LoganDark wrote:
| > IMO GGML is great (And I totally use it) but it's still
| not as fast as running the models on GPU for now.
|
| I think it was originally designed to be easily
| embeddable--and most importantly, _native code_ (i.e. not
| Python)--rather than competitive with GPUs.
|
| I think it's just starting to get into GPU support now,
| but carefully.
| brucethemoose2 wrote:
| Have you tried the most recent cuda offload? A dev claims
| they are getting 26.2ms/token (38 tokens per second) on
| 13B with a 4080.
| yukIttEft wrote:
| Its graph execution is still full of busyloops, e.g.:
|
| https://github.com/ggerganov/llama.cpp/blob/44f906e8537fcec9...
|
| I wonder how much more efficient it would be when Taskflow lib
| was used instead, or even inteltbb.
| mhh__ wrote:
| It's not a very good library IMO.
| moffkalast wrote:
| Someone ought to be along with a PR eventually.
| boywitharupee wrote:
| is graph execution used for training only or inference also?
| LoganDark wrote:
| Inference. It's a big bottleneck for RWKV.cpp, second only to
| the matrix multiplies.
| make3 wrote:
| does tbb work with apple Silicon?
| yukIttEft wrote:
| I guess https://formulae.brew.sh/formula/tbb
| [deleted]
| renewiltord wrote:
| This guy is damned good. I sponsored him on Github because his
| software is dope. I also like how when some controversy erupted
| on the project he just ejected the controversial people and moved
| on. Good stewardship. Great code.
|
| I recall something like when he first ported it and it worked on
| my M1 Max he hadn't even yet tested it on Apple Silicon since he
| didn't have the hardware.
|
| Honestly, with this and whisper, I am a huge fan. Good luck to
| him and the new company.
| killthebuddha wrote:
| Another important detail about the ejections that I think is
| particularly classy is that the people he ejected are broadly
| considered to have world-class technical skills. In other
| words, he was very explicitly prioritizing collaborative
| potential > technical skill. Maybe a future BDFL[1]!
|
| [1] https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
| jart wrote:
| Gerganov was prioritizing collaboration with 4chan who raided
| his GitHub to demand a change written by a transgender woman
| be reverted. There was so much hate speech and immaturity
| thrown around (words like tranny troon cucking muh model)
| that it's a real embarrassment (to those of us deeply want to
| see local models succeed) that one of the smartest guys
| working on the problem was taken in by all that. You can't
| run a collaborative environment that's open when you pander
| to hate, because hate subverts communities; it's impossible
| to compromise with anonymous trolls who harass a public
| figure over physical traits about her body she can't change.
|
| You don't have to take my word on it. Here are some archives
| of the 4chan threads where they coordinated the raid. It went
| on for like a month. https://archive.is/EX7Fq
| https://archive.is/enjpf https://archive.is/Kbjtt
| https://archive.is/HGwZm https://archive.is/pijMv
| https://archive.is/M7hLJ https://archive.is/4UxKP
| https://archive.is/IB9bv https://archive.is/p6Q2q
| https://archive.is/phCGN https://archive.is/M6AF1
| https://archive.is/mXoBs https://archive.is/68Ayg
| https://archive.is/DamPp https://archive.is/DiQC2
| https://archive.is/DeX8Z https://archive.is/gStQ1
|
| If you read these threads and see how nasty these little
| monsters are, you can probably imagine how Gerganov must have
| felt. He was probably scared they'd harass him too, since
| 4chan acts like he's their boy. Plus it was weak leadership
| on his part to disappear for days, suddenly show up again to
| neutral knight the conflict (https://justine.lol/neutral-
| knight.png) tell his team members they're no longer welcome,
| and then going back and deleting his comment later. Just goes
| to show you can be really brilliant at the hard technical
| skills, but totally clueless when it comes to people.
| zo1 wrote:
| Really curious why you tried to rename the file format
| magic string to have your initials? Going from GGML (see
| Title of this post) to GGJT with JT being Justine Tunney?
| Seems quite unnecessary and bound to have rubbed a lot of
| people the wrong way.
|
| Here is the official commit undoing the change:
|
| https://github.com/ggerganov/llama.cpp/pull/711/files#diff-
| 7...
| killthebuddha wrote:
| I didn't want to not reply but I also didn't want to be
| swept into a potentially fraught internet argument. So, I
| tried to edit my comment as a middle ground, but it looks
| like I can't, I guess there must be a timeout. If I could
| edit it, I'd add the following:
|
| "I should point out that I wasn't personally involved,
| haven't looked into it in detail, and that there are many
| different perspectives that should be considered."
| evanwise wrote:
| What was the controversy?
| kgwgk wrote:
| https://news.ycombinator.com/item?id=35411909
| pubby wrote:
| https://github.com/ggerganov/llama.cpp/pull/711
| nchudleigh wrote:
| he has been amazing to watch and has even helped me out with my
| app that uses his whisper.cpp project
| (https://superwhisper.com)
|
| Excited to see how his venture goes!
| PrimeMcFly wrote:
| > I also like how when some controversy erupted on the project
| he just ejected the controversial people and moved on. Good
| stewardship
|
| Do you have more info on the controversy? I'm not sure ejecting
| developers just because of controversy is honestly good
| stewardship.
| freedomben wrote:
| Right. More details needed to know if this is good
| stewardship (ejecting two toxic individuals) or laziness
| (ejecting a villain and a hero to get rid of the "problem"
| easily). TikTok was using this method for a while by ejecting
| both bullies and victims, and it "solved" the problem but
| most people see the injustice there.
|
| I'm not saying it was bad stewardship, I honestly don't know.
| I just agree that we shouldn't make a judgment without more
| information.
| jstarfish wrote:
| > More details needed to know if this is good stewardship
| (ejecting two toxic individuals) or laziness (ejecting a
| villain and a hero to get rid of the "problem" easily).
| TikTok was using this method for a while by ejecting both
| bullies and victims,
|
| This is SOP for American schools. It's laziness there,
| since education is supposed to be compulsory. They can't be
| bothered to investigate (and with today's hostile climate,
| I don't blame them) so they consign both parties to
| independent-study programs.
|
| For volunteer projects, throwing both overboard is
| unfortunate but necessary stewardship. The drama either
| attracts destabilizes the entire project, which only exists
| as long as it remains _fun_ for the maintainer. It 's
| tragic, but victims who can't recover gracefully are as
| toxic as their abusers.
| boppo1 wrote:
| >justice
|
| For an individual running a small open source project,
| there's time enough for coding or detailed justice, but not
| both. When two parties start pointing fingers and raising
| hell and its not immediately clear who is in the right, ban
| both and let them fork it.
| csmpltn wrote:
| > More details needed to know if this is good stewardship
| (ejecting two toxic individuals) or laziness (ejecting a
| villain and a hero to get rid of the "problem" easily).
|
| Man, nobody has time for this shit. Leave the games and the
| drama for the social justice warriors and the furries.
| People building shit ain't got time for this - ejecting
| trouble makers is the right way to go regardless of which
| "side" they're on.
| LoganDark wrote:
| > and the furries
|
| Um, what?
| camdenlock wrote:
| If you know, you know
| freedomben wrote:
| I would agree that there needs to be a balance because
| wasting time babysitting adults is dumb, but what if one
| person is a good and loved contributor, and the other is
| a social justice warrior new to the project that is
| picking fights with the contributor? Your philosophy
| makes not only bad stewardship but an injustice. I'm not
| suggesting this is the only scenario, just merely a
| hypothetical that I think illustrates my position.
| wmf wrote:
| And what do you do when every contributor to the project,
| including the founder, has been labeled a troublemaker?
| boppo1 wrote:
| Pick the fork that has devs who are focused on
| contributing code and not pursuing drama.
| infamouscow wrote:
| The code is MIT licensed. If you don't agree with the
| direction the project is taking you can fork it and add
| whatever you want.
|
| I don't understand why this is so difficult for software
| developers with GitHub accounts to understand.
| PrimeMcFly wrote:
| You've missed the point here more than I've seen anyone
| miss the point in a long time.
| infamouscow wrote:
| Software stewardship is cringe.
|
| The idea software licensed with a free software license
| can have a steward doesn't even make sense.
|
| How exactly does someone supervise or take care of
| intellectual property (read: code) when the author and
| original copyright holder explicitly licensed their work
| under the MIT license, granting anyone the following:
|
| > [T]o deal in the software without restriction,
| including without limitation the rights to use, copy,
| modify, merge, publish, distribute, sublicense, and/or
| sell copies of the software, and to permit persons to
| whom the software is furnished to do so, subject to the
| following conditions
|
| The author was certainly a steward when they were working
| on it in private, or heck, even in public since copyright
| is implicit, but certainly not after adding the MIT
| license.
|
| So when I think of software stewardship, all I see are
| self-appointed thought-leaders and corporate vampires
| like Oracle chest-beating to the public about how
| important they are.
|
| Simply a way for those in positions of power/status to
| remain in their positions elevated above everyone else.
| Depending on the situation and context that might be good
| or bad. What's important is it's not for these so-called
| "stewards" to decide.
| iamflimflam1 wrote:
| I've always thought on the edge to be IoT type stuff. So running
| on embedded devices. But maybe that not the case?
| Y_Y wrote:
| Like any new term the (mis)usage broadens the meaning over time
| until it either it's widely known, it's unfashionable, or most
| likely; it becomes so broad as to be meaningless and hence it
| achieves buzzword apotheosis.
|
| My old job title had "edge" in it, and I still don't know what
| it's supposed to mean, although "not cloud" is a good
| approximation.
| b33j0r wrote:
| Sounds like your job had a lot of velocity with lateral
| tragmorphicity in Q1, just in time for staff engineer
| optimization!
|
| Nicely done. Here is ~$50 worth of stock.
| timerol wrote:
| "Edge computing" is a pretty vague term, and can encompass
| anything from a 8MHz ARM core that can barely talk compliant
| BLE, all the way to a multi-thousand dollar setup on something
| like a self-checkout machine, which may have more compute
| available than your average laptop. In that range are home
| assistants, which normally have some basic ML for wake word
| detection, and then send the next bit of audio to the cloud
| with a more advanced model for full speech-to-text (and
| response)
| conjecTech wrote:
| Congratulations! How do you plan to make money?
| ggerganov wrote:
| I'm planning to write code and have fun!
| az226 wrote:
| Have you thought about what your path looks like to get to
| the next phase? Are you taking on any more investors pre-
| seee?
| beardog wrote:
| >ggml.ai is a company founded by Georgi Gerganov to support
| the development of ggml. Nat Friedman and Daniel Gross
| provided the pre-seed funding.
|
| Did you give them a different answer? It is okay if you can't
| or don't want to share, but I doubt the company is only
| planning to have fun. Regardless, best of luck to you and
| thank you for your efforts so far.
| jgrahamc wrote:
| This is a good plan.
| TechBro8615 wrote:
| I believe ggml is the basis of llama.cpp (the OP says it's "used
| by llama.cpp")? I don't know much about either, but when I read
| the llama.cpp code to see how it was created so quickly, I got
| the sense that the original project was ggml, given the amount of
| pasted code I saw. It seemed like quite an impressive library.
| make3 wrote:
| it's the library used for tensor operations inside of
| llama.cpp, yes
| kgwgk wrote:
| https://news.ycombinator.com/item?id=33877893
|
| "OpenAI recently released a model for automatic speech
| recognition called Whisper. I decided to reimplement the
| inference of the model from scratch using C/C++. To achieve
| this I implemented a minimalistic tensor library in C and
| ported the high-level architecture of the model in C++."
|
| That "minimalistic tensor library" was ggml.
| world2vec wrote:
| Might be a silly question but is GGML a similar/competing library
| to George Hotz's tinygrad [0]?
|
| [0] https://github.com/geohot/tinygrad
| qeternity wrote:
| No, GGML is a CPU optimized library and quantized weight format
| that is closely linked to his other project llama.cpp
| stri8ed wrote:
| How does the quantization happen? Are the weights
| preprocessed before loading the model?
| ggerganov wrote:
| The weights are preprocessed into integer quants combined
| with scaling factors in various configurations (4, 5,
| 8-bits and recently more exotic 2, 3 and 6-bit quants). At
| runtime, we use efficient SIMD implementations to perform
| the matrix multiplication at integer level, carefully
| optimizing for both compute and memory bandwidth. Similar
| strategies are applied when running GPU inference - using
| custom kernels for fast Matrix x Vector multiplications
| sebzim4500 wrote:
| Yes, but to my knowledge it doesn't do any of the
| complicated optimization stuff that SOTA quantisation
| methods use. It basically is just doing a bunch of
| rounding.
|
| There are advantages to simplicity, after all.
| brucethemoose2 wrote:
| Its not so simple anymore, see
| https://github.com/ggerganov/llama.cpp/pull/1684
| ggerganov wrote:
| ggml started with focus on CPU inference, but lately we have
| been augmenting it with GPU support. Although still in
| development, it already has partial CUDA, OpenCL and Metal
| backend support
| qeternity wrote:
| Hi Georgi - thanks for all the work, have been following
| and using since the availability of Llama base layers!
|
| Wasn't implying it's CPU only, just that it started as a
| CPU optimized library.
| ignoramous wrote:
| (a novice here who knows a couple of fancy terms)
|
| > _...lately we have been augmenting it with GPU support._
|
| Would you say you'd then be building an equivalent to
| Google's JAX?
|
| Someone even asked if anyone would build a C++ to JAX
| transpiler [0]... I am wondering if that's something you
| may implement? Thanks.
|
| [0] https://news.ycombinator.com/item?id=35475675
| freedomben wrote:
| As a person burned by nvidia, I can't thank you enough for
| the OpenCL support
| xiphias2 wrote:
| They are competing (although they are very different, tinygrad
| is full stack Python, ggml is focusing on a few very important
| models), but in my opinion George Hotz lost focus a bit by not
| working more on getting the low level optimizations perfect.
| georgehotz wrote:
| Which low level optimizations specifically are you referring
| to?
|
| I'm happy with most of the abstractions. We are pushing to
| assembly codegen. And if you meant things like matrix
| accelerators, that's my next priority.
|
| We are taking more a of breadth first approach. I think ggml
| is more depth first and application focused. (and I think
| Mojo is even more breadth first)
| edfletcher_t137 wrote:
| This is a bang-up idea, you absolutely love to see capital
| investment on this type of open, commodity-hardware-focused
| foundational technology. Rock on GGMLers & thank you!
| boringuser2 wrote:
| Looking at the source of this kind of underlines the difference
| between machine learning scientist types and actual computer
| scientists.
| rvz wrote:
| > Nat Friedman and Daniel Gross provided the pre-seed funding.
|
| Why? Why should VCs get involved again?
|
| They are just going to look for an exit and end up getting
| acquired by Apple Inc.
|
| Not again.
| sroussey wrote:
| Daniel Gross is a good guy, a yes his company did get acquired
| by apple a while back, but he loves to foster really dope stuff
| by amazing people, and ggml certainly fits the bill. And this
| looks like an Angel investment, not a VC one if that makes any
| difference to you.
| renewiltord wrote:
| It's possible to do whatever you want without VCs. The code is
| open source so you can start where he's starting from and run a
| purely different enterprise if you desire.
| okhuman wrote:
| +1. VC involvement in projects like these always pivot the team
| away from the core competency of what you'd expect them to
| deliver - into some commercialization aspect that convert only
| a tiny fraction of the community yet take up 60%+ of the core
| developer team's time.
|
| I don't know why project founders head this way...as the track
| records of leaders who do this end up disappointing the
| involved community at some point. Look to matt klein + cloud
| native computing foundation at envoy for a somewhat decent
| model of how to do this better.
|
| We continue down the Open Core model yet it continues to fail
| communities.
| wmf wrote:
| Developers shouldn't be unpaid slaves to the community.
| okhuman wrote:
| You're right. I just wish this decision was taken to the
| community, we could have all came together to help and
| supported during these difficult/transitional times. :(
| Maybe this decision was rushed or is money related, who
| knows the actual circumstances.
|
| Here's the Matt K article
| https://mattklein123.dev/2021/09/14/5-years-envoy-oss/
| jart wrote:
| Whenever a community project goes commercial, its interests
| are usually no longer aligned with the community. For
| example, llama.com makes frequent backwards-incompatible
| changes to its file format. I maintain a fork of ggml in the
| cosmopolitan monorepo which maintains support for old file
| formats. You can build and use it as follows:
| git clone https://github.com/jart/cosmopolitan cd
| cosmopolitan # cross-compile on x86-64-linux for
| x86-64 linux+windows+macos+freebsd+openbsd+netbsd
| make -j8 o//third_party/ggml/llama.com
| o//third_party/ggml/llama.com --help # cross-
| compile on x86-64-linux for aarch64-linux make -j8
| m=aarch64 o/aarch64/third_party/ggml/llama.com #
| note: creates .elf file that runs on RasPi, etc.
| # compile loader shim to run on arm64 macos cc -o ape
| ape/ape-m1.c # use xcode ./ape ./llama.com --help #
| use elf aarch64 binary above
|
| It goes the same speed as upstream for CPU inference. This is
| useful if you can't/won't recreate your weights files, or
| want to download old GGML weights off HuggingFace, since
| llama.com has support for every generation of the ggjt file
| format.
| halyconWays wrote:
| [dead]
| throw74775 wrote:
| Do you have pre-seed funding to give him?
| jgrahamc wrote:
| I do.
| samwillis wrote:
| ggml and llama.cpp are such a good platform for local LLMs,
| having some financial backing to support development is
| brilliant. We should be concentrating as much as possible to do
| local inference (and training) based on privet data.
|
| I want a _local_ ChatGPT fine tuned on my personal data running
| on my own device, not in the cloud. Ideally open source too,
| llama.cpp is looking like the best bet to achieve that!
| SparkyMcUnicorn wrote:
| Maybe I'm wrong, but I don't think you want it fine-tuned on
| your data.
|
| Pretty sure you might be looking for this:
| https://github.com/SamurAIGPT/privateGPT
|
| Fine-tuning is good for treating it how to act, but not great
| for reciting/recalling data.
| dr_dshiv wrote:
| How does this work?
| deet wrote:
| The parent is saying that "fine tuning", which has a
| specific meaning related to actually retraining the model
| itself (or layers at its surface) on a specialized set of
| data, is not what the GP is actually looking for.
|
| An alternative method is to index content in a database and
| then insert contextual hints into the LLM's prompt that
| give it extra information and detail with which to respond
| with an answer on-the-fly.
|
| That database can use semantic similarity (ie via a vector
| database), keyword search, or other ranking methods to
| decide what context to inject into the prompt.
|
| PrivateGPT is doing this method, reading files, extracting
| their content, splitting the documents into small-enough-
| to-fit-into-prompt bits, and then indexing into a database.
| Then, at query time, it inserts context into the LLM prompt
|
| The repo uses LangChain as boilerplate but it's pretty
| easily to do manually or with other frameworks.
|
| (PS if anyone wants this type of local LLM + document Q/A
| and agents, it's something I'm working on as supported
| product integrated into macOS, and using ggml; see profile)
| brucethemoose2 wrote:
| If MeZO gets implemented, we are basically there:
| https://github.com/princeton-nlp/MeZO
| moffkalast wrote:
| Basically there, with what kind of VRAM and processing
| requirements? I doubt anyone running on a CPU can fine tune
| in a time frame that doesn't give them an obsolete model when
| they're done.
| nl wrote:
| According to the paper it fine tunes at the speed of
| inference (!!)
|
| This would make fine tuning a qantized 13B model achievable
| in ~0.3 seconds per training example on a CPU.
| f_devd wrote:
| MeZO assumes a smooth parameter space, so you probably
| won't be able to do it with INT4/8 quantization, probably
| needs fp8 or smoother.
| gliptic wrote:
| I cannot find any such numbers in the paper. What the
| paper says is that MeZO converges much slower than SGD,
| and each step needs two forward passes.
|
| "As a limitation, MeZO takes many steps in order to
| achieve strong performance."
| moffkalast wrote:
| Wow if that's true then it's genuinely a complete
| gamechanger for LLMs as a whole. You probably mean more
| like 0.3s per token, not per example, but that's still
| more like 1 or two minutes per training case, not like a
| day for 4 cases like it is now.
| sp332 wrote:
| It's the same _memory footprint_ as inference. It 's not
| that fast, and the paper mentions some optimizations that
| could still be done.
| isoprophlex wrote:
| If you go through the drudgery of integrating with all
| the existing channels (mail, Teams, discord, slack,
| traditional social media, texts, ...), such rapid
| finetuning speeds could enable an always up to date
| personality construct, modeled on you.
|
| Which is my personal holy grail towards making myself
| unnecessary; it'd be amazing to be doing some light
| gardening while the bot handles my coworkers ;)
| [deleted]
| valval wrote:
| I think more importantly, what would the fine tuning
| routine look like? It's a non-trivial task to dump all of
| your personal data into any LLM architecture.
| rvz wrote:
| > ggml and llama.cpp are such a good platform for local LLMs,
| having some financial backing to support development is
| brilliant
|
| The problem is, this financial backing and support is via VCs,
| who will steer the project to close it all up again.
|
| > I want a local ChatGPT fine tuned on my personal data running
| on my own device, not in the cloud. Ideally open source too,
| llama.cpp is looking like the best bet to achieve that!
|
| I think you are setting yourself up for disappointment in the
| future.
| ulchar wrote:
| > The problem is, this financial backing and support is via
| VCs, who will steer the project to close it all up again.
|
| How exactly could they meaningfully do that? Genuine
| question. The issue with the OpenAI business model is that
| the collaboration within academia and open source circles is
| creating innovations that are on track to out-pace the closed
| source approach. Does OpenAI have the pockets to buy the open
| source collaborators and researchers?
|
| I'm truly cynical about many aspects of the tech industry but
| this is one of those fights that open source could win for
| the betterment of everybody.
| maxilevi wrote:
| I agree with the spirit but saying that open source is on
| track to outpace OpenAI in innovation is just not true.
| Open source models are being compared to GPT3.5, none yet
| even get close to GPT4 quality and they finished that last
| year.
| jart wrote:
| We're basically surviving off the scraps companies like
| Facebook have been tossing off the table, like LLaMA. The
| fact that we're even allowed and able to use these things
| ourselves, at all, is a tremendous victory.
| maxilevi wrote:
| I agree
| yyyk wrote:
| I've been going on and on about this in HN: Open source can
| win this fight, but I think OSS is overconfident. We need
| to be clear there are serious challenges ahead - ClosedAI
| and other corporations also have a plan, a plan that has
| good chances unless properly countered:
|
| A) Embed OpenAI (etc.) API everywhere. Make embedding easy
| and trivial. First to gain a small API/install moat
| (user/dev: 'why install OSS model when OpenAI is already
| available with an OS API?'). If it's easy to use OpenAI but
| not open source they have an advantage. Second to gain
| brand. But more importantly:
|
| B) Gain a technical moat by having a permanent data
| advantage using the existing install base (see above).
| Retune constantly to keep it.
|
| C) Combine with existing propriety data stores to increase
| local data advantage (e.g. easy access for all your Office
| 365/GSuite documents, while OSS gets the scary permission
| prompts).
|
| D) Combine with existing propriety moats to mutually
| reinforce.
|
| E) Use selective copyright enforcement to increase data
| advantage.
|
| F) Lobby legislators for limits that make competition (open
| or closed source) way harder.
|
| TL;DR: OSS is probably catching up on algorithms. When it
| comes to good data and good integrations OSS is far behind
| and not yet catching up. It's been argued that OpenAI's
| entire performance advantage is due to having better data
| alone, and they intend to keep that advantage.
| ljlolel wrote:
| Don't forget chip shortages. That's all centralized up
| through Nvidia, TSMC, and ASML
| ignoramous wrote:
| > _The problem is, this financial backing and support is via
| VCs, who will steer the project to close it all up again._
|
| A matter of _when_ , not _if_. I mean, the website itself
| makes that much clear: The ggml way
| ... Open Core The library and
| related projects are freely available under the MIT
| license... In the future we may choose to develop extensions
| that are licensed for commercial use Explore
| and have fun! ... Contributors are encouraged to
| try crazy ideas, build wild demos, and push the edge of
| what's possible
|
| So, like many other "open core" devtools out there, they'd
| like to have their cake and eat it too. And they might just
| as well, like others before them.
|
| Won't blame anyone here though; because clearly, if you're as
| good as Georgi Gerganov, why do it for free?
| jdonaldson wrote:
| > I think you are setting yourself up for disappointment in
| the future.
|
| Why would you say that?
| behnamoh wrote:
| I wonder if ClosedAI and other companies use the findings of
| the open source community in their products. For example, do
| they use QLORA to reduce the costs of training and inference?
| Do they quantize their models to serve non-subscribing
| consumers?
| jmoss20 wrote:
| Quantization is hardly a "finding of the open source
| community". (IIRC the first TPU was int8! Though the
| tradition is much older than that.)
| danielbln wrote:
| Not disagreeing with your points, but saying "ClosedAI" is
| about as clever as writing M$ for Microsoft back in the day,
| which is to say not very.
| rafark wrote:
| I think it's ironic that M$ made ClosedAI.
| replygirl wrote:
| Pedantic but that's not irony
| rafark wrote:
| Why do you think so? According to the dictionary, ironic
| could be something paradoxical or weird.
| Miraste wrote:
| M$ is a silly way to call Microsoft greedy. ClosedAI is
| somewhat better because OpenAI's very name is a bald-faced
| lie, and they should be called on it. Are there more
| elegant ways to do that? Sure, but every time I see Altman
| in the news crying crocodile tears about the "dangers" of
| open anything I think we need all the forms of opposition
| we can find.
| tanseydavid wrote:
| It is a colloquial spelling and they earned it, a long
| time ago.
| loa_in_ wrote:
| I'd say saying M$ makes it harder for M$ to find out I'm
| talking about them in them in the indexed web because it's
| more ambiguous, that's all I need to know.
| coolspot wrote:
| If we are talking about indexing, writing M$ is easier to
| find in an index because it is a such unique token. MS
| can mean many things (e.g. Miss), M$ is less ambiguous.
| smoldesu wrote:
| Yeah, I think it feigns meaningful criticism. The "Sleepy
| Joe"-tier insults are ad-hominem enough that I don't try to
| respond.
| ignoramous wrote:
| Can LLaMA be used for commerical purposes though (might limit
| external contributors)? I believe, FOSS alternatives like
| DataBricks _Dolly_ / Together _RedPajama_ / Eluether _GPT NeoX_
| (et al) is where the most progress is likely to be at.
| samwillis wrote:
| Although llama.cpp started with the LLaMA model, it now
| supports many others.
| okhuman wrote:
| This is a very good question that will be interesting how
| this develops. thanks for posting the alternatives list.
| detrites wrote:
| May also be worth mentioning - UAE's Falcon, which apparently
| performs well (leads?). Falcon recently had its royalty-based
| commercial license modified to be fully open for free private
| and commercial use, via Apache 2.0: https://falconllm.tii.ae/
| chaxor wrote:
| Why is commercial necessary to run local models?
| ignoramous wrote:
| It isn't, but such models may eventually lag behind the
| FOSS ones.
| digitallyfree wrote:
| OpenLLAMA will be released soon and it's 100% compatible with
| the original LLAMA.
|
| https://github.com/openlm-research/open_llama
| sva_ wrote:
| Really impressive work and I've asked this before, but is it
| really a good thing to have basically the whole library in a
| single 16k line file?
| CamperBob2 wrote:
| Yes. Next question
| regularfry wrote:
| It makes syncing between llama.cpp, whisper.cpp, and ggml
| itself quite straightforward.
|
| I think the lesson here is that this setup has enabled some
| very high-speed project evolution or, at least, not got in its
| way. If that is surprising and you were expecting downsides, a)
| why; and b) where did they go?
| graycat wrote:
| WOW! They are using BFGS! Haven't heard of that in decades! Had
| to think a little: Yup, the full name is Broyden-Fletcher-
| Goldfarb-Shanno for iterative unconstrained non-linear
| optimization!
|
| Some of the earlier descriptions of the optimization being used
| in the AI _learning_ was about steepest descent, that is, just
| find the gradient of the function are trying to minimize and move
| some distance in that direction. Just using the gradient was
| concerning since that method tends to _zig zag_ where after, say,
| 100 iterations the distance moved in the 100 iterations might be
| several times farther than the distance from the starting point
| of the iterations to the final one. Can visualize this _zig zag_
| already in just two dimensions, say, following a river, say, a
| river that curves, down a valley the river cut over a million
| years or so, that is, a valley with steep sides. Then gradient
| descent may keep crossing the river and go maybe 10 feet for each
| foot downstream!
|
| Right, if just trying to go downhill on a tilted flat plane, then
| the gradient will point in the steepest descent on the plane and
| gradient descent will go all way downhill in just one iteration.
|
| In even moderately challenging problems, BFGS can a big
| improvement.
| doxeddaily wrote:
| This scratches my itch for no dependencies.
| s1k3s wrote:
| I'm out of the loop on this entire thing so call me an idiot if I
| get it wrong. Isn't this whole movement based on a model leak
| from Meta? Aren't licenses involved that prevent it from going
| commercial?
| detrites wrote:
| GGML is essentially a library of lego pieces that can be put
| together to work with many LLM or other types of ML models.
|
| Meta's leaked model is one for which GGML has been applied to
| for fast, local inference.
| dimfeld wrote:
| Only the weights themselves. There have been other models since
| then built on the same Llama architecture, but trained from
| scratch so they're safe for commercial user. The GGML code and
| related projects (llama.cpp and so on) also support some other
| model types now such as Mosaic's MPT series.
| okhuman wrote:
| The establishment of ggml.ai a company focusing ggml and
| llama.cpp, the most innovative and exciting platform to come for
| local LLMs, on a Open Core model is just laziness.
|
| Just because you can (and have the connections), doesn't mean you
| should. It's a sad state of OSS when the best most brightest
| developers/founders reach for antiquated models.
|
| Maybe we take up a new rules in OSS communities that say you must
| release your CORE software as MIT at the same time you plan to go
| Open Core (and no sooner).
|
| Why should OSS communities take on your product market fit?!
| wmf wrote:
| This looks off-topic since GGML has not announced anything
| about open core and their software is already MIT.
|
| More generally, if you want to take away somebody's business
| model you need to provide one that works. It isn't easy.
| okhuman wrote:
| Agreed with you 100% - its not easy. Sometimes I just wish
| someone as talented as Georgi would innovate not just on the
| core tech side but bring that same tenancy to the licensing
| side, in a way that aligns incentives better and tries out
| something new. And that the community would have his back if
| some new approach failed, no matter what.
| aryamaan wrote:
| Could someone at high level talk more about how one starts
| contributing to this kind of problems.
|
| For the people who build solutions for data handling-- ranging
| from crud to building highly scalable solutions-- these things
| are alien concepts. (Or maybe I am just talking about it myself)
| danieljanes wrote:
| Does GGML support training on the edge? We're especially
| interested in training support for Android+iOS
| [deleted]
| svantana wrote:
| Yes - look at the file tests/test-opt.c. Unfortunately there's
| almost no documentation about its training/autodiff.
| KronisLV wrote:
| Just today, I finished a blog post (also my latest submission,
| felt like could be useful to some) about how to get something
| like this working in a bundle of something to run models, as well
| as a web UI for more easy interaction - in my case that was
| koboldcpp, which can run GGML, both on the CPU (with OpenBLAS)
| and on the GPU (with CLBlast). Thanks to Hugging Face, getting
| Metharme, WizardLM or other models is also extremely easy, and
| the 4-bit quantized ones provide decent performance even on
| commodity hardware!
|
| I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41
| instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which
| costs about 25 EUR per month and still can generate decent
| responses in less than half a minute, my local machine needing
| approx. double that time. While not quite as good as one might
| expect (decent response times mean maxing out CPU for the single
| request, if you don't have a compatible GPU with enough VRAM),
| the technology is definitely at a point where it's possible for
| it to make people's lives easier in select use cases with some
| supervision (e.g. customer support).
|
| What an interesting time to be alive, I wonder where we'll be in
| a decade.
| b33j0r wrote:
| I wish everyone in tech had your perspective. That is what I
| see, as well.
|
| There is a lull right now between gpt4 and gpt5 (literally and
| metaphorically). Consumer models are plateauing around 40B for
| a barely-reasonable RTX 3090 (ggml made this possible).
|
| Now is the time to launch your ideas, all!
| digitallyfree wrote:
| The fact that this is _commodity hardware_ makes ggml extremely
| impressive and puts the tech in the hands of everyone. I
| recently reported my experience running 7B llama.cpp on a 15
| year old Core 2 Quad [1] - when that machine came out it was a
| completely different world and I certainly never imagined how
| AI would look like today. This was around when the first iPhone
| was released and everyone began talking about how smartphones
| would become the next big thing. We saw what happened 15 years
| later...
|
| Today with the new k-quants users are reporting that 30B models
| are working with 2-bit quantization on 16GB CPUs and GPUs [2].
| That's enabling access to millions of consumers and the
| optimizations will only improve from there.
|
| [1]
| https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf...
|
| [2] https://github.com/ggerganov/llama.cpp/pull/1684,
| https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros...
| c_o_n_v_e_x wrote:
| What do you mean by commodity hardware? Single server single
| CPU socket x86/ARM boxes? Anything that does not have a GPU?
| [deleted]
| SparkyMcUnicorn wrote:
| Seems like serverless is the way to go for fast output while
| remaining inexpensive.
|
| e.g.
|
| https://replicate.com/stability-ai/stablelm-tuned-alpha-7b
|
| https://github.com/runpod/serverless-workers/tree/main/worke...
|
| https://modal.com/docs/guide/ex/falcon_gptq
| tikkun wrote:
| I think that's true if you're doing minimal usage / low
| utilization, otherwise a dedicated instance will be cheaper.
| mliker wrote:
| congrats! I was just listening to your changelog interview from
| months ago in which you said you were going to move on from this
| after you brush up the code a bit, but it seems the momentum is
| too great. Glad to see you carrying this amazing project(s)
| forward!
| FailMore wrote:
| Remember
| kretaceous wrote:
| Georgi's Twitter announcement:
| https://twitter.com/ggerganov/status/1666120568993730561
| jgrahamc wrote:
| Cool. I've just started sponsoring him on GitHub.
| FailMore wrote:
| Commenting to remember. Looks good
___________________________________________________________________
(page generated 2023-06-06 23:00 UTC)