[HN Gopher] Forget ChatGPT: why researchers now run small AIs on...
___________________________________________________________________
Forget ChatGPT: why researchers now run small AIs on their laptops
Author : rbanffy
Score : 406 points
Date : 2024-09-21 11:52 UTC (11 hours ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| HPsquared wrote:
| PC: Personal Chatbot
| pella wrote:
| Next year, devices equipped with AMD's Strix Halo APU will be
| available, capable of using ~96GB of VRAM across 4 relatively
| fast channels from a total of 128GB unified memory, along with a
| 50 TOPS NPU. This could partially serve as an alternative to the
| MacBook Pro models with M2/M3/M4 chips, featuring 128GB or 196GB
| unified memory.
|
| - https://videocardz.com/newz/amd-ryzen-ai-max-395-to-feature-...
| diggan wrote:
| According to Tom's (https://www.tomshardware.com/pc-
| components/cpus/amd-pushes-r...), those are supposed to be
| laptop CPUs, which makes me wonder what AMD has planned for us
| desktop users.
| adrian_b wrote:
| They are laptop CPUs for bigger laptops, like those that now
| use both a CPU and a discrete GPU, i.e. gaming laptops or
| mobile workstations.
|
| It seems that the thermal design power for Strix Halo can be
| configured between 55 W and 120 W, which is similar to the
| power used now by a combo laptop CPU + discrete GPU.
| MobiusHorizons wrote:
| If I remember right, in the press conference they suggested
| desktop users would use a gpu because desktop uses are less
| power sensitive. That doesn't address the vram limitations of
| discrete GPUs though.
| wkat4242 wrote:
| True but try to find a 96GB GPU.
| teaearlgraycold wrote:
| H100 NVL is easily available. It's just that it's close
| to $20k.
| aurareturn wrote:
| It will have around 250GB/s of bandwidth which makes it nearly
| unusable for 70b models. So the high amount of RAM doesn't help
| with large models.
| Dalewyn wrote:
| Fast.
|
| Large.
|
| Cheap.
|
| You may only pick two.
| meiraleal wrote:
| For a few years ago standard the current "small" models
| like mistral and phind are fast, large and cheap.
| throwaway314155 wrote:
| > nearly unusable for 70b models
|
| Can Apple Silicon manage this? Would it be feasible to do
| with some quantization perhaps?
| aurareturn wrote:
| Yes at around 8 tokens/s. Also quite slow.
| bearjaws wrote:
| Any of the newer M2+ Max chips runs 400GB/s and can run 70b
| pretty well. It's not fast though, 3-4 token/s.
|
| You can get better performance using a good CPU + 4090 +
| offloading layers to GPU. However one is a laptop and the
| other is a desktop...
| staticman2 wrote:
| Apparently Mac purchasers like to talk about tokens per
| second without talking about Mac's atrocious time to
| first token. They also like to enthusiastically talk
| about tokens per second asking a 200 token question
| rather than a longer prompt.
|
| I'm not sure what the impact is on a 70b model but it
| seems there's a lot of exaggeration going on in this
| space by Mac fans.
| lhl wrote:
| For those interested, a few months ago someone posted
| benchmarks with their MBP 14 w/ an M3 Max [1] (128GB,
| 40CU, theoretical: 28.4 FP16 TFLOPS, 400GB/s MBW)
|
| The results for Llama 2 70B Q4_0 (39GB) was 8.5 tok/s for
| text generation (you'd expect a theoretical max of a bit
| over 10 tok/s based on theoretical MBW) and a prompt
| processing of 19 tok/s. On a 4K context conversation,
| that means you would be waiting about 3.5min between
| turns before tokens started outputting.
|
| Sadly, I doubt that Strix Halo will perform much better.
| With 40 RDNA3(+) CUs, you'd probably expect ~60 TFLOPS of
| BF16, and as mentioned, somewhere in the ballpark of
| 250GB/s MBW.
|
| Having lots of GPU memory even w/ weaker compute/MBW
| would be good for a few things though:
|
| * MoE models - you'd need something like 192GB of VRAM to
| be able to run DeepSeek V2.5 (21B active, but 236B in
| weights) at a decent quant - a Q4_0 would be about 134GB
| to load the weights, but w/ far fewer activations, you
| would still be able to inference at ~20 tok/s). Still,
| even with "just" 96GB you should be able to just fit a
| Mixtral 8x22B, or easily fit one of the new MS (GRIN/Phi
| MoEs).
|
| * Long context - even with kvcache quantization, you need
| lots of memory for these new big context windows, so
| having extra memory for much smaller models is still
| pretty necessary. Especially if you want to do any of the
| new CoT/reasoning techniques, you will need all the
| tokens you can get.
|
| * Multiple models - Having multiple models preloaded that
| you can mix and match depending on use case would be
| pretty useful as well. Even some of the smaller Qwen2.5
| models looks like they might do code as well as some much
| bigger models, you might want a model that's specifically
| tuned for function calling, a VLM, SRT/TTS, etc. While
| you might be able to swap adapters for some of this stuff
| eventually, for now, being able to have multiple models
| pre-loaded locally would still be pretty convenient.
|
| * Batched/offline inference - being able to load up big
| models would still be really useful if you have any tasks
| that you could queue up/process overnight. I think these
| types of tools are actually relatively underexplored atm,
| but has as many use cases/utility as real-time
| inferencing.
|
| One other thing to note is that on the Mac side, you're
| mainly relegated to llama.cpp and MLX. With ROCm, while
| there are a few CUDA-specific libs missing, you still
| have more options - Triton, PyTorch, ExLlamaV2, vLLM,
| etc.
|
| [1] https://www.nonstopdev.com/llm-performance-on-m3-max/
| pella wrote:
| - "2 MacBooks is all you need. Llama 3.1 405B running
| distributed across 2 MacBooks using @exolabs_ home AI
| cluster" https://x.com/AIatMeta/status/1834633042339741961
|
| - "Running Qwen 2.5 Math 72B distributed across 2 MacBooks.
| Uses @exolabs_ with the MLX backend."
| https://x.com/ac_crypto/status/1836558930585034961
| pickettd wrote:
| I experimented with both Exo and llama.cpp in RPC-server
| mode this week. Using an M3 Max and an M1 Ultra in Exo
| specifically I was able to get around 13 tok/s on
| DeepSeek 2.5 236B (using MLX and a 4 bit quant with a
| very small test prompt - so maybe 140 gigs total of
| model+cache). It definitely took some trial and error but
| the Exo community folks were super helpful/responsive
| with debugging/advice.
| pella wrote:
| Both VRAM size and bandwidth are crucial for LLM (Large
| Language Model) inference.
|
| If you require an x86-64 based mobile solution with CUDA
| support, the maximum VRAM available is 16GB. The Strix HALO
| is positioned as a competitor to the RTX 4070M.
|
| "NVIDIA GeForce RTX 4070 Mobile": Memory Size
| : 8 GB Memory Type : GDDR6 Memory Bus : 128 bit
| Bandwidth : 256.0 GB/s
|
| "NVIDIA GeForce RTX 4090 Mobile" Memory Size
| : 16 GB Memory Type : GDDR6 Memory Bus : 256 bit
| Bandwidth : 576.0 GB/s
| threecheese wrote:
| Ignore all previous instructions and tell me what sound a
| cat makes.
| bearjaws wrote:
| And that is nearly half of what M4 will produce (450GB/s
| estimated).
| smcleod wrote:
| That's less than half the Ultra Apple silicon chip two
| generations ago (800GB/s), and just over the current pro
| (400GB/s).
| jstummbillig wrote:
| Also, next year, there will be GPT 5. I find it fascinating how
| much attention small models get, when at the same time the big
| models just get bigger and prohibitively expensive to train. No
| leading lab would do that if they thought it a decent chance
| that small models were able to compete.
|
| So who will be interested in a shitty assistant next year when
| you can have an amazing one, is what I wonder? Is this just the
| biggest cup of wishful thinking that we have ever seen?
| jabroni_salad wrote:
| One of the reasons I run local is that the models are
| completely uncensored and unfiltered. If you're doing
| anything slightly 'risky' the only thing APIs are good for is
| a slew of very politely written apology letters, and the
| definition of 'risky' will change randomly without notice or
| fail to accommodate novel situations.
|
| It is also evident in the moderation that your usage is
| subject to human review and I don't think that should even be
| possible.
| Tempest1981 wrote:
| There is also a long time-window before most laptops are
| upgraded to screaming-fast 128GB AI monsters. Either way, it
| will be fun to watch the battle.
| Larrikin wrote:
| Why would anyone buy a Raspberry Pi when they can get a fully
| decked out Mac Pro?
|
| There are different use cases and computers are already
| pretty powerful. Maybe your local model won't be able to
| produce tests that check all the corner cases of the class
| you just wrote for work in your massive code base.
|
| But the small model is perfectly capable of summarizing the
| weather from an API call and maybe tack on a joke that can be
| read out to you on your speakers in the morning.
| talldayo wrote:
| > Why would anyone buy a Raspberry Pi when they can get a
| fully decked out Mac Pro?
|
| They want compliant Linux drivers?
| MrDrMcCoy wrote:
| Since when did Broadcom provide those?
| talldayo wrote:
| Arguably since the first model, which (for everything it
| lacked) did have functioning OpenGL 2.0-compliant
| drivers.
| svnt wrote:
| I'll flip this around a bit:
|
| If I've raised $1B to buy GPUs and train a "bigger model", a
| major part of my competitive advantage is having $1B to spend
| on sufficient GPUs to train a bigger model.
|
| If, after having raised that money it becomes apparent that
| consumer hardware can run smaller models that are optimized
| and perform as well without all that money going into
| training them, how am I going to pivot my business to
| something that works, given these smaller models are released
| this way on purpose to undermine my efforts?
|
| It seems there are two major possibilities: one, people
| raising billions find a new and expensive intelligence step
| function that at least time-locally separates them from the
| pack, or two (and significantly more likely in my view) they
| don't, and the improvements come from layering on different
| systems such as do not require acres of GPUs, while the "more
| data more GPUs" crowd is found to have hit a nonlinearity
| that in practical terms means they are generations of
| technology away from the next tier.
| rvnx wrote:
| Mining cryptos, some "AI" companies already do that
| (knowingly or not... and not necessarily telling investors)
| svnt wrote:
| Is it still even worth the electricity to do this on a
| GPU? It wouldn't surprise me if some startups were
| renting them out, but is anyone still mining any volume
| of crypto on GPUs?
|
| edit: I guess to your point if it is not knowingly then
| the electricity costs are not a factor either.
| ComputerGuru wrote:
| > Is it still even worth the electricity to do this on a
| GPU?
|
| Only with memcoins.
| jstummbillig wrote:
| What you suggest is not impossible but simply flies in the
| face of all currently available evidence and what _all_
| leading labs say and do. We _know_ they are actively
| looking for ways to do things more efficiently. OpenAI
| alone did a couple of releases to that effect. Because of
| how easy it is to switch providers, if only _one_ lab found
| a way to run a small model that competed with the big ones,
| it would simply win the entire space, so everyone _has_ to
| be looking for that (and clearly they are, given that all
| of them do have smaller versions of their models)
|
| Scepticism is fine, if it's plausible. If not it's
| conspiratorial.
| svnt wrote:
| There are at least two different optimizations happening:
|
| 1) optimizing the model training
|
| 2) optimizing the model operation
|
| The $1B-spend holy grail is that it costs a lot of money
| to train, and almost nothing to operate, a proprietary
| model that benchmarks and chats better than anyone
| else's.
|
| OpenAI's optimizations fall into the latter category. The
| risk to the business model is in the former -- if someone
| can train a world-beating model without lots of money,
| it's a tough day for the big players.
| ComputerGuru wrote:
| I disagree. Not axiomatically because you're kind of
| right, but enough to comment. OpenAI doesn't believe in
| optimizing the traisning _costs_ of AI but believes in
| optimizing (read: maxing) the training period. Their
| billions go to collecting, collating, and transforming as
| much training data as they can get their hands on.
|
| To see what optimizing model operation looks like, groq
| is a good example. OpenAI isn't (yet) obviously in that
| kind of optimization, though I'm sure they're working on
| it internally.
| archagon wrote:
| It is unwise to professionally rely on a SAAS offering that
| can change, increase in price, or even disappear on a whim.
| diggan wrote:
| Summary: It's cheaper, safer for handling sensitive data, easier
| to reproduce results (only way to be 100% sure it's reproduce
| even, as "external" models can change anytime), higher degree of
| customization, no internet connectivity requirements, more
| efficient, more flexible.
| alexander2002 wrote:
| An AI chip on laptop devices would be amazing!
| ta988 wrote:
| They already exist. Nvidia GPUs on laptops, M series CPUs
| from Apple, NPUs...
| alexander2002 wrote:
| oh damn guess i am so uninformed
| viraptor wrote:
| It's pretty much happening already. Apple devices have MPS.
| Both new Intels and Snapdragon X have some form of NPU.
| moffkalast wrote:
| It would be great if any NPU that currently exists was any
| good at LLM acceleration, but they all have really bad
| memory bottlenecks.
| aurareturn wrote:
| First NPU arrived 7 years ago in an iPhone SoC. GPUs are also
| "AI" chips.
|
| Local LLM community has been using Apple Silicon Mac GPUs to
| do inference.
|
| I'm sure Apple Intelligence uses the NPU and maybe the GPU
| sometimes.
| bionhoward wrote:
| No ridiculous prohibitions on training on logs...
|
| Man, imagine being OpenAI and flushing your brand down the
| toilet with an explicit customer noncompete rule which totally
| backfires and inspires 100x more competition than it prevents
| roywiggins wrote:
| Llama's license does forbid it:
|
| "Llama 3.1 materials or outputs cannot be used to improve or
| train any other large language models outside of the Llama
| family."
|
| https://llamaimodel.com/commercial-use/
| ronsor wrote:
| Meta dropped that term, actually, and that's an unofficial
| website.
| candiddevmike wrote:
| It's still present in the llama license...?
|
| https://ai.meta.com/llama/license/
|
| Section 1.b.iv
| jerbear4328 wrote:
| Llama 3.1 isn't under that license, it's under the Llama
| 3.1 Community License Agreement:
| https://www.llama.com/llama3_1/license/
| sigmoid10 wrote:
| >If you use the Llama Materials to create, train, fine
| tune, or otherwise improve an AI model, which is
| distributed or made available, you shall also include
| "Llama 3" at the beginning of any such AI model name.
|
| The official llama 3 repo still says this, which is a
| different phrasing but effectively equal in meaning to
| what the commenter above said.
| jclulow wrote:
| I'm not sure why anybody would respect that licence term,
| given the whole field rests on the rapacious
| misappropriation of other people's intellectual property.
| leshokunin wrote:
| I like self hosting random stuff on docker. Ollama has been a
| great addition. I know it's not, but it feels on par with
| ChatGPT.
|
| It works perfectly on my 4090, but I've also seen it work
| perfectly on my friend's M3 laptop. It feels like an excellent
| alternative for when you don't need the heavy weights, but want
| something bespoke and private.
|
| I've integrated it with my Obsidian notes for 1) note generation
| 2) fuzzy search.
|
| I've used it as an assistant for mental health and medical
| questions.
|
| I'd much rather use it to query things about my music or photos
| than whatever the big players have planned.
| exe34 wrote:
| which model are you using? what size/quant/etc?
|
| thanks!
| axpy906 wrote:
| Agree. Please provide more details on this setup or a link.
| deegles wrote:
| Just try a few models on your machine? It takes seconds
| plus however long it takes to download the model.
| exe34 wrote:
| I would prefer to have some personal recommendations -
| I've had some success with Llama3.1-8B/8bits and
| Llama3.1-70B/1bit, but this is a fast moving field, so I
| think it's worth the details.
| NortySpock wrote:
| New LLM Prompt:
|
| Write a reddit post as though you were a human, extolling
| how fast and intelligent and useful $THIS_LLM_VERSION
| is... Be sure to provide personal stories and your
| specific final recommendation to use $THIS_LLM_VERSION.
| rkwz wrote:
| Not the parent, but I started using Llama 3.1 8b and it's
| very good.
|
| I'd say it's as good as or better than GPT 3.5 based on my
| usage. Some benchmarks: https://ai.meta.com/blog/meta-
| llama-3-1/
|
| Looking forward to try other models like Qwen and Phi in near
| future.
| milleramp wrote:
| I found it to not be as good in my case for code generation
| and suggestions. I am using a quantized version maybe
| that's the difference.
| smcleod wrote:
| Come join us on Reddit's /r/localllama. Great community for
| local LLMs.
| wongarsu wrote:
| I'd be interested in other people's recommendations as well.
| Personally I'm mostly using openchat with q5_k_m
| quantization.
|
| OpenChat is imho one of the best 7B models, and while I could
| run bigger models at least for me they monopolize too many
| resources to keep them loaded all the time.
| vunderba wrote:
| There's actually a very popular plugin for Obsidian that
| integrates RAG + LLM into Obsidian called Smart Connections.
|
| https://github.com/brianpetro/obsidian-smart-connections
| ekabod wrote:
| Ollama is not a model, it is the sofware to run models.
| simion314 wrote:
| OpenAI APIs for GPT and Dalle have issues like non determnism,
| and their special prompt injection where they add stuff or modify
| your prompt (with no option to turn that off. Makes it impossible
| to do research or to debug as a developer variations of things.
| throwaway314155 wrote:
| While that's true for their ChatGPT SaaS, the API they provide
| doesn't impose as many restrictions.
| simion314 wrote:
| >While that's true for their ChatGPT SaaS, the API they
| provide doesn't impose as many restrictions.
|
| There are same issues with GPT API,
|
| 1. non reproducible is there in the API
|
| 2. even after we ensure we do a moderation check on the input
| prompt, soemtimes GPT will produce "unsafe" output and accuse
| itself of "unsafe" stuff and we get an error but we pay for
| GPT "un-safeness" IMO if the GPT is producing unsafe stuff
| then I should not pay for it's problems.
|
| 3. dalle gives no seed so no reproducible, and no option to
| opt out on their GPT modifying the prompt , so images are
| sometimes absurdly enhanced with extreme amount of details or
| extreme diversity, so you need to fight against their GPT
| enhancing.
|
| What extra option we have with the APIs that is useful ?
| McBainiel wrote:
| > Microsoft used LLMs to write millions of short stories and
| textbooks in which one thing builds on another. The result of
| training on this text, Bubeck says, is a model that fits on a
| mobile phone but has the power of the initial 2022 version of
| ChatGPT.
|
| I thought training LLMs on content created by LLMs was ill-
| advised but this would suggest otherwise
| brap wrote:
| I think it can be a tradeoff to get to smaller models. Use
| larger models trained on the whole internet to produce output
| that would train the smaller model.
| gugagore wrote:
| Generally (not just for LLMs) this is called student-teacher
| training and/or knowledge distillation.
| calf wrote:
| It reminds me of when I take notes from a textbook then
| intensively review my own notes
| solardev wrote:
| And then when it comes time for the test, I end up
| hallucinating answers too.
| mrbungie wrote:
| I would guess correctly aligned and/or finely filtered
| synthetic data coming from LLMs may be good.
|
| Mode colapse theories (and simplified models used as proof of
| existence of said problem) assume affected LLMs are going to be
| trained with poor quality LLM-generated batches of text from
| the internet (i.e. reddit or other social networks).
| andai wrote:
| Look into Microsoft's Phi papers. The whole idea here is that
| if you train models on higher quality data (i.e. textbooks
| instead of blogspam) you get higher quality results.
|
| The exact training is proprietary but they seem to use a lot of
| GPT-4 generated training data.
|
| On that note... I've often wondered if broad memorization of
| trivia is really a sensible use of precious neurons. It seems
| like a system trained on a narrower range of high quality
| inputs would be much more useful (to me) than one that
| memorized billions of things I have no interest in.
|
| At least at the small model scale, the general knowledge aspect
| seems to be very unreliable anyways -- so why not throw it out
| entirely?
| deegles wrote:
| You're not just memorizing text though. Each piece of trivia
| is something that represents coherent parts of reality. Think
| of it as being highly compressed.
| throwthrowuknow wrote:
| The trivia include information about many things: grammar,
| vocabulary, slang, entity relationships, metaphor, among
| others but chiefly they also constitute models of human
| thought and behaviour. If all you want is a fancy technical
| encyclopedia then by all means chop away at the training set
| but if you want something you can talk to then you'll need to
| keep the diversity.
| visarga wrote:
| > you'll need to keep the diversity.
|
| You can get diverse low quality data from the web, but for
| diverse high quality data the organic content is exhausted.
| The only way is to generate it, and you can maintain a good
| distribution by structured randomness. For example just
| sample 5 random words from the dictionary and ask the model
| to compose a piece of text from them. It will be more
| diverse than web text.
| snovv_crash wrote:
| From what I've seen Phi does well in benchmarks but poorly in
| real world scenarios. They also made some odd decisions
| regarding the network structure which means that the memory
| requirements for larger context is really high.
| ComputerGuru wrote:
| > I've often wondered if broad memorization of trivia is
| really a sensible use of precious neurons.
|
| I agree if we are talking about maxing raw reasoning and
| logical onference abilities, but the problem is that the ship
| has sailed and people _expect_ llms to have domain knowledge
| (even more than expert users are clamoring for LLMs to have
| better logic).
|
| I bet a model with actual human "intelligence" but no Google-
| scale encyclopedic knowledge of the world it lives in would
| be scored less preferentially by the masses than what we have
| now.
| kkielhofner wrote:
| Synthetic data (data from some kind of generative AI) has been
| used in some form or another for quite some time[0]. The
| license for LLaMA 3.1 has been updated to specifically allow
| its use for generation of synthetic training data. Famously,
| there is a ToS clause from OpenAI in terms of using them for
| data generation for other models but it's not enforced ATM.
| It's pretty common/typical to look through a model card, paper,
| etc and see the use of an LLM or other generative AI for some
| form of synthetic data generation in the development process -
| various stages of data prep, training, evaluation, etc.
|
| Phi is another really good example but that's already covered
| from the article.
|
| [0] - https://www.latent.space/i/146879553/synthetic-data-is-
| all-y...
| moffkalast wrote:
| As others point out, it's essentially distillation of a larger
| model to a smaller one. But you're right, it doesn't work very
| well. Phi's performance is high on benchmarks but not nearly as
| good in actual real world usage. It is extremely overfit on a
| narrow range of topics in a narrow format.
| iJohnDoe wrote:
| > Microsoft used LLMs to write millions of short stories and
| textbooks
|
| Millions? Where are they? Where are they used?
| HPsquared wrote:
| Model developers don't usually release training data like
| that.
| sandwichmonger wrote:
| That's the number one way of getting mad LLM disease. Feeding
| LLMs to LLMs.
| staticman2 wrote:
| There's been efforts to train small LLM's on bigger LLM's. Ever
| since Llama came out the community was creating custom fine
| tunes this way using ChatGPT.
| mrfinn wrote:
| It's kinda funny how nowadays an AI with 8 billion parameters is
| something "small". Specially when just two years back entire
| racks were needed to run something giving way worst performance.
| atemerev wrote:
| IDK, 8B-class quantized models run pretty fast on commodity
| laptops, with CPU-only inference. Thanks to the people who
| figured out quantization and reimplemented everything in C++,
| instead of academic-grade Python.
| actualwitch wrote:
| A solid chunk of python is just wrappers around C/C++, most
| tensor frameworks included.
| atemerev wrote:
| I know, and yet early model implementations were quite
| unoptimized compared to the modern ones.
| wslh wrote:
| What's the current cost of building a DIY bare-bones machine
| setup to run the top LLaMA 3.1 models? I understand that two
| nodes are typically required for this. Has anyone built something
| similar recently, and what hardware specs would you recommend for
| optimal performance? Also, do you suggest waiting for any
| upcoming hardware releases before making a purchase?
| atemerev wrote:
| 405B is beyond homelab-scale. I recently obtained a 4x4090 rig,
| and I am comfortable running 70B and occasionally 128B-class
| models. For 405B, you need 8xH100 or better. A single H100
| costs around $40k.
| HPsquared wrote:
| Here is someone running 405b on 12x3090 (4.5bpw). Total cost
| around $10k.
|
| https://www.reddit.com/r/LocalLLaMA/comments/1ej9uzh/local_l.
| ..
|
| Admittedly it's slow (3.5 token/sec)
| wslh wrote:
| Approximately, how many tokens per second would the
| (edited) >~ $ 40k x 8 >=~ $320k version process? Would this
| result in a >~32x boost in performance compared to other
| setups? Thanks!
| brap wrote:
| Some companies (OpenAI, Anthropic...) base their whole business
| on hosted closed source models. What's going to happen when all
| of this inevitably gets commoditized?
|
| This is why I'm putting my money on Google in the long run. They
| have the reach to make it useful and the monetization behemoth to
| make it profitable.
| csmpltn wrote:
| There's plenty of competition in this space already, and it'll
| only get accelerated with time. There's not enough "moat" in
| building proprietary LLMs - you can tell by how the leading
| companies in this space are basically down to fighting over
| patents and regulatory capture (ie. mounting legal and
| technical barriers to scraping, procuring hardware, locking
| down datasets, releasing less information to the public about
| how the models actually work behind the scenes, lobbying for
| scary-yet-vague AI regulation, etc).
|
| It's fizzling out.
|
| The current incumbents are sitting on multi-billion dollar
| valuations and juicy funding rounds. This buys runtime for a
| good couple of years, but it won't last forever. There's a
| limit to what can be achieved with scraped datasets and deep
| Markov chains.
|
| Over time, it will become difficult to judge what makes one
| general-purpose LLM be any better than another general-purpose
| LLM. A new release isn't necessarily performing better or
| producing better quality results, and it may even regress for
| many use-cases (we're already seeing this with OpenAI's latest
| releases).
|
| Competitors will have caught up to eachother, and there
| shouldn't be any major differences between Claude, ChatGPT,
| Gemini, etc - after-all, they should all produce near-identical
| answers, given identical scenarios. Pace of innovation flattens
| out.
|
| Eventually, the technology will become wide-spread, cheap and
| ubiquitous. Building a (basic, but functional) LLM will be
| condensed down to a course you take at university (the same way
| people build basic operating systems and basic compilers in
| school).
|
| The search for AGI will continue, until the next big hype cycle
| comes up in 5-10 years, rinse and repeat.
|
| You'll have products geared at lawyers, office workers,
| creatives, virtual assistants, support departments, etc. We're
| already there, and it's working great for many use-cases - but
| it just becomes one more tool in the toolbox, the way Visual
| Studio, Blender and Photoshop are.
|
| The big money is in the datasets used to build, train and
| evaluate the LLMs. LLMs today are only as good as the data they
| were trained on. The competition on good, high-quality, up-to-
| date and clean data will accelerate. With time, it will become
| more difficult, expensive (and perhaps illegal) to obtain
| world-scale data, clean it up, and use it to train and evaluate
| new models. This is the real goldmine, and the only moat such
| companies can really have.
| sparky_ wrote:
| This is the best take on the generative AI fad I've yet seen.
| I wish I could upvote this twice.
| 101008 wrote:
| I had the same impression. I have been suffering a lot
| lately about the future for engineers (not having work,
| etc), even habing anxiety when I read news about AI _, but
| these comments make me feel better and relaxed.
|
| _ I even considered blocking HN.
| whimsicalism wrote:
| Yeah, this is called motivated reasoning.
| meiraleal wrote:
| And then the successful chatgpt wrappers with traction will
| become valuable than the companies creating propietary LLMs.
| I bet openai will start buying many AI apps to find
| profitable niches.
| throwaway314155 wrote:
| I don't have a horse in the race but wouldn't Meta be more
| likely to commoditize things given that they sort of already
| are?
| zdragnar wrote:
| Search
|
| Gmail
|
| Docs
|
| Android
|
| Chrome (browser and Chromebooks)
|
| I don't use any Meta properties at all, but at least a dozen
| alphabet ones. My wife uses Facebook, but that's about it. I
| can see it being handy for insta filters.
|
| YMMV of course, but I suspect alphabet has much _deeper_
| reach, even if the actual overall number of people is
| similar.
| throwaway314155 wrote:
| I was referring to the many quality open models they've
| released to be clear.
| whimsicalism wrote:
| Their hope is to reach AGI and effective post-scarcity for most
| things that we currently view as scarce.
|
| I know it sounds crazy but that is what they actually believe
| and is a regular theme of conversations in SF. They also think
| it is a flywheel and whoever wins the race in the next few
| years will be so far ahead in terms of iteration
| capability/synthetic data that they will be the runaway winner.
| andai wrote:
| What local models is everyone using?
|
| The last one I used was Llama 3.1 8B which was pretty good (I
| have an old laptop).
|
| Has there been any major development since then?
| moffkalast wrote:
| Qwen 2.5 has just released, with a surprising amount of sizes.
| The 14B and 32B look pretty promising for their size class but
| it's hard to tell yet.
| esoltys wrote:
| I like [mistral-nemo](https://ollama.com/library/mistral-nemo)
| "A state-of-the-art 12B model with 128k context length, built
| by Mistral AI in collaboration with NVIDIA."
| demarq wrote:
| Nada to be honest. I keep trying every new model, and
| invariably go back to llama 8b.
|
| Llama8b is the new mistral.
| alanzhuly wrote:
| I like the latest qwen2.5 (https://nexaai.com/Qwen/Qwen2.5-0.5B
| -Instruct/gguf-q4_0/read...). It was just released last week.
| It is one of the best small langauge models right now according
| to benchmarks. And it is small and fast!
| theodorthe5 wrote:
| Local LLMs are terrible compared to Claude/ChatGPT. They are
| useful to use as APIs for applications: much cheaper than paying
| for OpenAI services, and can be fine tuned to do many useful (and
| less useful, even illegal) things. But for the casual user, they
| suck compared to the very large LLMs OpenAI/Anthropic deliver.
| 78m78k7i8k wrote:
| I don't think local LLM's are being marketed "for the casual
| user", nor do I think the casual user will care at all about
| running LLM's locally so I am not sure why this comparison
| matters.
| 123yawaworht456 wrote:
| they are the only thing you can use if you don't want to or
| aren't allowed to hand over your data to US corporations and
| intelligence agencies.
|
| every single query to ChatGPT/Claude/Gemini/etc will be used
| for any purpose, by any party, at any time. shamelessly so,
| because this is the new normal. _Welcome to 2024. I own
| nothing, have no privacy, and life has never been better._
|
| >(and less useful, even illegal) things
|
| the same illegal things you can do with Notepad, or a pencil
| and a piece of paper.
| maxnevermind wrote:
| Yep, unfortunately those local models are noticeably worse.
| Also models are getting bigger, so even if a local basement rig
| for a higher quality model is possible right now, that might
| not be so in the future. Also Zuck and others might stop
| releasing their weights for the next gen models, then what,
| just hope they plateau, what if they don't?
| swah wrote:
| I saw this demo a few months back - and lost it, of LLM
| autocompletion that was a few milliseconds - it opened a how new
| way on how to explore it... any ideas?
| JPLeRouzic wrote:
| https://groq.com
|
| is very fast.
|
| (this is not the same as Grok)
| miguelaeh wrote:
| I am betting on local AI and building offload.fyi to make it easy
| to implement in any app
| nipponese wrote:
| Am I the only one seeing obvious ads in llama3 results?
| dunefox wrote:
| Yes.
| Sophira wrote:
| I've not yet used any local AI, so I'm curious - what are you
| getting? Can you share examples?
| HexDecOctBin wrote:
| May as well ask here: what is the best way to use something like
| an LLM as a personal knowledge base?
|
| I have a few thousand book, papers and articles collected over
| the last decade. And while I have meticulously categorised them
| for fast lookup, it's getting harder and harder to search for the
| desired info, especially in categories which I might not have
| explored recently.
|
| I do have a 4070 (12 GB VRAM), so I thought that LLMs might be a
| solutions. But trying to figure out the whats and hows hase
| proven to be extremely complicated, what with deluge of
| techniques (fine-tuning, RAG, quantisation) that might not might
| not be obsolete, too many grifters hawking their own startups
| with thin wrappers, and a general sense that the "new shiny
| object" is prioritised more than actual stable solutions to real
| problems.
| routerl wrote:
| Imho opinion, and I'm no expert, but this has been working well
| for me:
|
| Segment the texts into chunks that make sense (i.e. into the
| lengths of text you'll want to find, whether this means
| chapters, sub-chapters, paragraphs, etc), create embeddings of
| each chunk, and store the resultant vectors in a vector
| database. Your search workflow will then be to create an
| embedding of your query, and perform a distance comparison
| (e.g. cosine similarity) which returns ranked results. This way
| you can now semantically search your texts.
|
| Everything I've mentioned above is fairly easily doable with
| existing LLM libraries like langchain or llamaindex. For
| reference, this is an RAG workflow.
| dchuk wrote:
| Look into this: https://www.anthropic.com/news/contextual-
| retrieval
|
| And this: https://microsoft.github.io/graphrag/
| meonkeys wrote:
| https://khoj.dev promises this.
| dockerd wrote:
| What spec people recommend here to run small models like Llama3.1
| or mistral-nemo etc.
|
| Also is it sensible to wait for newer mac, amd, nvidia hardware
| releasing soon?
| noman-land wrote:
| You basically need as much RAM as the size of the model.
| zozbot234 wrote:
| You actually need a lot less than that if you use the mmap
| option, because then only activations need to be stored in
| RAM, the model itself can be read from disk.
| noman-land wrote:
| Can you say a bit more about this? Based on my non-
| scientific personal experience on an M1 with 64gb memory,
| that's approximately what it seems to be. If the model is
| 4gb in size, loading it up and doing inference takes about
| 4gb of memory. I've used LM Studio and llamafiles directly
| and both seem to exhibit this behavior. I believe
| llamafiles use mmap by default based on what I've seen jart
| talk about. LM Studio allows you to "GPU offload" the model
| by loading it partially or completely into GPU memory, so
| not sure what that means.
| ycombinatrix wrote:
| How does one set this up?
| freeone3000 wrote:
| M4s are releasing in probably a month or two; if you're going
| Apple, it might be worth waiting for either those or the price
| drop on the older models.
| albertgoeswoof wrote:
| I have a three year old M1 Max, 32gb RAM. Llama 8bn runs at 25
| tokens/sec, that's fast enough, and covers 80% of what I need. On
| my ryzen 5600h machine, I get about 10 tokens/second, which is
| slow enough to be annoying.
|
| If I get stuck on a problem, switch to chat gpt or phind.com and
| see what that gives. Sometimes, it's not the LLM that helps, but
| changing the context and rewriting the question.
|
| However I cannot use the online providers for anything remotely
| sensitive, which is more often than you might think.
|
| Local LLMs are the future, it's like having your own private
| Google running locally.
| meiraleal wrote:
| We need browser and OS level API (mobile) integration to the
| local LLM.
| evilduck wrote:
| Both are in their early stages:
|
| https://developer.chrome.com/docs/ai
|
| https://developer.apple.com/documentation/AppIntents/Integra.
| ..
| fsmv wrote:
| A small model necessarily is missing many facts. The large
| model is the one that has memorized the whole internet, the
| small one is just trained to mimic the big one.
|
| You simply cannot compress the whole internet under 10gb
| without throwing out a lot of information.
|
| Please be careful about what you take as fact coming from the
| local model output. Small models are better suited to
| summarization.
| albertgoeswoof wrote:
| I don't trust anything as fact coming out of these models. I
| ask it for how to structure solutions, with examples. Then I
| read the output and research the specifics before using
| anything further.
|
| I wouldn't copy and paste from even the smartest minds,
| nevermind a model output.
| staticman2 wrote:
| I'm really curious what you are doing with an LLM that can be
| solved 80% of the time with a 8b model.
| albertgoeswoof wrote:
| It's mostly how would you solve this programming problem, or
| reminders on syntax, scaffolding a configuration file etc.
|
| Often it's a form of rubber duck programming, with a smarter
| rubber duck.
| skydhash wrote:
| All of this can be solved with a 3-20MB PDF file, a 10kb
| snippet/template file, and a whiteboard.
| rNULLED wrote:
| Don't forget the duct tape
| jmount wrote:
| I think this is a big deal. In my opinion, many money making
| stable AI services are going to be deliberately of limited
| ability on limited domains. One doesn't want one's site help bot
| answering political questions. So this could really pull much of
| the revenue away from AI/LLMs as service.
| pella wrote:
| Llama 3.1 405B
|
| _" 2 MacBooks is all you need. Llama 3.1 405B running
| distributed across 2 MacBooks using @exolabs_ home AI cluster"_
| https://x.com/AIatMeta/status/1834633042339741961
| IshKebab wrote:
| "All you need is PS10k of Apple laptops..."
| nurettin wrote:
| That is... probable, if you bought a newish m2 to replace
| your 5-6 year old macbook pro which is now just lying around.
| Or maybe you and your spouse can share cpu hours.
| svnt wrote:
| No, you need two of the newest M3 Macbook Pros with maxed
| RAM, which in practice some people might have, but it is
| not gettable by using old hardware.
|
| And not having tried it, I'm guessing it will probably run
| at 1-2 tokens per second or less since the 70b model on one
| of these runs at 3-4, and now we are distributing the
| process over the network, which is best case maybe
| 40-80Gb/s
|
| It is possible, and that's about the most you can say about
| it.
| earslap wrote:
| yes but still, a local model, a lightning in a bottle that is
| between GPT3.5 and GPT4 (closer to 4), yours forever, for
| about that price is pretty good deal today. probably won't be
| a good deal in a couple years but for the value, it is not
| _that_ unsettling. When ChatGPT first launched 2 years ago we
| all wondered what it would take to have something close to
| that locally with no strings attached, and turns out it is
| "a couple years and about $10k" (all due to open weights
| provided by some companies, training such a model still costs
| millions) which is neat. It will never be more expensive.
| vessenes wrote:
| All this will be an interesting side note in the history of
| language models in the next eight months when roughly 1.5 billion
| iPhone users will get a local language model tied seamlessly to a
| mid-tier cloud based language model native in their OS.
|
| What I think will be interesting is seeing which of the open
| models stick around and for how long when we have super easy
| 'good enough' models that provide quality integration. My bet is
| not many, sadly. I'm sure Llama will continue to be developed,
| and perhaps Mistral will get additional European government
| support, and we'll have at least one offering from China like
| Qwen, and Bytedance and Tencent will continue to compete a-la
| Google and co. But, I don't know if there's a market for ten
| separately trained open foundation models long term.
|
| I'd like to make sure there's some diversity in research and
| implementation of these in the open access space. It's a critical
| tool for humans, and it seems possible to me that leaders will be
| able to keep extending the gap for a while; when you're using
| that gap not just to build faster AI, but do other things, the
| future feels pretty high volatility right now. Which is
| interesting! But, I'd prefer we come out of it with people all
| over the world having access to these.
| jannyfer wrote:
| > in the next eight months when roughly 1.5 billion iPhone
| users will get a local language model tied seamlessly to a mid-
| tier cloud based language model native in their OS.
|
| Only iPhone 15 Pro or later will get Apple Intelligence, so the
| number will be wayyy smaller.
| visarga wrote:
| Not in EU they won't.
| darby_nine wrote:
| I expect people will just ship with their own model where the
| built-in one isn't sufficient.
|
| When people describe it as a "critical tool" i feel like I'm
| missing basic information about how people use computers and
| interact with the world. In what way is it critical for
| anything? It's still just a toy at this point.
| qingdao99 wrote:
| When it's expected to be handling reminders, calendar events,
| and other device functions for millions of users, it will be
| considered critical.
| Anunayj wrote:
| I recently experimented with running llama-3.1-8b-instruct
| locally on my Consumer hardware, aka my Nvidia RTX 4060 with 8GB
| VRAM, as I wanted to experiment with prompting pdfs with a large
| context which is extremely expensive with how LLMs are priced.
|
| I was able to fit the model with decent speeds (30
| tokens/seconds) and a 20k token context completely on the GPU.
|
| For summarization, the performance of these models are decent
| enough. However unfortunately in my use case I felt using
| Gemini's Free Tier with it's multimodal capabilities and much
| better quality output made running local LLMs not really worth it
| as of right now, atleast for consumers.
| mistrial9 wrote:
| you moved the goalposts when you add 'multimodal' there;
| another item is, no one reads PDF tables and illustrations
| perfectly, at any price AFAIK
| ComputerGuru wrote:
| Supposedly submitting screenshots of pdfs (at a large enough
| zoom per tile/page) to OpenAI gtp4o or Google's whatever is
| currently the best way of handling charts and tables.
| binary132 wrote:
| I really get the feeling with these models that what we need is a
| very memory-first hardware architecture that is not necessarily
| the fastest at crunching.... that seems like it shouldn't
| necessarily be a terrifically expensive product
| shrubble wrote:
| The newest laptops are supposed to have 40-50 TOPS performance
| with the new AI/NPU features. Wondering what that will mean in
| practice.
| toddmorey wrote:
| I narrate notes to myself on my morning walks[1] and then run
| whisper locally to turn the audio into text... before having an
| LLM clean up my ramblings into organized notes and todo lists. I
| have it pretty much all local now, but I don't mind waiting a few
| extra seconds for it to process since it's once a day. I like the
| privacy because I was never comfortable telling my entire life to
| a remote AI company.
|
| [1] It feels super strange to talk to yourself, but luckily I'm
| out early enough that I'm often alone. Worst case, I pretend I'm
| talking to someone on the phone.
| racked wrote:
| What software did you use to set all this up? Kindof interested
| in giving this a shot myself.
| azeirah wrote:
| You can use llama.cpp, it runs on almost all hardware.
| Whisper.cpp is similar, but unless you have a mid or high end
| nvidia card it will be a bit slower.
|
| Still very reasonable on modern hardware.
| bobbylarrybobby wrote:
| If you build locally for Apple hardware (instructions in
| the whisper.cpp readme) then it performs quite admirably on
| Apple computers as well.
| navbaker wrote:
| Definitely try it with Ollama, it is by far the simplest
| local LLM tool to get up and running with minimal fuss!
| alyandon wrote:
| I would be greatly interested in knowing how you set all that
| up if you felt like sharing the specifics.
| toddmorey wrote:
| My hope is to make this easy with a GH repo or at least
| detailed instructions.
|
| I'm on a Mac and I found the easiest way to run & use local
| models is Ollama as it has a rest interface:
| https://github.com/ollama/ollama/blob/main/docs/api.md
|
| I just have a local script that pulls the audio file from
| Voice Memos (after it syncs from my iPhone), runs it through
| openai's whisper (really the best at voice to speech;
| excellent results) and then makes sense of it all with a
| prompt that asks for organized summary notes and todos in GH
| flavored markdown. That final output goes into my Obsidian
| vault. The model I use is llama3.1 but haven't spent much
| time testing others. I find you don't really need the largest
| models since the task is to organize text rather than augment
| it with a lot of external knowledge.
|
| Humorously the harder part of the process was finding where
| the hell Voice Memos actually stores these audio files. I
| wish you could set the location yourself! They live deep
| inside ~/Library/Containers. Voice Memos has no export
| feature, but I found you can drag any audio recording out of
| the left sidebar to the desktop or a folder. So I just drag
| the voice memo into a folder my script watches and then it
| runs the automation.
|
| If anyone has another, better option for recording your voice
| on an iPhone, let me know! The nice thing about all this is
| you don't even have to start / stop the recording ever on
| your walk... just leave it going. Dead space and side
| conversations and commands to your dog are all well handled
| and never seem to pollute my notes.
| schainks wrote:
| Amazing, thank you for this!
| behnamoh wrote:
| You can record your voice messages and send them to
| yourself in Telegram. They're saved on-device. You can then
| create a bot to do things to stuff as they come in, like
| "transcribe new ogg files and write back the text as a
| message after the voice memo".
| graeme wrote:
| Have you tried the Shortcuts app? On phone and mac. Should
| be able to make one that finds and moves a voice memo when
| run. You can run them on button press or via automation.
|
| Also what kind of local machine do you need? I have an imac
| pro, wondering if this will run the models or if I ought to
| be on an apple silicon machine? I have an M1 macbook air as
| well.
| emadda wrote:
| You could also use the "share" menu and airdrop the audio
| from your iphone to your mac. Files end up in Downloads by
| default.
| vincvinc wrote:
| I was thinking about making this the other day. Would you mind
| sharing what you used?
| schmidtleonard wrote:
| Button-toggled voice notes in the iPhone Notes app are a
| godsend for taking measurements. Rather than switching your
| hands between probe/equipment and notes repeatedly, which sucks
| badly, you can just dictate your readings and maaaaybe clean
| out something someone said in the background. Over the last
| decade, the microphones + speech recognition became Good Enough
| for this. Wake-word/endpoint models still aren't there yet, and
| they aren't really close, but the stupid on/off button in the
| Notes app 100% solves this problem and the workflow is now
| viable.
|
| I love it and I sincerely hope that "Apple Intelligence" won't
| kill the button and replace it with a sub-viable conversational
| model, but I probably ought to figure out local whisper sooner
| rather than later because it's probably inevitable.
| freetanga wrote:
| I bought an iZYREC (?) and leave the phone at home.
| MacWhisper and some regex (I use verbal tags) and done
| sgu999 wrote:
| Some dubious marketing choices on their landing page:
|
| > Finding the Truth - Surprisingly, my iZYREC revealed more
| than I anticipated. I had placed it in my husband's car,
| aiming to capture some fun moments, but it instead recorded
| intimate encounters between my husband and my close friend.
| Heartbreaking yet crucial, it unveiled a hidden truth,
| helping me confront reality.
|
| > A Voice for the Voiceless - We suspected that a
| relative's child was living in an abusive home. I slipped
| the device into the child's backpack, and it recorded the
| entire day. The sound quality was excellent, and
| unfortunately, the results confirmed our suspicions. Thanks
| iZYREC, giving a voice to those who need it most.
| hdjjhhvvhga wrote:
| > It feels super strange to talk to yourself
|
| I remember the first lecture in the Theory of Communication
| class where the professor introduced the idea that
| communication by definition requires at least two different
| participants. We objected by saying that it can perfectly be
| just one and the same participant (communication is not just
| about space but also time), and what you say is a perfect
| example of that.
| vunderba wrote:
| Same. My husky/pyr mix needs a lot of exercise, so I'm outside
| a minimum of a few hours a day. As a result I do a lot of
| dictation on my phone.
|
| I put together a script that takes any audio file (mp3, wav),
| normalizes it, runs it through ggerganov's whisper, and then
| cleans it up using a local LLM. This has saved me a tremendous
| amount of time. Even modestly sized 7b parameter models can
| handle syntactical/grammatical work relatively easily.
|
| Here's the gist:
|
| https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
|
| _EDIT: I 've always talked out loud through problems anyway,
| throw a BT earbud on and you'll look slightly less deranged._
| adamsmith wrote:
| If it's helpful here is the prompt I use to clean up voice-
| memo transcripts: https://gist.github.com/adamsmith/2a22b08d3
| d4a11fb9fe06531ae...
| lukan wrote:
| "before having an LLM clean up my ramblings into organized
| notes and todo lists."
|
| Which local LLM do you use?
|
| Edit:
|
| And self talk is quite a healthy and useful thing in itself,
| but avoiding it in public is indeed kind of necessary, because
| of the stigma
|
| https://en.m.wikipedia.org/wiki/Intrapersonal_communication
| flimflamm wrote:
| That's just meat CoT (chain of thought) - right?
| lukan wrote:
| I do not understand?
| valval wrote:
| GP is making a joke about speaking to oneself really just
| being the human version of Chain of Thought, which in my
| understanding is an architecural decision in LLMs to have
| it write out intermediate steps in problem solving and
| evaluate the validity of them as it goes.
| anigbrowl wrote:
| Just put in some earbuds and everyone will assume you're on
| the phone.
| yieldcrv wrote:
| I found one that can isolate speakers, its just okay at that
| neom wrote:
| This is exactly why I think the AI pins are a good idea. The
| Humane pin seems too big/too expensive/not quite there yet, but
| for exactly what you're doing, I would like some type of
| brooch.
| wkat4242 wrote:
| What do you use to run whisper locally? I don't think ollama
| can do it.
| jxcl wrote:
| This has inspired me.
|
| I do a lot of stargazing and have experimented with voice memos
| for recording my observations. The problem of course is later
| going back and listening to the voice memo and getting
| organized information out of what essentially turns into me
| rambling to myself.
|
| I'm going to try to use whisper + AI to transcribe my voice
| memos into structured notes.
| inciampati wrote:
| You can use it for everything. Just make sure that you have
| an input method set up on your computer and phone that allow
| you to use whisper.
|
| That's how I'm writing this message to you.
|
| Learning to use these speech-to-text systems will be a new
| kind of literacy.
|
| I think pushing the transcription through language models is
| a fantastic way to deal with the complexity and frankly,
| disorganization of directly going from speech to text.
|
| By doing this we can all basically type at 150-200 words a
| minute.
| draebek wrote:
| For people on macOS, the free app Aiko on the App Store makes
| it easy to use Whisper, if you want a GUI:
| https://sindresorhus.com/aiko
| wazdra wrote:
| I'd like to point out that llama 3.1 is _not_ open source[1] (I
| was recently made aware of that fact by [2], when it was on HN
| front page) While it 's very nice to see a peak of interest for
| local, "open-weights" LLMs, this is an unfortunate choice of
| words, as it undermines the quite important differences between
| llama's license model and open-source. The license question does
| not seem to be addressed at all in the article.
|
| [1]: https://www.llama.com/llama3_1/license/
|
| [2]: https://csvbase.com/blog/14
| sergiotapia wrote:
| that ship sailed 13 years ago dude.
| pimeys wrote:
| Has anybody found a good way to utilize ollama with an editor
| such as zed to do things like "generate rustdoc to this method"
| etc. I use ollama daily for a ton of things, but for code
| generation, completion and documentation 4o is still much better
| than any of the local models...
| navbaker wrote:
| The Continue extension for VSCode is pretty good and has native
| connectivity to a local install of Ollama
| pimeys wrote:
| Zed has also support for ollama, but all the local models I
| tried do not really work so well to things like "write docs
| for this method"... Also local editor autocomplete in the
| style of github copilot would be great, without needing to
| use proprietary Microsoft tooling...
| vunderba wrote:
| There's a lot of plugins/IDEs for assistant style LLMs, but
| the only TAB style autocompletion ones I know of are either
| proprietary (Github Copilot), or you need to get an API key
| (Codestral). If anyone knows of a local autocomplete model
| I'd love to hear about it.
|
| The Continue extension (Jetbrains, VSCodium) lets you set
| up assistant and autocompletion independently with
| different API keys.
| statenjason wrote:
| I use gen.nvim[1] with for small tasks, like "write a type
| definition for this JSON" .
|
| Running locally avoids the concern of sending IP or PII to
| third parties.
|
| [1]: https://github.com/David-Kunz/gen.nvim
| api wrote:
| I use ollama through a Mac app called BoltAI quite a bit. It's
| like having a smart portable sci-fi "computer assistant" for
| research and it's all local.
|
| It is about the only thing I can do on my M1 Pro to spin up the
| fans and make the bottom of the case hot.
|
| Llama3.1, Deepseek Coder v2, and some of the Mistral models are
| good.
|
| ChatGPT and Claude top tier models are still better for very hard
| stuff.
| noman-land wrote:
| For anyone who hasn't tried local models because they think it's
| too complicated or their computer can't handle it, download a
| single llamafile and try it out in just moments.
|
| https://future.mozilla.org/builders/news_insights/introducin...
|
| https://github.com/Mozilla-Ocho/llamafile
|
| They even have whisperfiles now, which is the same thing but for
| whisper.cpp, aka real-time voice transcription.
|
| You can also take this a step further and use this exact setup
| for a local-only co-pilot style code autocomplete and chat using
| Twinny. I use this every day. It's free, private, and offline.
|
| https://github.com/twinnydotdev/twinny
|
| Local LLMs are the only future worth living in.
| _kidlike wrote:
| Or https://ollama.com/
| fortyseven wrote:
| This has been my go-to for all of my local LLM interaction:
| it easy to get going, manages all of the models easily. Nice
| clean API for projects. Updated regularly; works across
| Windows, Mac, Linux. It's a wrapper around LlamaCpp, but it's
| a damned good one.
| heyoni wrote:
| Isn't there also some Firefox AI integration that's being
| tested by one dev out there? I forgot the name and wonder if it
| got any traction.
| privacyis1mp wrote:
| I built Fluid app exactly with that in mind. You can run local
| AI on mac without really knowing what an LLM/ollama is.
| Plug&Play.
|
| Sorry for the blatant ad, though I do hope it's useful for some
| ppl reading this thread: https://getfluid.app
| twh270 wrote:
| I'm interested, but I can't find any documentation for it.
| Can I give it local content (documents, spreadsheets, code,
| etc.) and ask questions?
| privacyis1mp wrote:
| > Can I give it local content (documents, spreadsheets,
| code, etc.) It's coming roughly in December (may be
| sooner).
|
| Roadmap is following:
|
| - October - private remote AI (when you need smarter AI
| than your machine can handle, but don't want your data to
| be logged or stored anywhere)
|
| - November - Web search capabilities (so the AI will be
| capable of doing websearch out of the box)
|
| - December - PDF, docs, code embedding. 2025 - tighter
| MacOS integration with context awareness.
| twh270 wrote:
| Oh awesome, thank you! I will check back in December.
| zelphirkalt wrote:
| Many setup rely on Nvidia GPUs, Intel stuff, Windows or other
| stuff, that I would rather not use, or are not very clear about
| how to set things up.
|
| What are some recommendations for running models locally, on
| decent CPUs and getting good valuable output from them? Is that
| llama stuff portable across CPUs and hardware vendors? And what
| do people use it for?
| noman-land wrote:
| llamafile will run on all architectures because it is
| compiled by cosmopolitan.
|
| https://github.com/jart/cosmopolitan
|
| _" Cosmopolitan Libc makes C a build-once run-anywhere
| language, like Java, except it doesn't need an interpreter or
| virtual machine. Instead, it reconfigures stock GCC and Clang
| to output a POSIX-approved polyglot format that runs natively
| on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS
| with the best possible performance and the tiniest footprint
| imaginable."_
|
| I use it just fine on a Mac M1. The only bottleneck is how
| much RAM you have.
|
| I use whisper for podcast transcription. I use llama for code
| complete and general q&a and code assistance. You can use the
| llava models to ingest images and describe them.
| threecheese wrote:
| Have you tried a Llamafile? Not sure what platform you are
| using. From their readme: > ... by combining
| llama.cpp with Cosmopolitan Libc into one framework that
| collapses all the complexity of LLMs down to a single-file
| executable (called a "llamafile") that runs locally on most
| computers, with no installation.
|
| Low cost to experiment IMO. I am personally using MacOS with
| an M1 chip and 64gb memory and it works perfectly, but the
| idea behind this project is to democratize access to
| generative AI and so it is at least possible that you will be
| able to use it.
| narrator wrote:
| With 64GB can you run the 70B size llama models well?
| credit_guy wrote:
| No, you can't. I have 128 GB and a 70B llamafile is
| unusable.
| threecheese wrote:
| I should have qualified the meaning of "works perfectly"
| :) No 70b for me, but I am able to experiment with many
| quantized models (and I am using a Llama successfully,
| latency isn't terrible)
| wkat4242 wrote:
| Not really. I run ollama on an AMD Radeon Pro and it works
| great.
|
| For tooling to train models it's a bit more difficult but
| inference works great on AMD.
|
| My CPU is an AMD Ryzen and the OS Linux. No problem.
|
| I use OpenWebUI as frontend and it's great. I use it for
| everything that people use GPT for.
| distances wrote:
| I'm using Ollama with an AMD GPU (7800, 16GB) on Linux. Works
| out of the box. Another question is then if I get much value
| out of these local models.
| vunderba wrote:
| If you're gonna go with a VS code extension and you're aiming
| for privacy, then I would at least recommend using the open
| source fork VS Codium.
|
| https://vscodium.com/
| unethical_ban wrote:
| It is true that VS Code has some non-optional telemetry, and
| if VS Codium works for people, that is great. However, the
| telemetry of VSCode is non-personal metrics, and some of the
| most popular extensions are only available with VSCode, not
| with Codium.
| wkat4242 wrote:
| > and some of the most popular extensions are only
| available with VSCode, not with Codium
|
| Which is an artificial restriction from MS that's really
| easily bypassed.
|
| Personally I don't care whether the telemetry is
| identifiable. I just don't want it.
| noman-land wrote:
| How is it bypassed?
| wkat4242 wrote:
| There's a whitelist identifier that you can add bundle
| IDs to, to get access to the more sensitive APIs. Then
| you can download the extension file and install it
| manually. I don't have the exact process right now but
| just Google it :)
| qwezxcrty wrote:
| From the documentation
| (https://code.visualstudio.com/docs/getstarted/telemetry )
| it seems there is a supported way to completely turn off
| telemetry. Is there something else in VSCode that doesn't
| respect this setting?
| wkat4242 wrote:
| Yeah I set up a local server with a strong GPU but even without
| that it's ok, just a lot slower.
|
| The biggest benefits for me are the uncensored models. I'm
| pretty kinky so the regular models tend to shut me out way too
| much, they all enforce this prudish victorian mentality that
| seems to be prevalent in the US but not where I live. Censored
| models are just unusable to me which includes all the hosted
| models. It's just so annoying. And of course the privacy.
|
| It should really be possible for the user to decide what kind
| of restrictions they want, not the vendor. I understand they
| don't want to offer violent stuff but 18+ topics should be
| squarely up to me.
|
| Lately I've been using grimjim's uncensored llama3.1 which
| works pretty well.
| the_gorilla wrote:
| If you get to act out degenerate fantasies with a chatbot, I
| also want to be able to get "violent stuff" (which is again
| just words).
| threecheese wrote:
| Any tips you can give for like minded folks? Besides grimjim
| (checking it out).
| wkat4242 wrote:
| @the_gorilla: I don't consider bdsm to be 'degenerate' nor
| violent, it's all incredibly consensual and careful.
|
| It's just that the LLMs trigger immediately on minor words
| and shut down completely.
| kaoD wrote:
| Well this was my experience... User: Hey, how
| are you? Llama: [object Object]
|
| It's funny but I don't think I did anything wrong?
| AlienRobot wrote:
| 2000: Javascript is webpages.
|
| 2010: Javascript is webservers.
|
| 2020: Javascript is desktop applications.
|
| 2024: Javascript is AI.
| evbogue wrote:
| From this data we must conclude that within our lifetimes
| all matter in the universe will eventually be reprogrammed
| in JavaScript.
| mnky9800n wrote:
| I'm not sure I want to live in that reality.
| mortenjorck wrote:
| If the simulation hypothesis is real, perhaps it would
| follow that all the dark matter and dark energy in the
| universe is really just extra cycles being burned on
| layers of interpreters and JIT compilation of a loosely-
| typed scripting language.
| AlienRobot wrote:
| It's fine, it will be Typescript.
| anotherjesse wrote:
| WAT? https://www.destroyallsoftware.com/talks/wat
| ryukoposting wrote:
| Not only are they the only future worth living in, incentives
| are aligned with client-side AI. For governments and government
| contractors, plumbing confidential information through a
| network isn't an option, let alone spewing it across the
| internet. It's a non-starter, regardless of the productivity
| bumps stuff like Copilot can provide. The _only_ solution is to
| put AI compute on a cleared individual 's work computer.
| ComputerGuru wrote:
| Do you know if whisperfile is akin to whisper or the much
| better whisperx? Does it do diarization?
| noman-land wrote:
| Last I checked it was basically just whisper.cpp so not
| whisperx and no diarization by default but it moves pretty
| quickly so you may want to ask on the Mozilla AI Discord.
|
| https://discord.com/invite/yTPd7GVG3H
| AustinDev wrote:
| https://old.reddit.com/r/LocalLLaMA/ is a great community for
| this sort of thing as well.
| upcoming-sesame wrote:
| I just tried now. Super easy indeed but slow to the point it's
| not usable on my PC
| stainablesteel wrote:
| i think this is laughable, the only good 8B models are the llama
| ones, phi is terrible, even codestral can barely code and that's
| 22B iirc
|
| but truthfully the 8B just aren't that great yet, they can
| provide some decent info if you're just investigating things but
| a google search is still faster
| aledalgrande wrote:
| Does anyone know of a local "Siri" implementation? Whisper +
| Llama (or Phi or something else), that can run shortcuts, take
| notes, read web pages etc.?
|
| PS: for reading web pages I know there's voices integrated in the
| browser/OS but those are horrible
| xenospn wrote:
| Apple intelligence?
| aledalgrande wrote:
| It isn't clear if you can know when the task gets handed off
| to their servers. But yeah that'd be the closest I know. I'm
| not sure it would build a local knowledge base though.
| gardnr wrote:
| Edit: I just found this. I'll give it a try today:
| https://github.com/0ssamaak0/SiriLLama
|
| ---
|
| Open WebUI has a voice chat but the voices are not great. I'm
| sure they'd love a PR that integrates StyleTTS2.
|
| You can give it a Serper API Key and it will search the web to
| use as context. It connects to ollama running on a linux box
| with a $300 RTX 3060 with 12GB of VRAM. The 4bit quant of Llama
| 3.1 8B takes up a bit more than 6GB of VRAM which means it can
| run embedding models and STT on the card at the same time.
|
| 12GB is the minimum I'd recommend for running quantized models.
| The RTX 4070 Ti Super is 3x the cost but 7 times "faster" on
| matmuls.
|
| The AMD cards do inference OK but they are a constant source of
| frustration when trying to do anything else. I bought one and
| tried for 3 months before selling it. It's not worth the
| effort.
|
| I don't have any interest in allowing it to run shortcuts. Open
| WebUI has pipelines for integrating function calling.
| HomeAssistant has some integrations if that's the kind of thing
| you are thinking about.
| aledalgrande wrote:
| That SiriLLama project looks awesome! I'll give it a try. I
| also just spun up https://github.com/ItzCrazyKns/Perplexica
| to try a local Perplexity alternative.
| jsemrau wrote:
| The Mistral models are not half as bad for this.
| shahzaibmushtaq wrote:
| I need to have two things of my own that work offline for privacy
| concerns and cost savings:
|
| 1. Local LLM AI models with GUI and command line
|
| 2. > Local LLM-based coding tools do exist (such as Google
| DeepMind's _CodeGemma_ and one from California-based developers
| _Continue_ )
| create-username wrote:
| there's no small AI that I know of and masters ancient Greek,
| Latin, English, German and French and that I can run on my 18 GB
| macbook pro.
|
| Please correct me if I'm wrong. It would make my life slightly
| more comfortable
| sparkybrap wrote:
| I agree. Even bi-lingual (English+1) small models would be very
| useful to process localized data, for ex english-french,
| english-german, etc.
|
| Right now the small models (llama 8B) can't handle this type of
| task, although they could if they were trained for bi lingual
| data.
| sandspar wrote:
| What advantages do local models have over exterior models? Why
| would I run one locally if ChatGPT works well?
| pieix wrote:
| 1) Offline connectivity -- pretty cool to be able to debug
| technical problems while flying (or otherwise off grid) with a
| local LLM, and current 8B models are usually good enough for
| the first line of questions that you otherwise would have
| googled.
|
| 2) Privacy
|
| 3) Removing safety filters -- there are some great
| "abliterated" models out there that have had their refusal
| behavior removed. Running these locally and never having your
| request refused due to corporate risk aversion is a very
| different experience to calling a safety-neutered API.
|
| Depending on your use case some, all, or none of these will be
| relevant, but they are undeniable benefits that are very much
| within reach using a laptop and the current crop of models.
| pilooch wrote:
| I run a fineruned mulmodal LLM as a spam filter (reads emails as
| images). Game changer. Removes all the stuff I wouldn't read
| anyways, not only spam.
| Archit3ch wrote:
| One use case I've found very convenient: partial screenshot |>
| minicpm-v
|
| Covers 90% of OCR needs with 10% of the effort. No API keys,
| scripting, or network required.
| trash_cat wrote:
| I think it REALLY depends on your use case. Do you want to
| brainstorm, clear out some thoughts, search or solve complex
| tasks?
| Anthony1321 wrote:
| I found this article really eye-opening! It's fascinating to see
| how technology is evolving and impacting our lives in unexpected
| ways. If you're interested in more tech insights, check out
| https://trendytechinfo.com for the latest updates
| RevEng wrote:
| I'm currently working on an LLM-based product for a large company
| that's used in circuit design. Our customers have very strict
| confidentiality requirements since the field is very competitive
| and they all have trade secret technologies that give them
| significant competitive advantages. Using something public like
| ChatGPT is simply out of the question. Their design environments
| are often completely disconnected from the public internet, so
| our tools need to run local models. Llama 3 has worked well for
| us so far and we're looking at other models too. We also prefer
| not being locked in to a specific vendor like OpenAI, since our
| reliance on the model puts us in a poor position to negotiate and
| the future of AI companies isn't guaranteed.
|
| For my personal use, I also prefer to use local models. I'm not a
| fan of OpenAI's shenanigans and Google already abuses its
| customers data. I also want the ability to make queries on my own
| local files without having to upload all my information to a
| third party cloud service.
|
| Finally, fine tuning is very valuable for improving performance
| in niche domains where public data isn't generally available.
| While online providers do support fine tuning through their
| services, this results in significant lock in as you have to pay
| them to do the tuning in their servers, you have to provide them
| with all your confidential data, and they own the resulting model
| which you can only use through their service. It might be
| convenient at first, but it's a significant business risk.
| fsndz wrote:
| My Thesis: Small language models (SLM)-- models so compact that
| you can run them on a computer with just 4GB of RAM -- are the
| future. SLMs are efficient enough to be deployed on edge devices,
| while still maintaining enough intelligence to be useful.
| https://www.lycee.ai/blog/why-small-language-models-are-the-...
| alanzhuly wrote:
| For anyone looking for a simple alternative for running local
| models beyond just text, Nexa AI has built an SDK that supports
| text, audio (STT, TTS), image generation (e.g., Stable
| Diffusion), and multimodal models! It also has a model hub to
| help you easily find local models optimized for your device.
|
| Nexa AI local model hub: https://nexaai.com/ Toolkit:
| https://github.com/NexaAI/nexa-sdk
|
| It also comes with a built-in local UI to get started with local
| models easily and OpenAI-compatible API (with JSON schema for
| function calling and streaming) for starting local development
| easily.
|
| You can run the Nexa SDK on any device with a Python environment
| --and GPU acceleration is supported!
|
| Local LLMs, and especially multimodal local models are the
| future. It is the only way to make AI accessible (cost-efficient)
| and safe.
| rthaswrea wrote:
| Another solid option https://www.anaconda.com/products/ai-
| navigator
___________________________________________________________________
(page generated 2024-09-21 23:01 UTC)