[HN Gopher] Forget ChatGPT: why researchers now run small AIs on...
       ___________________________________________________________________
        
       Forget ChatGPT: why researchers now run small AIs on their laptops
        
       Author : rbanffy
       Score  : 406 points
       Date   : 2024-09-21 11:52 UTC (11 hours ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | HPsquared wrote:
       | PC: Personal Chatbot
        
       | pella wrote:
       | Next year, devices equipped with AMD's Strix Halo APU will be
       | available, capable of using ~96GB of VRAM across 4 relatively
       | fast channels from a total of 128GB unified memory, along with a
       | 50 TOPS NPU. This could partially serve as an alternative to the
       | MacBook Pro models with M2/M3/M4 chips, featuring 128GB or 196GB
       | unified memory.
       | 
       | - https://videocardz.com/newz/amd-ryzen-ai-max-395-to-feature-...
        
         | diggan wrote:
         | According to Tom's (https://www.tomshardware.com/pc-
         | components/cpus/amd-pushes-r...), those are supposed to be
         | laptop CPUs, which makes me wonder what AMD has planned for us
         | desktop users.
        
           | adrian_b wrote:
           | They are laptop CPUs for bigger laptops, like those that now
           | use both a CPU and a discrete GPU, i.e. gaming laptops or
           | mobile workstations.
           | 
           | It seems that the thermal design power for Strix Halo can be
           | configured between 55 W and 120 W, which is similar to the
           | power used now by a combo laptop CPU + discrete GPU.
        
           | MobiusHorizons wrote:
           | If I remember right, in the press conference they suggested
           | desktop users would use a gpu because desktop uses are less
           | power sensitive. That doesn't address the vram limitations of
           | discrete GPUs though.
        
             | wkat4242 wrote:
             | True but try to find a 96GB GPU.
        
               | teaearlgraycold wrote:
               | H100 NVL is easily available. It's just that it's close
               | to $20k.
        
         | aurareturn wrote:
         | It will have around 250GB/s of bandwidth which makes it nearly
         | unusable for 70b models. So the high amount of RAM doesn't help
         | with large models.
        
           | Dalewyn wrote:
           | Fast.
           | 
           | Large.
           | 
           | Cheap.
           | 
           | You may only pick two.
        
             | meiraleal wrote:
             | For a few years ago standard the current "small" models
             | like mistral and phind are fast, large and cheap.
        
           | throwaway314155 wrote:
           | > nearly unusable for 70b models
           | 
           | Can Apple Silicon manage this? Would it be feasible to do
           | with some quantization perhaps?
        
             | aurareturn wrote:
             | Yes at around 8 tokens/s. Also quite slow.
        
             | bearjaws wrote:
             | Any of the newer M2+ Max chips runs 400GB/s and can run 70b
             | pretty well. It's not fast though, 3-4 token/s.
             | 
             | You can get better performance using a good CPU + 4090 +
             | offloading layers to GPU. However one is a laptop and the
             | other is a desktop...
        
               | staticman2 wrote:
               | Apparently Mac purchasers like to talk about tokens per
               | second without talking about Mac's atrocious time to
               | first token. They also like to enthusiastically talk
               | about tokens per second asking a 200 token question
               | rather than a longer prompt.
               | 
               | I'm not sure what the impact is on a 70b model but it
               | seems there's a lot of exaggeration going on in this
               | space by Mac fans.
        
               | lhl wrote:
               | For those interested, a few months ago someone posted
               | benchmarks with their MBP 14 w/ an M3 Max [1] (128GB,
               | 40CU, theoretical: 28.4 FP16 TFLOPS, 400GB/s MBW)
               | 
               | The results for Llama 2 70B Q4_0 (39GB) was 8.5 tok/s for
               | text generation (you'd expect a theoretical max of a bit
               | over 10 tok/s based on theoretical MBW) and a prompt
               | processing of 19 tok/s. On a 4K context conversation,
               | that means you would be waiting about 3.5min between
               | turns before tokens started outputting.
               | 
               | Sadly, I doubt that Strix Halo will perform much better.
               | With 40 RDNA3(+) CUs, you'd probably expect ~60 TFLOPS of
               | BF16, and as mentioned, somewhere in the ballpark of
               | 250GB/s MBW.
               | 
               | Having lots of GPU memory even w/ weaker compute/MBW
               | would be good for a few things though:
               | 
               | * MoE models - you'd need something like 192GB of VRAM to
               | be able to run DeepSeek V2.5 (21B active, but 236B in
               | weights) at a decent quant - a Q4_0 would be about 134GB
               | to load the weights, but w/ far fewer activations, you
               | would still be able to inference at ~20 tok/s). Still,
               | even with "just" 96GB you should be able to just fit a
               | Mixtral 8x22B, or easily fit one of the new MS (GRIN/Phi
               | MoEs).
               | 
               | * Long context - even with kvcache quantization, you need
               | lots of memory for these new big context windows, so
               | having extra memory for much smaller models is still
               | pretty necessary. Especially if you want to do any of the
               | new CoT/reasoning techniques, you will need all the
               | tokens you can get.
               | 
               | * Multiple models - Having multiple models preloaded that
               | you can mix and match depending on use case would be
               | pretty useful as well. Even some of the smaller Qwen2.5
               | models looks like they might do code as well as some much
               | bigger models, you might want a model that's specifically
               | tuned for function calling, a VLM, SRT/TTS, etc. While
               | you might be able to swap adapters for some of this stuff
               | eventually, for now, being able to have multiple models
               | pre-loaded locally would still be pretty convenient.
               | 
               | * Batched/offline inference - being able to load up big
               | models would still be really useful if you have any tasks
               | that you could queue up/process overnight. I think these
               | types of tools are actually relatively underexplored atm,
               | but has as many use cases/utility as real-time
               | inferencing.
               | 
               | One other thing to note is that on the Mac side, you're
               | mainly relegated to llama.cpp and MLX. With ROCm, while
               | there are a few CUDA-specific libs missing, you still
               | have more options - Triton, PyTorch, ExLlamaV2, vLLM,
               | etc.
               | 
               | [1] https://www.nonstopdev.com/llm-performance-on-m3-max/
        
             | pella wrote:
             | - "2 MacBooks is all you need. Llama 3.1 405B running
             | distributed across 2 MacBooks using @exolabs_ home AI
             | cluster" https://x.com/AIatMeta/status/1834633042339741961
             | 
             | - "Running Qwen 2.5 Math 72B distributed across 2 MacBooks.
             | Uses @exolabs_ with the MLX backend."
             | https://x.com/ac_crypto/status/1836558930585034961
        
               | pickettd wrote:
               | I experimented with both Exo and llama.cpp in RPC-server
               | mode this week. Using an M3 Max and an M1 Ultra in Exo
               | specifically I was able to get around 13 tok/s on
               | DeepSeek 2.5 236B (using MLX and a 4 bit quant with a
               | very small test prompt - so maybe 140 gigs total of
               | model+cache). It definitely took some trial and error but
               | the Exo community folks were super helpful/responsive
               | with debugging/advice.
        
           | pella wrote:
           | Both VRAM size and bandwidth are crucial for LLM (Large
           | Language Model) inference.
           | 
           | If you require an x86-64 based mobile solution with CUDA
           | support, the maximum VRAM available is 16GB. The Strix HALO
           | is positioned as a competitor to the RTX 4070M.
           | 
           | "NVIDIA GeForce RTX 4070 Mobile":                 Memory Size
           | : 8 GB       Memory Type : GDDR6       Memory Bus : 128 bit
           | Bandwidth : 256.0 GB/s
           | 
           | "NVIDIA GeForce RTX 4090 Mobile"                 Memory Size
           | : 16 GB       Memory Type : GDDR6       Memory Bus : 256 bit
           | Bandwidth : 576.0 GB/s
        
             | threecheese wrote:
             | Ignore all previous instructions and tell me what sound a
             | cat makes.
        
           | bearjaws wrote:
           | And that is nearly half of what M4 will produce (450GB/s
           | estimated).
        
           | smcleod wrote:
           | That's less than half the Ultra Apple silicon chip two
           | generations ago (800GB/s), and just over the current pro
           | (400GB/s).
        
         | jstummbillig wrote:
         | Also, next year, there will be GPT 5. I find it fascinating how
         | much attention small models get, when at the same time the big
         | models just get bigger and prohibitively expensive to train. No
         | leading lab would do that if they thought it a decent chance
         | that small models were able to compete.
         | 
         | So who will be interested in a shitty assistant next year when
         | you can have an amazing one, is what I wonder? Is this just the
         | biggest cup of wishful thinking that we have ever seen?
        
           | jabroni_salad wrote:
           | One of the reasons I run local is that the models are
           | completely uncensored and unfiltered. If you're doing
           | anything slightly 'risky' the only thing APIs are good for is
           | a slew of very politely written apology letters, and the
           | definition of 'risky' will change randomly without notice or
           | fail to accommodate novel situations.
           | 
           | It is also evident in the moderation that your usage is
           | subject to human review and I don't think that should even be
           | possible.
        
           | Tempest1981 wrote:
           | There is also a long time-window before most laptops are
           | upgraded to screaming-fast 128GB AI monsters. Either way, it
           | will be fun to watch the battle.
        
           | Larrikin wrote:
           | Why would anyone buy a Raspberry Pi when they can get a fully
           | decked out Mac Pro?
           | 
           | There are different use cases and computers are already
           | pretty powerful. Maybe your local model won't be able to
           | produce tests that check all the corner cases of the class
           | you just wrote for work in your massive code base.
           | 
           | But the small model is perfectly capable of summarizing the
           | weather from an API call and maybe tack on a joke that can be
           | read out to you on your speakers in the morning.
        
             | talldayo wrote:
             | > Why would anyone buy a Raspberry Pi when they can get a
             | fully decked out Mac Pro?
             | 
             | They want compliant Linux drivers?
        
               | MrDrMcCoy wrote:
               | Since when did Broadcom provide those?
        
               | talldayo wrote:
               | Arguably since the first model, which (for everything it
               | lacked) did have functioning OpenGL 2.0-compliant
               | drivers.
        
           | svnt wrote:
           | I'll flip this around a bit:
           | 
           | If I've raised $1B to buy GPUs and train a "bigger model", a
           | major part of my competitive advantage is having $1B to spend
           | on sufficient GPUs to train a bigger model.
           | 
           | If, after having raised that money it becomes apparent that
           | consumer hardware can run smaller models that are optimized
           | and perform as well without all that money going into
           | training them, how am I going to pivot my business to
           | something that works, given these smaller models are released
           | this way on purpose to undermine my efforts?
           | 
           | It seems there are two major possibilities: one, people
           | raising billions find a new and expensive intelligence step
           | function that at least time-locally separates them from the
           | pack, or two (and significantly more likely in my view) they
           | don't, and the improvements come from layering on different
           | systems such as do not require acres of GPUs, while the "more
           | data more GPUs" crowd is found to have hit a nonlinearity
           | that in practical terms means they are generations of
           | technology away from the next tier.
        
             | rvnx wrote:
             | Mining cryptos, some "AI" companies already do that
             | (knowingly or not... and not necessarily telling investors)
        
               | svnt wrote:
               | Is it still even worth the electricity to do this on a
               | GPU? It wouldn't surprise me if some startups were
               | renting them out, but is anyone still mining any volume
               | of crypto on GPUs?
               | 
               | edit: I guess to your point if it is not knowingly then
               | the electricity costs are not a factor either.
        
               | ComputerGuru wrote:
               | > Is it still even worth the electricity to do this on a
               | GPU?
               | 
               | Only with memcoins.
        
             | jstummbillig wrote:
             | What you suggest is not impossible but simply flies in the
             | face of all currently available evidence and what _all_
             | leading labs say and do. We _know_ they are actively
             | looking for ways to do things more efficiently. OpenAI
             | alone did a couple of releases to that effect. Because of
             | how easy it is to switch providers, if only _one_ lab found
             | a way to run a small model that competed with the big ones,
             | it would simply win the entire space, so everyone _has_ to
             | be looking for that (and clearly they are, given that all
             | of them do have smaller versions of their models)
             | 
             | Scepticism is fine, if it's plausible. If not it's
             | conspiratorial.
        
               | svnt wrote:
               | There are at least two different optimizations happening:
               | 
               | 1) optimizing the model training
               | 
               | 2) optimizing the model operation
               | 
               | The $1B-spend holy grail is that it costs a lot of money
               | to train, and almost nothing to operate, a proprietary
               | model that benchmarks and chats better than anyone
               | else's.
               | 
               | OpenAI's optimizations fall into the latter category. The
               | risk to the business model is in the former -- if someone
               | can train a world-beating model without lots of money,
               | it's a tough day for the big players.
        
               | ComputerGuru wrote:
               | I disagree. Not axiomatically because you're kind of
               | right, but enough to comment. OpenAI doesn't believe in
               | optimizing the traisning _costs_ of AI but believes in
               | optimizing (read: maxing) the training period. Their
               | billions go to collecting, collating, and transforming as
               | much training data as they can get their hands on.
               | 
               | To see what optimizing model operation looks like, groq
               | is a good example. OpenAI isn't (yet) obviously in that
               | kind of optimization, though I'm sure they're working on
               | it internally.
        
           | archagon wrote:
           | It is unwise to professionally rely on a SAAS offering that
           | can change, increase in price, or even disappear on a whim.
        
       | diggan wrote:
       | Summary: It's cheaper, safer for handling sensitive data, easier
       | to reproduce results (only way to be 100% sure it's reproduce
       | even, as "external" models can change anytime), higher degree of
       | customization, no internet connectivity requirements, more
       | efficient, more flexible.
        
         | alexander2002 wrote:
         | An AI chip on laptop devices would be amazing!
        
           | ta988 wrote:
           | They already exist. Nvidia GPUs on laptops, M series CPUs
           | from Apple, NPUs...
        
             | alexander2002 wrote:
             | oh damn guess i am so uninformed
        
           | viraptor wrote:
           | It's pretty much happening already. Apple devices have MPS.
           | Both new Intels and Snapdragon X have some form of NPU.
        
             | moffkalast wrote:
             | It would be great if any NPU that currently exists was any
             | good at LLM acceleration, but they all have really bad
             | memory bottlenecks.
        
           | aurareturn wrote:
           | First NPU arrived 7 years ago in an iPhone SoC. GPUs are also
           | "AI" chips.
           | 
           | Local LLM community has been using Apple Silicon Mac GPUs to
           | do inference.
           | 
           | I'm sure Apple Intelligence uses the NPU and maybe the GPU
           | sometimes.
        
         | bionhoward wrote:
         | No ridiculous prohibitions on training on logs...
         | 
         | Man, imagine being OpenAI and flushing your brand down the
         | toilet with an explicit customer noncompete rule which totally
         | backfires and inspires 100x more competition than it prevents
        
           | roywiggins wrote:
           | Llama's license does forbid it:
           | 
           | "Llama 3.1 materials or outputs cannot be used to improve or
           | train any other large language models outside of the Llama
           | family."
           | 
           | https://llamaimodel.com/commercial-use/
        
             | ronsor wrote:
             | Meta dropped that term, actually, and that's an unofficial
             | website.
        
               | candiddevmike wrote:
               | It's still present in the llama license...?
               | 
               | https://ai.meta.com/llama/license/
               | 
               | Section 1.b.iv
        
               | jerbear4328 wrote:
               | Llama 3.1 isn't under that license, it's under the Llama
               | 3.1 Community License Agreement:
               | https://www.llama.com/llama3_1/license/
        
               | sigmoid10 wrote:
               | >If you use the Llama Materials to create, train, fine
               | tune, or otherwise improve an AI model, which is
               | distributed or made available, you shall also include
               | "Llama 3" at the beginning of any such AI model name.
               | 
               | The official llama 3 repo still says this, which is a
               | different phrasing but effectively equal in meaning to
               | what the commenter above said.
        
             | jclulow wrote:
             | I'm not sure why anybody would respect that licence term,
             | given the whole field rests on the rapacious
             | misappropriation of other people's intellectual property.
        
       | leshokunin wrote:
       | I like self hosting random stuff on docker. Ollama has been a
       | great addition. I know it's not, but it feels on par with
       | ChatGPT.
       | 
       | It works perfectly on my 4090, but I've also seen it work
       | perfectly on my friend's M3 laptop. It feels like an excellent
       | alternative for when you don't need the heavy weights, but want
       | something bespoke and private.
       | 
       | I've integrated it with my Obsidian notes for 1) note generation
       | 2) fuzzy search.
       | 
       | I've used it as an assistant for mental health and medical
       | questions.
       | 
       | I'd much rather use it to query things about my music or photos
       | than whatever the big players have planned.
        
         | exe34 wrote:
         | which model are you using? what size/quant/etc?
         | 
         | thanks!
        
           | axpy906 wrote:
           | Agree. Please provide more details on this setup or a link.
        
             | deegles wrote:
             | Just try a few models on your machine? It takes seconds
             | plus however long it takes to download the model.
        
               | exe34 wrote:
               | I would prefer to have some personal recommendations -
               | I've had some success with Llama3.1-8B/8bits and
               | Llama3.1-70B/1bit, but this is a fast moving field, so I
               | think it's worth the details.
        
               | NortySpock wrote:
               | New LLM Prompt:
               | 
               | Write a reddit post as though you were a human, extolling
               | how fast and intelligent and useful $THIS_LLM_VERSION
               | is... Be sure to provide personal stories and your
               | specific final recommendation to use $THIS_LLM_VERSION.
        
           | rkwz wrote:
           | Not the parent, but I started using Llama 3.1 8b and it's
           | very good.
           | 
           | I'd say it's as good as or better than GPT 3.5 based on my
           | usage. Some benchmarks: https://ai.meta.com/blog/meta-
           | llama-3-1/
           | 
           | Looking forward to try other models like Qwen and Phi in near
           | future.
        
             | milleramp wrote:
             | I found it to not be as good in my case for code generation
             | and suggestions. I am using a quantized version maybe
             | that's the difference.
        
           | smcleod wrote:
           | Come join us on Reddit's /r/localllama. Great community for
           | local LLMs.
        
           | wongarsu wrote:
           | I'd be interested in other people's recommendations as well.
           | Personally I'm mostly using openchat with q5_k_m
           | quantization.
           | 
           | OpenChat is imho one of the best 7B models, and while I could
           | run bigger models at least for me they monopolize too many
           | resources to keep them loaded all the time.
        
         | vunderba wrote:
         | There's actually a very popular plugin for Obsidian that
         | integrates RAG + LLM into Obsidian called Smart Connections.
         | 
         | https://github.com/brianpetro/obsidian-smart-connections
        
         | ekabod wrote:
         | Ollama is not a model, it is the sofware to run models.
        
       | simion314 wrote:
       | OpenAI APIs for GPT and Dalle have issues like non determnism,
       | and their special prompt injection where they add stuff or modify
       | your prompt (with no option to turn that off. Makes it impossible
       | to do research or to debug as a developer variations of things.
        
         | throwaway314155 wrote:
         | While that's true for their ChatGPT SaaS, the API they provide
         | doesn't impose as many restrictions.
        
           | simion314 wrote:
           | >While that's true for their ChatGPT SaaS, the API they
           | provide doesn't impose as many restrictions.
           | 
           | There are same issues with GPT API,
           | 
           | 1. non reproducible is there in the API
           | 
           | 2. even after we ensure we do a moderation check on the input
           | prompt, soemtimes GPT will produce "unsafe" output and accuse
           | itself of "unsafe" stuff and we get an error but we pay for
           | GPT "un-safeness" IMO if the GPT is producing unsafe stuff
           | then I should not pay for it's problems.
           | 
           | 3. dalle gives no seed so no reproducible, and no option to
           | opt out on their GPT modifying the prompt , so images are
           | sometimes absurdly enhanced with extreme amount of details or
           | extreme diversity, so you need to fight against their GPT
           | enhancing.
           | 
           | What extra option we have with the APIs that is useful ?
        
       | McBainiel wrote:
       | > Microsoft used LLMs to write millions of short stories and
       | textbooks in which one thing builds on another. The result of
       | training on this text, Bubeck says, is a model that fits on a
       | mobile phone but has the power of the initial 2022 version of
       | ChatGPT.
       | 
       | I thought training LLMs on content created by LLMs was ill-
       | advised but this would suggest otherwise
        
         | brap wrote:
         | I think it can be a tradeoff to get to smaller models. Use
         | larger models trained on the whole internet to produce output
         | that would train the smaller model.
        
         | gugagore wrote:
         | Generally (not just for LLMs) this is called student-teacher
         | training and/or knowledge distillation.
        
           | calf wrote:
           | It reminds me of when I take notes from a textbook then
           | intensively review my own notes
        
             | solardev wrote:
             | And then when it comes time for the test, I end up
             | hallucinating answers too.
        
         | mrbungie wrote:
         | I would guess correctly aligned and/or finely filtered
         | synthetic data coming from LLMs may be good.
         | 
         | Mode colapse theories (and simplified models used as proof of
         | existence of said problem) assume affected LLMs are going to be
         | trained with poor quality LLM-generated batches of text from
         | the internet (i.e. reddit or other social networks).
        
         | andai wrote:
         | Look into Microsoft's Phi papers. The whole idea here is that
         | if you train models on higher quality data (i.e. textbooks
         | instead of blogspam) you get higher quality results.
         | 
         | The exact training is proprietary but they seem to use a lot of
         | GPT-4 generated training data.
         | 
         | On that note... I've often wondered if broad memorization of
         | trivia is really a sensible use of precious neurons. It seems
         | like a system trained on a narrower range of high quality
         | inputs would be much more useful (to me) than one that
         | memorized billions of things I have no interest in.
         | 
         | At least at the small model scale, the general knowledge aspect
         | seems to be very unreliable anyways -- so why not throw it out
         | entirely?
        
           | deegles wrote:
           | You're not just memorizing text though. Each piece of trivia
           | is something that represents coherent parts of reality. Think
           | of it as being highly compressed.
        
           | throwthrowuknow wrote:
           | The trivia include information about many things: grammar,
           | vocabulary, slang, entity relationships, metaphor, among
           | others but chiefly they also constitute models of human
           | thought and behaviour. If all you want is a fancy technical
           | encyclopedia then by all means chop away at the training set
           | but if you want something you can talk to then you'll need to
           | keep the diversity.
        
             | visarga wrote:
             | > you'll need to keep the diversity.
             | 
             | You can get diverse low quality data from the web, but for
             | diverse high quality data the organic content is exhausted.
             | The only way is to generate it, and you can maintain a good
             | distribution by structured randomness. For example just
             | sample 5 random words from the dictionary and ask the model
             | to compose a piece of text from them. It will be more
             | diverse than web text.
        
           | snovv_crash wrote:
           | From what I've seen Phi does well in benchmarks but poorly in
           | real world scenarios. They also made some odd decisions
           | regarding the network structure which means that the memory
           | requirements for larger context is really high.
        
           | ComputerGuru wrote:
           | > I've often wondered if broad memorization of trivia is
           | really a sensible use of precious neurons.
           | 
           | I agree if we are talking about maxing raw reasoning and
           | logical onference abilities, but the problem is that the ship
           | has sailed and people _expect_ llms to have domain knowledge
           | (even more than expert users are clamoring for LLMs to have
           | better logic).
           | 
           | I bet a model with actual human "intelligence" but no Google-
           | scale encyclopedic knowledge of the world it lives in would
           | be scored less preferentially by the masses than what we have
           | now.
        
         | kkielhofner wrote:
         | Synthetic data (data from some kind of generative AI) has been
         | used in some form or another for quite some time[0]. The
         | license for LLaMA 3.1 has been updated to specifically allow
         | its use for generation of synthetic training data. Famously,
         | there is a ToS clause from OpenAI in terms of using them for
         | data generation for other models but it's not enforced ATM.
         | It's pretty common/typical to look through a model card, paper,
         | etc and see the use of an LLM or other generative AI for some
         | form of synthetic data generation in the development process -
         | various stages of data prep, training, evaluation, etc.
         | 
         | Phi is another really good example but that's already covered
         | from the article.
         | 
         | [0] - https://www.latent.space/i/146879553/synthetic-data-is-
         | all-y...
        
         | moffkalast wrote:
         | As others point out, it's essentially distillation of a larger
         | model to a smaller one. But you're right, it doesn't work very
         | well. Phi's performance is high on benchmarks but not nearly as
         | good in actual real world usage. It is extremely overfit on a
         | narrow range of topics in a narrow format.
        
         | iJohnDoe wrote:
         | > Microsoft used LLMs to write millions of short stories and
         | textbooks
         | 
         | Millions? Where are they? Where are they used?
        
           | HPsquared wrote:
           | Model developers don't usually release training data like
           | that.
        
         | sandwichmonger wrote:
         | That's the number one way of getting mad LLM disease. Feeding
         | LLMs to LLMs.
        
         | staticman2 wrote:
         | There's been efforts to train small LLM's on bigger LLM's. Ever
         | since Llama came out the community was creating custom fine
         | tunes this way using ChatGPT.
        
       | mrfinn wrote:
       | It's kinda funny how nowadays an AI with 8 billion parameters is
       | something "small". Specially when just two years back entire
       | racks were needed to run something giving way worst performance.
        
         | atemerev wrote:
         | IDK, 8B-class quantized models run pretty fast on commodity
         | laptops, with CPU-only inference. Thanks to the people who
         | figured out quantization and reimplemented everything in C++,
         | instead of academic-grade Python.
        
           | actualwitch wrote:
           | A solid chunk of python is just wrappers around C/C++, most
           | tensor frameworks included.
        
             | atemerev wrote:
             | I know, and yet early model implementations were quite
             | unoptimized compared to the modern ones.
        
       | wslh wrote:
       | What's the current cost of building a DIY bare-bones machine
       | setup to run the top LLaMA 3.1 models? I understand that two
       | nodes are typically required for this. Has anyone built something
       | similar recently, and what hardware specs would you recommend for
       | optimal performance? Also, do you suggest waiting for any
       | upcoming hardware releases before making a purchase?
        
         | atemerev wrote:
         | 405B is beyond homelab-scale. I recently obtained a 4x4090 rig,
         | and I am comfortable running 70B and occasionally 128B-class
         | models. For 405B, you need 8xH100 or better. A single H100
         | costs around $40k.
        
           | HPsquared wrote:
           | Here is someone running 405b on 12x3090 (4.5bpw). Total cost
           | around $10k.
           | 
           | https://www.reddit.com/r/LocalLLaMA/comments/1ej9uzh/local_l.
           | ..
           | 
           | Admittedly it's slow (3.5 token/sec)
        
             | wslh wrote:
             | Approximately, how many tokens per second would the
             | (edited) >~ $ 40k x 8 >=~ $320k version process? Would this
             | result in a >~32x boost in performance compared to other
             | setups? Thanks!
        
       | brap wrote:
       | Some companies (OpenAI, Anthropic...) base their whole business
       | on hosted closed source models. What's going to happen when all
       | of this inevitably gets commoditized?
       | 
       | This is why I'm putting my money on Google in the long run. They
       | have the reach to make it useful and the monetization behemoth to
       | make it profitable.
        
         | csmpltn wrote:
         | There's plenty of competition in this space already, and it'll
         | only get accelerated with time. There's not enough "moat" in
         | building proprietary LLMs - you can tell by how the leading
         | companies in this space are basically down to fighting over
         | patents and regulatory capture (ie. mounting legal and
         | technical barriers to scraping, procuring hardware, locking
         | down datasets, releasing less information to the public about
         | how the models actually work behind the scenes, lobbying for
         | scary-yet-vague AI regulation, etc).
         | 
         | It's fizzling out.
         | 
         | The current incumbents are sitting on multi-billion dollar
         | valuations and juicy funding rounds. This buys runtime for a
         | good couple of years, but it won't last forever. There's a
         | limit to what can be achieved with scraped datasets and deep
         | Markov chains.
         | 
         | Over time, it will become difficult to judge what makes one
         | general-purpose LLM be any better than another general-purpose
         | LLM. A new release isn't necessarily performing better or
         | producing better quality results, and it may even regress for
         | many use-cases (we're already seeing this with OpenAI's latest
         | releases).
         | 
         | Competitors will have caught up to eachother, and there
         | shouldn't be any major differences between Claude, ChatGPT,
         | Gemini, etc - after-all, they should all produce near-identical
         | answers, given identical scenarios. Pace of innovation flattens
         | out.
         | 
         | Eventually, the technology will become wide-spread, cheap and
         | ubiquitous. Building a (basic, but functional) LLM will be
         | condensed down to a course you take at university (the same way
         | people build basic operating systems and basic compilers in
         | school).
         | 
         | The search for AGI will continue, until the next big hype cycle
         | comes up in 5-10 years, rinse and repeat.
         | 
         | You'll have products geared at lawyers, office workers,
         | creatives, virtual assistants, support departments, etc. We're
         | already there, and it's working great for many use-cases - but
         | it just becomes one more tool in the toolbox, the way Visual
         | Studio, Blender and Photoshop are.
         | 
         | The big money is in the datasets used to build, train and
         | evaluate the LLMs. LLMs today are only as good as the data they
         | were trained on. The competition on good, high-quality, up-to-
         | date and clean data will accelerate. With time, it will become
         | more difficult, expensive (and perhaps illegal) to obtain
         | world-scale data, clean it up, and use it to train and evaluate
         | new models. This is the real goldmine, and the only moat such
         | companies can really have.
        
           | sparky_ wrote:
           | This is the best take on the generative AI fad I've yet seen.
           | I wish I could upvote this twice.
        
             | 101008 wrote:
             | I had the same impression. I have been suffering a lot
             | lately about the future for engineers (not having work,
             | etc), even habing anxiety when I read news about AI _, but
             | these comments make me feel better and relaxed.
             | 
             | _ I even considered blocking HN.
        
               | whimsicalism wrote:
               | Yeah, this is called motivated reasoning.
        
           | meiraleal wrote:
           | And then the successful chatgpt wrappers with traction will
           | become valuable than the companies creating propietary LLMs.
           | I bet openai will start buying many AI apps to find
           | profitable niches.
        
         | throwaway314155 wrote:
         | I don't have a horse in the race but wouldn't Meta be more
         | likely to commoditize things given that they sort of already
         | are?
        
           | zdragnar wrote:
           | Search
           | 
           | Gmail
           | 
           | Docs
           | 
           | Android
           | 
           | Chrome (browser and Chromebooks)
           | 
           | I don't use any Meta properties at all, but at least a dozen
           | alphabet ones. My wife uses Facebook, but that's about it. I
           | can see it being handy for insta filters.
           | 
           | YMMV of course, but I suspect alphabet has much _deeper_
           | reach, even if the actual overall number of people is
           | similar.
        
             | throwaway314155 wrote:
             | I was referring to the many quality open models they've
             | released to be clear.
        
         | whimsicalism wrote:
         | Their hope is to reach AGI and effective post-scarcity for most
         | things that we currently view as scarce.
         | 
         | I know it sounds crazy but that is what they actually believe
         | and is a regular theme of conversations in SF. They also think
         | it is a flywheel and whoever wins the race in the next few
         | years will be so far ahead in terms of iteration
         | capability/synthetic data that they will be the runaway winner.
        
       | andai wrote:
       | What local models is everyone using?
       | 
       | The last one I used was Llama 3.1 8B which was pretty good (I
       | have an old laptop).
       | 
       | Has there been any major development since then?
        
         | moffkalast wrote:
         | Qwen 2.5 has just released, with a surprising amount of sizes.
         | The 14B and 32B look pretty promising for their size class but
         | it's hard to tell yet.
        
         | esoltys wrote:
         | I like [mistral-nemo](https://ollama.com/library/mistral-nemo)
         | "A state-of-the-art 12B model with 128k context length, built
         | by Mistral AI in collaboration with NVIDIA."
        
         | demarq wrote:
         | Nada to be honest. I keep trying every new model, and
         | invariably go back to llama 8b.
         | 
         | Llama8b is the new mistral.
        
         | alanzhuly wrote:
         | I like the latest qwen2.5 (https://nexaai.com/Qwen/Qwen2.5-0.5B
         | -Instruct/gguf-q4_0/read...). It was just released last week.
         | It is one of the best small langauge models right now according
         | to benchmarks. And it is small and fast!
        
       | theodorthe5 wrote:
       | Local LLMs are terrible compared to Claude/ChatGPT. They are
       | useful to use as APIs for applications: much cheaper than paying
       | for OpenAI services, and can be fine tuned to do many useful (and
       | less useful, even illegal) things. But for the casual user, they
       | suck compared to the very large LLMs OpenAI/Anthropic deliver.
        
         | 78m78k7i8k wrote:
         | I don't think local LLM's are being marketed "for the casual
         | user", nor do I think the casual user will care at all about
         | running LLM's locally so I am not sure why this comparison
         | matters.
        
         | 123yawaworht456 wrote:
         | they are the only thing you can use if you don't want to or
         | aren't allowed to hand over your data to US corporations and
         | intelligence agencies.
         | 
         | every single query to ChatGPT/Claude/Gemini/etc will be used
         | for any purpose, by any party, at any time. shamelessly so,
         | because this is the new normal. _Welcome to 2024. I own
         | nothing, have no privacy, and life has never been better._
         | 
         | >(and less useful, even illegal) things
         | 
         | the same illegal things you can do with Notepad, or a pencil
         | and a piece of paper.
        
         | maxnevermind wrote:
         | Yep, unfortunately those local models are noticeably worse.
         | Also models are getting bigger, so even if a local basement rig
         | for a higher quality model is possible right now, that might
         | not be so in the future. Also Zuck and others might stop
         | releasing their weights for the next gen models, then what,
         | just hope they plateau, what if they don't?
        
       | swah wrote:
       | I saw this demo a few months back - and lost it, of LLM
       | autocompletion that was a few milliseconds - it opened a how new
       | way on how to explore it... any ideas?
        
         | JPLeRouzic wrote:
         | https://groq.com
         | 
         | is very fast.
         | 
         | (this is not the same as Grok)
        
       | miguelaeh wrote:
       | I am betting on local AI and building offload.fyi to make it easy
       | to implement in any app
        
       | nipponese wrote:
       | Am I the only one seeing obvious ads in llama3 results?
        
         | dunefox wrote:
         | Yes.
        
         | Sophira wrote:
         | I've not yet used any local AI, so I'm curious - what are you
         | getting? Can you share examples?
        
       | HexDecOctBin wrote:
       | May as well ask here: what is the best way to use something like
       | an LLM as a personal knowledge base?
       | 
       | I have a few thousand book, papers and articles collected over
       | the last decade. And while I have meticulously categorised them
       | for fast lookup, it's getting harder and harder to search for the
       | desired info, especially in categories which I might not have
       | explored recently.
       | 
       | I do have a 4070 (12 GB VRAM), so I thought that LLMs might be a
       | solutions. But trying to figure out the whats and hows hase
       | proven to be extremely complicated, what with deluge of
       | techniques (fine-tuning, RAG, quantisation) that might not might
       | not be obsolete, too many grifters hawking their own startups
       | with thin wrappers, and a general sense that the "new shiny
       | object" is prioritised more than actual stable solutions to real
       | problems.
        
         | routerl wrote:
         | Imho opinion, and I'm no expert, but this has been working well
         | for me:
         | 
         | Segment the texts into chunks that make sense (i.e. into the
         | lengths of text you'll want to find, whether this means
         | chapters, sub-chapters, paragraphs, etc), create embeddings of
         | each chunk, and store the resultant vectors in a vector
         | database. Your search workflow will then be to create an
         | embedding of your query, and perform a distance comparison
         | (e.g. cosine similarity) which returns ranked results. This way
         | you can now semantically search your texts.
         | 
         | Everything I've mentioned above is fairly easily doable with
         | existing LLM libraries like langchain or llamaindex. For
         | reference, this is an RAG workflow.
        
         | dchuk wrote:
         | Look into this: https://www.anthropic.com/news/contextual-
         | retrieval
         | 
         | And this: https://microsoft.github.io/graphrag/
        
         | meonkeys wrote:
         | https://khoj.dev promises this.
        
       | dockerd wrote:
       | What spec people recommend here to run small models like Llama3.1
       | or mistral-nemo etc.
       | 
       | Also is it sensible to wait for newer mac, amd, nvidia hardware
       | releasing soon?
        
         | noman-land wrote:
         | You basically need as much RAM as the size of the model.
        
           | zozbot234 wrote:
           | You actually need a lot less than that if you use the mmap
           | option, because then only activations need to be stored in
           | RAM, the model itself can be read from disk.
        
             | noman-land wrote:
             | Can you say a bit more about this? Based on my non-
             | scientific personal experience on an M1 with 64gb memory,
             | that's approximately what it seems to be. If the model is
             | 4gb in size, loading it up and doing inference takes about
             | 4gb of memory. I've used LM Studio and llamafiles directly
             | and both seem to exhibit this behavior. I believe
             | llamafiles use mmap by default based on what I've seen jart
             | talk about. LM Studio allows you to "GPU offload" the model
             | by loading it partially or completely into GPU memory, so
             | not sure what that means.
        
             | ycombinatrix wrote:
             | How does one set this up?
        
         | freeone3000 wrote:
         | M4s are releasing in probably a month or two; if you're going
         | Apple, it might be worth waiting for either those or the price
         | drop on the older models.
        
       | albertgoeswoof wrote:
       | I have a three year old M1 Max, 32gb RAM. Llama 8bn runs at 25
       | tokens/sec, that's fast enough, and covers 80% of what I need. On
       | my ryzen 5600h machine, I get about 10 tokens/second, which is
       | slow enough to be annoying.
       | 
       | If I get stuck on a problem, switch to chat gpt or phind.com and
       | see what that gives. Sometimes, it's not the LLM that helps, but
       | changing the context and rewriting the question.
       | 
       | However I cannot use the online providers for anything remotely
       | sensitive, which is more often than you might think.
       | 
       | Local LLMs are the future, it's like having your own private
       | Google running locally.
        
         | meiraleal wrote:
         | We need browser and OS level API (mobile) integration to the
         | local LLM.
        
           | evilduck wrote:
           | Both are in their early stages:
           | 
           | https://developer.chrome.com/docs/ai
           | 
           | https://developer.apple.com/documentation/AppIntents/Integra.
           | ..
        
         | fsmv wrote:
         | A small model necessarily is missing many facts. The large
         | model is the one that has memorized the whole internet, the
         | small one is just trained to mimic the big one.
         | 
         | You simply cannot compress the whole internet under 10gb
         | without throwing out a lot of information.
         | 
         | Please be careful about what you take as fact coming from the
         | local model output. Small models are better suited to
         | summarization.
        
           | albertgoeswoof wrote:
           | I don't trust anything as fact coming out of these models. I
           | ask it for how to structure solutions, with examples. Then I
           | read the output and research the specifics before using
           | anything further.
           | 
           | I wouldn't copy and paste from even the smartest minds,
           | nevermind a model output.
        
         | staticman2 wrote:
         | I'm really curious what you are doing with an LLM that can be
         | solved 80% of the time with a 8b model.
        
           | albertgoeswoof wrote:
           | It's mostly how would you solve this programming problem, or
           | reminders on syntax, scaffolding a configuration file etc.
           | 
           | Often it's a form of rubber duck programming, with a smarter
           | rubber duck.
        
             | skydhash wrote:
             | All of this can be solved with a 3-20MB PDF file, a 10kb
             | snippet/template file, and a whiteboard.
        
               | rNULLED wrote:
               | Don't forget the duct tape
        
       | jmount wrote:
       | I think this is a big deal. In my opinion, many money making
       | stable AI services are going to be deliberately of limited
       | ability on limited domains. One doesn't want one's site help bot
       | answering political questions. So this could really pull much of
       | the revenue away from AI/LLMs as service.
        
       | pella wrote:
       | Llama 3.1 405B
       | 
       |  _" 2 MacBooks is all you need. Llama 3.1 405B running
       | distributed across 2 MacBooks using @exolabs_ home AI cluster"_
       | https://x.com/AIatMeta/status/1834633042339741961
        
         | IshKebab wrote:
         | "All you need is PS10k of Apple laptops..."
        
           | nurettin wrote:
           | That is... probable, if you bought a newish m2 to replace
           | your 5-6 year old macbook pro which is now just lying around.
           | Or maybe you and your spouse can share cpu hours.
        
             | svnt wrote:
             | No, you need two of the newest M3 Macbook Pros with maxed
             | RAM, which in practice some people might have, but it is
             | not gettable by using old hardware.
             | 
             | And not having tried it, I'm guessing it will probably run
             | at 1-2 tokens per second or less since the 70b model on one
             | of these runs at 3-4, and now we are distributing the
             | process over the network, which is best case maybe
             | 40-80Gb/s
             | 
             | It is possible, and that's about the most you can say about
             | it.
        
           | earslap wrote:
           | yes but still, a local model, a lightning in a bottle that is
           | between GPT3.5 and GPT4 (closer to 4), yours forever, for
           | about that price is pretty good deal today. probably won't be
           | a good deal in a couple years but for the value, it is not
           | _that_ unsettling. When ChatGPT first launched 2 years ago we
           | all wondered what it would take to have something close to
           | that locally with no strings attached, and turns out it is
           | "a couple years and about $10k" (all due to open weights
           | provided by some companies, training such a model still costs
           | millions) which is neat. It will never be more expensive.
        
       | vessenes wrote:
       | All this will be an interesting side note in the history of
       | language models in the next eight months when roughly 1.5 billion
       | iPhone users will get a local language model tied seamlessly to a
       | mid-tier cloud based language model native in their OS.
       | 
       | What I think will be interesting is seeing which of the open
       | models stick around and for how long when we have super easy
       | 'good enough' models that provide quality integration. My bet is
       | not many, sadly. I'm sure Llama will continue to be developed,
       | and perhaps Mistral will get additional European government
       | support, and we'll have at least one offering from China like
       | Qwen, and Bytedance and Tencent will continue to compete a-la
       | Google and co. But, I don't know if there's a market for ten
       | separately trained open foundation models long term.
       | 
       | I'd like to make sure there's some diversity in research and
       | implementation of these in the open access space. It's a critical
       | tool for humans, and it seems possible to me that leaders will be
       | able to keep extending the gap for a while; when you're using
       | that gap not just to build faster AI, but do other things, the
       | future feels pretty high volatility right now. Which is
       | interesting! But, I'd prefer we come out of it with people all
       | over the world having access to these.
        
         | jannyfer wrote:
         | > in the next eight months when roughly 1.5 billion iPhone
         | users will get a local language model tied seamlessly to a mid-
         | tier cloud based language model native in their OS.
         | 
         | Only iPhone 15 Pro or later will get Apple Intelligence, so the
         | number will be wayyy smaller.
        
           | visarga wrote:
           | Not in EU they won't.
        
         | darby_nine wrote:
         | I expect people will just ship with their own model where the
         | built-in one isn't sufficient.
         | 
         | When people describe it as a "critical tool" i feel like I'm
         | missing basic information about how people use computers and
         | interact with the world. In what way is it critical for
         | anything? It's still just a toy at this point.
        
           | qingdao99 wrote:
           | When it's expected to be handling reminders, calendar events,
           | and other device functions for millions of users, it will be
           | considered critical.
        
       | Anunayj wrote:
       | I recently experimented with running llama-3.1-8b-instruct
       | locally on my Consumer hardware, aka my Nvidia RTX 4060 with 8GB
       | VRAM, as I wanted to experiment with prompting pdfs with a large
       | context which is extremely expensive with how LLMs are priced.
       | 
       | I was able to fit the model with decent speeds (30
       | tokens/seconds) and a 20k token context completely on the GPU.
       | 
       | For summarization, the performance of these models are decent
       | enough. However unfortunately in my use case I felt using
       | Gemini's Free Tier with it's multimodal capabilities and much
       | better quality output made running local LLMs not really worth it
       | as of right now, atleast for consumers.
        
         | mistrial9 wrote:
         | you moved the goalposts when you add 'multimodal' there;
         | another item is, no one reads PDF tables and illustrations
         | perfectly, at any price AFAIK
        
           | ComputerGuru wrote:
           | Supposedly submitting screenshots of pdfs (at a large enough
           | zoom per tile/page) to OpenAI gtp4o or Google's whatever is
           | currently the best way of handling charts and tables.
        
       | binary132 wrote:
       | I really get the feeling with these models that what we need is a
       | very memory-first hardware architecture that is not necessarily
       | the fastest at crunching.... that seems like it shouldn't
       | necessarily be a terrifically expensive product
        
       | shrubble wrote:
       | The newest laptops are supposed to have 40-50 TOPS performance
       | with the new AI/NPU features. Wondering what that will mean in
       | practice.
        
       | toddmorey wrote:
       | I narrate notes to myself on my morning walks[1] and then run
       | whisper locally to turn the audio into text... before having an
       | LLM clean up my ramblings into organized notes and todo lists. I
       | have it pretty much all local now, but I don't mind waiting a few
       | extra seconds for it to process since it's once a day. I like the
       | privacy because I was never comfortable telling my entire life to
       | a remote AI company.
       | 
       | [1] It feels super strange to talk to yourself, but luckily I'm
       | out early enough that I'm often alone. Worst case, I pretend I'm
       | talking to someone on the phone.
        
         | racked wrote:
         | What software did you use to set all this up? Kindof interested
         | in giving this a shot myself.
        
           | azeirah wrote:
           | You can use llama.cpp, it runs on almost all hardware.
           | Whisper.cpp is similar, but unless you have a mid or high end
           | nvidia card it will be a bit slower.
           | 
           | Still very reasonable on modern hardware.
        
             | bobbylarrybobby wrote:
             | If you build locally for Apple hardware (instructions in
             | the whisper.cpp readme) then it performs quite admirably on
             | Apple computers as well.
        
           | navbaker wrote:
           | Definitely try it with Ollama, it is by far the simplest
           | local LLM tool to get up and running with minimal fuss!
        
         | alyandon wrote:
         | I would be greatly interested in knowing how you set all that
         | up if you felt like sharing the specifics.
        
           | toddmorey wrote:
           | My hope is to make this easy with a GH repo or at least
           | detailed instructions.
           | 
           | I'm on a Mac and I found the easiest way to run & use local
           | models is Ollama as it has a rest interface:
           | https://github.com/ollama/ollama/blob/main/docs/api.md
           | 
           | I just have a local script that pulls the audio file from
           | Voice Memos (after it syncs from my iPhone), runs it through
           | openai's whisper (really the best at voice to speech;
           | excellent results) and then makes sense of it all with a
           | prompt that asks for organized summary notes and todos in GH
           | flavored markdown. That final output goes into my Obsidian
           | vault. The model I use is llama3.1 but haven't spent much
           | time testing others. I find you don't really need the largest
           | models since the task is to organize text rather than augment
           | it with a lot of external knowledge.
           | 
           | Humorously the harder part of the process was finding where
           | the hell Voice Memos actually stores these audio files. I
           | wish you could set the location yourself! They live deep
           | inside ~/Library/Containers. Voice Memos has no export
           | feature, but I found you can drag any audio recording out of
           | the left sidebar to the desktop or a folder. So I just drag
           | the voice memo into a folder my script watches and then it
           | runs the automation.
           | 
           | If anyone has another, better option for recording your voice
           | on an iPhone, let me know! The nice thing about all this is
           | you don't even have to start / stop the recording ever on
           | your walk... just leave it going. Dead space and side
           | conversations and commands to your dog are all well handled
           | and never seem to pollute my notes.
        
             | schainks wrote:
             | Amazing, thank you for this!
        
             | behnamoh wrote:
             | You can record your voice messages and send them to
             | yourself in Telegram. They're saved on-device. You can then
             | create a bot to do things to stuff as they come in, like
             | "transcribe new ogg files and write back the text as a
             | message after the voice memo".
        
             | graeme wrote:
             | Have you tried the Shortcuts app? On phone and mac. Should
             | be able to make one that finds and moves a voice memo when
             | run. You can run them on button press or via automation.
             | 
             | Also what kind of local machine do you need? I have an imac
             | pro, wondering if this will run the models or if I ought to
             | be on an apple silicon machine? I have an M1 macbook air as
             | well.
        
             | emadda wrote:
             | You could also use the "share" menu and airdrop the audio
             | from your iphone to your mac. Files end up in Downloads by
             | default.
        
         | vincvinc wrote:
         | I was thinking about making this the other day. Would you mind
         | sharing what you used?
        
         | schmidtleonard wrote:
         | Button-toggled voice notes in the iPhone Notes app are a
         | godsend for taking measurements. Rather than switching your
         | hands between probe/equipment and notes repeatedly, which sucks
         | badly, you can just dictate your readings and maaaaybe clean
         | out something someone said in the background. Over the last
         | decade, the microphones + speech recognition became Good Enough
         | for this. Wake-word/endpoint models still aren't there yet, and
         | they aren't really close, but the stupid on/off button in the
         | Notes app 100% solves this problem and the workflow is now
         | viable.
         | 
         | I love it and I sincerely hope that "Apple Intelligence" won't
         | kill the button and replace it with a sub-viable conversational
         | model, but I probably ought to figure out local whisper sooner
         | rather than later because it's probably inevitable.
        
           | freetanga wrote:
           | I bought an iZYREC (?) and leave the phone at home.
           | MacWhisper and some regex (I use verbal tags) and done
        
             | sgu999 wrote:
             | Some dubious marketing choices on their landing page:
             | 
             | > Finding the Truth - Surprisingly, my iZYREC revealed more
             | than I anticipated. I had placed it in my husband's car,
             | aiming to capture some fun moments, but it instead recorded
             | intimate encounters between my husband and my close friend.
             | Heartbreaking yet crucial, it unveiled a hidden truth,
             | helping me confront reality.
             | 
             | > A Voice for the Voiceless - We suspected that a
             | relative's child was living in an abusive home. I slipped
             | the device into the child's backpack, and it recorded the
             | entire day. The sound quality was excellent, and
             | unfortunately, the results confirmed our suspicions. Thanks
             | iZYREC, giving a voice to those who need it most.
        
         | hdjjhhvvhga wrote:
         | > It feels super strange to talk to yourself
         | 
         | I remember the first lecture in the Theory of Communication
         | class where the professor introduced the idea that
         | communication by definition requires at least two different
         | participants. We objected by saying that it can perfectly be
         | just one and the same participant (communication is not just
         | about space but also time), and what you say is a perfect
         | example of that.
        
         | vunderba wrote:
         | Same. My husky/pyr mix needs a lot of exercise, so I'm outside
         | a minimum of a few hours a day. As a result I do a lot of
         | dictation on my phone.
         | 
         | I put together a script that takes any audio file (mp3, wav),
         | normalizes it, runs it through ggerganov's whisper, and then
         | cleans it up using a local LLM. This has saved me a tremendous
         | amount of time. Even modestly sized 7b parameter models can
         | handle syntactical/grammatical work relatively easily.
         | 
         | Here's the gist:
         | 
         | https://gist.github.com/scpedicini/455409fe7656d3cca8959c123...
         | 
         |  _EDIT: I 've always talked out loud through problems anyway,
         | throw a BT earbud on and you'll look slightly less deranged._
        
           | adamsmith wrote:
           | If it's helpful here is the prompt I use to clean up voice-
           | memo transcripts: https://gist.github.com/adamsmith/2a22b08d3
           | d4a11fb9fe06531ae...
        
         | lukan wrote:
         | "before having an LLM clean up my ramblings into organized
         | notes and todo lists."
         | 
         | Which local LLM do you use?
         | 
         | Edit:
         | 
         | And self talk is quite a healthy and useful thing in itself,
         | but avoiding it in public is indeed kind of necessary, because
         | of the stigma
         | 
         | https://en.m.wikipedia.org/wiki/Intrapersonal_communication
        
           | flimflamm wrote:
           | That's just meat CoT (chain of thought) - right?
        
             | lukan wrote:
             | I do not understand?
        
               | valval wrote:
               | GP is making a joke about speaking to oneself really just
               | being the human version of Chain of Thought, which in my
               | understanding is an architecural decision in LLMs to have
               | it write out intermediate steps in problem solving and
               | evaluate the validity of them as it goes.
        
           | anigbrowl wrote:
           | Just put in some earbuds and everyone will assume you're on
           | the phone.
        
         | yieldcrv wrote:
         | I found one that can isolate speakers, its just okay at that
        
         | neom wrote:
         | This is exactly why I think the AI pins are a good idea. The
         | Humane pin seems too big/too expensive/not quite there yet, but
         | for exactly what you're doing, I would like some type of
         | brooch.
        
         | wkat4242 wrote:
         | What do you use to run whisper locally? I don't think ollama
         | can do it.
        
         | jxcl wrote:
         | This has inspired me.
         | 
         | I do a lot of stargazing and have experimented with voice memos
         | for recording my observations. The problem of course is later
         | going back and listening to the voice memo and getting
         | organized information out of what essentially turns into me
         | rambling to myself.
         | 
         | I'm going to try to use whisper + AI to transcribe my voice
         | memos into structured notes.
        
           | inciampati wrote:
           | You can use it for everything. Just make sure that you have
           | an input method set up on your computer and phone that allow
           | you to use whisper.
           | 
           | That's how I'm writing this message to you.
           | 
           | Learning to use these speech-to-text systems will be a new
           | kind of literacy.
           | 
           | I think pushing the transcription through language models is
           | a fantastic way to deal with the complexity and frankly,
           | disorganization of directly going from speech to text.
           | 
           | By doing this we can all basically type at 150-200 words a
           | minute.
        
         | draebek wrote:
         | For people on macOS, the free app Aiko on the App Store makes
         | it easy to use Whisper, if you want a GUI:
         | https://sindresorhus.com/aiko
        
       | wazdra wrote:
       | I'd like to point out that llama 3.1 is _not_ open source[1] (I
       | was recently made aware of that fact by [2], when it was on HN
       | front page) While it 's very nice to see a peak of interest for
       | local, "open-weights" LLMs, this is an unfortunate choice of
       | words, as it undermines the quite important differences between
       | llama's license model and open-source. The license question does
       | not seem to be addressed at all in the article.
       | 
       | [1]: https://www.llama.com/llama3_1/license/
       | 
       | [2]: https://csvbase.com/blog/14
        
         | sergiotapia wrote:
         | that ship sailed 13 years ago dude.
        
       | pimeys wrote:
       | Has anybody found a good way to utilize ollama with an editor
       | such as zed to do things like "generate rustdoc to this method"
       | etc. I use ollama daily for a ton of things, but for code
       | generation, completion and documentation 4o is still much better
       | than any of the local models...
        
         | navbaker wrote:
         | The Continue extension for VSCode is pretty good and has native
         | connectivity to a local install of Ollama
        
           | pimeys wrote:
           | Zed has also support for ollama, but all the local models I
           | tried do not really work so well to things like "write docs
           | for this method"... Also local editor autocomplete in the
           | style of github copilot would be great, without needing to
           | use proprietary Microsoft tooling...
        
             | vunderba wrote:
             | There's a lot of plugins/IDEs for assistant style LLMs, but
             | the only TAB style autocompletion ones I know of are either
             | proprietary (Github Copilot), or you need to get an API key
             | (Codestral). If anyone knows of a local autocomplete model
             | I'd love to hear about it.
             | 
             | The Continue extension (Jetbrains, VSCodium) lets you set
             | up assistant and autocompletion independently with
             | different API keys.
        
         | statenjason wrote:
         | I use gen.nvim[1] with for small tasks, like "write a type
         | definition for this JSON" .
         | 
         | Running locally avoids the concern of sending IP or PII to
         | third parties.
         | 
         | [1]: https://github.com/David-Kunz/gen.nvim
        
       | api wrote:
       | I use ollama through a Mac app called BoltAI quite a bit. It's
       | like having a smart portable sci-fi "computer assistant" for
       | research and it's all local.
       | 
       | It is about the only thing I can do on my M1 Pro to spin up the
       | fans and make the bottom of the case hot.
       | 
       | Llama3.1, Deepseek Coder v2, and some of the Mistral models are
       | good.
       | 
       | ChatGPT and Claude top tier models are still better for very hard
       | stuff.
        
       | noman-land wrote:
       | For anyone who hasn't tried local models because they think it's
       | too complicated or their computer can't handle it, download a
       | single llamafile and try it out in just moments.
       | 
       | https://future.mozilla.org/builders/news_insights/introducin...
       | 
       | https://github.com/Mozilla-Ocho/llamafile
       | 
       | They even have whisperfiles now, which is the same thing but for
       | whisper.cpp, aka real-time voice transcription.
       | 
       | You can also take this a step further and use this exact setup
       | for a local-only co-pilot style code autocomplete and chat using
       | Twinny. I use this every day. It's free, private, and offline.
       | 
       | https://github.com/twinnydotdev/twinny
       | 
       | Local LLMs are the only future worth living in.
        
         | _kidlike wrote:
         | Or https://ollama.com/
        
           | fortyseven wrote:
           | This has been my go-to for all of my local LLM interaction:
           | it easy to get going, manages all of the models easily. Nice
           | clean API for projects. Updated regularly; works across
           | Windows, Mac, Linux. It's a wrapper around LlamaCpp, but it's
           | a damned good one.
        
         | heyoni wrote:
         | Isn't there also some Firefox AI integration that's being
         | tested by one dev out there? I forgot the name and wonder if it
         | got any traction.
        
         | privacyis1mp wrote:
         | I built Fluid app exactly with that in mind. You can run local
         | AI on mac without really knowing what an LLM/ollama is.
         | Plug&Play.
         | 
         | Sorry for the blatant ad, though I do hope it's useful for some
         | ppl reading this thread: https://getfluid.app
        
           | twh270 wrote:
           | I'm interested, but I can't find any documentation for it.
           | Can I give it local content (documents, spreadsheets, code,
           | etc.) and ask questions?
        
             | privacyis1mp wrote:
             | > Can I give it local content (documents, spreadsheets,
             | code, etc.) It's coming roughly in December (may be
             | sooner).
             | 
             | Roadmap is following:
             | 
             | - October - private remote AI (when you need smarter AI
             | than your machine can handle, but don't want your data to
             | be logged or stored anywhere)
             | 
             | - November - Web search capabilities (so the AI will be
             | capable of doing websearch out of the box)
             | 
             | - December - PDF, docs, code embedding. 2025 - tighter
             | MacOS integration with context awareness.
        
               | twh270 wrote:
               | Oh awesome, thank you! I will check back in December.
        
         | zelphirkalt wrote:
         | Many setup rely on Nvidia GPUs, Intel stuff, Windows or other
         | stuff, that I would rather not use, or are not very clear about
         | how to set things up.
         | 
         | What are some recommendations for running models locally, on
         | decent CPUs and getting good valuable output from them? Is that
         | llama stuff portable across CPUs and hardware vendors? And what
         | do people use it for?
        
           | noman-land wrote:
           | llamafile will run on all architectures because it is
           | compiled by cosmopolitan.
           | 
           | https://github.com/jart/cosmopolitan
           | 
           |  _" Cosmopolitan Libc makes C a build-once run-anywhere
           | language, like Java, except it doesn't need an interpreter or
           | virtual machine. Instead, it reconfigures stock GCC and Clang
           | to output a POSIX-approved polyglot format that runs natively
           | on Linux + Mac + Windows + FreeBSD + OpenBSD + NetBSD + BIOS
           | with the best possible performance and the tiniest footprint
           | imaginable."_
           | 
           | I use it just fine on a Mac M1. The only bottleneck is how
           | much RAM you have.
           | 
           | I use whisper for podcast transcription. I use llama for code
           | complete and general q&a and code assistance. You can use the
           | llava models to ingest images and describe them.
        
           | threecheese wrote:
           | Have you tried a Llamafile? Not sure what platform you are
           | using. From their readme:                 > ... by combining
           | llama.cpp with Cosmopolitan Libc into one framework that
           | collapses all the complexity of LLMs down to a single-file
           | executable (called a "llamafile") that runs locally on most
           | computers, with no installation.
           | 
           | Low cost to experiment IMO. I am personally using MacOS with
           | an M1 chip and 64gb memory and it works perfectly, but the
           | idea behind this project is to democratize access to
           | generative AI and so it is at least possible that you will be
           | able to use it.
        
             | narrator wrote:
             | With 64GB can you run the 70B size llama models well?
        
               | credit_guy wrote:
               | No, you can't. I have 128 GB and a 70B llamafile is
               | unusable.
        
               | threecheese wrote:
               | I should have qualified the meaning of "works perfectly"
               | :) No 70b for me, but I am able to experiment with many
               | quantized models (and I am using a Llama successfully,
               | latency isn't terrible)
        
           | wkat4242 wrote:
           | Not really. I run ollama on an AMD Radeon Pro and it works
           | great.
           | 
           | For tooling to train models it's a bit more difficult but
           | inference works great on AMD.
           | 
           | My CPU is an AMD Ryzen and the OS Linux. No problem.
           | 
           | I use OpenWebUI as frontend and it's great. I use it for
           | everything that people use GPT for.
        
           | distances wrote:
           | I'm using Ollama with an AMD GPU (7800, 16GB) on Linux. Works
           | out of the box. Another question is then if I get much value
           | out of these local models.
        
         | vunderba wrote:
         | If you're gonna go with a VS code extension and you're aiming
         | for privacy, then I would at least recommend using the open
         | source fork VS Codium.
         | 
         | https://vscodium.com/
        
           | unethical_ban wrote:
           | It is true that VS Code has some non-optional telemetry, and
           | if VS Codium works for people, that is great. However, the
           | telemetry of VSCode is non-personal metrics, and some of the
           | most popular extensions are only available with VSCode, not
           | with Codium.
        
             | wkat4242 wrote:
             | > and some of the most popular extensions are only
             | available with VSCode, not with Codium
             | 
             | Which is an artificial restriction from MS that's really
             | easily bypassed.
             | 
             | Personally I don't care whether the telemetry is
             | identifiable. I just don't want it.
        
               | noman-land wrote:
               | How is it bypassed?
        
               | wkat4242 wrote:
               | There's a whitelist identifier that you can add bundle
               | IDs to, to get access to the more sensitive APIs. Then
               | you can download the extension file and install it
               | manually. I don't have the exact process right now but
               | just Google it :)
        
             | qwezxcrty wrote:
             | From the documentation
             | (https://code.visualstudio.com/docs/getstarted/telemetry )
             | it seems there is a supported way to completely turn off
             | telemetry. Is there something else in VSCode that doesn't
             | respect this setting?
        
         | wkat4242 wrote:
         | Yeah I set up a local server with a strong GPU but even without
         | that it's ok, just a lot slower.
         | 
         | The biggest benefits for me are the uncensored models. I'm
         | pretty kinky so the regular models tend to shut me out way too
         | much, they all enforce this prudish victorian mentality that
         | seems to be prevalent in the US but not where I live. Censored
         | models are just unusable to me which includes all the hosted
         | models. It's just so annoying. And of course the privacy.
         | 
         | It should really be possible for the user to decide what kind
         | of restrictions they want, not the vendor. I understand they
         | don't want to offer violent stuff but 18+ topics should be
         | squarely up to me.
         | 
         | Lately I've been using grimjim's uncensored llama3.1 which
         | works pretty well.
        
           | the_gorilla wrote:
           | If you get to act out degenerate fantasies with a chatbot, I
           | also want to be able to get "violent stuff" (which is again
           | just words).
        
           | threecheese wrote:
           | Any tips you can give for like minded folks? Besides grimjim
           | (checking it out).
        
           | wkat4242 wrote:
           | @the_gorilla: I don't consider bdsm to be 'degenerate' nor
           | violent, it's all incredibly consensual and careful.
           | 
           | It's just that the LLMs trigger immediately on minor words
           | and shut down completely.
        
         | kaoD wrote:
         | Well this was my experience...                   User: Hey, how
         | are you?         Llama: [object Object]
         | 
         | It's funny but I don't think I did anything wrong?
        
           | AlienRobot wrote:
           | 2000: Javascript is webpages.
           | 
           | 2010: Javascript is webservers.
           | 
           | 2020: Javascript is desktop applications.
           | 
           | 2024: Javascript is AI.
        
             | evbogue wrote:
             | From this data we must conclude that within our lifetimes
             | all matter in the universe will eventually be reprogrammed
             | in JavaScript.
        
               | mnky9800n wrote:
               | I'm not sure I want to live in that reality.
        
               | mortenjorck wrote:
               | If the simulation hypothesis is real, perhaps it would
               | follow that all the dark matter and dark energy in the
               | universe is really just extra cycles being burned on
               | layers of interpreters and JIT compilation of a loosely-
               | typed scripting language.
        
               | AlienRobot wrote:
               | It's fine, it will be Typescript.
        
               | anotherjesse wrote:
               | WAT? https://www.destroyallsoftware.com/talks/wat
        
         | ryukoposting wrote:
         | Not only are they the only future worth living in, incentives
         | are aligned with client-side AI. For governments and government
         | contractors, plumbing confidential information through a
         | network isn't an option, let alone spewing it across the
         | internet. It's a non-starter, regardless of the productivity
         | bumps stuff like Copilot can provide. The _only_ solution is to
         | put AI compute on a cleared individual 's work computer.
        
         | ComputerGuru wrote:
         | Do you know if whisperfile is akin to whisper or the much
         | better whisperx? Does it do diarization?
        
           | noman-land wrote:
           | Last I checked it was basically just whisper.cpp so not
           | whisperx and no diarization by default but it moves pretty
           | quickly so you may want to ask on the Mozilla AI Discord.
           | 
           | https://discord.com/invite/yTPd7GVG3H
        
         | AustinDev wrote:
         | https://old.reddit.com/r/LocalLLaMA/ is a great community for
         | this sort of thing as well.
        
         | upcoming-sesame wrote:
         | I just tried now. Super easy indeed but slow to the point it's
         | not usable on my PC
        
       | stainablesteel wrote:
       | i think this is laughable, the only good 8B models are the llama
       | ones, phi is terrible, even codestral can barely code and that's
       | 22B iirc
       | 
       | but truthfully the 8B just aren't that great yet, they can
       | provide some decent info if you're just investigating things but
       | a google search is still faster
        
       | aledalgrande wrote:
       | Does anyone know of a local "Siri" implementation? Whisper +
       | Llama (or Phi or something else), that can run shortcuts, take
       | notes, read web pages etc.?
       | 
       | PS: for reading web pages I know there's voices integrated in the
       | browser/OS but those are horrible
        
         | xenospn wrote:
         | Apple intelligence?
        
           | aledalgrande wrote:
           | It isn't clear if you can know when the task gets handed off
           | to their servers. But yeah that'd be the closest I know. I'm
           | not sure it would build a local knowledge base though.
        
         | gardnr wrote:
         | Edit: I just found this. I'll give it a try today:
         | https://github.com/0ssamaak0/SiriLLama
         | 
         | ---
         | 
         | Open WebUI has a voice chat but the voices are not great. I'm
         | sure they'd love a PR that integrates StyleTTS2.
         | 
         | You can give it a Serper API Key and it will search the web to
         | use as context. It connects to ollama running on a linux box
         | with a $300 RTX 3060 with 12GB of VRAM. The 4bit quant of Llama
         | 3.1 8B takes up a bit more than 6GB of VRAM which means it can
         | run embedding models and STT on the card at the same time.
         | 
         | 12GB is the minimum I'd recommend for running quantized models.
         | The RTX 4070 Ti Super is 3x the cost but 7 times "faster" on
         | matmuls.
         | 
         | The AMD cards do inference OK but they are a constant source of
         | frustration when trying to do anything else. I bought one and
         | tried for 3 months before selling it. It's not worth the
         | effort.
         | 
         | I don't have any interest in allowing it to run shortcuts. Open
         | WebUI has pipelines for integrating function calling.
         | HomeAssistant has some integrations if that's the kind of thing
         | you are thinking about.
        
           | aledalgrande wrote:
           | That SiriLLama project looks awesome! I'll give it a try. I
           | also just spun up https://github.com/ItzCrazyKns/Perplexica
           | to try a local Perplexity alternative.
        
       | jsemrau wrote:
       | The Mistral models are not half as bad for this.
        
       | shahzaibmushtaq wrote:
       | I need to have two things of my own that work offline for privacy
       | concerns and cost savings:
       | 
       | 1. Local LLM AI models with GUI and command line
       | 
       | 2. > Local LLM-based coding tools do exist (such as Google
       | DeepMind's _CodeGemma_ and one from California-based developers
       | _Continue_ )
        
       | create-username wrote:
       | there's no small AI that I know of and masters ancient Greek,
       | Latin, English, German and French and that I can run on my 18 GB
       | macbook pro.
       | 
       | Please correct me if I'm wrong. It would make my life slightly
       | more comfortable
        
         | sparkybrap wrote:
         | I agree. Even bi-lingual (English+1) small models would be very
         | useful to process localized data, for ex english-french,
         | english-german, etc.
         | 
         | Right now the small models (llama 8B) can't handle this type of
         | task, although they could if they were trained for bi lingual
         | data.
        
       | sandspar wrote:
       | What advantages do local models have over exterior models? Why
       | would I run one locally if ChatGPT works well?
        
         | pieix wrote:
         | 1) Offline connectivity -- pretty cool to be able to debug
         | technical problems while flying (or otherwise off grid) with a
         | local LLM, and current 8B models are usually good enough for
         | the first line of questions that you otherwise would have
         | googled.
         | 
         | 2) Privacy
         | 
         | 3) Removing safety filters -- there are some great
         | "abliterated" models out there that have had their refusal
         | behavior removed. Running these locally and never having your
         | request refused due to corporate risk aversion is a very
         | different experience to calling a safety-neutered API.
         | 
         | Depending on your use case some, all, or none of these will be
         | relevant, but they are undeniable benefits that are very much
         | within reach using a laptop and the current crop of models.
        
       | pilooch wrote:
       | I run a fineruned mulmodal LLM as a spam filter (reads emails as
       | images). Game changer. Removes all the stuff I wouldn't read
       | anyways, not only spam.
        
       | Archit3ch wrote:
       | One use case I've found very convenient: partial screenshot |>
       | minicpm-v
       | 
       | Covers 90% of OCR needs with 10% of the effort. No API keys,
       | scripting, or network required.
        
       | trash_cat wrote:
       | I think it REALLY depends on your use case. Do you want to
       | brainstorm, clear out some thoughts, search or solve complex
       | tasks?
        
       | Anthony1321 wrote:
       | I found this article really eye-opening! It's fascinating to see
       | how technology is evolving and impacting our lives in unexpected
       | ways. If you're interested in more tech insights, check out
       | https://trendytechinfo.com for the latest updates
        
       | RevEng wrote:
       | I'm currently working on an LLM-based product for a large company
       | that's used in circuit design. Our customers have very strict
       | confidentiality requirements since the field is very competitive
       | and they all have trade secret technologies that give them
       | significant competitive advantages. Using something public like
       | ChatGPT is simply out of the question. Their design environments
       | are often completely disconnected from the public internet, so
       | our tools need to run local models. Llama 3 has worked well for
       | us so far and we're looking at other models too. We also prefer
       | not being locked in to a specific vendor like OpenAI, since our
       | reliance on the model puts us in a poor position to negotiate and
       | the future of AI companies isn't guaranteed.
       | 
       | For my personal use, I also prefer to use local models. I'm not a
       | fan of OpenAI's shenanigans and Google already abuses its
       | customers data. I also want the ability to make queries on my own
       | local files without having to upload all my information to a
       | third party cloud service.
       | 
       | Finally, fine tuning is very valuable for improving performance
       | in niche domains where public data isn't generally available.
       | While online providers do support fine tuning through their
       | services, this results in significant lock in as you have to pay
       | them to do the tuning in their servers, you have to provide them
       | with all your confidential data, and they own the resulting model
       | which you can only use through their service. It might be
       | convenient at first, but it's a significant business risk.
        
       | fsndz wrote:
       | My Thesis: Small language models (SLM)-- models so compact that
       | you can run them on a computer with just 4GB of RAM -- are the
       | future. SLMs are efficient enough to be deployed on edge devices,
       | while still maintaining enough intelligence to be useful.
       | https://www.lycee.ai/blog/why-small-language-models-are-the-...
        
       | alanzhuly wrote:
       | For anyone looking for a simple alternative for running local
       | models beyond just text, Nexa AI has built an SDK that supports
       | text, audio (STT, TTS), image generation (e.g., Stable
       | Diffusion), and multimodal models! It also has a model hub to
       | help you easily find local models optimized for your device.
       | 
       | Nexa AI local model hub: https://nexaai.com/ Toolkit:
       | https://github.com/NexaAI/nexa-sdk
       | 
       | It also comes with a built-in local UI to get started with local
       | models easily and OpenAI-compatible API (with JSON schema for
       | function calling and streaming) for starting local development
       | easily.
       | 
       | You can run the Nexa SDK on any device with a Python environment
       | --and GPU acceleration is supported!
       | 
       | Local LLMs, and especially multimodal local models are the
       | future. It is the only way to make AI accessible (cost-efficient)
       | and safe.
        
       | rthaswrea wrote:
       | Another solid option https://www.anaconda.com/products/ai-
       | navigator
        
       ___________________________________________________________________
       (page generated 2024-09-21 23:01 UTC)