[HN Gopher] Llama 3.2: Revolutionizing edge AI and vision with o...
___________________________________________________________________
Llama 3.2: Revolutionizing edge AI and vision with open,
customizable models
Author : nmwnmw
Score : 172 points
Date : 2024-09-25 17:29 UTC (5 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| TheAceOfHearts wrote:
| I still can't access the hosted model at meta.ai from Puerto
| Rico, despite us being U.S. citizens. I don't know what Meta has
| against us.
|
| Could someone try giving the 90b model this word search problem
| [0] and tell me how it performs? So far with every model I've
| tried, none has ever managed to find a single word correctly.
|
| [0] https://imgur.com/i9Ps1v6
| Workaccount2 wrote:
| This is likely because the models use OCR on images with text,
| and once parsed the word search doesn't make sense anymore.
|
| Would be interesting to see a model just working on raw input
| though.
| simonw wrote:
| Image models such as Llama 3.2 11B and 90B (and the Claude 3
| series, and Microsoft Phi-3.5-vision-instruct, and PaliGemma,
| and GPT-4o) don't run OCR as a separate step. Everything they
| do is from that raw vision model.
| paxys wrote:
| Non US citizens can access the model just fine, if that's what
| you are implying.
| TheAceOfHearts wrote:
| I'm not implying anything. It's just frustrating that despite
| being a US territory with US citizens, PR isn't allowed to
| use this service without any explanation.
| paxys wrote:
| Just because you cannot access the model doesn't mean all
| of Puerto Rico is blocked.
| TheAceOfHearts wrote:
| When I visit meta.ai it says:
|
| > Meta AI isn't available yet in your country
|
| Maybe it's just my ISP, I'll ask some friends if they can
| access the service.
| paxys wrote:
| meta.ai is their AI service (similar to ChatGPT). The
| model source itself is hosted on llama.com.
| TheAceOfHearts wrote:
| I'm aware. I wanted to try out their hosted version of
| the model because I'm GPU poor.
| elcomet wrote:
| You can try it on hugging face
| nmwnmw wrote:
| - Llama 3.2 introduces small vision LLMs (11B and 90B parameters)
| and lightweight text-only models (1B and 3B) for edge/mobile
| devices, with the smaller models supporting 128K token context.
|
| - The 11B and 90B vision models are competitive with leading
| closed models like Claude 3 Haiku on image understanding tasks,
| while being open and customizable.
|
| - Llama 3.2 comes with official Llama Stack distributions to
| simplify deployment across environments (cloud, on-prem, edge),
| including support for RAG and safety features.
|
| - The lightweight 1B and 3B models are optimized for on-device
| use cases like summarization and instruction following.
| opdahl wrote:
| I'm blown away with just how open the Llama team at Meta is. It
| is nice to see that they are not only giving access to the
| models, but they at the same time are open about how they built
| them. I don't know how the future is going to go in the terms of
| models, but I sure am grateful that Meta has taken this position,
| and are pushing more openness.
| nickpsecurity wrote:
| Do they tell you what training data they use for alignment? As
| in, what biases they intentionally put in the system they're
| widely deploying?
| warkdarrior wrote:
| Do you have some concrete example of biases in their models?
| Or are you just fishing for something to complain about?
| resters wrote:
| This is great! Does anyone know if the llama models are trained
| to do function calling like openAI models are? And/or are there
| any function calling training datasets?
| refulgentis wrote:
| Yes (rationale: 3.1 was, would be strange to rollback.)
|
| In general, you'll do a ton of damage by constraining token
| generation to valid JSON - I've seen models as small as 800M
| handle JSON with that. It's ~impossible to train constraining
| into it with remotely the same reliability -- you have to erase
| a ton of conversational training that makes it say ex. "Sure!
| Here's the JSON you requested:"
| Closi wrote:
| What about OpenAI Structured Outputs? This seems to do
| exactly this.
| refulgentis wrote:
| Correct, I think so too, seemed that update must be doing
| exactly this. tl;dr: in the context of Llama fn calling
| reliability, you don't need to reach for training, in fact,
| you'll do it and still have the same problem.
| zackangelo wrote:
| I'm building this type of functionality on top of Llama
| models if you're interested:
| https://docs.mixlayer.com/examples/json-output
| TmpstsTrrctta wrote:
| They mention tool calling in the link for the smaller models,
| and compare to 8B levels of function calling in benchmarks
| here:
|
| https://news.ycombinator.com/item?id=41651126
| ushakov wrote:
| yes, but only the text-only models!
|
| https://www.llama.com/docs/model-cards-and-prompt-formats/ll...
| zackangelo wrote:
| This is incorrect:
|
| > With text-only inputs, the Llama 3.2 Vision Models can do
| tool-calling exactly like their Llama 3.1 Text Model
| counterparts. You can use either the system or user prompts
| to provide the function definitions.
|
| > Currently the vision models don't support tool-calling with
| text+image inputs.
|
| They support it, but not when an image is submitted in the
| prompt. I'd be curious to see what the model does. Meta
| typically sets conservative expectations around this type of
| behavior (e.g., they say that the 3.1 8b model won't do
| multiple tool calls, but in my experience it does so just
| fine).
| winddude wrote:
| the vision models can also do tool calling according to the
| docs, but with text-only inputs, maybe that's what you meant
| ~ <https://www.llama.com/docs/model-cards-and-prompt-
| formats/ll...>
| moffkalast wrote:
| I've just tested the 1B and 3B at Q8, some interesting bits:
|
| - The 1B is extremely coherent (feels something like maybe
| Mistral 7B at 4 bits), and with flash attention and 4 bit KV
| cache it only uses about 4.2 GB of VRAM for 128k context
|
| - A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but
| it might need a lower quant to fit it and with 9T training tokens
| it'll probably degrade pretty badly
|
| - The 3B is a certified Gemma-2-2B killer
|
| Given that llama.cpp doesn't support any multimodality (they
| removed the old implementation), it might be a while before the
| 11B and 90B become runnable. Doesn't seem like they outperform
| Qwen-2-VL at vision benchmarks though.
| Patrick_Devine wrote:
| Hoping to get this out soon w/ Ollama. Just working out a
| couple of last kinks. The 11b model is legit good though,
| particularly for tasks like OCR. It can actually read my
| cursive handwriting.
| gdiamos wrote:
| Llama 3.2 includes a 1B parameter model. This should be 8x higher
| throughput for data pipelines. In our experience, smaller models
| are just fine for simple tasks like reading paragraphs from PDF
| documents.
| gdiamos wrote:
| Do inference frameworks like vllm support vision?
| woodson wrote:
| Yes, vLLM does (though marked experimental):
| https://docs.vllm.ai/en/latest/models/vlm.html
| minimaxir wrote:
| Off topic/meta, but the Llama 3.2 news topic received many, many
| HN submissions and upvotes but never made it to the front page:
| the fact that it's on the front page now indicates that
| moderators intervened to rescue it:
| https://news.ycombinator.com/from?site=meta.com (showdead on)
|
| If there's an algorithmic penalty against the news for whatever
| reason, that may be a flaw in the HN ranking algorithm.
| makin wrote:
| The main issue was that Meta quickly took down the first
| announcement, and the only remaining working submission was the
| information-sparse HuggingFace link. By the time the other
| links were back up, it was too late. Perfect opportunity for a
| rescue.
| dhbradshaw wrote:
| Tried out 3B on ollama, asking questions in optics, bio, and
| rust.
|
| It's super fast with a lot of knowledge, a large context and
| great understanding. Really impressive model.
| tomComb wrote:
| I question whether a 3B model can have "a lot of knowledge".
| foxhop wrote:
| My guess is it uses the same vocabulary size as llama 3.1
| which is 128,000 different tokens (words) to support many
| languages. Parameter count is less of an indicator of fitness
| than previously thought.
| sva_ wrote:
| Curious about the multimodal model's architecture. But alas, when
| I try to request access
|
| > Llama 3.2 Multimodal is not available in your region.
|
| It sounds like they input the continuous output of an image
| encoder into a transformer, similar to transfusion[0]? Does
| someone know where to find more details?
|
| Edit:
|
| _> Regarding the licensing terms, Llama 3.2 comes with a very
| similar license to Llama 3.1, with one key difference in the
| acceptable use policy: any individual domiciled in, or a company
| with a principal place of business in, the European Union is not
| being granted the license rights to use multimodal models
| included in Llama 3.2._ [1]
|
| What a bummer.
|
| 0. https://www.arxiv.org/abs/2408.11039
|
| 1. https://huggingface.co/blog/llama32#llama-32-license-
| changes...
| _ink_ wrote:
| Oh. That's sad indeed. What might be the reason for excluding
| Europe?
| Arubis wrote:
| Glibly, Europe has the gall to even consider writing
| regulations without asking the regulated parties for
| permission.
| pocketarc wrote:
| Between this and Apple's policies, big tech corporations
| really seem to be putting the screws to the EU as much as
| they can.
|
| "See, consumers? Look at how bad your regulation is, that
| you're missing out on all these cool things we're working
| on. Talk to your politicians!"
|
| Regardless of your political opinion on the subject, you've
| got to admit, at the very least, it will be educational to
| see how this develops over the next 5-10 years of tech
| progress, as the EU gets excluded from more and more
| things.
| DannyBee wrote:
| Or, again, they are just deciding the economy isn't worth
| the cost. (or not worth prioritizing upfront or ....)
|
| When we had numerous discussions on HN as these rules
| were implemented, this is precisely what the europeans
| said should happen.
|
| So why does it now have to be some concerted effort to
| "put the screws to EU"?
|
| I otherwise agree it will be interesting, but mostly in
| the sense that i watched people swear up and down this
| was just about protecting EU citizens and they were fine
| with none of these companies doing anything in the EU or
| not prioritizing the EU if they decided it wasn't worth
| the cost.
|
| We'll see if that's true or not, i guess, or if they
| really wanted it to be "you have to do it, but on our
| terms" or whatever.
| imiric wrote:
| > Between this and Apple's policies, big tech
| corporations really seem to be putting the screws to the
| EU as much as they can.
|
| Funny, I see that the other way around, actually. The EU
| is forcing Big Tech to be transparent and not exploit
| their users. It's the companies that must choose to
| comply, or take their business elsewhere. Let's not
| forget that Apple users in the EU can use 3rd-party
| stores, and it was EU regulations that forced Apple to
| switch to USB-C. All of these are a win for consumers.
|
| The reason Meta is not making their models available in
| the EU is because they can't or won't comply with the
| recent AI regulations. This only means that the law is
| working as intended.
|
| > it will be educational to see how this develops over
| the next 5-10 years of tech progress, as the EU gets
| excluded from more and more things.
|
| I don't think we're missing much that Big Tech has to
| offer, and we'll probably be better off for it. I'm
| actually in favor of even stricter regulations,
| particularly around AI, but what was recently enacted is
| a good start.
| aftbit wrote:
| This makes it sound like some kind of retaliation, instead
| of Meta attempting to comply with the very regulations
| you're talking about. Maybe llama3.2 would violate the
| existing face recognition database policies?
| DannyBee wrote:
| Why is it that and not just cost/benefit for them?
|
| They've decided it's not worth their time/energy to do it
| right now in a way that complies with regulation (or
| whatever)
|
| Isn't that precisely the choice the EU wants them to make?
|
| Either do it within the bounds of what we want, or leave us
| out of it?
| paxys wrote:
| Punishment. "Your government passes laws we don't like, so we
| aren't going to let you have our latest toys".
| GaggiX wrote:
| Fortunately, Qwen-2-VL exists, it is pretty good and under an
| actual open source license, Apache 2.0.
|
| Edit: the larger 72B model is not under Apache 2.0 but
| https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct/blob/main/...
|
| Qwen2-VL-72B seems to perform better than llama-3.2-90B on
| visual tasks.
| mrfinn wrote:
| Pity, it's over. We'll never ever be able to download those ten
| gigabytes files, at the other side of the fence.
| Y_Y wrote:
| I hereby grant license to anyone in the EU to do whatever they
| want with this.
| lawlessone wrote:
| Cheers :)
| moffkalast wrote:
| Well you said hereby so it must be law.
| btdmaster wrote:
| Full text:
|
| https://github.com/meta-llama/llama-models/blob/main/models/...
|
| https://github.com/meta-llama/llama-models/blob/main/models/...
|
| > With respect to any multimodal models included in Llama 3.2,
| the rights granted under Section 1(a) of the Llama 3.2
| Community License Agreement are not being granted to you if you
| are an individual domiciled in, or a company with a principal
| place of business in, the European Union. This restriction does
| not apply to end users of a product or service that
| incorporates any such multimodal models.
| ankit219 wrote:
| If you are still curious about the architecture, from the blog:
|
| > To add image input support, we trained a set of adapter
| weights that integrate the pre-trained image encoder into the
| pre-trained language model. The adapter consists of a series of
| cross-attention layers that feed image encoder representations
| into the language model. We trained the adapter on text-image
| pairs to align the image representations with the language
| representations. During adapter training, we also updated the
| parameters of the image encoder, but intentionally did not
| update the language-model parameters. By doing that, we keep
| all the text-only capabilities intact, providing developers a
| drop-in replacement for Llama 3.1 models.
|
| What this crudely means is that they extended the base Llama
| 3.1, to include image based weights and inference. You can do
| that if you freeze the existing weights. add new ones which are
| then updated during training runs (adapter training). Then they
| did SFT and RLHF runs on the composite model (for lack of a
| better word). This is a little known technique, and very
| effective. I just had a paper accepted about a similar
| technique, will share a blog once that is published if you are
| interested (though it's not on this scale, and probably not as
| effective). Side note: That is also why you see param size of
| 11B and 90B as addition from the text only models.
| IAdkH wrote:
| Again, we see that Llama is totally open source! Practically
| BSD licensed!
|
| So the issue is privacy:
|
| https://www.itpro.com/technology/artificial-intelligence/met...
|
| "Meta aims to use the models in its platforms, as well as on
| its Ray-Ban smart glasses, according to a report from Axios."
|
| I suppose that means that Ray Ban smart glasses surveil the
| environment and upload the victim's identities to Meta,
| presumably for further training of models. Good that the EU
| protects us from such schemes.
| getcrunk wrote:
| Still no 14/30b parameter models since llama 2. Seriously killing
| real usability for power users/diy.
|
| The 7/8B models are great for poc and moving to edge for minor
| use cases ... but there's a big and empty gap till 70b that most
| people can't run.
|
| The tin foil hat in me is saying this is the compromise the
| powers that be have agreed too. Basically being "open" but
| practically gimped for average joe techie. Basically arms control
| swader999 wrote:
| You don't need an F-15 to play at least, a decent sniper rifle
| will do. You can still practise even with a pellet gun. I'm
| running 70b models on my M2 max with 96 ram. Even larger models
| sort of work, although I haven't really put much time into
| anything above 70b.
| foxhop wrote:
| 4090 has 24G
|
| So we really need ~40B or G model (two cards) or like a ~20B
| with some room for context window.
|
| 5090 has ??G - still unreleased
| kingkongjaffa wrote:
| llama3.2:3b-instruct-q8_0 is performing better than 3.1 8b-q4 on
| my macbookpro M1. It's faster and the results are better. It
| answered a few riddles and thought experiments better despite
| being 3b vs 8b.
|
| I just removed my install of 3.1-8b.
|
| my ollama list is currently:
|
| $ ollama list
|
| NAME ID SIZE MODIFIED
|
| llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
|
| gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
|
| phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
|
| mxbai-embed-large:latest 468836162de7 669 MB 3 months ago
| taneq wrote:
| For a second I read that as " _it_ just removed my install of
| 3.1-8b" :D
| sk11001 wrote:
| Can one of thse models be run on a single machine? What specs do
| you need?
| Y_Y wrote:
| Absolutely! They have a billion-parameter model that will run
| on my first computer if we quantize it to 1.5 bits. But
| realistically yes, if you can fit in total ram you can run it
| slowly, if you can fit it in gpu ram you can probably run it
| fast enough to chat.
| GaggiX wrote:
| The 90B seem to perform pretty weak on visual tasks compare to
| Qwen2-VL-72B: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct,
| or am I missing something?
| kombine wrote:
| Are these models suitable for Code assistance - as an alternative
| to Cursor or Copilot?
| a_wild_dandan wrote:
| "The Llama jumped over the ______!" (Fence? River? Wall?
| Synagogue?)
|
| With 1-hot encoding, the answer is "wall", with 100% probability.
| Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE
| PENALTY, SCRUB!
|
| I believe this unforgiving dynamic is why model distillation
| works well. The original teacher model had to learn via the "hot
| or cold" game on _text_ answers. But when the child instead
| imitates the teacher 's predictions, it learns _semantically
| rich_ answers. That strikes me as vastly more compute-efficient.
| So to me, it makes sense why these Llama 3.2 edge models punch so
| far above their weight(s). But it still blows my mind thinking
| how far models have advanced from a year or two ago. Kudos to
| Meta for these releases.
| bottlepalm wrote:
| What mobile devices can the smaller models run on? iPhone,
| Android?
| simonw wrote:
| I'm absolutely amazed at how capable the new 1B model is,
| considering it's just a 1.3GB download (for the Ollama GGUF
| version).
|
| I tried running a full codebase through it (since it can handle
| 128,000 tokens) and asking it to summarize the code - it did a
| surprisingly decent job, incomplete but still unbelievable for a
| model that tiny:
| https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...
|
| More of my notes here:
| https://simonwillison.net/2024/Sep/25/llama-32/
|
| I've been trying out the larger image models to using the
| versions hosted on https://lmarena.ai/ - navigate to "Direct
| Chat" and you can select them from the dropdown and upload images
| to run prompts.
| GaggiX wrote:
| Llama 3.2 vision models don't seem that great if they have to
| compare them to Claude 3 Haiku or GPT4o-mini. For an open
| alternative I would use Qwen-2-72B model, it's smaller than the
| 90B and seems to perform quite better. Also Qwen2-VL-7B as an
| alternative to Llama-3.2-11B, smaller, better in visual
| benchmarks and also Apache 2.0.
| foxhop wrote:
| The llama 3.0, 3.1, & 3.2 all use the TikToken tokenizer which
| is the open source openai tokenizer.
| JohnHammersley wrote:
| Ollama post: https://ollama.com/blog/llama3.2
| gunalx wrote:
| 3b was pretty good at multimodal (Norwegian) still a lot of
| gibberish at times, and way more sensitive than 8b but more
| usable than Gemma 2 2b at multi modal, fine at my python list
| sorter with args standard question. But 90b vision just refuses
| all my actually useful tasks like helping recreate the images in
| html or do anything useful with the image data other than
| describing it. Have not gotten as stuck with 70b or openai
| before. Insane amount of refusals all the time.
| thimabi wrote:
| Does anyone know how these models fare in terms of multilingual
| real-world usage? I've used previous iterations of llama models
| and they all seemed to be lacking in that regard.
___________________________________________________________________
(page generated 2024-09-25 23:00 UTC)