[HN Gopher] Llama 2 on ONNX runs locally
___________________________________________________________________
Llama 2 on ONNX runs locally
Author : tmoneyy
Score : 51 points
Date : 2023-08-10 21:37 UTC (1 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| hashtag-til wrote:
| This is very cool! I really hope the ONNX project gets much more
| adoption in the next months and years and help reduce the
| fragmentation in the ML ecosystem.
| brucethemoose2 wrote:
| Eh... I have seen ONNX demos for years, and they tend to stay
| barebones and slow, kinda like this.
|
| NCNN, MLIR and TVM based ports have been far more impressive.
| claytonjy wrote:
| I'm not sure there's much chance of that happening. ONNX seems
| to be the broadest in coverage, but for basically any model
| ONNX supports, there's a faster alternative.
|
| For the latest generative/transformer stuff (whisper, llama,
| etc) it's often specialized C(++) stuff, but torch 2.0
| compilation keeps geting better, BetterTransformers, TensorRT,
| etc.
| turnsout wrote:
| Does anyone know the feasibility of converting the ONNX model to
| CoreML for accelerated inference on Apple devices?
| kiratp wrote:
| If you're working with LLMs, just use this -
| https://github.com/ggerganov/llama.cpp
|
| It has Metal support.
| brucethemoose2 wrote:
| MLC's Apache TVM implementation can also compile to Metal.
|
| Not sure if they made an autotuning profile for it yet.
| mchiang wrote:
| They used to have this: https://github.com/onnx/onnx-coreml
| rrherr wrote:
| How does this compare to using
| https://github.com/ggerganov/llama.cpp with
| https://huggingface.co/models?search=thebloke/llama-2-ggml ?
| version_five wrote:
| Ggml / llama.cpp has a lot of hardware optimizations built in
| now, CPU, GPU and specific instruction sets like for apple
| silicon (I'm not familiar with the names). I would want to know
| how many of those are also present in onnx and available to
| this model.
|
| There are currently also more quantization options available as
| mentioned. Though those incur a performance loss (they make the
| model faster but worse) so it depends on what you're optimizing
| for.
| brucethemoose2 wrote:
| ONNX is a format. There are different runtimes for different
| devices... But I can't speak for any of them.
|
| > specific instruction sets like for apple silicon
|
| You are thinking of the Accelerate framework support, which
| is basically Apple's ARM CPU SIMD library.
|
| But Llama.cpp also has a Metal GPU backend, which is the
| defacto backend for Apple devices now.
| [deleted]
| brucethemoose2 wrote:
| Very unfavorably. Mostly because the ONNX models are FP32/FP16
| (so ~3-4x the RAM use), but also because llama.cpp is well
| optimized with many features (like prompt caching, grammar,
| device splitting, context extending, cfg...)
|
| MLC's Apache TVM implementation is also excellent. The
| autotuning in particular is like black magic.
| skeletoncrew wrote:
| I tried quite a few of these and the ONNX one seems the most
| elegantly put together of all. I'm impressed.
|
| Speed can be improved. Quick and dirty/hype solutions, not
| sure.
|
| I really hope ONNX gets traction it deserves.
| brucethemoose2 wrote:
| > ONNX one seems the most elegantly put together of all.
|
| What do you mean by this? The demo UI? Code quality?
| version_five wrote:
| > Quick and dirty/hype solutions, not sure.
|
| Curious what you mean by this
| moffkalast wrote:
| These are still FP16/32 models, almost certainly a few times
| slower and larger than the latest N bit quantized GGMLs.
| glitchc wrote:
| How was this allowed? I was under the impression that companies
| the size of Microsoft needed to contact Meta to negotiate a
| license.
|
| Excerpt from the license:
|
| _Additional Commercial Terms. If, on the Llama 2 version release
| date, the monthly active users of the products or services made
| available by or for Licensee, or Licensee 's affiliates, is
| greater than 700 million monthly active users in the preceding
| calendar month, you must request a license from Meta, which Meta
| may grant to you in its sole discretion, and you are not
| authorized to exercise any of the rights under this Agreement
| unless or until Meta otherwise expressly grants you such rights._
| amelius wrote:
| > To get access permissions to the Llama 2 model, please fill
| out the Llama 2 access request form. If allowable, you will
| receive GitHub access in the next 48 hours, but usually much
| sooner.
|
| I guess they send the form to Meta?
|
| Anyway, I hope this is not what Open Source will be like from
| now on.
| thadk wrote:
| > Meta and Microsoft have been longtime partners on AI,
| starting with a collaboration to integrate ONNX Runtime with
| PyTorch to create a great developer experience for PyTorch on
| Azure, and Meta's choice of Azure as a strategic cloud
| provider. (sic)
|
| https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-me...
| stu2b50 wrote:
| So they negotiated a license? Meta partnered with Azure for the
| Llama 2 launch, there's no reason to think that they're
| antagonistic towards each other.
| flatfuzz wrote:
| [dead]
___________________________________________________________________
(page generated 2023-08-10 23:00 UTC)