[HN Gopher] MK-1
___________________________________________________________________
MK-1
Author : ejz
Score : 70 points
Date : 2023-08-05 21:20 UTC (1 hours ago)
(HTM) web link (mkone.ai)
(TXT) w3m dump (mkone.ai)
| xianshou wrote:
| Not a single mention of existing quantization techniques? Ten
| bucks says this is just a wrapper around bitsandbytes or ggml.
| Philpax wrote:
| ...isn't this just quantization?
| bhouston wrote:
| Whatever it is, it will likely be copied into the open source
| tooling like llama.cop soonish or something similar will arrive
| in llama.cpp. It doesn't seem defensive advantage. It seems
| like a feature and fighting against fast moving open source
| alternatives.
| atlas_hugged wrote:
| Exactly what I was thinking. Everyone already does this. Unless
| they're doing something else, they'll have to show why it's
| better than just quickly quantizing to 8 bits or 4 bits or
| whatever.
| amelius wrote:
| If you look at the demo video, the output is exactly the same
| for both cases, so I doubt it uses quantization.
| metadat wrote:
| Too bad it's not an open source effort.
|
| I'm not a fan of proprietary dependencies in my stack, full stop.
| lolinder wrote:
| I seriously doubt this will go anywhere. The open source
| community has already achieved basically the same performance
| improvements via quantization. This feels like someone has
| repackaged those libraries and is going to try to sell them to
| unwary and uninformed AI startups.
| drtournier wrote:
| MKML == abstractions and wrappers for GGML?
| pestatije wrote:
| > Today, we're announcing our first product, MKML. MKML is a
| software package that can reduce LLM inference costs on GPUs by
| 2x with just a few lines of Python code. And it is plug and play
| with popular ecosystems like Hugging Face and PyTorch
| cududa wrote:
| No judgement, but I'm genuinely curious why you saw the need to
| comment with a random sentence in their post?
| qup wrote:
| It's not a random sentence, it's the main sentence everyone
| wants to read. They posted it to be helpful.
| Scene_Cast2 wrote:
| I've worked on ML model quantization. The open source 4-bit or
| 8-bit quantization isn't as good as one can get - there are much
| fancier techniques to keep predictive performance while squeezing
| size.
|
| Some techniques (like quantization-aware training) involve
| changes to training.
| lolinder wrote:
| I'm sure there are better methods! But in this case, MKML's
| numbers just don't look impressive when placed alongside the
| prominent quantization techniques already in use. According to
| this chart [0] it's most similar in size to a Q6_K
| quantization, and if anything has slightly worse perplexity.
|
| If their technique _were_ better, I imagine that the company
| would acknowledge the existence of the open source techniques
| and show them in their comparisons, instead of pretending the
| only other option is the raw fp16 model.
|
| [0]
| https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...
| KRAKRISMOTT wrote:
| What about Unum's quantization methods?
|
| https://github.com/unum-cloud/usearch
| ipsum2 wrote:
| Isn't FasterTransformer (NVidia, OSS) and text-generation-
| inference (Huggingface, not OSS) are faster than this?
| lolinder wrote:
| It's weird that not once do they mention or compare their results
| to the already-available quantization methods. I normally try to
| give benefit of the doubt, but there's really no way they're not
| aware that there are already widely used techniques for
| accomplishing this same thing, so the comparison benchmarks
| _really_ should be there.
|
| To fill in the gap, here's llama.cpp's comparison chart[0] for
| the different quantizations available for Llama 1. We can't
| compare directly with their Llama 2 metrics, but just comparing
| the percent change in speed and perplexity, MK-1 looks very
| similar to Q5_1. There's a small but not insignificant hit to
| perplexity, and a just over 2x speedup.
|
| If these numbers are accurate, you can download pre-quantized
| Llama 2 models from Hugging Face that will perform essentially
| the same as what MK-1 is offering, with the Q5 files here:
| https://huggingface.co/TheBloke/Llama-2-13B-GGML/tree/main
|
| [0] https://github.com/ggerganov/llama.cpp#quantization
| andy_xor_andrew wrote:
| Also, using the word "codecs" kind of puts a bad taste in my
| mouth. It's like they're trying to sound like they invented an
| entirely new paradigm, with their own fancy name that reminds
| people of video compression.
| moffkalast wrote:
| Q5_1 is already old news too, K quants are faster and more
| space efficient for the same perplexity loss.
|
| https://old.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...
| lolinder wrote:
| For sure, but I couldn't find numbers for the K quants that
| included inference speeds, so I settled on the older one. If
| MK-1 were trying to be honest they'd definitely want to
| benchmark against the newest methods!
___________________________________________________________________
(page generated 2023-08-05 23:00 UTC)