_______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
 (HTM) Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
 (HTM)   TurboQuant: Redefining AI efficiency with extreme compression
       
       
        antiresonant wrote 3 hours 28 min ago:
        At this rate, the current AI era is going to clear the queue of all
        mathematics that's ever been created but not yet applied.
       
        antoniuschan99 wrote 15 hours 55 min ago:
        It could turn a 1M context system to a 4M context system.
        TurboQuant-style KV-cache compression makes longer context windows
        cheaper to serve. Not exactly sure how much increase in context size
        though.
       
        htrp wrote 22 hours 46 min ago:
        The actual paper from April 2025
        
        TurboQuant: Online Vector Quantization with Near-optimal Distortion
        Rate
        
 (HTM)  [1]: https://arxiv.org/abs/2504.19874
       
        wbsun wrote 23 hours 21 min ago:
        The blog is new but the paper was submitted almost one year ago: [1] .
        Anyone has ideas if this is already implemented in many models (at
        least Gemini, I guess)? If that's the case, can I expect cheaper RAM
        for my computer :D
        
 (HTM)  [1]: https://arxiv.org/abs/2504.19874
       
        mesuvash wrote 23 hours 24 min ago:
        TurboQuant explained with an easy to understand (no-math) animation
        
 (HTM)  [1]: https://mesuvash.github.io/blog/2026/turboquant-interactive/
       
          fc417fc802 wrote 18 hours 47 min ago:
          Someone else linked that elsewhere in the comments and while it's
          certainly a nice visual it seems like it's not accurately portraying
          the paper. Isn't the grid supposed to have a weird alignment that
          depends on the bit depth? And there's supposed to be a second
          quantization step involving the residual.
       
            mesuvash wrote 15 hours 49 min ago:
            Fair point. I've updated the animation to address this. The grid
            now uses the correct non-uniform centroids (optimal for the arcsine
            distribution in 2D), so you'll see grid lines cluster near the
            edges where unit-circle coordinates actually concentrate, rather
            than being evenly spaced. The spacing does change with bit depth.
            
            On the second quantization step: the paper's inner-product variant
            uses (b-1) bits for the MSE quantizer shown here, then applies a
            1-bit QJL (Quantized Johnson-Lindenstrauss) encoding of the
            residual to make dot-product estimates unbiased. I chose to omit
            QJL from the animation to keep it digestible as a visual, but I've
            added a note calling this out explicitly.
       
              fc417fc802 wrote 6 hours 26 min ago:
              It looks nice! Fair enough about QJL - it seems to be nothing
              more than an unbiasing measure anyway.
              
              I'm not sure if it's my own misunderstanding or if the paper [0]
              has something of an error. Section 3.1 starts out to the effect
              "let x be on the unit hypersphere" (but I'm fairly certain it's
              actually not). Neither algorithm 1 nor algorithm 2 show a
              normalization step prior to rotating x. Algorithm 2 line 8 shows
              that the scalar returned is actually the magnitude of the
              residual without accounting for QJL.
              
              Anyway I'm pretty sure the authors inadvertently omitted that
              detail which really had me confused for a while there.
              
              [0]
              
 (HTM)        [1]: https://arxiv.org/abs/2504.19874
       
                mesuvash wrote 4 hours 51 min ago:
                IIUC, The paper's notation S^(d-1) means the unit sphere in R^d
                (e.g., the familiar unit circle is S^1 living in R^2). So, i
                think, x in the algorithm is already a unit vector.
                
                Reference: 
                Section 2:Preliminaries 
                ...
                We use the notation S^d−1 to denote the hypersphere in R^d of
                radius 1.
                
                Section 3.1
                Let x ∈ S^d−1 be a (worst-case) vector on the unit sphere
                in dimension d.
       
        bdcs wrote 23 hours 28 min ago:
        Here's my attempt at a undergrad-level summary (corrections welcome!):
        
        The core idea is to quantize KV cache, but do so in a way that destroys
        minimal information. In this case, it's similarly scores between
        vectors. The simplest way to do this is to change all the elements from
        16bit of precision to, say, 4 bits (Scalar Quant.). These papers
        improve on it by realizing: almost all the energy (concentration of
        measure) is towards the equator of the hypersphere (normally
        distributed as 1/d; d=vector dimensionality). (The curse/blessing of
        hyper dimensionality strikes again.) So when we quantize the elements
        (think "latitudes", e.g. to the nearest degree) we destroy a lot of
        information because basically all the vectors were around the equator
        (so some latitudes have a lot of vectors and some have very few). The
        idea is to rotate the vectors away from the equator so they're more
        consistently distributed (to better preserve the entropy during
        quantization, which I guess was amitport's DRIVE idea). PolarQuant does
        a hyperpolar coordinate transform which superficially seems neat for
        preserving entropy because of this equator/polar framing (and
        ultimately unnecessary as shown by TurboQuant). They also realized
        there's a bias to the resulting vectors during similarity, so they
        wrote the QJL paper to fix the bias. And then the TurboQuant paper took
        PolarQuant + QJL, removed the hyperpolar coords, and added in some
        gross / highly-pragmatic extra bits for important channels (c.f.
        elements of the vectors) which is sort of a pathology of LLMs these
        days but it is what it is. Et voila, highly compressed KV Cache. If
        you're curious why you can randomly rotate the input, it's because all
        the vectors are rotated the same, so similarity works out. You could
        always un-rotate to get the original, but there's no need because the
        similarity on rotated/unrotated is the same if you compare apples to
        apples (with the QJL debiasing). Why was PolarQuant even published?
        Insu Han is solely on that paper and demanded/deserved
        credit/promotion, would be my guess. The blog post is chock-full of
        errors and confusions.
       
          bdcs wrote 23 hours 18 min ago:
          Some corrections: the vectors are un-rotated in practice for future
          query vectors. This could be removed with a slightly different LLM
          arch.
          
          PolarQuant does live on in TurboQuant's codebooks for quantization
          which borrows from the hyperpolar coords
       
            fc417fc802 wrote 5 hours 53 min ago:
            > added in some gross / highly-pragmatic extra bits for important
            channels
            
            I'm curious what you meant by that. I understood it to only have
            the MSE quantization vector, a 1-bit QJL vector, and a scalar
            magnitude.
            
            > PolarQuant does live on in TurboQuant's codebooks for
            quantization which borrows from the hyperpolar coords
            
            Isn't the turbo codebook the irregularly spaced centroid grid?
       
        parsimo2010 wrote 1 day ago:
        This blog post sucks. It does not make me want to read the papers.
        
        Look at this figure: [1] The speedup labels on the vertical axis are 0,
        2, 2, 4, 6, 8... Why is 2 repeated? Did they just have nano-banana make
        them some charts? Can they not be bothered to use matplotlib or bokeh
        and directly render a graph? I don't know, maybe there is some
        legitimate reason that I don't know about for making a single value
        occur multiple times on a graph axes, but if that is the case, then
        they probably need to explain it in the figure caption. So it's either
        a "GenAI special" or it's poor communication about how to read the
        graph...
        
        Look at this video visualization: [2] Do you have literally any clue
        what Polar Quantization is? Would this make me think, "I kind of have a
        high level understanding of that, let me go get the details from the
        paper."
        
        Look at this figure: [1] The left hand side of the graph, which is
        normally assumed to start at 0, starts at 48. Those MASSIVE differences
        you see in the figure? Only a few percent. And that's a deception but
        only if the figure is even accurate, because we saw earlier they can't
        even get figure axes correct.
        
 (HTM)  [1]: https://storage.googleapis.com/gweb-research2023-media/images/...
 (HTM)  [2]: https://storage.googleapis.com/gweb-research2023-media/media/Q...
 (HTM)  [3]: https://storage.googleapis.com/gweb-research2023-media/images/...
       
          davesque wrote 19 hours 36 min ago:
          Yeah, the viz for polar quantization is straight up nonsensical. 
          Okay, so some colors are converted into clocks and then into a bigger
          box with a pink box inside of it.  Got it.  Even understanding what
          polar coordinates are doesn't help you make sense out of it.
       
        Serhii-Set wrote 1 day ago:
        Compression research keeps producing surprisingly practical results.
        The interesting parallel in image formats — AVIF and JPEG XL both
        came from video codec research (AV1 and JPEG committee respectively),
        and the compression gains translated almost directly. Makes me wonder
        how much of the current AI quantization work will eventually land in
        production inference the same way.
       
          computerbuster wrote 1 day ago:
          JPEG XL is mainly based on unique image-specific research, but you're
          right to say a lot of the techniques are compatible with videos in
          theory (the XYB color space comes to mind). AVIF is an AV1 OBU in an
          image-specific container, and required a lot of image-specific
          engineering to make AV1's tools useful for images; see libaom's tune
          "iq", and the same in SVT-AV1. The compression gains translated when
          engineering effort went into creating bespoke implementations, and
          the same may happen for LLMs if I had to guess.
       
            Serhii-Set wrote 1 day ago:
            The XYB color space detail is really interesting — I wasn't aware
            of how much image-specific engineering went into making AV1 tools
            work for stills. The libaom 'iq' tuning makes sense in retrospect.
            So the compression gains in AVIF weren't just inherited from AV1
            video work but required significant additional optimization. That
            makes the JXL comparison more nuanced too — JXL was designed
            image-first from the start, which might explain why it encodes
            faster despite similar or better compression ratios.
       
        naasking wrote 1 day ago:
        This sounds great! TurboQuant does KV cache compression using
        quantization via rotations, and ParoQuant [1] does weight compression
        using quantization via rotations! So we can get 4-bit weights that
        match bf16 precision, the KV cache goes down to 3 bits per key. This
        brings larger models and long contexts into the range of "possibly
        runnable" on beefy consumer hardware.
        
 (HTM)  [1]: https://github.com/z-lab/paroquant
       
        mmastrac wrote 1 day ago:
        Is this a tradeoff between GPU-computation-expense vs accuracy? ie: you
        could quantize into segments or grids on the unit circle/sphere/etc,
        but that's too expensive so it's better to just quantize to a Cartesian
        grid because the GPU can decompress cheaper?
       
        gavinray wrote 1 day ago:
        Can someone ELI5 these two concepts please, which make no sense to me:
        
          > "TurboQuant starts by randomly rotating the data vectors. This
        clever step simplifies the data's geometry"
        
        I don't understand how taking a series of data and applying a random
        rotation could mathemetically lead every time to "simpler" geometry.
        
        If I throw a bunch of shapes on the ground, tightly packed and touching
        each other, then rotate all of them, you can't guarantee that the new
        conglomerate shape is any more/less "simple" than before, right?
        
          > "Johnson-Lindenstrauss Transform to shrink complex,
        high-dimensional data while preserving the essential distances and
        relationships between data points. It reduces each resulting vector
        number to a single sign bit (+1 or -1)."
        
        How can a boolean value preserve all of the relational and positional
        information between data points?
       
          elif wrote 18 hours 10 min ago:
          i could be mistaken but from my read, the 'rotation' aspect is
          nothing new and not dissimilar from normal spin quant, where the
          importance matrix is rotated during  calibration such that the local
          minima/maxima are more evenly smoothed and excessive/redundant
          quantization of parameters is avoided.
          
          as for the J-L transformation is way above my head so i'm almost
          certainly mistaken but it seems to be some clever way to use a bit as
          a sort of pointer in order to reuse existing chunks of parameter
          weight data like in a jpeg or zip compression algorithm.
       
          gopalv wrote 1 day ago:
          > I don't understand how taking a series of data and applying a
          random rotation could mathemetically lead every time to "simpler"
          geometry.
          
          Let's pick a simpler compression problem where changing the frame of
          reference improves packing.
          
          There's a neat trick in the context of floating point numbers.
          
          The values do not always compress when they are stored exactly as
          given.
          
          [0.1, 0.2, 0.3, 0.4, 0.5]
          
          Maybe I can encode them in 15 bytes instead of 20 as float32.
          
          Up the frame of reference to be decibels instead of bels and we can
          encode them as sequential values without storing exponent or sign
          again.
          
          Changing the frame of reference, makes the numbers "more alike" than
          they were originally.
          
          But how do you pick a good frame of reference is all heuristics and
          optimization gradients.
       
          kingstnap wrote 1 day ago:
          Other people have answered here but the real answer is that deep
          neural networks don't learn isotropic distributions of activations.
          
          What happens is that you get very spikey activations, there are so
          called "outlier" activations. A easy to read paper that tells you
          about this is SmoothQuant [0]. Another source from Anthropic and the
          Mechanistic Interperability people is calling these "privileged
          basis" [1].
          
          Now based on the weight symmetries of a typical transformer, these
          actually don't need to exist. Weight symmetries means the ways you
          can change the weights without actually affecting the mathematical
          function, there are a broad class of these because the linear algebra
          has a lot of redundancies in it.
          
          But the behaviour of the Adam optimizer is such that you do end up w/
          these things because it sort of more quickly optimizes to produce
          them. This comes from the fact it is an elementwise dynamic learning
          rate (and probably partly to do with the epsilon).
          
          [0] [1]
          
 (HTM)    [1]: https://arxiv.org/pdf/2211.10438
 (HTM)    [2]: https://transformer-circuits.pub/2023/privileged-basis/index...
       
            gavinray wrote 1 day ago:
            From your second paper:
            
              > In particular, we can generate fixed random rotation matrices
            at initialization, and multiply them into the activations any time
            we read from or write to the residual stream. 
            
            I guess I was mistaken in assuming this part was part of the
            TurboQuant-specific innovations. Still an interesting concept
            though
       
            Bolwin wrote 1 day ago:
            Do you know if this also applies to the muon optimizer? It seems to
            be replacing adamw
       
              kingstnap wrote 1 day ago:
              My guess is that probably not for Muon. What I said about ADAM
              was partly based on this blogpost I read some time ago, should
              have cited it as well [0].
              
              The thing about Muon is that it doesn't have this specific
              feature of ADAM that causes it to "move along the diagonal".
              Basically if you flatten weights as a huge vector of a few
              billion elements. SGD moves along the gradient, which isn't
              biased. ADAM normalizes everything elementwise, so it sort of
              moves along a vector of +-1.
              
              This isn't a proof or anything, but what you can imagine might be
              happening is that if you move along +-1, then you find spikey
              solutions somehow. Not sure how to prove that. Muon doesn't
              really do this, but it has its own sort of funky reshaping of the
              update (it moves along low rank directions).
              
              [0]
              
 (HTM)        [1]: https://www.lesswrong.com/posts/yrhu6MeFddnGRSLtQ/adam-o...
       
          photon_lines wrote 1 day ago:
          The whole goal of quantisation is to put the data into 'bins' so that
          it can easily be 'packed' so that you can represent it using less
          bits (less information). You can think of it like rounding
          essentially (3.14159 -> 3). Now, sometimes within data, the
          distribution will be non-ideal for separating it out into bins (let's
          say that our rounding rules are simple -- we simply use a floor
          function so 2.45 maps to 2 and 6.4543 maps to 6 etc...) and our bins
          simply map to the floor -- if we had a set of numbers which look like
          this: [3.11, 4.43, 5.78, 12.33, 34.32], they would simply map to [3,
          4, 5, 12, 34]. Now, we have one huge outlier in our data (34) so to
          create bins for those sets of numbers, we would need 6 bits of
          information (2 to the power of 6 = 64), but this is mostly due to the
          fact that we have one huge outlier (34.32). To get rid of this -- the
          algorithms applies a random rotation matrix which 'distorts' the
          original data so that it is more evenly distributed among the
          possible bins which are assigned to the data set. In linear algebra,
          a rotation matrix is an orthogonal matrix. When you multiply your
          vector by this matrix, you aren't changing the "amount" of data (the
          length of the vector remains the same), but you are recalculating
          every single number in that vector as a weighted sum of the
          originals. According to the Central Limit Theorem, when you sum up
          many random things, the result always starts looking like a bell
          curve. This is the magic TurboQuant relies on: they don't know what
          your data looks like, but they know that after the rotation, the data
          must look like a Beta Distribution and they use this fact to
          transform the original data into a more 'tightly packed' distribution
          which allows them to more efficiently pack (or quantise) the
          information. If most of the transformed data is huddled together into
          a predictable Bell curve shape, you can pack your bins tightly around
          that shape leading to much higher precision with fewer needed bits to
          store it. For example, after applying a rotation matrix, our original
          transform [3.11, 4.43, 5.78, 12.33, 34.32] might get mapped to
          something like [8.12, 8.65, 9.25, 10.53, 12.86] and we can crate bins
          which both are more accurate and need less bits in order to hold our
          original data set. To create the most optimal bins -- the Lloyd-Max
          algorithm is used. This algorithm is the gold standard for 1D
          quantisation. Its goal is to find the best places to put your
          "boundaries" (where you cut the data) and your "reconstruction
          values" (the number you store) to minimise the Mean Squared Error
          (MSE). After applying this, you have your 'rounded' values (or
          quantized data), but there is still an error value which is missing
          from our data set: and this is where the residual bit comes in. That
          bit doesn't represent the original data (or vector) - it simply
          represents our 'bias' after we apply the above algorithms. It's
          basically like a '1-bit note' which allows you to perfectly cancel
          out all the bias terms which our above quantisation algorithm
          produces to make the 'interactions' (or inner products) when we
          multiply our values together extremely accurate again even after
          transforming our original data. Does this make sense?
       
            rtrgrd wrote 19 hours 13 min ago:
            Added to my non-llm username list :)
            
            Thanks so much for the explanation
       
            gavinray wrote 1 day ago:
            I had to read this over a few times to piece it together, thanks
            for the thorough and digestable explanation!
       
            nico wrote 1 day ago:
            Amazing explanation! Thank you so much for taking the time to put
            it together. It makes a lot of sense. I’m not the one who asked
            the question, but I was impressed by such eloquent and clearly
            explained answer
       
              photon_lines wrote 5 hours 24 min ago:
              Thank you! I'm glad you found it helpful (and that others did
              too)!!
       
            rohansood15 wrote 1 day ago:
            Thank you.
       
          lumost wrote 1 day ago:
          They are saying that models should be invariant to data's orientation
          - and only sensitive to the distance between vectors. This has a
          pretty significant effect on reducing the set of possible models, and
          may stabilize the optimization.
          
          In simple terms, large ML models like LLMs often learn trivial rules
          such as "if the 21st decimal place of the 5th dimension in the
          embedding vector is 5 - then the image is of a cat." Learning such a
          memorization function is usually not what we are trying to do, and
          there are a variety of techniques to avoid these trivial solutions
          and "smooth" the optimization geometry.
       
          wordpad wrote 1 day ago:
          They are not doing random rotation, simplification here means they
          are aligning the outliers. If you threw a bunch of shapes on the
          ground they are picking up one that rolled away and putting it with
          the others.
          
          >How can a boolean value preserve all of the relational and
          positional information between data points?
          
          They aren't reducing entire vector to a bollean only each of its
          dimensions.
       
        iddan wrote 1 day ago:
        I am guessing as Google is vertically integrated and "actually pays"
        for AI infra (compared to OpenAI & Anthropic that receives hardware as
        partnerships) they have a more urgent incentive to reduce model sizes.
        Also, Google and Apple will be the first to gain from running model
        on-device
       
          skybrian wrote 1 day ago:
          This seems to be an inference-time optimization and they are putting
          AI on every search result page. That seems like plenty of incentive
          to optimize.
       
          mrcwinn wrote 1 day ago:
          I can assure you OpenAI and Anthropic pay for hardware. They don’t
          receive it for free.
       
        _s_a_m_ wrote 1 day ago:
        has the word "advanced", gotta be good
       
        akhenakh wrote 1 day ago:
        Someone implementing it on llamacpp already
        
 (HTM)  [1]: https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c91f...
       
          vibe42 wrote 1 day ago:
          The pace of development in llama.cpp is really high, could see an
          implementation being merged in 4-6 weeks.
       
          GistNoesis wrote 1 day ago:
          He even attempts to improve on the paper by replacing the random
          rotation operation which is O(d^2), by a Subsampled Randomized
          Hadamard Transform which can be computed in O(d*log d).
          
          Hopefully Johnson–Lindenstrauss lemma applies in the same way for
          SRHTransformed vectors as they do for randomly rotated vectors and
          the independence of the distribution laws of the coordinates remains
          and therefore the quantization of each coordinates independently is
          still theoretically sound.
       
          cpburns2009 wrote 1 day ago:
          For some reason I thought the implementation would be way more
          complicated than that. I obviously lack the domain knowledge to
          tackle something like this, but it looks straight forward.
       
            qingcharles wrote 1 day ago:
            Agreed. Actual LOC is tiny. Very impressive PR.
       
        lwhi wrote 1 day ago:
        Will this help us run models locally?
       
        bilsbie wrote 1 day ago:
        It seems like most breakthroughs I see are for efficiency?  What are
        the most importsnt breakthroughs from the past two or three years for
        intelligence?
       
          cubefox wrote 1 day ago:
          >  What are the most importsnt breakthroughs from the past two or
          three years for intelligence?
          
          The most important one in that timeframe was clearly reasoning/RLVR
          (reinforcement learning with verifiable rewards), which was pioneered
          by OpenAI's Q* aka Strawberry aka o1.
       
          Lerc wrote 1 day ago:
          If you think of it from the point of view of the universal
          approximation theorem, it's all efficiency optimisation. We know that
          it works if we do it incredibly inefficiently.
          
          Every architecture improvement is essentially a way to achieve the
          capability of a single fully-connected hidden layer network n wide. 
          With fewer parameters.
          
          Given these architectures usually still contain fully connected
          layers, unless they've done something really wrong, they should still
          be able to do anything if you make the entire thing large enough.
          
          That means a large enough [insert model architecture] will be able to
          approximate any function to arbitrary precision.  As long as the
          efficiency gains with the architecture are retained as the scale
          increases they should be able to get there quicker.
       
          ertgbnm wrote 1 day ago:
          Most breakthroughs that are published are for efficiency because most
          breakthroughs that are published are for open source.'
          
          All the foundation model breakthroughs are hoarded by the labs doing
          the pretraining. That being said, RL reasoning training is the
          obvious and largest breakthrough for intelligence in recent years.
       
            WarmWash wrote 1 day ago:
            With all the floating around of AI researchers though, I kind of
            wonder how "secret" all these secrets are. I'm sure they have
            internal siloing, but even still, big players seem to regularly
            defect to other labs. On top of this, all the labs seem to be
            pretty neck and neck, with no one clearly pulling ahead across the
            board.
       
          irthomasthomas wrote 1 day ago:
          Efficiency gains can be used to make existing models more profitable,
          or to make new larger and more intelligent models.
       
            cubefox wrote 1 day ago:
            Some yes, others no. Distillation and quantization can't be used to
            make new base models since they require a preexisting one.
       
              irthomasthomas wrote 1 day ago:
              it enables models larger than was previously possible.
       
                cubefox wrote 1 day ago:
                No because the base model from which the distilled or quantized
                models are derived is larger.
       
        pstoll wrote 1 day ago:
        And a group has published an independent working implementation today,
        nice to see:
        
 (HTM)  [1]: https://github.com/tonbistudio/turboquant-pytorch
       
          ilija139 wrote 1 day ago:
          It has a lot clearer explanation of the method than Google's own
          post.
       
            ramon156 wrote 1 day ago:
            Well, yeah. Claude simplified it. That doesn't mean it's a better
            explanation.
       
              adi_kurian wrote 1 day ago:
              Did it lose important detail?
       
        ssijak wrote 1 day ago:
        For my grug brain can somebody translate this to ELIgrug terms?
        
        Does this mean I would be able to run 500b model on my 48gb macbook
        without loosing quality?
       
          prabal97 wrote 2 hours 2 min ago:
          I wrote this more intuitive explanation. I think you might find it
          helpful!
          
 (HTM)    [1]: https://prabal.ca/posts/google-long-context-cheaper/
       
          x_may wrote 1 day ago:
          KV cache compression, so how much memory the model needs to use for
          extending its context. Does not affect the weight size.
       
        macleginn wrote 1 day ago:
        "TurboQuant proved it can quantize the key-value cache to just 3 bits
        without requiring training or fine-tuning and causing any compromise in
        model accuracy" -- what do each 3 bits correspond to? Hardly individual
        keys or values, since it would limit each of them to 8 different
        vectors.
       
          carlosvega wrote 1 day ago:
          Is the number of bits per coordinate. So, 1 bit is 2x2 grid. 3 bit is
          a 64 cell grid (2^3 x 2^3). Here you have a demo.
          
 (HTM)    [1]: https://mesuvash.github.io/blog/2026/turboquant-interactive/
       
          jbellis wrote 1 day ago:
          The explanation is terrible, but it's clear that it's not actually
          lossless.
       
        zeeshana07x wrote 1 day ago:
        The gap between how this is described in the paper vs the blog post is
        pretty wide. Would be nice to see more accessible writing from research
        teams — not everyone reading is a ML engineer
       
          dev_tools_lab wrote 1 day ago:
          Agreed. The practical implications are often 
          more interesting than the math anyway — smaller 
          models running locally means you can afford to 
          run multiple models in parallel for cross-validation, 
          which changes how you approach tasks like code 
          analysis or bug detection.
       
          om8 wrote 1 day ago:
          These are very different media types with very different goals.
       
        mskkm wrote 1 day ago:
        Pied Piper vibes. As far as I can tell, this algorithm is hardly
        compatible with modern GPU architectures. My guess is that’s why the
        paper reports accuracy-vs-space, but conveniently avoids reporting
        inference wall-clock time. The baseline numbers also look seriously
        underreported. “several orders of magnitude” speedups for vector
        search? Really? anyone has actually reproduced these results?
       
          fc417fc802 wrote 1 day ago:
          Efficient execution on the GPU appears to have been one of the
          specific aims of the authors. Table 2 of their paper shows real world
          performance that would appear at a glance to be compatible with
          inference.
       
            mskkm wrote 1 day ago:
            This is not an LLM inference result. Table 2 is the part I find
            most questionable. Claiming orders-of-magnitude improvements in
            vector search over standard methods is an extraordinary claim. If
            it actually held up in practice, I would have expected to see
            independent reproductions or real-world adoption by now. It’s
            been about a year since the paper came out, and I haven’t seen
            much of either. That doesn’t prove the claim is false, but it
            certainly doesn’t inspire confidence.
       
          NitpickLawyer wrote 1 day ago:
          Apparently MLX confirmed it -
          
 (HTM)    [1]: https://x.com/prince_canuma/status/2036611007523512397
       
            mskkm wrote 1 day ago:
            They confirmed on the accuracy on NIAH but didn't reproduce the
            claimed 8x efficiency.
       
          veunes wrote 1 day ago:
          Classic academic move. If the authors show accuracy-vs-space charts
          but hide end-to-end latency, it usually means their code is slower in
          practice than vanilla fp16 without any compression. Polar coordinates
          are absolute poison for parallel GPU compute
       
            fc417fc802 wrote 1 day ago:
            I don't think they're using polar coordinates? They're quantizing
            to grid centroids.
       
        lucrbvi wrote 1 day ago:
        Sounds like Multi-Head Latent Attention (MLA) from DeepSeek
       
          veunes wrote 1 day ago:
          Nah, those are completely different beasts. DeepSeek's MLA solves the
          KV cache issue via low-rank projection - they literally squeeze the
          matrix through a latent vector at train time. TurboQuant is just
          Post-Training Quantization where they mathematically compress
          existing weights and activations using polar coordinates
       
            esafak wrote 1 day ago:
            No, it is about compressing the KV cache; see How TurboQuant works.
       
        amitport wrote 1 day ago:
        This is a great development for KV cache compression. I did notice a
        missing citation in the related works regarding the core mathematical
        mechanism, though. The foundational technique of applying a geometric
        rotation prior to extreme quantization, specifically for managing the
        high-dimensional geometry and enabling proper bias correction, was
        introduced in our NeurIPS 2021 paper, "DRIVE" ( [1] ). We used this
        exact rotational approach and a similar bias correction mechanism to
        achieve optimal distributed mean estimation. I also presented this work
        and subsequent papers in a private invited talk at Google shortly after
        publication. Given the strong theoretical overlap with the mechanisms
        in TurboQuant and PolarQuant, I hope to see this prior art acknowledged
        in the upcoming camera-ready versions.
        
 (HTM)  [1]: https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c1b...
       
          jjssmith wrote 22 hours 11 min ago:
          LOL. This is a classical technique, Johnson-Linderstrauss etc. In
          this context, rediscovered every few years (recently months), e.g.
          here's 2017:
          
 (HTM)    [1]: https://proceedings.mlr.press/v70/suresh17a
       
            amitport wrote 9 hours 17 min ago:
            We do mention and the paper you shared. Please read our paper to
            see how the rotation-aware bias correction we introduced
            efficiently fixes the bias and provides a better worst-case error.
       
          jmalicki wrote 1 day ago:
          If they didn't cite your paper that's bullshit.
          
          But if they read your paper enough that they invited you to a talk,
          that probably means they were far enough along to independently
          inventing it they were going to do so anyway, and wanted to chat with
          someone who was also doing the thing they were already doing.  Good
          ideas tend to reveal themselves to anyone who is aware of the
          problem.
       
            CyberDildonics wrote 1 day ago:
            That's rationalizing like crazy. If they knew about it they should
            have cited it.
       
              jmalicki wrote 19 hours 18 min ago:
              That's what I'm saying - not citing is total bullshit.
              
              But if they invited a talk, and published a paper and cited it,
              it might be a little off, but not horrible.
       
            amitport wrote 1 day ago:
            To be clear, I am not claiming they stole an idea. They have made
            significant independent research. However, a specific part
            regarding the treatment of rotation with bias correction relates to
            prior work, and it would be appropriate to have that recognized.
       
              jmalicki wrote 19 hours 17 min ago:
              If they didn't at least cite it, it is complete bullshit.
              
              If they cited it, but you feel you deserved more credit than
              that... I feel you, but it's less clear cut.
       
            efavdb wrote 1 day ago:
            The earlier paper was from 2021!
       
            cubefox wrote 1 day ago:
            > But if they read your paper enough that they invited you to a
            talk, that probably means they were far enough along to
            independently inventing it
            
            That's more than a stretch. They likely invited them because
            someone thought the abstract sounded interesting, or something like
            that.
       
            ekjhgkejhgk wrote 1 day ago:
            Doesn't matter, you should still cite. It's basic manners in
            science.
       
              kleiba wrote 1 day ago:
              Exactly, that's why the section is called "Related Work".
       
          sva_ wrote 1 day ago:
          Schmidhuber'd
       
          eecc wrote 1 day ago:
          Pardon my simplistic question, but when you mean rotation you’re
          essentially talking about diagonalization aren’t you?
          
          So storing the diagonal as a matrix and the new bases is more
          compact?
       
            tripplyons wrote 1 day ago:
            There are papers that try to quantize angles associated with
            weights because angles have a more uniform distribution. I haven't
            read this specific paper, but it looks like it uses a similar trick
            at a glance.
       
            amitport wrote 1 day ago:
            In this context, the rotation is for spreading energy and ensuring
            predictable coordinate distributions rather than diagonalization;
            it makes coordinate-wise quantization much more computationally
            efficient, though it throws away learnable structure.
       
              eecc wrote 1 day ago:
              ah ok, so intuitively it's like minimizing the error when
              replacing the values with a well-known distribution. So all you
              need to carry along is the rotation and the assumption that there
              is some amount of loss.
       
          busfahrer wrote 1 day ago:
          I just today learned about Multi-Head Latent Attention, which is also
          sort of a way of compressing the KV cache. Can someone explain how
          this new development relates to MHLA?
       
            tripplyons wrote 1 day ago:
            MLA makes it so the keys and values used are a function of a
            smaller latent vector you cache instead of a key and a value for
            each token. KV cache quantization reduces the size of the values in
            the cache by using less bits to store each value. These two
            approaches operate on different parts of the process so they can be
            used in combination. For example, you can quantize the latents that
            are stored for MLA.
       
            yorwba wrote 1 day ago:
            Multi-Head Latent attention is a redesigned attention mechanism
            that produces lower-dimensional KV-cache entries. Vector
            quantization can store KV-cache entries using a small number of
            bits per dimension while ensuring that the resulting attention
            scores don't change too much. So MLA needs to be part of the model
            from the beginning of training, whereas VQ can be retrofitted
            afterwards, and you could also combine the two.
       
        maurelius2 wrote 1 day ago:
        I'm somewhat at a loss here other than understanding the fundamentals.
        Can someone tell me how the compression impact performance?
       
          prabal97 wrote 2 hours 1 min ago:
          Reposting it here ... I wrote this more intuitive explanation. I
          think you might find it helpful too!
          
 (HTM)    [1]: https://prabal.ca/posts/google-long-context-cheaper/
       
          valine wrote 1 day ago:
          So let’s start with a really simple decoder transformer with a
          single layer and single attention head, and train it to predict the
          next token in a sequence of text. To predict the next token you need
          a few things: a query for the very last token in the sequence, and a
          key and value for every prior token. You take your query and compute
          a dot product with every prior key (two large vectors in, scaler
          attention score out). That scaler attention score first goes through
          softmax, and then becomes the weight you use to compute a weighted
          average of your values, new value goes through the mlp, mlp output is
          projected into the logits from which you sample your next token
          (that’s the general idea at least skipped a few steps).
          
          The last query in the sequence will be new for every new token you
          predict, but the set of prior keys and values stay the same, ie keys
          and values are reusable. The key value cache gets bigger and bigger
          for each new token you add to the sequence, and that’s where
          compression comes in. You have to store the keys and values in vram,
          and you’d like to keep the size down by not storing the raw
          uncompressed tensors. To make this work well your compression needs
          two things: it needs to be fast so that you can compress and
          decompress on the fly, and it needs to play well with softmax
          attention. Prior attempts at compression usually suck at one or the
          other, either the speed to decompress is too slow and your token/s
          takes a hit, or you lose important precision and the model output
          quality suffers. The claim in the paper is that they’ve made
          progress on both.
       
            edg5000 wrote 1 day ago:
            So limiting max context length also reduces VRAM needs a bit? If
            cache is 20% of total, 1/10th of context as a limit would mean 18%
            total memory reduction.
       
              valine wrote 1 day ago:
              Yup exactly, in principle it helps with both inference speed by
              reducing memory bandwidth usage and also reduces the memory
              footprint of your kvcache.
       
          dryarzeg wrote 1 day ago:
          If in short, for many inference tasks the bottleneck is memory
          bandwidth. Suppose you have a machine with a memory bandwidth of 256
          GB/s, and let's say you want to do inference for 4B model (model with
          4 billion parameters). If you will load the model in BF16 format (16
          bits), each forward pass (i.e. each token generated) will require
          roughly ~8 GB of memory bandwidth. So, 256/8 = 32 t/s, and that's the
          generation speed you will be strictly capped at even if your
          processing power is measured in exaFLOPS. But let's say now that you
          have decided to instead quantize the model and then run the quantized
          version. Suppose you have made a Q4_K_M version (4 bits + some
          weights will take more). Now each of your forward passes will take
          roughly 2-3 GB (rough approximations, reality is different) of memory
          bandwith (actually, it will be around 2 GB), and even in the worst
          case 256/3 = 85.3, while 256/2 = 128 t/s. Quants can reduce quality
          of the model and lower it's performance, but in most modern
          quantization methods those losses are usually negligible (although,
          of course, they're still present). So, as you can see, it can be
          concluded that quantization "widens" (it's not removing it fully)
          memory bottleneck while still preserving (not always though)
          acceptable quality.
          
          (Sorry for my terrible English, it's not my native language)
       
            rohansood15 wrote 1 day ago:
            The paper is about vector quantization, which affects KV cache not
            model weights/sizes.
       
        moktonar wrote 1 day ago:
        Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector?
        If so I understand that angles can be quantized better but when radius
        r is big the error is large for highly quantized angles right? What am
        I missing?
       
          amitport wrote 1 day ago:
          r is a single value per vector. You don't have to quantize it, you
          can keep it and quantize the billion+ other coordinates of the
          vector.
       
            mungoman2 wrote 1 day ago:
            What they're saying is that the error for a vector increases with
            r, which is true.
            
            Trivially, with r=0, the error is 0, regardless of how heavily the 
             direction is quantized. Larger r means larger absolute error in
            the reconstructed vector.
       
              amitport wrote 1 day ago:
              Yes, the important part is that the normalized error does not
              increase with the dimension of the vector (which does happen when
              using biased quantizers)
              
              It is expected that bigger vectors have proportionally bigger
              error, nothing can be done by the quantizer about that.
       
                moktonar wrote 13 hours 16 min ago:
                Except maybe storing another smaller vector for the difference
                with the original data an also quantize that maybe recursively
       
        benob wrote 1 day ago:
        This is the worst lay-people explanation of an AI component I have seen
        in a long time. It doesn't even seem AI generated.
       
          davesque wrote 1 day ago:
          Yeah, and some parts of the article are just bizarre:
          
          > Instead of looking at a memory vector using standard coordinates
          (i.e., X, Y, Z) that indicate the distance along each axis,
          PolarQuant converts the vector into polar coordinates using a
          Cartesian coordinate system. This is comparable to replacing "Go 3
          blocks East, 4 blocks North" with "Go 5 blocks total at a 37-degree
          angle”
          
          Why bother explaining this? Were they targeting the high school and
          middle school student reader base??
       
          BenoitP wrote 1 day ago:
          It is AI generated. Or was written by someone a bit far from the
          technical advances IMHO. The Johnson-Lindenstrauss Lemma is a very
          specific and powerful concept, when in the article the QLJ
          explanation is vacuous. A knowledgeable human would not have left the
          reader wanting for how that relates to the Lemma.
       
            hrmtst93837 wrote 1 day ago:
            Honestly, the bigger miss is people treating JL as some silver
            bullet for "extreme" compression, as if preserving pairwise
            distances for a fixed point set somehow means you still keep the
            task-relevant structure once you're dealing with modern models.
            
            Try projecting embeddings this way and watch your recall crater the
            moment you need downstream task performance instead of
            nearest-neighbor retreival demos. If you're optimizing for blog
            post vibes instead of anything measurable sure, call it a
            breakthrough.
       
          spencerflem wrote 1 day ago:
          I think it is though-
          
          “ TurboQuant, QJL, and PolarQuant are more than just practical
          engineering solutions; they’re fundamental algorithmic
          contributions backed by strong theoretical proofs. These methods
          don't just work well in real-world applications; they are provably
          efficient and operate near theoretical lower bounds.”
       
            zarzavat wrote 1 day ago:
            I read "this clever step" and immediately came to the comments to
            see if anyone picked up on it.
            
            It reads like a pop science article while at the same time being
            way too technical to be a pop science article.
            
            Turing test ain't dead yet.
       
              TeMPOraL wrote 1 day ago:
              > Turing test ain't dead yet.
              
              Only because people are lazy, and don't bother with a simple
              post-processing step: attach a bunch of documents or text
              snippets written by a human (whether yourself or, say, some
              respected but stylistically boring author), and ask the LLM to
              match style/tone.
       
            NoahZuniga wrote 1 day ago:
            Genius new idea: replace the em-dashes with semicolons so it looks
            less like AI.
       
              Quarrel wrote 1 day ago:
              Damnit.
              
              There goes another bit of my writing style that will get mistaken
              for an LLM.
       
              tux3 wrote 1 day ago:
              You're absolutely right. That's not just a genius idea; it's a
              radical new paradigm.
       
            integralid wrote 1 day ago:
            I also instinctively reacted to that fragment, but at this point I
            think this is overreacting to a single expression. It's not just a
            normal thing to say in English, it's something people have been
            saying for a long time before LLMs existed.
       
              g-mork wrote 1 day ago:
              Another instinctual reaction here. This specific formulation pops
              out of AI all the time, there might as well have been an emdash
              in the title
       
              nvme0n1p1 wrote 1 day ago:
              There are tells all over the page:
              
              > Redefining AI efficiency with extreme compression
              
              "Redefine" is a favorite word of AI. Honestly no need to read
              further.
              
              > the key-value cache, a high-speed "digital cheat sheet" that
              stores frequently used information under simple labels
              
              No competent engineer would describe a cache as a "cheat sheet".
              Cheat sheets are static, but caches dynamically update during
              execution. Students don't rewrite their cheat sheets during the
              test, do they? LLMs love their inaccurate metaphors.
              
              > QJL: The zero-overhead, 1-bit trick
              
              > It reduces each resulting vector number to a single sign bit
              (+1 or -1). This algorithm essentially creates a high-speed
              shorthand that requires zero memory overhead.
              
              Why does it keep emphasizing zero overhead? Why is storing a
              single bit a "trick?" Either there's currently an epidemic of
              algorithms that use more than one bit to store a bit, or the AI
              is shoving in extra plausible-sounding words to pad things out.
              You decide which is more likely.
              
              It's 1:30am and I can't sleep, and I still regret wasting my time
              on this slop.
       
                radarsat1 wrote 1 day ago:
                > "Redefine" is a favorite word of AI. Honestly no need to read
                further.
                
                You're not wrong, but it certainly is an annoying outcome of AI
                that we're not allowed to use.. words.. anymore.
       
                TeMPOraL wrote 1 day ago:
                I say you're fixating on the wrong signal here. "Redefine" and
                "cheat sheet" are normal words people frequently use, and I see
                worse metaphors in human-written text routinely.
                
                It's the structure and rhythm at the sentence and paragraph
                levels that's the current tell, as SOTA LLMs all seem to
                overuse clarification constructs like "it's not X, it's Y" and
                "it's X, an Y and a Z", and "it's X, it's essentially doing Y".
                
                Thing is, I actually struggle to find what's so off-putting
                about these, given that they're usually used correctly. So far,
                the best hypothesis I have for what makes AI text stand out is
                that LLM output is too good. Most text written by real humans
                (including my own) is shit, with the best of us caring about
                communicating clearly, and most people not even that; nobody
                spends time refining the style and rhythm, unless they're
                writing a poem. You don't expect a blog post or a random
                Internet article (much less a HN comment) to be written in the
                same style as a NYT bestseller book for general audience - but
                LLMs do that naturally, they write text better at paragraph
                level than most people ever could, which stands out as jarring.
                
                > Either there's currently an epidemic of algorithms that use
                more than one bit to store a bit, or the AI is shoving in extra
                plausible-sounding words to pad things out. You decide which is
                more likely.
                
                Or, those things matter to authors and possibly the audience.
                Which is reasonable, because LLMs made the world suddenly hit
                hard against global capacity constraints in compute, memory,
                and power; between that and edge devices/local use, everyone
                who pays attention is interested in LLM efficiency.
       
                  spencerflem wrote 1 day ago:
                  Because it’s a lot of fluff to convey things in a way
                  that’s not very accurate.
       
                  snovv_crash wrote 1 day ago:
                  LLM prose is very bland and smooth, in the same way that
                  bland white factory bread is bland and smooth. It also
                  typically uses a lot of words to convey very simple ideas,
                  simply because the data is typically based on a small prompt
                  that it tries to decompress. LLMs are capable of very good
                  data transformation and good writing, but not when they are
                  asked to write an article based on a single sentence.
       
                    TeMPOraL wrote 1 day ago:
                    That's true. I.e. it's not that they're not capable of
                    doing better, it's just whoever's prompting them is
                    typically too lazy to add an extra sentence or three (or a
                    link) to steer it to a different region of the latent
                    space. There's easily a couple dozen dimensions almost
                    always left at their default values; it doesn't take much
                    to alter them and nudge the model to sample from a more
                    interesting subspace style-wise.
                    
                    (Still, it makes sense to do it as a post-processing style
                    transfer space, as verbosity is a feature while the model
                    is still processing the "main" request - each token
                    produced is a unit of computation; the more terse the
                    answer, the dumber it gets (these days it's somewhat
                    mitigated by "thinking" and agentic loops)).
       
                roywiggins wrote 1 day ago:
                "The X Trick" or "The Y Dilemma" or similar snowclones in a
                header is also a big AI thing. Humans use this construction
                too, but LLMs love it out of all proportion. I call it The
                Ludlum Delusion (since that's how every Robert Ludlum book is
                titled).
       
                veunes wrote 1 day ago:
                Looks like Google canned all their tech writers just to pivot
                the budget into H100s for training these very same writers
       
                  snovv_crash wrote 1 day ago:
                  Capex vs. opex
       
                pqs wrote 1 day ago:
                There is also the possibility that the article when through the
                hands of the company's communication department which has
                writers that probably write at LLM level.
       
            benob wrote 1 day ago:
            Maybe they quantized a bit too much the model parameters...
       
        bluequbit wrote 1 day ago:
        I did not understand what polarQuant is.
        
        Is is something like pattern based compression where the algorithm
        finds repeating patterns and creates an index of those common symbols
        or numbers?
       
          Rapzid wrote 1 day ago:
          That overview is frustratingly high-level. I know what a vector is, a
          bit, and yet that compression description is crazy uninformative. And
          that PolarQuant visualization is.. Very abstract.
       
          viktorcode wrote 1 day ago:
          The way I understand it, it's a way of compressing vectors by
          switching from their per-component representation to polar
          coordinates representation, where the nearby vectors are clumped
          together to a single line, allowing to describe them by different
          lengths
       
          Maxious wrote 1 day ago:
           [1] has a little visualisation
          
 (HTM)    [1]: https://mesuvash.github.io/blog/2026/turboquant-interactive/
       
            Geee wrote 1 day ago:
            Is there an error in the visualization? It shows that every vector
            is rotated the same amount. My understanding was that they are
            randomized with different values, which results in a predictable
            distribution, which is easier to quantize.
       
              mesuvash wrote 19 hours 42 min ago:
              That's actually correct and intentional. TurboQuant applies the
              same rotation matrix to every vector. The key insight is that any
              unit vector, when multiplied by a random orthogonal matrix,
              produces coordinates with a known distribution (Beta/arcsine in
              2D, near-Gaussian in high-d). The randomness is in the matrix
              itself (generated once from a seed), not per-vector. Since the
              distribution is the same regardless of the input vector, a single
              precomputed quantization grid works for everything. I've updated
              the description to make this clearer.
       
                Geee wrote 18 hours 56 min ago:
                Thanks. However, from this visualization it's not clear how the
                random rotation is beneficial. I guess it makes more sense on
                higher dimensional vectors.
       
                  mesuvash wrote 16 hours 17 min ago:
                  Yes, this is important in high dimension. But sadly, very
                  hard to visualize. In 2d it looks like unnecessary.
       
              fc417fc802 wrote 19 hours 51 min ago:
              I believe they are all rotated by the same random matrix, the
              purpose being (IIUC) to distribute the signal evenly across all
              dimensions. So effectively it drowns any structure that might be
              present in noise. That's essential for data efficiency in
              addition to avoiding bias related issues during the initial
              quantization step. However there are still some other issues due
              to bias that are addressed by a second quantization step
              involving the residual.
              
              That said, I don't believe the visualization is correct. The grid
              for one doesn't seem to match what's described in the paper.
              
              Also it's entirely possible I've misunderstood or neglected to
              notice key details.
       
            Rapzid wrote 1 day ago:
            Awesome! So it nudges the vectors into stepped polar rays.. It's
            effectively angle snapping? Plus a sort of magnitude clustering.
       
            pstoll wrote 1 day ago:
            Good post but link at the end is broken.
            
            “””
            For the full technical explanation with equations, proofs, and
            PyTorch pseudocode, see the companion post: TurboQuant:
            Near-Optimal Vector Quantization Without Looking at Your Data.“
       
              mesuvash wrote 19 hours 46 min ago:
              Author here. Sorry still working on refining the post. Will share
              once the post is ready.
       
            spencerflem wrote 1 day ago:
            I like the visualization, but I don’t understand the grid
            quantization. If every point is on the unit circle aren’t all the
            center grid cords unused?
       
              mesuvash wrote 19 hours 43 min ago:
              Yes. Great catch. I simplified the grid just for visualization
              purpose.
              
              I've updated the visualization. The grid is actually not
              uniformly spaced. Each coordinate is quantized independently
              using optimal centroids for the known coordinate distribution. In
              2D, unit-circle coordinates follow the arcsine distribution
              (concentrating near ±1), so the centroids cluster at the edges,
              not the center.
       
                spencerflem wrote 6 hours 52 min ago:
                Cool! Thank you
       
              fc417fc802 wrote 1 day ago:
              Yeah that's odd. It seems like you'd want an n-1 dimensional grid
              on the surface of the unit sphere rather than an n dimensional
              grid within which the sphere resides.
              
              Looking at the paper ( [1] ) they cite earlier work that does
              exactly that. They object that grid projection and binary search
              perform exceptionally poorly on the GPU.
              
              I don't think they're using a regular grid as depicted on the
              linked page. Equation 4 from the paper is how they compute
              centroids for the MSE optimal quantizer.
              
              Why specify MSE optimal you ask? Yeah so it turns out there's
              actually two quantization steps, a detail also omitted from the
              linked page. They apply QJL quantization to the residual of the
              grid quantized data.
              
              My description is almost certainly missing key details; I'm not
              great at math and this is sufficiently dense to be a slog.
              
 (HTM)        [1]: https://arxiv.org/abs/2504.19874
       
              vincnetas wrote 1 day ago:
              i think grid can be a surface of the unit sphere
       
          mrugge wrote 1 day ago:
          1. Efficient recursive transform of kv embeddings into polar
          coordinates
          2. Quantize resulting angles without the need for explicit
          normalization. This saves memory via key insight: angles follow a
          distribution and have analytical form.
       
            quotemstr wrote 1 day ago:
            Reminds me vaguely of Burrows-Wheeler transformations in bzip2.
       
       
 (DIR) <- back to front page