[HN Gopher] FFmpeg devs boast of another 100x leap thanks to han...
       ___________________________________________________________________
        
       FFmpeg devs boast of another 100x leap thanks to handwritten
       assembly code
        
       Author : harambae
       Score  : 83 points
       Date   : 2025-07-20 20:51 UTC (2 hours ago)
        
 (HTM) web link (www.tomshardware.com)
 (TXT) w3m dump (www.tomshardware.com)
        
       | shmerl wrote:
       | Still waiting for Pipewire + xdg desktop portal screen / window
       | capture support in ffmpeg CLI. It's been dragging feet forever
       | with it.
        
       | askvictor wrote:
       | I wonder how many optimisations like this could be created by
       | LLMs. Obviously we should not be incorporating LLMs into
       | compilers, but I suspect that that's eventually what will happen.
        
         | pizlonator wrote:
         | The hardest part of optimizations like this is verifying that
         | they are correct.
         | 
         | We don't have a reliable general purpose was of verifying if
         | any code transformation is correct.
         | 
         | LLMs definitely can't do this (they will lie and say that
         | something is correct even if it isn't).
        
           | viraptor wrote:
           | But we do! For llvm there's
           | https://github.com/AliveToolkit/alive2 There are papers like 
           | https://people.cs.rutgers.edu/~sn349/papers/cgo19-casmverify.
           | .. There's https://github.com/google/souper There's
           | https://cr.yp.to/papers/symexemu-20250505.pdf And probably
           | other things I'm not aware of. If you're limiting the scope
           | to a few blocks at a time, symbolic execution will do fine.
        
             | pizlonator wrote:
             | > limiting the scope to a few blocks
             | 
             | Yeah so like that doesn't scale.
             | 
             | The interesting optimizations involve reasoning across
             | thousands of blocks
             | 
             | And my point is there is no reliable general purpose
             | solution here. ,,Only works for a few blocks at a time" is
             | not reliable. It's not general purpose
        
         | LtWorf wrote:
         | > I wonder how many optimisations like this could be created by
         | LLMs
         | 
         | Zero. There's no huge corpus of stackoverflow questions on
         | highly specific assembly optimisations so...
        
           | astrange wrote:
           | You can run an agent in a loop, but for something this small
           | you can already use a SAT solver or superoptimizer if you
           | want to get out of the business of thinking about things
           | yourself.
           | 
           | I've never seen anyone actually do it, mostly because
           | modeling the problem is more work than just doing it.
        
             | ksclk wrote:
             | > you can already use a SAT solver
             | 
             | Could you elaborate please? How would you approach this
             | problem, using a SAT solver? All I know is that a SAT
             | solver tells you whether a certain formula of ANDs and ORs
             | is true. I don't know how it could be useful in this case.
        
               | dragontamer wrote:
               | Pretty much all instructions at the assembly level are
               | sequences of AND/OR/XOR operations.
               | 
               | SAT solvers can prove that some (shorter) sequences are
               | equivalent to other (longer) sequences. But it takes a
               | brute force search.
               | 
               | IIRC, these super optimizing SAT solvers can see patterns
               | and pick 'Multiply' instructions as part of their search.
               | So it's more than traditional SAT. But it's still... At
               | the end of the day.... A SAT equivalence problem.
        
               | agent327 wrote:
               | A short look at any compiled code on godbolt will very
               | quickly inform you that pretty much all instructions at
               | the assembly level are, in fact, NOT sequences of
               | AND/OR/XOR operations.
        
               | dragontamer wrote:
               | All instructions are implemented with logic gates. In
               | fact. All instructions today are likely NAND gates.
               | 
               | Have you ever seen a WallaceTree multiplier? A good
               | sequence that shows how XOR and AND gates can implement
               | multiply.
               | 
               | Now, if multiply + XOR gets the new function you want,
               | it's likely better than whatever the original compiler
               | output.
        
               | viraptor wrote:
               | You're missing the point. All instructions can be
               | simplified to short integer operations, then all integer
               | operations are just networks of gates, then all gates can
               | be replaced with AND/OR/NOT, or even just NAND. That's
               | why you can SAT solve program equivalence. See SMT2
               | programs using BV theory for example.
               | 
               | Also of course all instructions are MOV anyway.
               | https://github.com/xoreaxeaxeax/movfuscator
        
           | gametorch wrote:
           | There are literally textbooks on optimization. With tons of
           | examples. I'm sure there are models trained out there trained
           | on them.
        
         | viraptor wrote:
         | https://arxiv.org/html/2505.11480v1 we're getting there. This
         | is for general purpose code, which is going to be easier than
         | heavy SIMD where you have to watch out for very specific
         | timing, pipelines and architecture details. But it's a step in
         | that direction.
        
         | smj-edison wrote:
         | I'd be worried about compile times, lol. Final binaries are
         | quite often tens to hundreds of megabytes, pretty sure an LLM
         | processes tokens _much_ slower than a compiler completes
         | passes.
         | 
         | EDIT: another thought: non-deterministic compilation would also
         | be an issue unless you were tracking the input seed, and it
         | would still cause spooky action at a distance unless you had
         | some sort of recursive seed. Compilers are supposed to be
         | reliable and deterministic, though to be fair advanced
         | optimizations can be surprising because of the various
         | heuristics.
        
           | viraptor wrote:
           | There's no reason to run the optimisation discovery at
           | compile time. Anything that changes the structure can be run
           | to change the source ahead of time. Anything that doesn't can
           | be generalised into a typical optimisation step in the
           | existing compiler pipeline. Same applies to Souper for
           | example - you really don't want everyone to run it.
        
         | Arubis wrote:
         | This intrinsically feels like the opposite of a good use case
         | for an LLM for code gen. This isn't boilerplate code by any
         | means, nor would established common patterns be helpful. A lot
         | of what ffmpeg devs are doing at the assembly level is
         | downright novel.
        
         | hashishen wrote:
         | It doesn't matter. This is inherently better because the dev
         | knows exactly what is being done. Llms could cripple entire
         | systems with assembly access
        
       | Aardwolf wrote:
       | The article somtimes says 100x, other times it says 100% speed
       | boost. E.g. it says "boosts the app's 'rangedetect8_avx512'
       | performance by 100.73%." but the screenshot shows 100.73x.
       | 
       | 100x would be a 9900% speed boost, while a 100% speed boost would
       | mean it's 2x as fast.
       | 
       | Which one is it?
        
         | pizlonator wrote:
         | The ffmpeg folks are claiming 100x not 100%. Article probably
         | has a typo
        
         | MadnessASAP wrote:
         | 100x to the single function 100% (2x) to the whole filter
        
         | torginus wrote:
         | I'd guess the function operates of 8 bit values judging from
         | the name. If the previous implementation was scalar, a double-
         | pumped AVX512 implementation can process 128 elements at a
         | time, making the 100x speedup plausible.
        
         | ErrorNoBrain wrote:
         | it doesnt matter
         | 
         | it's faster
         | 
         | thats what matters
        
       | pavlov wrote:
       | Only for x86 / x86-64 architectures (AVX2 and AVX512).
       | 
       | It's a bit ironic that for over a decade everybody was on x86 so
       | SIMD optimizations could have a very wide reach in theory, but
       | the extension architectures were pretty terrible (or you couldn't
       | count on the newer ones being available). And now that you
       | finally can use the new and better x86 SIMD, you can't depend on
       | x86 ubiquity anymore.
        
         | Aurornis wrote:
         | AVX512 is a set of extensions. You can't even count on an
         | AVX512 CPU implementing all of the AVX512 instructions you want
         | to use, unless you stick to the foundation instructions.
         | 
         | Modern encoders also have better scaling across threads, though
         | not infinite. I was in an embedded project a few years ago
         | where we spent a lot of time trying to get the SoC's video
         | encoder working reliably until someone ran ffmpeg and we
         | realized we could just use several of the CPU cores for a
         | better result anyway
        
       | AaronAPU wrote:
       | When I spent a decade doing SIMD optimizations for HEVC (among
       | other things), it was sort of a joke to compare the assembly
       | versions to plain c. Because you'd get some ridiculous
       | multipliers like 100x. It is pretty misleading, what it really
       | means is it was extremely inefficient to begin with.
       | 
       | The devil is in the details, microbenchmarks are typically
       | calling the same function a million times in a loop and
       | everything gets cached reducing the overhead to sheer cpu cycles.
       | 
       | But that's not how it's actually used in the wild. It might be
       | called once in a sea of many many other things.
       | 
       | You can at least go out of your way to create a massive test
       | region of memory to prevent the cache from being so hot, but I
       | doubt they do that.
        
         | torginus wrote:
         | Sorry for the derail, but it sounds like you have a ton of
         | experience with SIMD.
         | 
         | Have you used ISPC, and what are your thoughts on it?
         | 
         | I feel it's a bit ridiculous that in this day and age you have
         | to write SIMD code by hand, as regular compilers suck at auto-
         | vectorizing, especially as this has never been the case with
         | GPU kernels.
        
       ___________________________________________________________________
       (page generated 2025-07-20 23:00 UTC)