[HN Gopher] FFmpeg devs boast of another 100x leap thanks to han...
___________________________________________________________________
FFmpeg devs boast of another 100x leap thanks to handwritten
assembly code
Author : harambae
Score : 83 points
Date : 2025-07-20 20:51 UTC (2 hours ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| shmerl wrote:
| Still waiting for Pipewire + xdg desktop portal screen / window
| capture support in ffmpeg CLI. It's been dragging feet forever
| with it.
| askvictor wrote:
| I wonder how many optimisations like this could be created by
| LLMs. Obviously we should not be incorporating LLMs into
| compilers, but I suspect that that's eventually what will happen.
| pizlonator wrote:
| The hardest part of optimizations like this is verifying that
| they are correct.
|
| We don't have a reliable general purpose was of verifying if
| any code transformation is correct.
|
| LLMs definitely can't do this (they will lie and say that
| something is correct even if it isn't).
| viraptor wrote:
| But we do! For llvm there's
| https://github.com/AliveToolkit/alive2 There are papers like
| https://people.cs.rutgers.edu/~sn349/papers/cgo19-casmverify.
| .. There's https://github.com/google/souper There's
| https://cr.yp.to/papers/symexemu-20250505.pdf And probably
| other things I'm not aware of. If you're limiting the scope
| to a few blocks at a time, symbolic execution will do fine.
| pizlonator wrote:
| > limiting the scope to a few blocks
|
| Yeah so like that doesn't scale.
|
| The interesting optimizations involve reasoning across
| thousands of blocks
|
| And my point is there is no reliable general purpose
| solution here. ,,Only works for a few blocks at a time" is
| not reliable. It's not general purpose
| LtWorf wrote:
| > I wonder how many optimisations like this could be created by
| LLMs
|
| Zero. There's no huge corpus of stackoverflow questions on
| highly specific assembly optimisations so...
| astrange wrote:
| You can run an agent in a loop, but for something this small
| you can already use a SAT solver or superoptimizer if you
| want to get out of the business of thinking about things
| yourself.
|
| I've never seen anyone actually do it, mostly because
| modeling the problem is more work than just doing it.
| ksclk wrote:
| > you can already use a SAT solver
|
| Could you elaborate please? How would you approach this
| problem, using a SAT solver? All I know is that a SAT
| solver tells you whether a certain formula of ANDs and ORs
| is true. I don't know how it could be useful in this case.
| dragontamer wrote:
| Pretty much all instructions at the assembly level are
| sequences of AND/OR/XOR operations.
|
| SAT solvers can prove that some (shorter) sequences are
| equivalent to other (longer) sequences. But it takes a
| brute force search.
|
| IIRC, these super optimizing SAT solvers can see patterns
| and pick 'Multiply' instructions as part of their search.
| So it's more than traditional SAT. But it's still... At
| the end of the day.... A SAT equivalence problem.
| agent327 wrote:
| A short look at any compiled code on godbolt will very
| quickly inform you that pretty much all instructions at
| the assembly level are, in fact, NOT sequences of
| AND/OR/XOR operations.
| dragontamer wrote:
| All instructions are implemented with logic gates. In
| fact. All instructions today are likely NAND gates.
|
| Have you ever seen a WallaceTree multiplier? A good
| sequence that shows how XOR and AND gates can implement
| multiply.
|
| Now, if multiply + XOR gets the new function you want,
| it's likely better than whatever the original compiler
| output.
| viraptor wrote:
| You're missing the point. All instructions can be
| simplified to short integer operations, then all integer
| operations are just networks of gates, then all gates can
| be replaced with AND/OR/NOT, or even just NAND. That's
| why you can SAT solve program equivalence. See SMT2
| programs using BV theory for example.
|
| Also of course all instructions are MOV anyway.
| https://github.com/xoreaxeaxeax/movfuscator
| gametorch wrote:
| There are literally textbooks on optimization. With tons of
| examples. I'm sure there are models trained out there trained
| on them.
| viraptor wrote:
| https://arxiv.org/html/2505.11480v1 we're getting there. This
| is for general purpose code, which is going to be easier than
| heavy SIMD where you have to watch out for very specific
| timing, pipelines and architecture details. But it's a step in
| that direction.
| smj-edison wrote:
| I'd be worried about compile times, lol. Final binaries are
| quite often tens to hundreds of megabytes, pretty sure an LLM
| processes tokens _much_ slower than a compiler completes
| passes.
|
| EDIT: another thought: non-deterministic compilation would also
| be an issue unless you were tracking the input seed, and it
| would still cause spooky action at a distance unless you had
| some sort of recursive seed. Compilers are supposed to be
| reliable and deterministic, though to be fair advanced
| optimizations can be surprising because of the various
| heuristics.
| viraptor wrote:
| There's no reason to run the optimisation discovery at
| compile time. Anything that changes the structure can be run
| to change the source ahead of time. Anything that doesn't can
| be generalised into a typical optimisation step in the
| existing compiler pipeline. Same applies to Souper for
| example - you really don't want everyone to run it.
| Arubis wrote:
| This intrinsically feels like the opposite of a good use case
| for an LLM for code gen. This isn't boilerplate code by any
| means, nor would established common patterns be helpful. A lot
| of what ffmpeg devs are doing at the assembly level is
| downright novel.
| hashishen wrote:
| It doesn't matter. This is inherently better because the dev
| knows exactly what is being done. Llms could cripple entire
| systems with assembly access
| Aardwolf wrote:
| The article somtimes says 100x, other times it says 100% speed
| boost. E.g. it says "boosts the app's 'rangedetect8_avx512'
| performance by 100.73%." but the screenshot shows 100.73x.
|
| 100x would be a 9900% speed boost, while a 100% speed boost would
| mean it's 2x as fast.
|
| Which one is it?
| pizlonator wrote:
| The ffmpeg folks are claiming 100x not 100%. Article probably
| has a typo
| MadnessASAP wrote:
| 100x to the single function 100% (2x) to the whole filter
| torginus wrote:
| I'd guess the function operates of 8 bit values judging from
| the name. If the previous implementation was scalar, a double-
| pumped AVX512 implementation can process 128 elements at a
| time, making the 100x speedup plausible.
| ErrorNoBrain wrote:
| it doesnt matter
|
| it's faster
|
| thats what matters
| pavlov wrote:
| Only for x86 / x86-64 architectures (AVX2 and AVX512).
|
| It's a bit ironic that for over a decade everybody was on x86 so
| SIMD optimizations could have a very wide reach in theory, but
| the extension architectures were pretty terrible (or you couldn't
| count on the newer ones being available). And now that you
| finally can use the new and better x86 SIMD, you can't depend on
| x86 ubiquity anymore.
| Aurornis wrote:
| AVX512 is a set of extensions. You can't even count on an
| AVX512 CPU implementing all of the AVX512 instructions you want
| to use, unless you stick to the foundation instructions.
|
| Modern encoders also have better scaling across threads, though
| not infinite. I was in an embedded project a few years ago
| where we spent a lot of time trying to get the SoC's video
| encoder working reliably until someone ran ffmpeg and we
| realized we could just use several of the CPU cores for a
| better result anyway
| AaronAPU wrote:
| When I spent a decade doing SIMD optimizations for HEVC (among
| other things), it was sort of a joke to compare the assembly
| versions to plain c. Because you'd get some ridiculous
| multipliers like 100x. It is pretty misleading, what it really
| means is it was extremely inefficient to begin with.
|
| The devil is in the details, microbenchmarks are typically
| calling the same function a million times in a loop and
| everything gets cached reducing the overhead to sheer cpu cycles.
|
| But that's not how it's actually used in the wild. It might be
| called once in a sea of many many other things.
|
| You can at least go out of your way to create a massive test
| region of memory to prevent the cache from being so hot, but I
| doubt they do that.
| torginus wrote:
| Sorry for the derail, but it sounds like you have a ton of
| experience with SIMD.
|
| Have you used ISPC, and what are your thoughts on it?
|
| I feel it's a bit ridiculous that in this day and age you have
| to write SIMD code by hand, as regular compilers suck at auto-
| vectorizing, especially as this has never been the case with
| GPU kernels.
___________________________________________________________________
(page generated 2025-07-20 23:00 UTC)