https://www.tomshardware.com/software/the-biggest-speedup-ive-seen-so-far-ffmpeg-devs-boast-of-another-100x-leap-thanks-to-handwritten-assembly-code [] Skip to main content (*) ( ) Open menu Close menu Tom's Hardware [ ] Search Search Tom's Hardware [ ] RSS US Edition flag of US flag of UK UK flag of US US flag of Australia Australia flag of Canada Canada * * Best Picks * CPUs * GPUs * SSDs * News * 3D Printers * Coupons * More + Newsletter + Reviews + PC Components + PC Building + Motherboards + Cases + Cooling + Power Supplies + RAM + Desktops + Laptops + Peripherals + Monitors + Windows 11 + Gaming + Overclocking + About Us Forums Trending * Industry News * Where to Buy Switch 2 * MI350X and MI355X AI GPUs * Where to Buy RTX 5060 * Nvidia NVLink Fusion * Switch 2 Recommended reading 7-Zip Software 7-Zip for Windows goes massively parallel with first 'Threadripper Edition' -- five years after Threadripper debut, Version 25.00 the first to support more than 64 threads Alphawave Networking New display hits 1 million FPS -- display and high-speed camera could enable 1 Tbps speeds Ryzen Threadripper Pro CPUs AMD's Threadripper 9995WX stuns in Cinebench R23 -- new Ryzen flagship reportedly 73% faster than its predecessor Core Ultra 200V (Lunar Lake) GPU Drivers Intel driver update for Lunar Lake chips reportedly improves iGPU FPS by 10%, 1% lows by 25% LSFG 3.1 update arrives GPUs New Lossless Scaling update can reduce GPU load by 2x AMD FidelityFX Super Resolution slide deck GPUs Enthusiast hacks FSR 4 onto RX 7000 series GPU without official AMD support, returns better quality but slightly lower fps than FSR 3.1 Blu-ray discs Windows Ancient CD ripping tool updated for the first time in 16 years, now supports Windows 11 1. Software 'The biggest speedup I've seen so far' -- FFmpeg devs boast of another 100x leap thanks to handwritten assembly code News By Mark Tyson published 17 July 2025 But admit this boost is only seen in 'an obscure filter'. * * * * * * * Comments (9) When you purchase through links on our site, we may earn an affiliate commission. Here's how it works. FFmpeg update can make some operations 100x faster (Image credit: FFmpeg) The developers behind the FFmpeg project are again claiming major performance uplifts delivered by wielding the art of handwritten assembly code. With the latest patch applied, users should see a "100x speedup" in the cross-platform open-source media transcoding application. However, the developers were soon to clarify that the 100x claim applies to just a single function, "not the whole of FFmpeg." BREAKING: FFmpeg 100x speedup from handwritten assembly13:55:30 <*haasn> rangedetect8_avx512: 121.2 (100.18x) that may be the biggest speedup I've seen so farJuly 16, 2025 "The biggest speedup I've seen so far" Last November, we reported on an FFmpeg performance boost that could speed certain operations by up to 94x. The latest handwritten assembly patch boosts the app's 'rangedetect8_avx512' performance by 100.73%. If your modern processor doesn't support AVX512, you should still see a 65.63% uplift with the rangedetect8_avx2 code path. Where will you feel these speed increases? In some follow-up tweets, the FFmpeg developers admit that "It's a single function that's now 100x faster, not the whole of FFmpeg." They would later go on to elaborate that the functionality, which might enjoy a 100% speed boost, depending upon your system, was "an obscure filter." You may like * 7-Zip 7-Zip for Windows goes massively parallel with first 'Threadripper Edition' -- five years after Threadripper debut, Version 25.00 the first to support more than 64 threads * Alphawave New display hits 1 million FPS -- display and high-speed camera could enable 1 Tbps speeds * Ryzen Threadripper Pro AMD's Threadripper 9995WX stuns in Cinebench R23 -- new Ryzen flagship reportedly 73% faster than its predecessor The obscurity of the function means it hadn't been prioritized by the devs until now. But we also gather that the filter code was recoded using the SIMD (Single Instruction, Multiple Data) processing concept for vastly improved parallel processing on today's powerful chips. Evidently, compilers - programs that take higher-level language code and spit out assembly (machine) code - are still not competitive with handwritten assembly. Or you could say, "register allocator sucks on compilers," as FFmpeg tweeted today. FFmpeg update can make some operations 100x faster (Image credit: FFmpeg) Assembly language evangelicals Harking back to the golden age of home computing in the 1980s and 1990s, where fixed-spec systems had lifecycles measured in half-decades - and strictly limited processing resources - handwritten assembly code optimizations played a larger part in the business of speeding up computers, games, and other software. FFmpeg is perhaps one of the few 'assembly evangelists' remaining. The dev team even runs a 'school.' Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. [ ][ ]Contact me with news and offers from other Future brands[ ]Receive email from us on behalf of our trusted partners or sponsors[Sign me up] By submitting your information you agree to the Terms & Conditions and Privacy Policy and are aged 16 or over. FFmpeg tools and libraries run across Linux, Mac OS X, Microsoft Windows, the BSDs, Solaris, systems, and more. One of the most popular video player software utilities, VLC, uses the libavcodec and libavformat libraries from the FFmpeg project. Follow Tom's Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button. TOPICS Open Source See all comments (9) Mark Tyson Mark Tyson Social Links Navigation News Editor Mark Tyson is a news editor at Tom's Hardware. He enjoys covering the full breadth of PC tech; from business and semiconductor design to products approaching the edge of reason. Read more 7-Zip 7-Zip for Windows goes massively parallel with first 'Threadripper Edition' -- five years after Threadripper debut, Version 25.00 the first to support more than 64 threads Alphawave New display hits 1 million FPS -- display and high-speed camera could enable 1 Tbps speeds Ryzen Threadripper Pro AMD's Threadripper 9995WX stuns in Cinebench R23 -- new Ryzen flagship reportedly 73% faster than its predecessor Core Ultra 200V (Lunar Lake) Intel driver update for Lunar Lake chips reportedly improves iGPU FPS by 10%, 1% lows by 25% LSFG 3.1 update arrives New Lossless Scaling update can reduce GPU load by 2x AMD FidelityFX Super Resolution slide deck Enthusiast hacks FSR 4 onto RX 7000 series GPU without official AMD support, returns better quality but slightly lower fps than FSR 3.1 Latest in Software Clear Linux Intel axes Clear Linux, the fastest distribution on the market Official Cyberpunk 2077 coming to MacOS graphic. 'Cyberpunk 2077' system requirements revealed for Apple Silicon Macs -- M3 Pro recommended for 1080p 60 FPS Screenshots from Cyberpunk 2077: Ultimate Edition for Mac 'Cyberpunk 2077' comes to the Mac July 17 -- patient Apple gamers get support for every Apple Silicon chip, new Metal features, and Spatial Audio Windows 7 Windows 7 runs natively on the Steam Deck 7-Zip 7-Zip for Windows goes massively parallel with first 'Threadripper Edition' -- five years after Threadripper debut, Version 25.00 the first to support more than 64 threads AMD A project to bring CUDA to non-Nvidia GPUs is making major progress Latest in News Vaulted Deep waste disposal Microsoft buys more than a billion dollars' worth of excrement, including human poop, to clean up its AI mess Psyho winning World Coding Championship Polish programmer beats OpenAI's custom AI in 10-hour marathon, wins World Coding Championship Recovering destroyed Nintendo SD cards Boot exploit for software-bricked Nintendo WII U consoles discovered by repairing destroyed and discarded SD cards from factory A screencap from Rockstar's website on the day of GTA VI's official unveiling and trailer announcement, November 8th 2023. GTA 5 finally launches in Saudi Arabia and the UAE 12 years after its global release Clear Linux Intel axes Clear Linux, the fastest distribution on the market Reference Circuit Books crafted by Ian Dunn PCB reference books with pages made from actual USB-C powered PCBs are now available at $37 each [ ] 9 Comments Comment from the forums * DS426 Reminds me of the Voxel Space engine used in circa 1993 for the original Comanche PC game that was entirely written in Assembly as it needed to perform well even without GPU acceleration. Reply * ex_bubblehead There's still a lot to be said for hand optimised machine language code. Reply * bit_user The article said: Last November, we reported on an FFmpeg performance boost that could speed certain operations by up to 94x. That speedup was reproducible only if you compiled the totally unoptimized, generic C implementation in debug mode. When I tried compiling it in release mode and using clang instead of gcc, I got over 50% as fast as the hand-written assembly, without any changes to the generic C sources. Upon reading the sources, it's clear that the C could've been written more optimally, likely yielding further improvements - and I'm not even talking about using any AVX512 intrinsics! I will be taking a look at this latest patch, when I have a chance. P.S. thanks for actually linking the patch, this time. Last time, the patch wasn't in ffmpeg, but rather someone posted a slide from a DAV1D presentation on the ffmpeg Twitter account. Took me a while to figure that out. Edit: I just leaned that someone looked into that previous 94x speedup even deeper than I did! Apparently, the SIMD code implemented a 6-tap convolution, whereas the generic C version implemented 8-tap. I'll bet that accounts for a lot of the difference clang couldn't close. https://news.ycombinator.com/item?id=42042706Furthermore, they actually reached the original author and got an admission that the comparison was made using C code compiled with optimizations disabled! Reply * bit_user ex_bubblehead said: There's still a lot to be said for hand optimised machine language code. Oh, but they never even tried using C with intrinsics. Whenever they optimize something, they go straight to assembly. So, we don't even know how well-optimized C compares. In the last thread, someone claimed compilers wouldn't be smart enough to fuse two separate operations into a single AVX-512 instruction, which I subsequently demonstrated clang/LLVM doing. I've been quite impressed by its autovectorization. It won't restructure your code to be vector-friendly, but it seems to do a good job of more straight-forward vectorization tasks. If I get a chance to fiddle with this patch, I'll post my findings here. Reply * bit_user DS426 said: Reminds me of the Voxel Space engine used in circa 1993 for the original Comanche PC game that was entirely written in Assembly as it needed to perform well even without GPU acceleration. In 1993, compilers weren't nearly as sophisticated and the first PC 3D graphics accelerator cards didn't yet exist. Nvidia's NV1 didn't launch until May, 1995. Cards based on 3D Labs' Glint 300SX also showed up about the same time. BTW, I'm sure plenty of other 3D games from that time used assembly language. It wouldn't surprise me to learn that Wolfenstein and Doom both did. Quake was worked on by the author of the book Zen of Assembly Language. I read his columns in Dr. Dobbs Journal, back in the day, and still have a copy of his book Zen of Code Optimization floating around, somewhere. Reply * DS426 bit_user said: In 1993, compilers weren't nearly as sophisticated and the first PC 3D graphics accelerator cards didn't yet exist. Nvidia's NV1 didn't launch until May, 1995. Cards based on 3D Labs' Glint 300SX also showed up about the same time. BTW, I'm sure plenty of other 3D games from that time used assembly language. It wouldn't surprise me to learn that Wolfenstein and Doom both did. Quake was worked on by the author of the book Zen of Assembly Language. I read his columns in Dr. Dobbs Journal, back in the day, and still have a copy of his book Zen of Code Optimization floating around, somewhere. Great point! It sounds wild today but probably wasn't very rare back then. As for graphics, yep, that design decision makes complete sense (I would say not even a decision as I don't think there was another feasible option) given the goings-on of that time. Even 3dfx' Voodoo accelerator didn't come out commercially until 1996. Reply * terabite The writer seems to use 100% and 100x interchangeably. They are not. A 100% uplift is twice the performance, a 100x uplift is 100 times the performance Reply * bit_user terabite said: The writer seems to use 100% and 100x interchangeably. They are not. A 100% uplift is twice the performance, a 100x uplift is 100 times the performance Yeah, I noticed that as well. At least the headline matches what the developer claimed. Not my biggest issue with the article, though. I'm trying to put myself in the headspace of an author or a reader who knows little or nothing about software development or code optimization. There are so many questions that should come up, like: whether such performance gains are just waiting to be had in any piece of software written in C? Wouldn't that seem like a reasonable question to ask, because it'd save everyone having to upgrade their CPUs, if true! And yet, why doesn't the author pursue that angle? And if not, what's so special about this code that makes the compiler so horrible at optimizing it? It just strikes me as a rather incurious article. The only thing more confounding than that is the question of why the developer (if they actually believed these numbers), didn't apparently deem the C code in need of any reworking, since ffmpeg is a cross-platform codebase and doesn't have assembly language versions of every function for every supported CPU ISA. So, the C version should at least be made decently fast. And if it's slower by more than an order of magnitude, it should be obvious something is really wrong with it, because an order of magnitude is usually about what you'd get from vectorizing something like this. Reply * bit_user I noticed this patch still hadn't been merged. A quick look at the linked mailing list thread shows why: The developer (Niklas Haas) said: Upon further testing, I realized that this logic (both C and SIMD) overflows for 16-bit inputs. Will fix and resubmit. Source: https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/ 346727.html He goes on to say: I also found that the C versions can be made slightly faster by returning out of the inner loop, which generates a shorter scalar version that is faster than the auto-vectorized abomination that was generated before. So, that also suggests why the generic C version performed quite so poorly. He iterated on these patches two more times, so far, yet it still hasn't been merged. Here's the latest patch, featuring a speedup of 55x over generic C*: https://ffmpeg.org/pipermail/ffmpeg-devel/2025-July/346786.html * I've learned that ffmpeg doesn't enable compiler autovectorization, by default. So, this is comparing AVX-512 (processing up to 64 bytes at a time) vs. serial code that processes only one byte per loop iteration. Considering that, a speedup of 55x actually makes sense. Once his patch is finally merged, I'll probably pull the tree and see what sort of performance it gets vs. autovectorized SSE2 and AVX-512 via both gcc and clang. P.S. I think this highlights the risks of reporting on patches before they've even been merged. Also, the article's author probably could've seen the first reply I quoted from, above, given it was sent less than a hour after the one cited in the article. While it might be somewhat rare for a developer to send such a retraction, the whole purpose of sending patches to such mailing lists is to solicit feedback from other list members (thus leading to fixes & further iteration), which is quite typical. Anything that's submitted, especially in its first draft, should be considered nothing more than a work-in-progress. So, I'd say the article author (Mark Tyson) deserves an extra demerit for failing to look at all the messages in that mailing list thread. Reply * View All 9 Comments Show more comments Tom's Hardware is part of Future US Inc, an international media group and leading digital publisher. Visit our corporate site. * Terms and conditions * Contact Future's experts * Privacy policy * Cookies policy * Accessibility Statement * Advertise with us * About us * Coupons * Careers (c) Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.