https://pngquant.org/rust.html
Improved portability and performance
libimagequant is a library for generating high-quality palettes,
useful for compression of transparent PNG images (~75% smaller!) and
making nice GIF animations.
libimagequant is now a pure Rust library. The new version is a
drop-in replacement (ABI-compatible), so C projects can continue
using it. The C version will be supported for a while to give library
users and distros time to switch.
Why?
In short: There's no standard for building C programs. It's a
non-portable mess, and a time sink. Cross-compilation of OpenMP was
the last straw. Rust/Cargo is much more dependable, and enables me to
support more features on more platforms.
I didn't plan to rewrite the library yet. The C codebase is stable
and gets the job done. However, I've switched to an M1 Mac, and this
made x86-64 a cross-compilation target for me. Building of
redistributable executables with OpenMP was already an annoyance, but
cross-compilation added a whole another level of brokenness to deal
with. It felt silly to get a new fast CPU only to emulate an old one,
or rely on someone else's CPU in the cloud instead. Over the holiday
break I tested whether it would be easier to rewrite the library in
Rust instead, and indeed it was.
Why not just drop multi-threading, or OpenMP?
That's what I used to do: OpenMP was optional, but being limited to 1
/8th or 1/16th of a CPU sucked. I use libimagequant in gif.ski, which
for sad GIFfy reasons is CPU-intensive, with quantization is in its
hot path, so it really matters.
OpenMP was actually quite nice for basic for loops -- when it worked.
Unfortunately, OpenMP is not a library, but a compiler extension, so
it's at mercy of built-in compiler support. It can be outdated,
incomplete, or completely absent.
I could have kept the library in C and replaced OpenMP with some
other solution instead, but I wasn't keen on reinventing this wheel.
Even basic things like spawning a thread run into C's portability
woes. The platonic ideal portable C exists only as a hypothetical
construct in the C standard. The C that exists in the real world is
whatever Microsoft, Apple, and others have shipped. That C is a mess
of vendor-specific toolchains, each with its own way of doing things,
missing features, broken headers, and leaky abstractions. Shouting
"it's not C's fault, screw !" doesn't solve the
problem, but switching to Rust does.
How?
Some parts were simply rewritten from scratch. Some parts were
converted with Citrus, which is my C-to-Rust "transpiler". Unlike
c2rust, it's not semantically accurate, but it generates readable
Rust source code that is easier to clean up. I did the first
quick'n'dirty (but fully functional) conversion at a rate of about
1000 lines per day. Later rewriting, refactoring and polishing took
the average down to under 200 lines per day. rust-analyzer, cargo
clippy, and cargo fix have been very helpful.
Results
I'm quite happy with the result. 3425 lines of mostly C have been
replaced with 3413 lines of Rust. The new version exports the same
ABI, so for C programs and FFI users (Python, Java, etc.) it's a
drop-in replacement. The new code uses the same algorithm, but there
are minor differences from adaptation to Rust's idioms.
It's a coincidence that the number of lines ended up being so close,
because different parts of code translated to significantly different
amount of lines. For example, I was able to delete pngquant's entire
DIY hash table implementation, and use one from Rust's standard
library instead. OTOH I've written an abstraction for iterating over
images, which has replaced a few lines of pointer arithmetic with a
bunch of objects, methods, and interfaces.
Executable size
The Rust executable statically links with the Rust's standard library
(unlike C executables that usually use OS's libc) and this costs
~260KB of executable size. I'm not happy about this, but fortunately
it's only a one-time cost that doesn't grow with code size. Rust's
standard library can be disabled, so I have an option of trimming it
down if necessary, but I'm not that bothered by it.
I've compared sizes of executables (aarch64, with LTO, panic=abort,
and stripped). Apart from the libstd overhead, byte size of the
single-threaded Rust version of the library has grown by about 11%
compared to the C version without OpenMP. Rust version using rayon
was actually 4% smaller than with OpenMP linked statically. I've
liberally used high-level Rust abstractions, generic libstd
containers, and a few dependencies, so I'm very happy that the Rust
version turned out to be reasonably small. rayon is a substantial
library with thread pools, work-stealing queues, parallel iterators,
and many generic interfaces. It being smaller than the
"compiler-native" OpenMP was a nice surprise too.
Performance
Performance of the Rust version is similar, with a couple of things
noticeably faster:
* The hash table implementation in Rust's standard library uses a
state-of-the art algorithm, and my C didn't. In theory, nothing
stopped me from using the same algorithm in C. In practice, lack
of a useful standard library, lack of templates/generics, and the
dreadful state of C dependency management did.
* The median cut algorithm used to represent ranges as int start,
length; and indexed with them into a shared array. Equivalent
int-indexing code in Rust had similar performance. However, when
I changed them to &mut [] slices, it got a speed boost! Could it
be thanks to the mythical no-alias guarantee of slices?
Rust has also added a bunch tiny overheads all over the place, like
glue code for panics and destructors, extra bounds checks, use of
64-bit usize instead of 32-bit int, but these things did not show up
in profiling, and other wins outweigh them.
I think it's an interesting comparison, because it wasn't a rewrite
of an old legacy codebase with a shiny new design. The C code was as
good as I could make it, and the Rust version uses the same algorithm
(with different bugs ;)
Nice things
I've tried to be careful about implicit numeric type conversions in C
(including mixing of float and double), but there were a few that
I've missed. They became clear after conversion to Rust. Rust is
strict about numeric types to the point of being annoying, but at
least I know the only type conversions are the ones that I wrote.
Rust's enums worked well. The library has various ways of
representing images in memory (e.g. they can be dynamically streamed
without buffering). In C this needed a few inter-dependent nullable
pointers and boolean flags, which were hard to keep track of. In Rust
it translated nicely to different sets of enum fields.
Lifetime annotations helped me catch a bug. There's an image
convolution pass that needs to look at pixels from the current,
previous, and next line at the same time. However, the
statefully-complex image may have had only a 1-line buffer, so all
three line reads would end up overwriting each other in that buffer.
In Rust such loop didn't compile due to a borrow checking error
(can't mutably borrow the internal temp buffer more than once).
When users pass invalid pointers to the library (freed memory or
other garbage) it makes my code to crash. Users report it to me as my
bug, and I end up debugging their bugs. The good news is that it
doesn't happen in Rust, but I still expose a C API, so I've ported my
defence from C: I first dereference all user-supplied C pointers in a
specific function named its_your_fault_not_mine (or a more
professional version of it). This triggers the crashes immediately,
right there, rather than sometime later deep in my code, so the crash
stacktrace gets a built-in disclaimer. In C, there's no standard
feature for preventing a function from being inlined or optimized
out. There's no standard for making a "useless" memory read happen
without causing a compiler warning. In Rust, there are explicit
features for all of this. BTW, volatile in Rust is a function call.
That makes so much more sense!
Rust has x86 SIMD intrinsics and CPU detection built-in, so I was
able to port my SSE code easily. When I first considered a Rust
rewrite a few years ago they weren't there yet, so that's a reminder
that Rust is improving.
Corners cut
The C API had support for custom allocators. I've dropped it. Rust
has its own mechanism for configuring a global allocator which is
hopefully enough. Custom allocators per object are possible, but I
don't think it's worth the hassle.
In C I've used a "flexible array member" trick to allocate structs
with a head and variable-length data in one malloc. The same
technique is doable in Rust with some gymnastics, but also turned out
to be entirely unnecessary. Rust prefers to return objects "by
value", but with calling convention magic or inlining avoid actually
copying them. This usually eliminates heap allocations of structs
entirely, or at least makes structs with variable-length data still
cost one heap allocation, not two. It was a microoptimization anyway.
The C library used its own memory pool to avoid small allocations. It
was probably an overkill even for C, and didn't improve performance
in Rust (I've tested it), so I've dropped it.
Quirks
The C API promised to hold image data allocated with malloc, and even
dynamically toggle whether it should free it or not. From idiomatic
Rust perspective that's weird and sloppy. I've managed to fully
reimplement it, but it was 130 lines of not-so-pretty code that
shouldn't need to exist in a Rust library.
Neither C nor Rust do a good job of autovectorizing my code that uses
4xf32 pixels. I've tried giving them a 16-byte alignment, reordering
operations to be as close to SSE instructions as possible, enabled
arch=native, and so on, but it's a quartet of single-data
instructions every time, in both languages.
Rust isn't really an object-oriented language, but it has enough
syntax sugar to look like one. Trying to make the library more
object-oriented exposed a "drum-stick" design issue: does play(drum,
stick) translate to drum.play(stick) or stick.hit(drum)? The C
library only had global functions, and all fields of all structs were
public to the library (opaque structs in C are for the public API/
ABI, not the internals), so "where does this code belong" was never a
question. The Rust-OO conversion ended up with a few odd functions
that don't fit anywhere, and more crate-public fields than I'd like.
By default Rust assumes that any structure containing a C pointer is
not thread-safe, but I've inherited an API that is full of C
pointers! I've been forced to manually override it with "trust me,
it's thread-safe" in a few places. Usually it's Rust telling me
whether my code is thread-safe or not, so I'm uneasy about the
overrides. miri test passes.
Parallelism
Rust rayon is conceptually similar to OpenMP, but instead of #pragma
omp for there's par_iter(). It has delivered the same linear speedups
where possible.
The C version used buffers[omp_get_thread_num()] to have exclusive
per-thread buffers. This was basically a DIY thread-local storage, so
I've used thread-local storage in Rust too. For summing-up results
from multiple threads Rayon supports parallel fold/reduce, but my
data is simple enough that summing it up serially at the end is
slightly faster (parallel reduce adds intermediate steps that didn't
pay off here).
I've parallelized all the simple cases. I'd like to tackle more, but
the rest requires bigger architectural and algorithmic changes, which
is too much for the first release. Rayon is easy to safely experiment
with, and Rust is very nice to refactor, so I hope to do more later.
Conclusion
I'm ecstatic that I can now support all platforms from Android to
Windows, with the same cargo build command. I won't have to worry
again about the ___aarch64_cas8_acq_rel symbol, vcomp.dll,
compiler-private lib dirs, ignored #pragmas, or #ifdefs for
vendor-specific bugs/features. The next gif.ski will be 2.2 times
faster.
Where/when to get it?
I've released the Rust library as a beta on crates.io. The source is
on GitHub. I haven't published it in pngquant.org tarballs yet,
because I still have some things to finish in pngquant exe, update
all build/release processes, packages, documentation, etc.
---------------------------------------------------------------------
Jan 2022, by Kornel.
Return to home page.