https://blend2d.com/blog/png-image-codec.html

 
[blend]
Blend2D
2D Vector Graphics Engine
About
Roadmap
Performance
Docs
Blog
Research
Support
Download
{Fiddle}

High-Performance PNG Codec

It's been some time I have written about a High-Performance QOI Codec
, which joined other codecs offered by Blend2D library in 2024. The
development of image codecs continued and now I would like to
announce a new high-performance PNG codec, which is much faster than
other available codecs written in C, C++, and other programming
languages.

Introduction

PNG is not new; it's been with us for decades and it's most likely
one of the most deployed image formats ever, which also means that
many others focused on improving the quality of PNG codecs. And
indeed, there are many libraries that provide speedy PNG decoding,
encoding, or both:

  * libspng - written in C, seems faster than the original libpng
  * wuffs - written in "WUFFs the language", claims to be The
    Fastest, Safest PNG Decoder in the World
  * zune-png - written in rust

We can even find pretty interesting statements such as Memory-safe
PNG decoders now vastly outperform C PNG libraries, etc... However,
PNG decoding is fortunately not about a programming language!

DEFLATE - The Challenger

The biggest challenge of decoding PNG images is DEFLATE compression
algorithm, which is used to compress pixel data, and is the only
algorithm allowed by the PNG specification (at least in 2025).
DEFLATE aged and is not well suited for modern CPUs that can execute
multiple instructions in a single clock cycle in addition to various
other features such as out of order execution, branch prediction,
etc... In general, everybody who implemented a DEFLATE decoder found
a ceiling - at some point it's not possible to improve it further,
because it's just inherently scalar and almost impossible to use any
kind of SIMD to process the bit-stream.

As a reference, I would like to mention Iguana, which clearly shows
that in order to utilize SIMD hardware by decoders, the layout of
compressed data must be designed for that. And that's something
unfixable when it comes to DEFLATE, because that would make it
incompatible.

The Way to High Performance Decoding

In general, high-performance decoding of a DEFLATE stream is about
the following:

  * Quickly build decode tables
  * Quickly decode bit-stream by using these decode tables

Since the maximum DEFLATE code (both litlen and offset) can have at
most 15 bits it's not practical to use a single lookup table for all
codes, which is why sub-tables are necessary. Conveniently,
compression is all about statistics so long codes are much less
likely than short codes, which means that a branch misprediction is
not really a problem when a sub-table is hit as it won't be hit that
often if decode tables are sized well. However, the challenge is to
find the optimal size of these tables, because big tables are slower
to build, and smaller tables would mean more sub-table lookups. It
seems that the sweet spot for high performance DEFLATE decoding is 11
bits for litlen table and 9 bits for offset table (11/9 limit seem to
be now widely used across many decoders), but this is just the
beginning when optimizing DEFLATE.

Fast Loop

So, what is the most important part of DEFLATE decoder? Is a fast
loop! A loop that does bounds checking at the beginning so it doesn't
have to do it during decoding of litlen and offset codes. And to
perform multiple iterations of fast loop it's better to precalculate
the maximum number of possible iterations and just recalculate that
again once that number was exhausted. To make the fast loop fast,
it's important to optimize for latency, which means using various
tricks such as Reading bits with zero refill latency.

Big Tables Can Become a Bottleneck

Larger decode tables are fine when there are enough bits to decode,
however, if the bit-stream is small, building large decode tables
could require much more time than the actual decoding of the
bit-stream. The solution is to make the table bits dynamic and to use
less bits than maximum when possible. Some bigger PNG files are
composed of multiple blocks (even dozens) and each block defines its
own Huffman tables, thus, table building should be pretty quick.

The Right Data Corpus

Now I would like to highlight something. Most DEFLATE implementations
compare performance against a corpus that doesn't represent
compressed PNG data. For example a Silesia Compression Corpus is
widely used to test implementations of compression and decompression
tools, but it doesn't provide data that would match how PNGs are
usually compressed. This could mean that tuning a deflate decoder to
perform best while decoding Silesia Corpus may not be best for
decoding PNG files. The difference between optimal performance can of
course vary depending on input images.

For reference, here is a small visual comparison of the structure of
a bit-stream that you would see in a Silesia corpus and regular PNG
images:

Compressed stream of a file from a Silesia Corpus:
  [Beginning of Data]
  LITERAL
  LITERAL
  ...
  [After Literals]
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  ...

Compressed stream of a PNG file (from testing images):
  [Beginning of Data]
  LITERAL
  LITERAL
  ...
  [After Literals]
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LITERAL
  LEN+OFFSET
  LEN+OFFSET
  LEN+OFFSET
  LITERAL
  LITERAL
  LITERAL
  LITERAL
  LEN+OFFSET
  ...

The beginning of each bit-stream is usually pretty similar - a lot of
literals, which are used to decode the initial sequence of bytes.
However, after the initial sequence is done what comes next varies
between compressed files. When decoding files from a Silesia Corpus,
LEN+OFFSET codes usually prevail, so tuning decoder to quickly decode
& copy matches is generally what could speed it up. However, PNG
images have much more literals even after the initial sequence,
possibly mixed up with LEN+OFFSET codes, so the decoder is much more
challenged. Based on this information I started thinking differently
- in order to improve the performance I have to find something others
are not doing, especially when it comes to decoding literals!

A Literal Pair

At the moment the most used DEFLATE decoders use a 32-bit entry in a
decode table. This entry can hold a lot of data, and it's usually
used in a way that it holds additional metadata such as base offset,
base length, literal value, and the number of extra length/offset
bits, etc... When it comes to literals such entry is under-utilized.
A literal is 8 bits, and the other common metadata fits in 8 to 16
bits (depending on layout), which leaves us with up to 3 literals per
decode entry.

It turned out that 3 literals limit per decode entry was too much -
11 bits usually won't fit 3 literals and table building suffered a
significant penalty. However, a pair of literals per entry is
achievable and many would actually fit. So I have settled with
maximum 2 literals per entry, which seems a sweet spot - since the
maximum table size is 11 bits, a single lookup could decode 2
literals that have at most 11 bits combined, which is likely to
happen, especially in PNG images that use filtering before
compression.

Streaming Support

PNG image format uses chunks, where each chunk has a tag and payload.
There are many tags that can be stored in a PNG image, but for the
purpose of decoding let's talk about IDAT. It's a chunk that contains
compressed data, however, for some reason PNG allows to split
compressed data into multiple IDAT chunks, so it's not possible to
use a DEFLATE decoder that needs all input data to be contiguous
(such as libdeflate). This also dictates that in order to quickly
decode PNG images, the DEFLATE decoder must support streaming.

Repeating Matches

Match in a DEFLATE stream can have length greater than offset, which
means that such a match can be used to apply a repeat to an existing
sequence of bytes or a single byte (when match offset is 1). This is
a nightmare from an implementer perspective, because if you don't
want to copy data byte-to-byte the algorithm must be able to handle
such repeats. This is actually an ideal place for utilizing SIMD (for
example VPSHUFB can be used to implement the repeat on X86 hardware,
and ARM has a similar instruction that can be used in the same
context).

Summary

Based on the all above we can actually draft an ideal pipeline for
decoding a DEFLATE stream to decode PNG data.

  * Maximum litlen table is 11 bits, maximum offset table is 9 bits
  * When building tables, use less bits when all codes fit to avoid
    building large decode tables when decoding small blocks
  * When there is enough data, add literal pairs to the litlen decode
    table, and use a decode loop that can take advantage of them
  * When there is enough space in the output buffer and enough bytes
    in the input buffer, use a fast loop designed for processing a
    lot of data to avoid unnecessary checks in the hot path
  * Using SIMD to repeat sequences of bytes improves performance, but
    branch there to not waste cycles for regular matches that are
    non-repeating (that would actually decrease performance)
  * The rest is about tuning the code - it's always possible to
    decode multiple literals that are close to each other, to
    pre-decode entries in advance, etc... This requires experimenting
    and usually most ideas turn into blunders (making the overall
    performance worse than before or only benefiting a single PNG
    file while making all others slower)

PNG Decoding Performance

Before I start showing the performance of decoding PNG images, let's
first see where the DEFLATE decoder is when it comes to a Silesia
Corpus. It's not the corpus that I have focused on, however, it's an
important corpus that is used world-wide for compression benchmarks.
For this purpose I have written a tool that uses Blend2D internal API
for a moment (as the DEFLATE decoder is currently internal and not
part of public API). The results are shown below:

./bl-bench-deflate

File dickens of original size 10192446 compressed to 3736325 (level 8)
[zlib      ] dickens                 : Decompress 20.108 [ms] [177.2 compressed MiB/s]
[libdeflate] dickens                 : Decompress  7.521 [ms] [473.8 compressed MiB/s]
[blend2d   ] dickens                 : Decompress  7.697 [ms] [462.9 compressed MiB/s]

File mozilla of original size 51220480 compressed to 18618185 (level 8)
[zlib      ] mozilla                 : Decompress 98.561 [ms] [180.1 compressed MiB/s]
[libdeflate] mozilla                 : Decompress 42.278 [ms] [420.0 compressed MiB/s]
[blend2d   ] mozilla                 : Decompress 45.036 [ms] [394.3 compressed MiB/s]

File mr of original size 9970564 compressed to 3562465 (level 8)
[zlib      ] mr                      : Decompress 17.078 [ms] [198.9 compressed MiB/s]
[libdeflate] mr                      : Decompress  7.202 [ms] [471.7 compressed MiB/s]
[blend2d   ] mr                      : Decompress  7.799 [ms] [435.6 compressed MiB/s]

File nci of original size 33553445 compressed to 3294801 (level 8)
[zlib      ] nci                     : Decompress 25.200 [ms] [124.7 compressed MiB/s]
[libdeflate] nci                     : Decompress  8.233 [ms] [381.7 compressed MiB/s]
[blend2d   ] nci                     : Decompress  7.916 [ms] [397.0 compressed MiB/s]

File ooffice of original size 6152192 compressed to 3033058 (level 8)
[zlib      ] ooffice                 : Decompress 16.604 [ms] [174.2 compressed MiB/s]
[libdeflate] ooffice                 : Decompress  7.268 [ms] [398.0 compressed MiB/s]
[blend2d   ] ooffice                 : Decompress  7.399 [ms] [390.9 compressed MiB/s]

File osdb of original size 10085684 compressed to 3653950 (level 8)
[zlib      ] osdb                    : Decompress 15.924 [ms] [218.8 compressed MiB/s]
[libdeflate] osdb                    : Decompress  6.000 [ms] [580.8 compressed MiB/s]
[blend2d   ] osdb                    : Decompress  6.160 [ms] [565.6 compressed MiB/s]

File reymont of original size 6627202 compressed to 1759380 (level 8)
[zlib      ] reymont                 : Decompress 10.834 [ms] [154.9 compressed MiB/s]
[libdeflate] reymont                 : Decompress  3.645 [ms] [460.3 compressed MiB/s]
[blend2d   ] reymont                 : Decompress  3.759 [ms] [446.4 compressed MiB/s]

File samba of original size 21606400 compressed to 5330470 (level 8)
[zlib      ] samba                   : Decompress 29.965 [ms] [169.7 compressed MiB/s]
[libdeflate] samba                   : Decompress 11.220 [ms] [453.1 compressed MiB/s]
[blend2d   ] samba                   : Decompress 12.131 [ms] [419.0 compressed MiB/s]

File sao of original size 7251944 compressed to 5270353 (level 8)
[zlib      ] sao                     : Decompress 15.699 [ms] [320.2 compressed MiB/s]
[libdeflate] sao                     : Decompress  8.757 [ms] [574.0 compressed MiB/s]
[blend2d   ] sao                     : Decompress  8.951 [ms] [561.5 compressed MiB/s]

File webster of original size 41458703 compressed to 11993158 (level 8)
[zlib      ] webster                 : Decompress 73.062 [ms] [156.5 compressed MiB/s]
[libdeflate] webster                 : Decompress 25.807 [ms] [443.2 compressed MiB/s]
[blend2d   ] webster                 : Decompress 26.006 [ms] [439.8 compressed MiB/s]

File x-ray of original size 8474240 compressed to 5922522 (level 8)
[zlib      ] x-ray                   : Decompress 22.894 [ms] [246.7 compressed MiB/s]
[libdeflate] x-ray                   : Decompress 12.116 [ms] [466.2 compressed MiB/s]
[blend2d   ] x-ray                   : Decompress 14.015 [ms] [403.0 compressed MiB/s]

File xml of original size 5345280 compressed to 701018 (level 8)
[zlib      ] xml                     : Decompress  5.082 [ms] [131.5 compressed MiB/s]
[libdeflate] xml                     : Decompress  1.623 [ms] [412.0 compressed MiB/s]
[blend2d   ] xml                     : Decompress  1.677 [ms] [398.6 compressed MiB/s]

As can be seen Blend2D's deflate decompressor is very close to
libdeflate in terms of performance, but it's not faster when decoding
compressed data from a Silesia Corpus. However, let's benchmark how
it performs when used to decode PNG images:

[libpng  ] png/harvesters.png        : Decode  9.665 [ms]
[wuffs   ] png/harvesters.png        : Decode  7.984 [ms]
[spng    ] png/harvesters.png        : Decode  8.807 [ms]
[stbimage] png/harvesters.png        : Decode 12.236 [ms]
[blend2d ] png/harvesters.png        : Decode  5.169 [ms]

[libpng  ] png/hibiscus.primitive.png: Decode  0.803 [ms]
[wuffs   ] png/hibiscus.primitive.png: Decode  0.690 [ms]
[spng    ] png/hibiscus.primitive.png: Decode  0.647 [ms]
[stbimage] png/hibiscus.primitive.png: Decode  0.684 [ms]
[blend2d ] png/hibiscus.primitive.png: Decode  0.225 [ms]

[libpng  ] png/hibiscus.regular.png  : Decode  1.627 [ms]
[wuffs   ] png/hibiscus.regular.png  : Decode  1.195 [ms]
[spng    ] png/hibiscus.regular.png  : Decode  1.493 [ms]
[stbimage] png/hibiscus.regular.png  : Decode  1.631 [ms]
[blend2d ] png/hibiscus.regular.png  : Decode  0.679 [ms]

[libpng  ] png/medium_rgb8.png       : Decode 14.313 [ms]
[wuffs   ] png/medium_rgb8.png       : Decode 11.275 [ms]
[spng    ] png/medium_rgb8.png       : Decode 12.811 [ms]
[stbimage] png/medium_rgb8.png       : Decode 16.698 [ms]
[blend2d ] png/medium_rgb8.png       : Decode  6.148 [ms]

[libpng  ] png/medium_rgba8.png      : Decode 19.539 [ms]
[wuffs   ] png/medium_rgba8.png      : Decode 19.882 [ms]
[spng    ] png/medium_rgba8.png      : Decode 18.067 [ms]
[stbimage] png/medium_rgba8.png      : Decode 22.840 [ms]
[blend2d ] png/medium_rgba8.png      : Decode  9.288 [ms]

[libpng  ] png/large_rgb8.png        : Decode 92.882 [ms]
[wuffs   ] png/large_rgb8.png        : Decode 74.817 [ms]
[spng    ] png/large_rgb8.png        : Decode 83.422 [ms]
[stbimage] png/large_rgb8.png        : Decode 95.808 [ms]
[blend2d ] png/large_rgb8.png        : Decode 41.997 [ms]

[libpng  ] png/large_rgba8.png       : Decode 110.803 [ms]
[wuffs   ] png/large_rgba8.png       : Decode 114.620 [ms]
[spng    ] png/large_rgba8.png       : Decode 102.576 [ms]
[stbimage] png/large_rgba8.png       : Decode 122.232 [ms]
[blend2d ] png/large_rgba8.png       : Decode 52.625 [ms]

[libpng  ] png/large_palette.png     : Decode 10.040 [ms]
[wuffs   ] png/large_palette.png     : Decode  4.626 [ms]
[spng    ] png/large_palette.png     : Decode  8.586 [ms]
[stbimage] png/large_palette.png     : Decode  5.328 [ms]
[blend2d ] png/large_palette.png     : Decode  2.944 [ms]

[libpng  ] qoi/dice.png              : Decode  3.081 [ms]
[wuffs   ] qoi/dice.png              : Decode  2.719 [ms]
[spng    ] qoi/dice.png              : Decode  2.559 [ms]
[stbimage] qoi/dice.png              : Decode  3.146 [ms]
[blend2d ] qoi/dice.png              : Decode  1.191 [ms]

[libpng  ] qoi/edgecase.png          : Decode  0.084 [ms] (20x libpng warning: iCCP: known incorrect sRGB profile)
[wuffs   ] qoi/edgecase.png          : Decode  0.016 [ms]
[spng    ] qoi/edgecase.png          : Decode  0.069 [ms]
[stbimage] qoi/edgecase.png          : Decode  0.023 [ms]
[blend2d ] qoi/edgecase.png          : Decode  0.030 [ms]

[libpng  ] qoi/kodim10.png           : Decode  4.267 [ms]
[wuffs   ] qoi/kodim10.png           : Decode  2.853 [ms]
[spng    ] qoi/kodim10.png           : Decode  3.767 [ms]
[stbimage] qoi/kodim10.png           : Decode  3.995 [ms]
[blend2d ] qoi/kodim10.png           : Decode  1.738 [ms]

[libpng  ] qoi/kodim23.png           : Decode  3.828 [ms]
[wuffs   ] qoi/kodim23.png           : Decode  2.650 [ms]
[spng    ] qoi/kodim23.png           : Decode  3.421 [ms]
[stbimage] qoi/kodim23.png           : Decode  3.737 [ms]
[blend2d ] qoi/kodim23.png           : Decode  1.775 [ms]

[libpng  ] qoi/qoi_logo.png          : Decode  0.336 [ms]
[wuffs   ] qoi/qoi_logo.png          : Decode  0.311 [ms]
[spng    ] qoi/qoi_logo.png          : Decode  0.201 [ms]
[stbimage] qoi/qoi_logo.png          : Decode  0.146 [ms]
[blend2d ] qoi/qoi_logo.png          : Decode  0.072 [ms]

[libpng  ] qoi/testcard.png          : Decode  0.197 [ms]
[wuffs   ] qoi/testcard.png          : Decode  0.220 [ms]
[spng    ] qoi/testcard.png          : Decode  0.114 [ms]
[stbimage] qoi/testcard.png          : Decode  0.122 [ms]
[blend2d ] qoi/testcard.png          : Decode  0.056 [ms]

[libpng  ] qoi/testcard_rgba.png     : Decode  0.210 [ms]
[wuffs   ] qoi/testcard_rgba.png     : Decode  0.225 [ms]
[spng    ] qoi/testcard_rgba.png     : Decode  0.148 [ms]
[stbimage] qoi/testcard_rgba.png     : Decode  0.134 [ms]
[blend2d ] qoi/testcard_rgba.png     : Decode  0.060 [ms]

[libpng  ] qoi/wikipedia_008.png     : Decode  9.633 [ms]
[wuffs   ] qoi/wikipedia_008.png     : Decode  7.591 [ms]
[spng    ] qoi/wikipedia_008.png     : Decode  8.237 [ms]
[stbimage] qoi/wikipedia_008.png     : Decode  9.841 [ms]
[blend2d ] qoi/wikipedia_008.png     : Decode  3.220 [ms]

And here is a comparison of the new code compared to the previous PNG
decoder (Blend2D only):

[bl-old] png/harvesters.png        : Decode 10.584 [ms]
[bl-new] png/harvesters.png        : Decode  5.169 [ms]

[bl-old] png/hibiscus.primitive.png: Decode  0.517 [ms]
[bl-new] png/hibiscus.primitive.png: Decode  0.225 [ms]

[bl-old] png/hibiscus.regular.png  : Decode  1.417 [ms]
[bl-new] png/hibiscus.regular.png  : Decode  0.679 [ms]

[bl-old] png/medium_rgb8.png       : Decode 12.716 [ms]
[bl-new] png/medium_rgb8.png       : Decode  6.148 [ms]

[bl-old] png/medium_rgba8.png      : Decode 17.443 [ms]
[bl-new] png/medium_rgba8.png      : Decode  9.288 [ms]

[bl-old] png/large_rgb8.png        : Decode 78.050 [ms]
[bl-new] png/large_rgb8.png        : Decode 41.997 [ms]

[bl-old] png/large_rgba8.png       : Decode 93.093 [ms]
[bl-new] png/large_rgba8.png       : Decode 52.625 [ms]

[bl-old] png/large_palette.png     : Decode  5.550 [ms]
[bl-new] png/large_palette.png     : Decode  2.944 [ms]

[bl-old] qoi/dice.png              : Decode  2.406 [ms]
[bl-new] qoi/dice.png              : Decode  1.191 [ms]

[bl-old] qoi/edgecase.png          : Decode  0.026 [ms]
[bl-new] qoi/edgecase.png          : Decode  0.030 [ms]

[bl-old] qoi/kodim10.png           : Decode  3.262 [ms]
[bl-new] qoi/kodim10.png           : Decode  1.738 [ms]

[bl-old] qoi/kodim23.png           : Decode  3.236 [ms]
[bl-new] qoi/kodim23.png           : Decode  1.775 [ms]

[bl-old] qoi/qoi_logo.png          : Decode  0.218 [ms]
[bl-new] qoi/qoi_logo.png          : Decode  0.072 [ms]

[bl-old] qoi/testcard.png          : Decode  0.152 [ms]
[bl-new] qoi/testcard.png          : Decode  0.056 [ms]

[bl-old] qoi/testcard_rgba.png     : Decode  0.160 [ms]
[bl-new] qoi/testcard_rgba.png     : Decode  0.060 [ms]

[bl-old] qoi/wikipedia_008.png     : Decode  9.255 [ms]
[bl-new] qoi/wikipedia_008.png     : Decode  3.220 [ms]

As can be seen above the new PNG decoder is significantly faster than
the previous one and in this case the DEFLATE decompressor takes all
the credit. We can use perf tool to compare where most cycles were
spent before and are spend now:

Blend2D (old DEFLATE):
  73.19%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::DeflateDecoder::_decode()
  10.42%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<3u>()
   7.88%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<4u>()
   2.04%  bl-bench-png  [kernel.kallsyms]     [k] clear_page_erms
   1.71%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::blDeflateBuildTable()
   1.16%  bl-bench-png  bl-bench-png          [.] bl_convert_premultiply_8888_leading_alpha_shufb_avx2()
   0.80%  bl-bench-png  bl-bench-png          [.] bl_convert_rgb32_from_rgb24_shufb_avx2()
   0.41%  bl-bench-png  bl-bench-png          [.] bl_convert_any_from_indexed8()

Blend2D (new DEFLATE):
  46.47%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::Fast::decode_AVX2()
  17.86%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<3u>()
  13.44%  bl-bench-png  bl-bench-png          [.] bl::Png::Ops::inverseFilterSimdImpl<4u>()
   6.15%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::buildFastTable()
   3.79%  bl-bench-png  [kernel.kallsyms]     [k] clear_page_erms
   2.54%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::Decoder::decode()
   1.97%  bl-bench-png  bl-bench-png          [.] bl::Compression::Deflate::buildDecodeTable()
   1.79%  bl-bench-png  bl-bench-png          [.] bl_convert_premultiply_8888_leading_alpha_shufb_avx2()
   1.19%  bl-bench-png  bl-bench-png          [.] bl_convert_rgb32_from_rgb24_shufb_avx2()
   0.67%  bl-bench-png  bl-bench-png          [.] bl_convert_any_from_indexed8()

DEFLATE decompression still burns around ~57% cycles overall, but
that's significantly better than burning ~75% cycles as it used to
be. And since one part of the whole pipeline is now more optimized,
other parts start hitting the profile as well (for example it could
be worth optimizing further the inverse PNG filter, which now burns
~30% of cycles).

One additional thing to mention is Deflate::buildFastTable()
function, which is used to enhance the initial decode table to
provide literal pairs, where applicable. This is actually a nice
example of a trade-off. Building such table means burning cycles, but
it can pay of if there is enough literals to decode. However, if
there is not enough literals then cycles will be simply wasted.

PNG Decoding Performance compared with QOI

The reference implementation of QOI decoder & encoder comes with a
tool called qoibench. It's easy to compile and to add more codecs
into it; and that's something that I was interested in. So I have
added a Blend2D codec there and just ran the benchmark and to my
surprise Blend2D PNG decoding is in many cases even faster than the
QOI reference decoder (but never faster than Blend2D's optimized QOI
codec). So here are some grand totals as reported by qoibench when
ran on the QOI benchmark suite, which can be downloaded here:

# Grand total for qoi_benchmark_suite/images/icon_64
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           0.0         0.2        137.65         17.46         3   23.6%
blend2d-png:      0.0         0.1        311.06         73.63         4   25.3%
stbi:             0.0         0.2        153.62         19.97         4   27.9%
qoi:              0.0         0.0        482.51        392.37         4   28.7%
blend2d-qoi:      0.0         0.0        868.35        652.13         4   28.7%

# Grand total for qoi_benchmark_suite/images/icon_512
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           1.4        10.1        181.90         26.01        51    5.0%
blend2d-png:      0.4         2.8        693.39         93.27        49    4.9%
stbi:             0.9         7.2        296.64         36.39        69    6.8%
qoi:              0.4         0.7        666.37        398.50        85    8.4%
blend2d-qoi:      0.2         0.3       1331.57        937.25        85    8.4%

# Grand total for qoi_benchmark_suite/images/photo_kodak
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:           4.1        72.9         97.08          5.40       648   56.3%
blend2d-png:      1.7        10.5        233.00         37.27       697   60.6%
stbi:             4.5        34.4         86.66         11.43       873   75.8%
qoi:              1.9         2.2        206.47        179.57       671   58.3%
blend2d-qoi:      1.0         1.4        381.22        282.69       671   58.3%

# Grand total for qoi_benchmark_suite/images/textures_photo
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:          10.2       165.1        102.41          6.35      1836   59.8%
blend2d-png:      5.3        25.0        197.11         41.94      2168   70.6%
stbi:            12.5        82.6         84.21         12.69      2334   76.0%
qoi:              4.7         5.1        220.91        207.54      1981   64.5%
blend2d-qoi:      2.5         3.5        414.78        301.93      1981   64.5%

# Grand total for qoi_benchmark_suite/images/screenshot_web
            decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:          29.9       326.6        271.32         24.88      2402    7.6%
blend2d-png:     10.6        88.7        767.99         91.65      2619    8.3%
stbi:            23.7       267.2        343.15         30.41      3076    9.7%
qoi:             12.8        21.8        632.90        373.04      2649    8.3%
blend2d-qoi:      6.9         9.8       1169.91        828.74      2649    8.3%

Further Ideas

I think honestly that DEFLATE is not worth optimizing further,
because it simply wasn't designed for SIMD. The decoding is a
completely scalar process: after bunch of bits get decoded, you can
decode more. And here comes the latency - table lookups are scalar
and depend on each other, thus it's virtually impossible to be clever
about it and implement something significantly better than what's
available today, without bringing completely new ideas to the table.

One idea I had was to use AVX-512 7-bit table lookup instruction to
decode the bit-stream in parallel (speculatively). Basically if we
only use 8 bits from the decode table, we can decode these 8 bits and
63 consecutive combinations that follow (each combination one bit
shifted compared to the initial lookup) with a pair of VPERMI2B/
VPERMT2B instructions (it would require an X86_64 CPU with
AVX512_VBMI extension, so bye bye Intel!). This could possibly scale
up to more bits (9 would be achievable) with the cost of using more
lookups - thankfully these lookups could be done in parallel and just
combined later to select the matching ones. However, I didn't proceed
with this idea as I'm not sure I would want to burn my time that is
very scarce now. However, it could definitely replace memory load
with permutations, which are done completely within a ZMM register,
so theoretically the latency could be improved a lot, provided that
the decoding won't exceed the limitation of this lookup (8 or 9 bits)
often.

Conclusion

Blend2D's PNG decoder is probably one of the fastest on the market
based on the benchmarks shown in this post, and completely free and
open-source! Since both zune-image and a image-rs are roughly
comparable to wuffs (at least according to this post) it's also much
faster than decoders written in other programming languages, but of
course an independent verification would be great.

Memory safety doesn't play any role in terms of performance in this
case, because we are talking about a completely scalar algorithm, so
lesser latency translates into more throughput. The heart of every
optimized DEFLATE decoder is always a "fast" loop, which ensures ALL
memory accesses are safe within that loop body, which again
translates into a lesser latency, thus higher throughput.

I would love to see my ideas further refined and perfected. I would
be very interested in a working AVX512_VBMI implementation as
mentioned in Further Ideas section.

More Information

About Blend2D

Project Roadmap

Performance Comparison

Documentation

Getting Started

Build Instructions

C/C++ API Reference

Support

Contact

Donation Options

Commercial Support

Get Blend2D

Download Page

Blend2D on GitHub

(c) 2017-2024 Blend2D Team | GitHub | Chat | Support | Download
Blend2D library and its dependencies are open source software
released under the Zlib license and can be used safely in any
open-source or commercial product, statically or dynamically linked,
and without advertising the use of Blend2D. Code snippets and
examples are released into public domain, see Unlicense for more
details.