[HN Gopher] Computing Adler32 Checksums at 41 GB/s
___________________________________________________________________
Computing Adler32 Checksums at 41 GB/s
Author : wooosh
Score : 66 points
Date : 2022-08-07 16:17 UTC (6 hours ago)
(HTM) web link (wooo.sh)
(TXT) w3m dump (wooo.sh)
| pizza wrote:
| Ooh now that is very interesting. I would really love to see how
| this speeds up the run-time of fpng as a whole, if you have any
| numbers. It looks like fjxl [0] and fpnge [1] (which also uses
| AVX2) are at the Pareto front for lossless image compression
| right now [2], but if this speeds things significantly then it's
| possible there'll be a huge shakeup!
|
| [0]
| https://github.com/libjxl/libjxl/tree/main/experimental/fast...
|
| [1] https://github.com/veluca93/fpnge
|
| [2] https://twitter.com/richgel999/status/1485976101692358656
| bob1029 wrote:
| If image encode/decode speed is the _only_ concern,
| libjpegturbo is going to be orders of magnitude faster than any
| of these lossless schemes. With jpeg, you could encode 1080p
| bitmaps in <10 milliseconds (per thread) on any consumer PC
| made in the last decade.
|
| The frequency domain is a really powerful place to operate in
| when you are dealing with this amount of data.
| pizza wrote:
| That's not true. libjpeg-turbo is ~50 MB/s last I tried -
| plus it's not lossless. fjxl and fpnge are basically an order
| of magnitude faster than that. libjpeg-turbo isn't even the
| fastest jpeg codec - you should check out the (relatively
| obscure) libmango - roughly 1 gbps decode on a 2020 macbook
| pro - or nvJPEG for GPU-based JPEG decoding. Supposedly
| there's even faster GPU-based decoders than nvJPEG, too.
| bob1029 wrote:
| > GPU-based
|
| How does this impact the overall latency of encoding a
| single image?
| pizza wrote:
| Probably quite a bit, I don't know. The typical use case
| is to load up thousands of JPEGs at once to get good
| throughput despite copy overhead. You can see here the
| benchmark against jpeg-turbo:
| https://developer.nvidia.com/blog/leveraging-hardware-
| jpeg-d...
| averne_ wrote:
| I've written an open-source driver for the decoding side
| of the nvjpg module found in the Tegra X1 (ie. earlier
| hardware revision than the one in the A100).
|
| I did some quick benchmarks against libjpeg-turbo, if
| that can give you an idea. I expect encoding performance
| would be similar.
|
| https://github.com/averne/oss-nvjpg#performance
| wooosh wrote:
| Unfortunately I haven't had the time to do a proper benchmark,
| and the fpng test executable only decodes/encodes a single
| image which produces very noisy/inconclusive results. However,
| I'm under the impression that it doesn't make a large
| difference in terms of overall time.
|
| fpnge (which I wasn't aware of until now) appears to already be
| using a very similar (identical?) algorithm, so I suspect the
| relative performance of fpng and fpnge would not be
| significantly impacted by this change.
| Nyan wrote:
| As someone who has been recently optimising fpnge, Adler32
| computation is pretty much negligible regarding overall
| runtime. The Huffman coding and filter search take up most of
| the time. (IIRC fpng doesn't do any filter search, but
| Huffman encoding isn't vectorized, so I'd expect that to
| dominate fpng's runtime)
| dougall wrote:
| Nice! (I've been meaning to write up this Apple M1 ~60GB/s
| version, which I think is similar:
| https://gist.github.com/dougallj/66151f1c509484a42fe0abd0d84... )
| jiggawatts wrote:
| I hope this brilliant work has been merged into the relevant open
| source libraries.
|
| Something that's unfair about the world is that work like this
| could reach billions of people and save a million dollars worth
| of time and electricity annually but is being done gratis.
|
| It would be amazing if there were charities that rewarded high-
| impact open source contributions like this proportionally to the
| benefits to humanity...
| daniel-cussen wrote:
| I love this kind of writeup. This is my idea of fun: speedups.
| TAForObvReasons wrote:
| While micro-optimizations are interesting, there are two
| questions left unanswered:
|
| - Does this change noticeably affect the total runtime? The
| checksum seems simple enough that the slight difference here
| wouldn't show up in PNG benchmarks.
|
| - The proposed solution uses AVX2, which is not currently used in
| the original codebase. Would any other part of the processing
| benefit from using newer instructions?
| londons_explore wrote:
| If checksum calculation was any substantial portion of image
| decoding, I think that would be a strong case for simply not
| checking the checksum.
|
| If you put corrupted data into a PNG decoder, I don't think
| it's awfully important to most users whether they get a decode
| error or a garbled image out.
| wooosh wrote:
| This was actually considered, and other libraries do ignore
| checksums, or at least have options to:
|
| https://github.com/richgel999/fpng/issues/9
___________________________________________________________________
(page generated 2022-08-07 23:00 UTC)