[HN Gopher] Faster neural networks straight from JPEG (2018)
___________________________________________________________________
Faster neural networks straight from JPEG (2018)
Author : Anon84
Score : 178 points
Date : 2023-07-13 14:51 UTC (8 hours ago)
(HTM) web link (www.uber.com)
(TXT) w3m dump (www.uber.com)
| quantumstar4k wrote:
| [dead]
| wodenokoto wrote:
| That seems just in line with an earlier front page submission
| that used knn and gzipped text for categorization.
| watersb wrote:
| IIRC, "neural network" style systems design started in the realm
| of computer vision with "Perceptrons":
|
| https://en.wikipedia.org/wiki/Perceptrons_(book)Perceptrons =
| https://g.co/kgs/8Un4eW
|
| Makes sense that image processing would be a good fit in some
| cases.
| mochomocha wrote:
| I was doing the same thing at Netflix around the same time as a
| 20% research project. Training GANs end2end directly in JPEG
| coeffs space (and then rebuild a JPEG from the generated coeffs
| using libjpeg to get an image). The pitch was that it not only
| worked, but you could get fast training by representing each JPEG
| block as a dense + sparse vector (dense for the low DCT coeffs,
| sparse for the high ones since they're ~all zeros) and using a
| neural network library with fast ops on sparse data.
|
| Training on pixels is inefficient. Why have your first layers of
| CNNs relearn what's already smartly encoded in the JPEG bits in
| the first place before it's blown into a bloated height x width x
| 3 float matrix?
| sharemywin wrote:
| I wondered awhile back about using EGA or CGA for doing image
| recognition and or for stable diffusion. seems like there
| should be more than enough image data for that resolution and
| color depth(16 colors).
| 23B1 wrote:
| This is really interesting. Have you published your research
| anywhere that I could read?
| vladimirralev wrote:
| In naive ML scenarios you are right. You can think of JPEG as
| an input embedding. One of many. The JPEG/spectral embedding is
| useful because it already provides miniature variational
| encoding that "makes sense" in terms of translation, sharpness,
| color, scale and texture.
|
| But with clever ML you can design better variational
| characteristics such as rotation or nonlinear thing like faces,
| fingers, projections and abstract objects.
|
| Further JPEG encoding/decoding will be an obstacle for many
| architectures that require gradients going back and forth
| between pixel space and JPEG in order to do evaluation steps
| and loss functions based on the pixel space (which would be
| superior). Not to mention if you need human feedback in
| generative scenarios to retouch the output and run training
| steps on the changed pixels.
|
| And finally, there are already picture and video embeddings
| that are gradient-friendly and reusable.
| luckystarr wrote:
| I remember I've heard somewhere that our retina encodes the
| visuals it receives into a compressed signal before forwarding
| it to the visual cortex. If true, this may actually be how it's
| done "for real". ;)
| johntb86 wrote:
| I would worry that the fixed, non-overlapping block nature of a
| JPEG would reduce translation invariance - shift an image by 4
| pixels and the DCT coefficients may look very different. People
| have been doing a lot of work to try to reduce the dependence
| of the image on the actual pixel coordinates - see for example
| https://research.nvidia.com/publication/2021-12_alias-free-g...
| andai wrote:
| Does JPEG-2000 fix that? From what I gathered, it doesn't use
| blocks.
| mochomocha wrote:
| JPEG-2000 uses wavelets as a decomposition basis as opposed
| to DCT which in theory makes it possible to treat the whole
| image as a single block while ensuring high compression. In
| practice though tiles are used, I would guess to improve on
| memory and compute parallelism.
| DougBTX wrote:
| On the other hand, ViT uses non-overlapping patches anyway,
| so the impact may be minor. Example code:
| https://nn.labml.ai/transformers/vit/index.html
| corysama wrote:
| As an AI armchair quarterback, I've always held the opinion
| that the image ML space has a counter-productive bias towards
| not pre-processing images stemming from a mixture of test
| purity and academic hubris. "Look what this method can learn
| from raw data completely independently!" makes for a nice
| paper. So, they stick with sRGB inputs rather than doing basic
| classic transforms like converting to YUV420. Everyone learns
| from the papers, so that's assumed to be the standard practice.
| CodesInChaos wrote:
| Typical CNNs can learn a linear transformation of the input
| at no cost. Since YUV is such a linear transformation of RGB,
| there is no benefit in converting to it beforehand.
| bob1029 wrote:
| How is there not a cost associated with forcing the machine
| to learn how to do something that we already have a simple,
| deterministic algorithm for? Won't some engineer need to
| double check a few things with regard to the AI's idea of
| color space transform?
| uoaei wrote:
| Your instincts are correct. Training is faster, more
| stable, and more efficient that way. In certain cases it
| "pretty much is irrelevant" but the advantages of the
| strategy of modelling the knowns and training only on the
| unknowns becomes starkly apparent when doing e.g. sensor
| fusion or other ML tasks on physical systems.
| CodesInChaos wrote:
| You could probably derive some smart initialization for
| the first layer of a NN based on domain knowledge (color
| spaces, sobel filters, etc.). But since this is such a
| small part of what the NN has to learn, I expect this to
| result in a small improvement in training time and have
| no effect on final performance and accuracy, so it's
| unlikely to be worth the complexity of developing such a
| feature.
| whimsicalism wrote:
| Absolutely this.
|
| Seems like on HN people are still learning 'the bitter
| lesson'.
| potatoman22 wrote:
| That's something that surprises me too, given the
| preprocessing applied to NLP.
| CodesInChaos wrote:
| Short text representations (via good tokenization)
| significantly reduces the computational cost of a
| transformer (need to generate fewer tokens for the same
| output length, and need fewer tokens to represent the same
| window size). I think these combine to n^3 scaling (n^2
| from window size and n from output size).
|
| For images it's not clear to me if there are any
| preprocessing methods that do a lot better than resizing
| the image to a smaller resolution (which is commonly done
| already).
| thomasahle wrote:
| We used to do a lot more preprocessing to NLP, like
| stemming, removing stop words, or even adding grammar
| information (NP, VP, etc.). Now we just do basic
| tokenization. The rest turned out to be irrelevant or even
| counter productive.
| kbelder wrote:
| But also, that basic tokenization is essential; training
| it on a raw ascii stream would be much less efficient.
| There is a sweet spot of processing & abstraction that
| should be aimed for.
| fxtentacle wrote:
| In my experience, staying close to the storage format is very
| useful because it allows the neural network to correctly deal
| with clipped/saturated values. If your file is saved in sRGB
| and you train in sRGB, then when something turns to 0 or 255,
| the AI can handle it as a special case because most likely it
| was too bright or too dark for your sensor to capture
| accurately. If you first transform to a different color
| space, that clear clip boundary gets lost.
|
| Also, I would prefer sRGB oder RGB because it more closely
| matches the human vision system. That said, the RGB to YUV
| transformation is effectively a matrix multiplication, so if
| you use conv features like everyone then you can merge it
| into your weights for free.
| whimsicalism wrote:
| In deep ML, people are pretty familiar with the bitter lesson
| and don't want to waste time on this.
| Llamamoe wrote:
| The bitter lesson is about not trying to encode impossible-
| to-formalize conceptual knowledge, not avoiding data
| efficiency and the need to scale the model up to ever
| higher parameter counts.
|
| If we followed this logic, we'd be training LLMs on
| character-level UTF-32 and just letting it figure
| everything out by itself, while needing two orders of
| magnitude bigger contexts and parameter counts.
| whimsicalism wrote:
| Converting from RGB to YUV is absolutely subject to the
| bitter lesson because it is trying to generalize from a
| representation that we have seen works for some classical
| methods and hard code that knowledge in to the AI which
| could easily learn (and will anyways) a more useful
| representation for itself.
|
| > LLMs on character-level UTF-32 and just letting it
| figure everything out by itself, while needing two orders
| of magnitude bigger contexts and parameter counts.
|
| This was tried extensively and honestly it is probably
| _still_ too early to proclaim the demise of this
| approach. It 's also completely different - you're
| conflating a representation that literally changes the
| number of forward passes you have to do (ie. the amount
| of computation - what the bitter lesson is about) vs. one
| that (at most) would just require stacking on a few
| layers or so.
|
| A better example for your point (imo) would be audio
| recognition, where we pre-transform from wave amplitudes
| into log mel spectrogram for ingestion by the model. I
| think this will ultimately fall to the bitter lesson as
| well though.
|
| Also a key difference is that you are proposing going
| from methods that _already work_ to try to inject more
| classical knowledge into them. It is oftentimes the case
| that you 'll have an intermediary fusion between deep +
| classical, but not if you already have working fully deep
| methods.
| mhh__ wrote:
| It's also probably because a lot of knowledge about how jpeg
| works is tied up signal processing books that usually front
| load a bunch of mathematics as opposed to ML which often
| needs a bit of mathematical intuition but in practice is
| usually empirical.
| andai wrote:
| I was trying to learn how the discrete cosine transform
| works, so I looked up some code in an open source program.
| The code said it copied it verbatim from a book from the
| 90s. I looked up the book and the book said it copied it
| verbatim from a paper from the 1970s.
| mhh__ wrote:
| The code is irrelevant, in the eyes of the theoretician
| anyway.
| andai wrote:
| I am theoretically challenged.
| shoulderfake wrote:
| [dead]
| GaggiX wrote:
| Add "(2018)"
| tedunangst wrote:
| I was wondering about "jpeg turned 25 in 2017." Seemed like
| peculiar phrasing.
| readthenotes1 wrote:
| Back when Uber was thinking it could make self driving taxis
| londons_explore wrote:
| Uber now has a deal with Waymo - so in effect they are still
| trying to do self-driving taxi's, but now via a complex
| business relationship.
| capableweb wrote:
| Ah, just like I "make hamburgers" when I go through the
| McDonals drive-through, although it's via a business
| transaction.
| ianlevesque wrote:
| God this website is cynical. Licensing technology from
| another company to commercialize it is completely
| legitimate.
| wpietri wrote:
| That is legitimate, but that's not the point here. The
| point is Uber's hubris. A hubris very useful in pumping
| up its stock price ahead of an IPO. If they had quietly
| planned to license it from the get-go, nobody would have
| mentioned it.
| ianlevesque wrote:
| Uber literally killed people trying to develop their own
| first. Tesla continues to do so. I wish everyone would
| license Waymo instead.
| vajrabum wrote:
| Any reason to prefer Waymo over Cruise? I saw an
| driverless Cruise taxi in SF just the other day. And are
| they any other competitors still in this space worth
| watching?
| jjk166 wrote:
| More like making hamburgers by buying frozen burger
| patties at the supermarket.
| kevmo314 wrote:
| A lot of generative audio works very similar these days: it's
| much faster to predict and generate a codebook than a raw
| waveform.
| a-dub wrote:
| i guess an interesting question is: can you coax a network into
| learning a better perceptual compression transform than the dcts
| in jpeg?
| kunalgupta wrote:
| see also:
| https://twitter.com/goodside/status/1679358632431853568?s=46...
| [deleted]
| naillo wrote:
| Finally someone who did this, I've always thought this was a low
| hanging fruit. I wonder if you could make interesting and quickly
| trained diffusion models with this trick.
| liuliu wrote:
| Sorta of? Latent diffusion models operate in a compressed
| latent space, which is just a richer / learnable representation
| than DCT.
| waldarbeiter wrote:
| Note that it is from 2018. As someone here already mentioned
| there is a paper that applies the same idea to Vision
| Transformers published this year [1].
|
| [1]
| https://openaccess.thecvf.com/content/CVPR2023/papers/Park_R...
| goldemerald wrote:
| For those interested, a modern version (vision transformers) was
| just published this year at CVPR
| https://openaccess.thecvf.com/content/CVPR2023/html/Park_RGB...
| m3affan wrote:
| Transformers are the taking the cake in the Deep Learning
| community
| w-m wrote:
| Ha, I remember the poster from the conference, it was quite
| crowded when I passed by. This one seemed to have a big focus
| on data augmentation in the DCT space. I was asking myself (and
| the author) whether you couldn't eke out a little more
| efficiency by trying to quantize your network similarly to the
| default JPEG quantization table. As I understood, currently all
| weights are quantized uniformly, which does not make sense when
| your inputs are heavily quantized, does it? Maybe I should dive
| a little deeper into the Uber paper, they were focusing a bit
| more in the quantization part. Sorry if I'm talking nonsense,
| this is absolutely not my area, but I found the topic
| captivating.
| buildbot wrote:
| There is some work on using JPEG style DCT blocks for
| quantization:
| https://dl.acm.org/doi/10.1109/ISCA45697.2020.00075
|
| (Disclaimer, not mine, but a friends work)
| qwertox wrote:
| Thank you. First published 2022-11-29 on arxiv [0] and updated
| one month ago.
|
| Interesting line: "With these two improvements -- ViT and data
| augmentation -- we show that our ViT-Ti model achieves up to
| 39.2% faster training and 17.9% faster inference with no
| accuracy loss compared to the RGB counterpart."
|
| [0] https://arxiv.org/abs/2211.16421v2
| jcjohns wrote:
| I'm one of the authors of this CVPR paper -- cool to see our
| work mentioned on HN!
|
| The Uber paper from 2018 is one that has been floating around
| in the back of my head for a while. Decoding DCT to RGB is
| essentially an 8x8 stride 8 convolution -- it seems wasteful to
| perform this operation on CPU for data loading, then
| immediately pass the resulting decoded RGB into convolution
| layers that probably learn similar filters as those used during
| DCT decoding anyway.
|
| Compared to the earlier Uber paper, our CVPR paper makes two
| big advances:
|
| (1) Cleaner architecture: The Uber paper uses a CNN, while we
| use a ViT. It's kind of awkward to modify an existing CNN
| architecture to accept DCT instead of RGB since the grayscale
| data is 8x lower resolution than RGB, and the color information
| is 16x lower than RGB. With a CNN, you need to add extra layers
| to deal with the downsampled input, and use some kind of fusion
| mechanism to fuse the luma/chroma data of different resolution.
| With a ViT it's very straightforward to accept DCT input; you
| only need to change the patch embedding layer, and the body of
| the network is unchanged.
|
| (2) Data augmentation: The original Uber paper only showed
| speedup during inference. During training they need to perform
| data augmentation, so convert DCT to RGB, augment in RGB, then
| convert back to DCT to feed the augmented data to the model.
| This means that their approach will be _slower_ during training
| vs an RGB model. In our paper we show to to perform all
| standard image augmentations directly in DCT, so we can get
| speedups during both training and inference.
|
| Happy to answer any questions about the project!
| inoop wrote:
| > Decoding DCT to RGB is essentially an 8x8 stride 8
| convolution -- it seems wasteful to perform this operation on
| CPU for data loading
|
| Then why not do it on the GPU? Feels like exactly the sort of
| thing it was designed to do.
|
| Or alternatively, use nvjpeg?
| jcjohns wrote:
| This makes sense in theory, but is hard to get working in
| practice.
|
| We tried using nvjpeg to do JPEG decoding on GPU as a
| additional baseline, but using it as a drop-in replacement
| to a standard training pipeline gives huge slowdowns for a
| few reasons:
|
| (1) Batching: nvjpeg isn't batched; you need to decode one
| at a time in a loop. This is slow but could in principle be
| improved with a better GPU decoder.
|
| (2) Concurrent data loading / model execution: In a
| standard training pipeline, the CPU is loading and
| augmenting data on CPU for the next batch in parallel with
| the model running forward / backward on the current batch.
| Using the GPU for decoding blocks it from running the model
| concurrently. If you were careful I think you could
| probably find a way to interleave JPEG decoding and model
| execution on the GPU, but it's not straightforward. Just
| naively swapping out to use nvjpeg in a standard PyTorch
| training pipeline gives very bad performance.
|
| (3) Data augmentation: If you do DCT -> RGB decoding on the
| GPU, then you have to think about how and where to do data
| augmentation. You can augment in DCT either on CPU or on
| GPU; however DCT augmentation tends to be more expensive
| than RGB augmentation (especially for resize operations),
| so if you are already going through the trouble of decoding
| to RGB then it's probably much cheaper to augment in RGB.
| If you augment in RGB on GPU, then you are blocking
| parallel model execution for both JPEG decoding and
| augmentation, and problem (2) gets even worse. If you do
| RGB augmentation on CPU, you end up with and extra GPU ->
| CPU -> GPU round trip on every model iteration which again
| reduces performance.
| arketyp wrote:
| I'm just a low tier ML engineer, but I'd say you generally
| want to avoid splitting GPU resources over many libraries,
| to the extent it's even practically possible.
| bob1029 wrote:
| > Accuracy gains are due primarily to the specific use of a DCT
| representation, which turns out to work curiously well for image
| classification.
|
| It would seem quantization is a useful tool for any sort of NN-
| style application.
|
| If the expected output is intended to be human-like, why not feed
| it information that a typical human could not distinguish from a
| lossless representation? Seems like a simple game of expectations
| and information theory.
| kridsdale3 wrote:
| That's kind of the key theory behind why JPEG (and other lossy
| encodings) work at all. A perfect being would see a JPEG next
| to a PNG or TIFF and find the first repugnantly error-ridden.
|
| But we tend to ignore high-frequency data's specifics most of
| the time, so it psychologically works.
|
| I often wonder though, what do my cat and dog hear when I'm
| playing compressed music? Does it sounds like a muddy phone
| call to them?
| jdiff wrote:
| No, audio compression doesn't filter out high frequencies,
| that's just what computer audio as a whole does. And I don't
| think there's enough of those high frequency components in
| what humans typically record for a cat or dog to notice the
| difference. As far as compression, the tricks that work on us
| should work on them.
| a-dub wrote:
| the early xing mp3 codec famously cut everything off above
| 18khz, but that was out of spec. :)
|
| instead perceptual audio compression typically filters out
| frequencies that neighbor other frequencies with lots of
| power. deleting these neighbors is called perceptual
| masking and to the best of my knowledge, we do not actually
| know if it works the same way in animal auditory systems.
| LordDragonfang wrote:
| >MP3 compression works by reducing (or approximating) the
| accuracy of certain components of sound that are considered
| (by psychoacoustic analysis) to be beyond the hearing
| capabilities of most humans.
|
| -via Wikipedia
|
| This holds true for most other audio compression as well.
|
| Now, it's true that max recording frequency is bounded by
| sample rate via the Nyquist theorem, but that doesn't mean
| we're incapable of recording at higher fidelity - we just
| don't bother most of the time, because on consumer hardware
| it's going to be filtered out eventually anyway (or just
| not reproduced well enough, due to low-quality physical
| hardware). Recording studios will regularly produce masters
| that far exceed that normal hearing range though.
| bob1029 wrote:
| > Does it sounds like a muddy phone call to them?
|
| Likely no.
|
| Audio is decidedly less "compressible" in human perceptual
| terms. The brain is amazingly skilled at detecting time delay
| and frequency deviations, so this perceptual baseline likely
| extends (mostly) to your pets.
|
| You can fool the eyes a lot more easily. You can take away
| 50%+ or more of the color information before even a skilled
| artist will start noticing.
| sdenton4 wrote:
| There are real differences in audio perception, though.
| Frequency range and sensitivity to different frequencies is
| a big difference in other animals; I would expect cats (who
| chase rodents, which often have very high pitched or even
| ultrasonic vocalizations) to be more sensitive to high
| frequencies than humans, and thus low passed / low sample
| rate audio could sounds 'bad.'
|
| Another aspect is time resolution. Song birds can have 2-4x
| the time resolution of human hearing, which helps
| distinguish sounds in their very fast, complex calls. This
| may lead to better perception of artifacts in lossy coding
| schemes, but it's hard to say for sure.
|
| Edit: reference on cat hearing:
| https://pubmed.ncbi.nlm.nih.gov/4066516
|
| The hearing range of the cat for sounds of 70 dB SPL
| extends from 48 Hz to 85 kHz, giving it one of the broadest
| hearing ranges among mammals.
| a-dub wrote:
| while work has been done to characterize frequency
| sensitivity across species (which does vary quite a bit,
| especially in the higher ranges (>20khz)), i haven't seen
| any work that has been done to explore frequency domain
| perceptual masking curves in a cross species context.
|
| since some species use their auditory systems for spatial
| localization, i would guess that the perceptual system
| would be totally different in those contexts.
| rullelito wrote:
| I've heard of this long before 2018, was this really seen as
| novel in 2018?
___________________________________________________________________
(page generated 2023-07-13 23:00 UTC)