[HN Gopher] Faster neural networks straight from JPEG (2018)
       ___________________________________________________________________
        
       Faster neural networks straight from JPEG (2018)
        
       Author : Anon84
       Score  : 178 points
       Date   : 2023-07-13 14:51 UTC (8 hours ago)
        
 (HTM) web link (www.uber.com)
 (TXT) w3m dump (www.uber.com)
        
       | quantumstar4k wrote:
       | [dead]
        
       | wodenokoto wrote:
       | That seems just in line with an earlier front page submission
       | that used knn and gzipped text for categorization.
        
       | watersb wrote:
       | IIRC, "neural network" style systems design started in the realm
       | of computer vision with "Perceptrons":
       | 
       | https://en.wikipedia.org/wiki/Perceptrons_(book)Perceptrons =
       | https://g.co/kgs/8Un4eW
       | 
       | Makes sense that image processing would be a good fit in some
       | cases.
        
       | mochomocha wrote:
       | I was doing the same thing at Netflix around the same time as a
       | 20% research project. Training GANs end2end directly in JPEG
       | coeffs space (and then rebuild a JPEG from the generated coeffs
       | using libjpeg to get an image). The pitch was that it not only
       | worked, but you could get fast training by representing each JPEG
       | block as a dense + sparse vector (dense for the low DCT coeffs,
       | sparse for the high ones since they're ~all zeros) and using a
       | neural network library with fast ops on sparse data.
       | 
       | Training on pixels is inefficient. Why have your first layers of
       | CNNs relearn what's already smartly encoded in the JPEG bits in
       | the first place before it's blown into a bloated height x width x
       | 3 float matrix?
        
         | sharemywin wrote:
         | I wondered awhile back about using EGA or CGA for doing image
         | recognition and or for stable diffusion. seems like there
         | should be more than enough image data for that resolution and
         | color depth(16 colors).
        
         | 23B1 wrote:
         | This is really interesting. Have you published your research
         | anywhere that I could read?
        
         | vladimirralev wrote:
         | In naive ML scenarios you are right. You can think of JPEG as
         | an input embedding. One of many. The JPEG/spectral embedding is
         | useful because it already provides miniature variational
         | encoding that "makes sense" in terms of translation, sharpness,
         | color, scale and texture.
         | 
         | But with clever ML you can design better variational
         | characteristics such as rotation or nonlinear thing like faces,
         | fingers, projections and abstract objects.
         | 
         | Further JPEG encoding/decoding will be an obstacle for many
         | architectures that require gradients going back and forth
         | between pixel space and JPEG in order to do evaluation steps
         | and loss functions based on the pixel space (which would be
         | superior). Not to mention if you need human feedback in
         | generative scenarios to retouch the output and run training
         | steps on the changed pixels.
         | 
         | And finally, there are already picture and video embeddings
         | that are gradient-friendly and reusable.
        
         | luckystarr wrote:
         | I remember I've heard somewhere that our retina encodes the
         | visuals it receives into a compressed signal before forwarding
         | it to the visual cortex. If true, this may actually be how it's
         | done "for real". ;)
        
         | johntb86 wrote:
         | I would worry that the fixed, non-overlapping block nature of a
         | JPEG would reduce translation invariance - shift an image by 4
         | pixels and the DCT coefficients may look very different. People
         | have been doing a lot of work to try to reduce the dependence
         | of the image on the actual pixel coordinates - see for example
         | https://research.nvidia.com/publication/2021-12_alias-free-g...
        
           | andai wrote:
           | Does JPEG-2000 fix that? From what I gathered, it doesn't use
           | blocks.
        
             | mochomocha wrote:
             | JPEG-2000 uses wavelets as a decomposition basis as opposed
             | to DCT which in theory makes it possible to treat the whole
             | image as a single block while ensuring high compression. In
             | practice though tiles are used, I would guess to improve on
             | memory and compute parallelism.
        
           | DougBTX wrote:
           | On the other hand, ViT uses non-overlapping patches anyway,
           | so the impact may be minor. Example code:
           | https://nn.labml.ai/transformers/vit/index.html
        
         | corysama wrote:
         | As an AI armchair quarterback, I've always held the opinion
         | that the image ML space has a counter-productive bias towards
         | not pre-processing images stemming from a mixture of test
         | purity and academic hubris. "Look what this method can learn
         | from raw data completely independently!" makes for a nice
         | paper. So, they stick with sRGB inputs rather than doing basic
         | classic transforms like converting to YUV420. Everyone learns
         | from the papers, so that's assumed to be the standard practice.
        
           | CodesInChaos wrote:
           | Typical CNNs can learn a linear transformation of the input
           | at no cost. Since YUV is such a linear transformation of RGB,
           | there is no benefit in converting to it beforehand.
        
             | bob1029 wrote:
             | How is there not a cost associated with forcing the machine
             | to learn how to do something that we already have a simple,
             | deterministic algorithm for? Won't some engineer need to
             | double check a few things with regard to the AI's idea of
             | color space transform?
        
               | uoaei wrote:
               | Your instincts are correct. Training is faster, more
               | stable, and more efficient that way. In certain cases it
               | "pretty much is irrelevant" but the advantages of the
               | strategy of modelling the knowns and training only on the
               | unknowns becomes starkly apparent when doing e.g. sensor
               | fusion or other ML tasks on physical systems.
        
               | CodesInChaos wrote:
               | You could probably derive some smart initialization for
               | the first layer of a NN based on domain knowledge (color
               | spaces, sobel filters, etc.). But since this is such a
               | small part of what the NN has to learn, I expect this to
               | result in a small improvement in training time and have
               | no effect on final performance and accuracy, so it's
               | unlikely to be worth the complexity of developing such a
               | feature.
        
               | whimsicalism wrote:
               | Absolutely this.
               | 
               | Seems like on HN people are still learning 'the bitter
               | lesson'.
        
           | potatoman22 wrote:
           | That's something that surprises me too, given the
           | preprocessing applied to NLP.
        
             | CodesInChaos wrote:
             | Short text representations (via good tokenization)
             | significantly reduces the computational cost of a
             | transformer (need to generate fewer tokens for the same
             | output length, and need fewer tokens to represent the same
             | window size). I think these combine to n^3 scaling (n^2
             | from window size and n from output size).
             | 
             | For images it's not clear to me if there are any
             | preprocessing methods that do a lot better than resizing
             | the image to a smaller resolution (which is commonly done
             | already).
        
             | thomasahle wrote:
             | We used to do a lot more preprocessing to NLP, like
             | stemming, removing stop words, or even adding grammar
             | information (NP, VP, etc.). Now we just do basic
             | tokenization. The rest turned out to be irrelevant or even
             | counter productive.
        
               | kbelder wrote:
               | But also, that basic tokenization is essential; training
               | it on a raw ascii stream would be much less efficient.
               | There is a sweet spot of processing & abstraction that
               | should be aimed for.
        
           | fxtentacle wrote:
           | In my experience, staying close to the storage format is very
           | useful because it allows the neural network to correctly deal
           | with clipped/saturated values. If your file is saved in sRGB
           | and you train in sRGB, then when something turns to 0 or 255,
           | the AI can handle it as a special case because most likely it
           | was too bright or too dark for your sensor to capture
           | accurately. If you first transform to a different color
           | space, that clear clip boundary gets lost.
           | 
           | Also, I would prefer sRGB oder RGB because it more closely
           | matches the human vision system. That said, the RGB to YUV
           | transformation is effectively a matrix multiplication, so if
           | you use conv features like everyone then you can merge it
           | into your weights for free.
        
           | whimsicalism wrote:
           | In deep ML, people are pretty familiar with the bitter lesson
           | and don't want to waste time on this.
        
             | Llamamoe wrote:
             | The bitter lesson is about not trying to encode impossible-
             | to-formalize conceptual knowledge, not avoiding data
             | efficiency and the need to scale the model up to ever
             | higher parameter counts.
             | 
             | If we followed this logic, we'd be training LLMs on
             | character-level UTF-32 and just letting it figure
             | everything out by itself, while needing two orders of
             | magnitude bigger contexts and parameter counts.
        
               | whimsicalism wrote:
               | Converting from RGB to YUV is absolutely subject to the
               | bitter lesson because it is trying to generalize from a
               | representation that we have seen works for some classical
               | methods and hard code that knowledge in to the AI which
               | could easily learn (and will anyways) a more useful
               | representation for itself.
               | 
               | > LLMs on character-level UTF-32 and just letting it
               | figure everything out by itself, while needing two orders
               | of magnitude bigger contexts and parameter counts.
               | 
               | This was tried extensively and honestly it is probably
               | _still_ too early to proclaim the demise of this
               | approach. It 's also completely different - you're
               | conflating a representation that literally changes the
               | number of forward passes you have to do (ie. the amount
               | of computation - what the bitter lesson is about) vs. one
               | that (at most) would just require stacking on a few
               | layers or so.
               | 
               | A better example for your point (imo) would be audio
               | recognition, where we pre-transform from wave amplitudes
               | into log mel spectrogram for ingestion by the model. I
               | think this will ultimately fall to the bitter lesson as
               | well though.
               | 
               | Also a key difference is that you are proposing going
               | from methods that _already work_ to try to inject more
               | classical knowledge into them. It is oftentimes the case
               | that you 'll have an intermediary fusion between deep +
               | classical, but not if you already have working fully deep
               | methods.
        
           | mhh__ wrote:
           | It's also probably because a lot of knowledge about how jpeg
           | works is tied up signal processing books that usually front
           | load a bunch of mathematics as opposed to ML which often
           | needs a bit of mathematical intuition but in practice is
           | usually empirical.
        
             | andai wrote:
             | I was trying to learn how the discrete cosine transform
             | works, so I looked up some code in an open source program.
             | The code said it copied it verbatim from a book from the
             | 90s. I looked up the book and the book said it copied it
             | verbatim from a paper from the 1970s.
        
               | mhh__ wrote:
               | The code is irrelevant, in the eyes of the theoretician
               | anyway.
        
               | andai wrote:
               | I am theoretically challenged.
        
       | shoulderfake wrote:
       | [dead]
        
       | GaggiX wrote:
       | Add "(2018)"
        
         | tedunangst wrote:
         | I was wondering about "jpeg turned 25 in 2017." Seemed like
         | peculiar phrasing.
        
         | readthenotes1 wrote:
         | Back when Uber was thinking it could make self driving taxis
        
           | londons_explore wrote:
           | Uber now has a deal with Waymo - so in effect they are still
           | trying to do self-driving taxi's, but now via a complex
           | business relationship.
        
             | capableweb wrote:
             | Ah, just like I "make hamburgers" when I go through the
             | McDonals drive-through, although it's via a business
             | transaction.
        
               | ianlevesque wrote:
               | God this website is cynical. Licensing technology from
               | another company to commercialize it is completely
               | legitimate.
        
               | wpietri wrote:
               | That is legitimate, but that's not the point here. The
               | point is Uber's hubris. A hubris very useful in pumping
               | up its stock price ahead of an IPO. If they had quietly
               | planned to license it from the get-go, nobody would have
               | mentioned it.
        
               | ianlevesque wrote:
               | Uber literally killed people trying to develop their own
               | first. Tesla continues to do so. I wish everyone would
               | license Waymo instead.
        
               | vajrabum wrote:
               | Any reason to prefer Waymo over Cruise? I saw an
               | driverless Cruise taxi in SF just the other day. And are
               | they any other competitors still in this space worth
               | watching?
        
               | jjk166 wrote:
               | More like making hamburgers by buying frozen burger
               | patties at the supermarket.
        
       | kevmo314 wrote:
       | A lot of generative audio works very similar these days: it's
       | much faster to predict and generate a codebook than a raw
       | waveform.
        
       | a-dub wrote:
       | i guess an interesting question is: can you coax a network into
       | learning a better perceptual compression transform than the dcts
       | in jpeg?
        
       | kunalgupta wrote:
       | see also:
       | https://twitter.com/goodside/status/1679358632431853568?s=46...
        
         | [deleted]
        
       | naillo wrote:
       | Finally someone who did this, I've always thought this was a low
       | hanging fruit. I wonder if you could make interesting and quickly
       | trained diffusion models with this trick.
        
         | liuliu wrote:
         | Sorta of? Latent diffusion models operate in a compressed
         | latent space, which is just a richer / learnable representation
         | than DCT.
        
         | waldarbeiter wrote:
         | Note that it is from 2018. As someone here already mentioned
         | there is a paper that applies the same idea to Vision
         | Transformers published this year [1].
         | 
         | [1]
         | https://openaccess.thecvf.com/content/CVPR2023/papers/Park_R...
        
       | goldemerald wrote:
       | For those interested, a modern version (vision transformers) was
       | just published this year at CVPR
       | https://openaccess.thecvf.com/content/CVPR2023/html/Park_RGB...
        
         | m3affan wrote:
         | Transformers are the taking the cake in the Deep Learning
         | community
        
         | w-m wrote:
         | Ha, I remember the poster from the conference, it was quite
         | crowded when I passed by. This one seemed to have a big focus
         | on data augmentation in the DCT space. I was asking myself (and
         | the author) whether you couldn't eke out a little more
         | efficiency by trying to quantize your network similarly to the
         | default JPEG quantization table. As I understood, currently all
         | weights are quantized uniformly, which does not make sense when
         | your inputs are heavily quantized, does it? Maybe I should dive
         | a little deeper into the Uber paper, they were focusing a bit
         | more in the quantization part. Sorry if I'm talking nonsense,
         | this is absolutely not my area, but I found the topic
         | captivating.
        
           | buildbot wrote:
           | There is some work on using JPEG style DCT blocks for
           | quantization:
           | https://dl.acm.org/doi/10.1109/ISCA45697.2020.00075
           | 
           | (Disclaimer, not mine, but a friends work)
        
         | qwertox wrote:
         | Thank you. First published 2022-11-29 on arxiv [0] and updated
         | one month ago.
         | 
         | Interesting line: "With these two improvements -- ViT and data
         | augmentation -- we show that our ViT-Ti model achieves up to
         | 39.2% faster training and 17.9% faster inference with no
         | accuracy loss compared to the RGB counterpart."
         | 
         | [0] https://arxiv.org/abs/2211.16421v2
        
         | jcjohns wrote:
         | I'm one of the authors of this CVPR paper -- cool to see our
         | work mentioned on HN!
         | 
         | The Uber paper from 2018 is one that has been floating around
         | in the back of my head for a while. Decoding DCT to RGB is
         | essentially an 8x8 stride 8 convolution -- it seems wasteful to
         | perform this operation on CPU for data loading, then
         | immediately pass the resulting decoded RGB into convolution
         | layers that probably learn similar filters as those used during
         | DCT decoding anyway.
         | 
         | Compared to the earlier Uber paper, our CVPR paper makes two
         | big advances:
         | 
         | (1) Cleaner architecture: The Uber paper uses a CNN, while we
         | use a ViT. It's kind of awkward to modify an existing CNN
         | architecture to accept DCT instead of RGB since the grayscale
         | data is 8x lower resolution than RGB, and the color information
         | is 16x lower than RGB. With a CNN, you need to add extra layers
         | to deal with the downsampled input, and use some kind of fusion
         | mechanism to fuse the luma/chroma data of different resolution.
         | With a ViT it's very straightforward to accept DCT input; you
         | only need to change the patch embedding layer, and the body of
         | the network is unchanged.
         | 
         | (2) Data augmentation: The original Uber paper only showed
         | speedup during inference. During training they need to perform
         | data augmentation, so convert DCT to RGB, augment in RGB, then
         | convert back to DCT to feed the augmented data to the model.
         | This means that their approach will be _slower_ during training
         | vs an RGB model. In our paper we show to to perform all
         | standard image augmentations directly in DCT, so we can get
         | speedups during both training and inference.
         | 
         | Happy to answer any questions about the project!
        
           | inoop wrote:
           | > Decoding DCT to RGB is essentially an 8x8 stride 8
           | convolution -- it seems wasteful to perform this operation on
           | CPU for data loading
           | 
           | Then why not do it on the GPU? Feels like exactly the sort of
           | thing it was designed to do.
           | 
           | Or alternatively, use nvjpeg?
        
             | jcjohns wrote:
             | This makes sense in theory, but is hard to get working in
             | practice.
             | 
             | We tried using nvjpeg to do JPEG decoding on GPU as a
             | additional baseline, but using it as a drop-in replacement
             | to a standard training pipeline gives huge slowdowns for a
             | few reasons:
             | 
             | (1) Batching: nvjpeg isn't batched; you need to decode one
             | at a time in a loop. This is slow but could in principle be
             | improved with a better GPU decoder.
             | 
             | (2) Concurrent data loading / model execution: In a
             | standard training pipeline, the CPU is loading and
             | augmenting data on CPU for the next batch in parallel with
             | the model running forward / backward on the current batch.
             | Using the GPU for decoding blocks it from running the model
             | concurrently. If you were careful I think you could
             | probably find a way to interleave JPEG decoding and model
             | execution on the GPU, but it's not straightforward. Just
             | naively swapping out to use nvjpeg in a standard PyTorch
             | training pipeline gives very bad performance.
             | 
             | (3) Data augmentation: If you do DCT -> RGB decoding on the
             | GPU, then you have to think about how and where to do data
             | augmentation. You can augment in DCT either on CPU or on
             | GPU; however DCT augmentation tends to be more expensive
             | than RGB augmentation (especially for resize operations),
             | so if you are already going through the trouble of decoding
             | to RGB then it's probably much cheaper to augment in RGB.
             | If you augment in RGB on GPU, then you are blocking
             | parallel model execution for both JPEG decoding and
             | augmentation, and problem (2) gets even worse. If you do
             | RGB augmentation on CPU, you end up with and extra GPU ->
             | CPU -> GPU round trip on every model iteration which again
             | reduces performance.
        
             | arketyp wrote:
             | I'm just a low tier ML engineer, but I'd say you generally
             | want to avoid splitting GPU resources over many libraries,
             | to the extent it's even practically possible.
        
       | bob1029 wrote:
       | > Accuracy gains are due primarily to the specific use of a DCT
       | representation, which turns out to work curiously well for image
       | classification.
       | 
       | It would seem quantization is a useful tool for any sort of NN-
       | style application.
       | 
       | If the expected output is intended to be human-like, why not feed
       | it information that a typical human could not distinguish from a
       | lossless representation? Seems like a simple game of expectations
       | and information theory.
        
         | kridsdale3 wrote:
         | That's kind of the key theory behind why JPEG (and other lossy
         | encodings) work at all. A perfect being would see a JPEG next
         | to a PNG or TIFF and find the first repugnantly error-ridden.
         | 
         | But we tend to ignore high-frequency data's specifics most of
         | the time, so it psychologically works.
         | 
         | I often wonder though, what do my cat and dog hear when I'm
         | playing compressed music? Does it sounds like a muddy phone
         | call to them?
        
           | jdiff wrote:
           | No, audio compression doesn't filter out high frequencies,
           | that's just what computer audio as a whole does. And I don't
           | think there's enough of those high frequency components in
           | what humans typically record for a cat or dog to notice the
           | difference. As far as compression, the tricks that work on us
           | should work on them.
        
             | a-dub wrote:
             | the early xing mp3 codec famously cut everything off above
             | 18khz, but that was out of spec. :)
             | 
             | instead perceptual audio compression typically filters out
             | frequencies that neighbor other frequencies with lots of
             | power. deleting these neighbors is called perceptual
             | masking and to the best of my knowledge, we do not actually
             | know if it works the same way in animal auditory systems.
        
             | LordDragonfang wrote:
             | >MP3 compression works by reducing (or approximating) the
             | accuracy of certain components of sound that are considered
             | (by psychoacoustic analysis) to be beyond the hearing
             | capabilities of most humans.
             | 
             | -via Wikipedia
             | 
             | This holds true for most other audio compression as well.
             | 
             | Now, it's true that max recording frequency is bounded by
             | sample rate via the Nyquist theorem, but that doesn't mean
             | we're incapable of recording at higher fidelity - we just
             | don't bother most of the time, because on consumer hardware
             | it's going to be filtered out eventually anyway (or just
             | not reproduced well enough, due to low-quality physical
             | hardware). Recording studios will regularly produce masters
             | that far exceed that normal hearing range though.
        
           | bob1029 wrote:
           | > Does it sounds like a muddy phone call to them?
           | 
           | Likely no.
           | 
           | Audio is decidedly less "compressible" in human perceptual
           | terms. The brain is amazingly skilled at detecting time delay
           | and frequency deviations, so this perceptual baseline likely
           | extends (mostly) to your pets.
           | 
           | You can fool the eyes a lot more easily. You can take away
           | 50%+ or more of the color information before even a skilled
           | artist will start noticing.
        
             | sdenton4 wrote:
             | There are real differences in audio perception, though.
             | Frequency range and sensitivity to different frequencies is
             | a big difference in other animals; I would expect cats (who
             | chase rodents, which often have very high pitched or even
             | ultrasonic vocalizations) to be more sensitive to high
             | frequencies than humans, and thus low passed / low sample
             | rate audio could sounds 'bad.'
             | 
             | Another aspect is time resolution. Song birds can have 2-4x
             | the time resolution of human hearing, which helps
             | distinguish sounds in their very fast, complex calls. This
             | may lead to better perception of artifacts in lossy coding
             | schemes, but it's hard to say for sure.
             | 
             | Edit: reference on cat hearing:
             | https://pubmed.ncbi.nlm.nih.gov/4066516
             | 
             | The hearing range of the cat for sounds of 70 dB SPL
             | extends from 48 Hz to 85 kHz, giving it one of the broadest
             | hearing ranges among mammals.
        
             | a-dub wrote:
             | while work has been done to characterize frequency
             | sensitivity across species (which does vary quite a bit,
             | especially in the higher ranges (>20khz)), i haven't seen
             | any work that has been done to explore frequency domain
             | perceptual masking curves in a cross species context.
             | 
             | since some species use their auditory systems for spatial
             | localization, i would guess that the perceptual system
             | would be totally different in those contexts.
        
       | rullelito wrote:
       | I've heard of this long before 2018, was this really seen as
       | novel in 2018?
        
       ___________________________________________________________________
       (page generated 2023-07-13 23:00 UTC)