[HN Gopher] YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Per...
       ___________________________________________________________________
        
       YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance
        
       Author : T-A
       Score  : 78 points
       Date   : 2021-09-10 17:04 UTC (5 hours ago)
        
 (HTM) web link (neuralmagic.com)
 (TXT) w3m dump (neuralmagic.com)
        
       | ThouYS wrote:
       | I guess pretty soon we'll wonder how anyone was ever ok with
       | using raw non-sparsified models
        
       | fwsgonzo wrote:
       | I would love to use this in a "special setting" but it has to be
       | compiled, static, code. So, if there is an implementation for C,
       | C++, Rust or any other language that would be great! I am more
       | than willing to handle the special build myself. I just need a
       | language that compiles to machine code.
        
         | codetrotter wrote:
         | Is the special setting running it on iOS and distributing it
         | via the Apple App Store? :)
        
         | spullara wrote:
         | I don't know a lot about this but I think you can use this to
         | remove the Python dependency:
         | 
         | https://pytorch.org/docs/stable/jit.html
        
       | enricozb wrote:
       | Tangentially related: There is some controversy around YOLOv5's
       | name: https://github.com/ultralytics/yolov5/issues/2
        
       | deepnotderp wrote:
       | I'm pretty sure this isn't using the Tensor cores on the GPU.
       | 
       | If you see here
       | (https://github.com/ultralytics/yolov5/blob/master/README.md),
       | the speed of inference on a V100 for YOLOv5s should be 2 ms per
       | image, or 500 imgs/s, not the 44.6 img/s being reported here.
       | 
       | This is important as it is more than an order of magnitude off.
        
         | ml_hardware wrote:
         | My guess is they _are_ using tensor cores as they report FP16
         | throughput, but they seem to be measuring at batch size 1,
         | which is hugely unfair to the GPUs.
         | 
         | For inference workloads you usually batch incoming requests
         | together and run once on GPU (though this increases latency). A
         | latency/throughput tradeoff curve at different batch sizes
         | would tell the whole story.
         | 
         | Also, they are using INT8 on CPU and neglect to measure the
         | same on GPU. All the GPU throughputs would 2x.
         | 
         | tl;dr just use GPUs
         | 
         | Edit: to the comments below, I agree low-latency can be
         | important to some workloads, but that's exactly why I think we
         | need to see a latency-throughput tradeoff curve.
         | 
         | Unfortunately, I'm pretty sure that modern GPUs
         | (A10/A30/A40/A100) basically dominate CPUs even when latency is
         | constrained, and the MLPerf results give a good (fair!)
         | comparison of this:
         | 
         | https://developer.nvidia.com/blog/extending-nvidia-performan...
         | 
         | The GPU throughputs are much, much higher than the CPU ones,
         | and I don't think even NM's software can overcome this gap. Not
         | to mention they degrade the model quality...
         | 
         | The last question is whether CPUs are more cost-effective
         | despite being slower, and the answer is still... no. The
         | instances used in this blog post cost:
         | 
         | - C5 CPU (c5.12xlarge): $2.04/hr - T4 GPU (g4dn.2xlarge):
         | $0.752/hr
         | 
         | NM's best result @ batch-size-1 costs more, at lower
         | throughput, at lower model quality, at ~same latency, than a
         | last-gen GPU operating at half it's capacity. A new A10 GPU
         | using INT8 will widen the perf/$ gap by another ~4x.
         | 
         | Also full disclosure I don't work at NVIDIA or anything like
         | that so I'm not trying to shill :) I just like ML hardware a
         | lot and want to help people make fair comparisons.
        
           | bigcorp-slave wrote:
           | It depends on your application. If you are running on a
           | smartphone, or on an AR headset, or on a car, or on a camera,
           | etc, you generally do not have the latency budget to wait for
           | multiple frames and run at high batch size.
        
           | 37ef_ced3 wrote:
           | Batch size 1 improves latency, especially for
           | businesses/services with fewer users. Latency matters.
           | 
           | Also, your CPU cost numbers are way off, using an expensive
           | provider like AWS instead of, say, Vultr
           | (https://www.vultr.com)
           | 
           | And many businesses/services can't saturate the hardware you
           | describe. It's just too much compute power. With CPUs you can
           | scale down to fit your actual needs: all the way down to a
           | single AVX-512 core doing maybe 24 inferences per second
           | (costing a few dollars PER MONTH).
        
           | deepnotderp wrote:
           | V100 GPUs have non tensor core fp16 operations too I think
        
             | woadwarrior01 wrote:
             | Yes. Non tensor core fp16 ops are the default. Tensor cores
             | are essentially 4x4 fp16 mac units and there's a
             | requirement that matrix dimensions are multiples of 8[1]
             | that needs to be met for them to be used.
             | 
             | [1]:
             | https://docs.nvidia.com/deeplearning/performance/mixed-
             | preci...
        
             | ml_hardware wrote:
             | That's true.. in fact, seeing V100 FP16 < T4 FP16 makes me
             | believe you're right, the V100 should be much faster if the
             | tensor cores were being used.
        
       | baybal2 wrote:
       | I told you! :)
       | 
       | CPUs keep throwing specialised hardware into the garbage bag
       | every time few years after mass adoption of something big, as the
       | computer science improves.
       | 
       | People were once telling that real time audio was physically
       | impossible to do on the CPU, now you can do it even on a
       | smartphone CPU.
        
         | monocasa wrote:
         | Some of that is figuring out the software overhead too. Alexia
         | Massalin was doing audio DSP on her 68k workstation back in the
         | late 80s with her Synthesis kernel. Overall it seems that
         | multiplexing with the right semantics is a majority of the
         | problem to running new workloads on GPCPUs and takes time to
         | understand.
        
         | hrydgard wrote:
         | CPUs haven't replaced GPUs for graphics though, and are very
         | far away from doing it...
        
       | 37ef_ced3 wrote:
       | Even without sparsification, AVX-512 CPUs are far more cost-
       | effective for inference.
       | 
       | To get your money's worth, you must saturate the processor. It's
       | easy, in practice, with an AVX-512 CPU (e.g., see
       | https://NN-512.com or Fabrice Bellard's
       | https://bellard.org/libnc), and almost impossible with a GPU.
       | 
       | A GPU cloud instance costs $1000+ per month (vs. $10 per month
       | for an AVX-512 CPU). A bargain GPU instance (e.g., Linode) costs
       | $1.50 per hour (and far more on AWS) but an AVX-512 CPU costs
       | maybe $0.02 per hour.
       | 
       | That said, GPUs are essential for training.
       | 
       | Note that Neural Magic's engine is completely closed-source, and
       | their python library communicates with the Neural Magic servers.
       | If you use their engine, your project survives only as long as
       | Neural Magic's business survives.
        
         | outlace wrote:
         | "That said, GPUs are essential for training"
         | 
         | http://learningsys.org/neurips19/assets/papers/18_CameraRead...
        
           | zekrioca wrote:
           | Read parent's (emphasis mine):
           | 
           | > Even without sparsification, AVX-512 CPUs are far more
           | cost-effective _for inference_.
        
         | ramzyo wrote:
         | > Note that Neural Magic's engine is completely closed-source,
         | and their python library communicates with the Neural Magic
         | servers. If you use their engine, your project survives only as
         | long as Neural Magic's business survives.
         | 
         | Could be totally off here, but have a feeling this team is
         | going to get scooped up by one of the big co's. Having deja vu
         | of XNOR.ai (purchased by Apple), when partners got the short
         | end of the stick (see Wyze cam saga:
         | https://www.theverge.com/2019/11/27/20985527/wyze-person-
         | det...)
        
         | shock-value wrote:
         | GCP lists a T4 which is suitable for inference for between
         | $0.11/hour and $0.35/hour (depending on commitment duration and
         | preemptibility).
         | 
         | https://cloud.google.com/compute/gpus-pricing
        
           | carbocation wrote:
           | Agreed - I priced this out for a specific distributed
           | inference task a few months ago and the T4 was cheaper and
           | faster than GPU.
           | 
           | On 26 million images using a pytorch model that had 41
           | million parameters, T4 instances were about 32% cheaper than
           | CPU instances, and took about 45% of the time even after
           | accounting for extra GPU startup time.
        
             | [deleted]
        
             | ml_hardware wrote:
             | T4 is a gpu :) NVIDIA Tesla T4: https://www.nvidia.com/en-
             | us/data-center/tesla-t4/
        
         | fxtentacle wrote:
         | I believe that's also why there are so many options for turning
         | fully trained TensorFlow graphs into C++ code.
         | 
         | You use the expensive GPU for building the AI, then dumb it
         | down for mass-deployment on cheap CPUs.
        
       | ramzyo wrote:
       | This is really impressive work. Would be interested to know
       | whether they plan to support ARM CPUs in their runtime engine
       | (currently looks like just a few AMD & Intel CPUs:
       | https://github.com/neuralmagic/deepsparse). Many IoT and embedded
       | devices with application-class processors could benefit from
       | these speedups.
        
         | ddv wrote:
         | "Is there plan to support sparse NN influence on mobile
         | devices, such as Arm CPUs?"
         | 
         | "Based on the amount of work for support, it's on our medium to
         | long-term roadmap currently."
         | 
         | https://github.com/neuralmagic/deepsparse/issues/183
        
           | monocasa wrote:
           | And no community pitching in to help with that either.
           | 
           | > Open sourcing of the backend runtime is something we're not
           | planning for the near to medium term. It is something we're
           | actively reevaluating, though, as the company and the
           | industry as a whole grows to see what works best for the
           | needs of our users.
        
       | c54 wrote:
       | What does sparsification mean here?
        
         | throwawaybanjo1 wrote:
         | https://docs.neuralmagic.com/main/source/getstarted.html#spa...
        
       | chessgecko wrote:
       | Super interesting, but it would probably help the pytorch
       | performance significantly on both the gpu and cpu if they
       | torchscripted the models, it's probably pretty simple given they
       | exported it to onnx and would be more apples to apples
       | 
       | It'll be exciting to see what they can do with the a10 gpus when
       | they're available.
        
       ___________________________________________________________________
       (page generated 2021-09-10 23:01 UTC)