[HN Gopher] YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Per...
___________________________________________________________________
YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance
Author : T-A
Score : 78 points
Date : 2021-09-10 17:04 UTC (5 hours ago)
(HTM) web link (neuralmagic.com)
(TXT) w3m dump (neuralmagic.com)
| ThouYS wrote:
| I guess pretty soon we'll wonder how anyone was ever ok with
| using raw non-sparsified models
| fwsgonzo wrote:
| I would love to use this in a "special setting" but it has to be
| compiled, static, code. So, if there is an implementation for C,
| C++, Rust or any other language that would be great! I am more
| than willing to handle the special build myself. I just need a
| language that compiles to machine code.
| codetrotter wrote:
| Is the special setting running it on iOS and distributing it
| via the Apple App Store? :)
| spullara wrote:
| I don't know a lot about this but I think you can use this to
| remove the Python dependency:
|
| https://pytorch.org/docs/stable/jit.html
| enricozb wrote:
| Tangentially related: There is some controversy around YOLOv5's
| name: https://github.com/ultralytics/yolov5/issues/2
| deepnotderp wrote:
| I'm pretty sure this isn't using the Tensor cores on the GPU.
|
| If you see here
| (https://github.com/ultralytics/yolov5/blob/master/README.md),
| the speed of inference on a V100 for YOLOv5s should be 2 ms per
| image, or 500 imgs/s, not the 44.6 img/s being reported here.
|
| This is important as it is more than an order of magnitude off.
| ml_hardware wrote:
| My guess is they _are_ using tensor cores as they report FP16
| throughput, but they seem to be measuring at batch size 1,
| which is hugely unfair to the GPUs.
|
| For inference workloads you usually batch incoming requests
| together and run once on GPU (though this increases latency). A
| latency/throughput tradeoff curve at different batch sizes
| would tell the whole story.
|
| Also, they are using INT8 on CPU and neglect to measure the
| same on GPU. All the GPU throughputs would 2x.
|
| tl;dr just use GPUs
|
| Edit: to the comments below, I agree low-latency can be
| important to some workloads, but that's exactly why I think we
| need to see a latency-throughput tradeoff curve.
|
| Unfortunately, I'm pretty sure that modern GPUs
| (A10/A30/A40/A100) basically dominate CPUs even when latency is
| constrained, and the MLPerf results give a good (fair!)
| comparison of this:
|
| https://developer.nvidia.com/blog/extending-nvidia-performan...
|
| The GPU throughputs are much, much higher than the CPU ones,
| and I don't think even NM's software can overcome this gap. Not
| to mention they degrade the model quality...
|
| The last question is whether CPUs are more cost-effective
| despite being slower, and the answer is still... no. The
| instances used in this blog post cost:
|
| - C5 CPU (c5.12xlarge): $2.04/hr - T4 GPU (g4dn.2xlarge):
| $0.752/hr
|
| NM's best result @ batch-size-1 costs more, at lower
| throughput, at lower model quality, at ~same latency, than a
| last-gen GPU operating at half it's capacity. A new A10 GPU
| using INT8 will widen the perf/$ gap by another ~4x.
|
| Also full disclosure I don't work at NVIDIA or anything like
| that so I'm not trying to shill :) I just like ML hardware a
| lot and want to help people make fair comparisons.
| bigcorp-slave wrote:
| It depends on your application. If you are running on a
| smartphone, or on an AR headset, or on a car, or on a camera,
| etc, you generally do not have the latency budget to wait for
| multiple frames and run at high batch size.
| 37ef_ced3 wrote:
| Batch size 1 improves latency, especially for
| businesses/services with fewer users. Latency matters.
|
| Also, your CPU cost numbers are way off, using an expensive
| provider like AWS instead of, say, Vultr
| (https://www.vultr.com)
|
| And many businesses/services can't saturate the hardware you
| describe. It's just too much compute power. With CPUs you can
| scale down to fit your actual needs: all the way down to a
| single AVX-512 core doing maybe 24 inferences per second
| (costing a few dollars PER MONTH).
| deepnotderp wrote:
| V100 GPUs have non tensor core fp16 operations too I think
| woadwarrior01 wrote:
| Yes. Non tensor core fp16 ops are the default. Tensor cores
| are essentially 4x4 fp16 mac units and there's a
| requirement that matrix dimensions are multiples of 8[1]
| that needs to be met for them to be used.
|
| [1]:
| https://docs.nvidia.com/deeplearning/performance/mixed-
| preci...
| ml_hardware wrote:
| That's true.. in fact, seeing V100 FP16 < T4 FP16 makes me
| believe you're right, the V100 should be much faster if the
| tensor cores were being used.
| baybal2 wrote:
| I told you! :)
|
| CPUs keep throwing specialised hardware into the garbage bag
| every time few years after mass adoption of something big, as the
| computer science improves.
|
| People were once telling that real time audio was physically
| impossible to do on the CPU, now you can do it even on a
| smartphone CPU.
| monocasa wrote:
| Some of that is figuring out the software overhead too. Alexia
| Massalin was doing audio DSP on her 68k workstation back in the
| late 80s with her Synthesis kernel. Overall it seems that
| multiplexing with the right semantics is a majority of the
| problem to running new workloads on GPCPUs and takes time to
| understand.
| hrydgard wrote:
| CPUs haven't replaced GPUs for graphics though, and are very
| far away from doing it...
| 37ef_ced3 wrote:
| Even without sparsification, AVX-512 CPUs are far more cost-
| effective for inference.
|
| To get your money's worth, you must saturate the processor. It's
| easy, in practice, with an AVX-512 CPU (e.g., see
| https://NN-512.com or Fabrice Bellard's
| https://bellard.org/libnc), and almost impossible with a GPU.
|
| A GPU cloud instance costs $1000+ per month (vs. $10 per month
| for an AVX-512 CPU). A bargain GPU instance (e.g., Linode) costs
| $1.50 per hour (and far more on AWS) but an AVX-512 CPU costs
| maybe $0.02 per hour.
|
| That said, GPUs are essential for training.
|
| Note that Neural Magic's engine is completely closed-source, and
| their python library communicates with the Neural Magic servers.
| If you use their engine, your project survives only as long as
| Neural Magic's business survives.
| outlace wrote:
| "That said, GPUs are essential for training"
|
| http://learningsys.org/neurips19/assets/papers/18_CameraRead...
| zekrioca wrote:
| Read parent's (emphasis mine):
|
| > Even without sparsification, AVX-512 CPUs are far more
| cost-effective _for inference_.
| ramzyo wrote:
| > Note that Neural Magic's engine is completely closed-source,
| and their python library communicates with the Neural Magic
| servers. If you use their engine, your project survives only as
| long as Neural Magic's business survives.
|
| Could be totally off here, but have a feeling this team is
| going to get scooped up by one of the big co's. Having deja vu
| of XNOR.ai (purchased by Apple), when partners got the short
| end of the stick (see Wyze cam saga:
| https://www.theverge.com/2019/11/27/20985527/wyze-person-
| det...)
| shock-value wrote:
| GCP lists a T4 which is suitable for inference for between
| $0.11/hour and $0.35/hour (depending on commitment duration and
| preemptibility).
|
| https://cloud.google.com/compute/gpus-pricing
| carbocation wrote:
| Agreed - I priced this out for a specific distributed
| inference task a few months ago and the T4 was cheaper and
| faster than GPU.
|
| On 26 million images using a pytorch model that had 41
| million parameters, T4 instances were about 32% cheaper than
| CPU instances, and took about 45% of the time even after
| accounting for extra GPU startup time.
| [deleted]
| ml_hardware wrote:
| T4 is a gpu :) NVIDIA Tesla T4: https://www.nvidia.com/en-
| us/data-center/tesla-t4/
| fxtentacle wrote:
| I believe that's also why there are so many options for turning
| fully trained TensorFlow graphs into C++ code.
|
| You use the expensive GPU for building the AI, then dumb it
| down for mass-deployment on cheap CPUs.
| ramzyo wrote:
| This is really impressive work. Would be interested to know
| whether they plan to support ARM CPUs in their runtime engine
| (currently looks like just a few AMD & Intel CPUs:
| https://github.com/neuralmagic/deepsparse). Many IoT and embedded
| devices with application-class processors could benefit from
| these speedups.
| ddv wrote:
| "Is there plan to support sparse NN influence on mobile
| devices, such as Arm CPUs?"
|
| "Based on the amount of work for support, it's on our medium to
| long-term roadmap currently."
|
| https://github.com/neuralmagic/deepsparse/issues/183
| monocasa wrote:
| And no community pitching in to help with that either.
|
| > Open sourcing of the backend runtime is something we're not
| planning for the near to medium term. It is something we're
| actively reevaluating, though, as the company and the
| industry as a whole grows to see what works best for the
| needs of our users.
| c54 wrote:
| What does sparsification mean here?
| throwawaybanjo1 wrote:
| https://docs.neuralmagic.com/main/source/getstarted.html#spa...
| chessgecko wrote:
| Super interesting, but it would probably help the pytorch
| performance significantly on both the gpu and cpu if they
| torchscripted the models, it's probably pretty simple given they
| exported it to onnx and would be more apples to apples
|
| It'll be exciting to see what they can do with the a10 gpus when
| they're available.
___________________________________________________________________
(page generated 2021-09-10 23:01 UTC)