[HN Gopher] Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transis...
___________________________________________________________________
Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transistors
Author : rbanffy
Score : 59 points
Date : 2021-03-26 09:09 UTC (1 days ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| [deleted]
| barkingcat wrote:
| This will probably be a nightmare for a consumer product.
|
| Too many components from too many different sources, with intel
| doing the "integration".
|
| Doesn't this remind anyone of the engineering philosophy of the
| Boeing 787 Dreamliner? Have individual manufacturers build
| component parts and then use just in time integration to put
| assembly and packaging at the end. If any individual manufacturer
| runs out of chips or components, or de-prioritize production (for
| example, if Samsung or TSMC is being ordered by Korea or Taiwan
| to specifically prioritize chips for their automotive industries)
| - this could lead to shortages that will cause ripples down the
| assembly line for these xe-hpc chips.
|
| Especially in today's world, when companies like Apple are
| constantly moving toward vertical integration, and bringing in
| all external dependencies inward (or at least have ironclad
| contracts mandating partners satisfy their contractual duties),
| this move by intel is in the wrong direction in the post Covid-
| chip shortage era.
| wmf wrote:
| Ponte Vecchio isn't a consumer product. In fact, I've long
| predicted that they'll only manufacture enough to satisfy the
| Aurora contract.
| rincebrain wrote:
| It would be...unfortunate, for Intel to do something with
| such little volume as that again after Xeon Phi.
| marcodiego wrote:
| Nice. But without benchmarks, these numbers mean nothing.
| [deleted]
| rubyn00bie wrote:
| It kind of feels like this is just Intel's marketing machine. The
| chip is less impressive than the article makes it sound, Nvidia
| _shipped_ the A100 in 2020. This Intel chip doesn 't even exist
| in a production system... and the A100 is already pretty damn
| close, hitting (w/ FP16) 624 TF with sparsity according to
| Nvidia's documentation which is at least as accurate as an
| unreleased data center chip from Intel:
|
| https://www.nvidia.com/en-us/data-center/a100/
|
| I'd guess, by the time Intel actually ships anything useful,
| Nvidia will have it made it mostly moot.
| Veedrac wrote:
| TFLOPS-equivalent-with-sparsity is not real TFLOPS. The article
| compared to A100's 312 TFLOPS which is much more reasonable.
| my123 wrote:
| Intel didn't say what the TFlops number was about at all, we
| just don't know anything other than the headline.
| dragontamer wrote:
| We know the performance targets of the Aurora supercomputer
| though.
|
| The only way Intel reaches those performance targets is to
| outdo the current crop of GPUs: MI100 (from AMD) and A100
| (NVidia).
|
| Not that that is a guarantee that we got a winner here, but
| we know the goal for what Intel is shooting for at least.
| my123 wrote:
| Given the _actual_ performance metrics that they gave about its
| Xe HP cousin (which Intel didn't publish any indication on it
| having FP64 at all), I'm inclined to believe that the 1PF
| number is indeed some very ML-specific stuff.
|
| https://cdn.mos.cms.futurecdn.net/BUsZ5EdKUcP8mWRKypTNB4-970...
|
| When excluding ML... (because that's what Intel gave actual
| metrics on for Xe-HP)
|
| 41Tflops FP32 with 4 dies. For comparison, an RTX 3090 (arch
| whitepaper at https://www.nvidia.com/content/dam/en-
| zz/Solutions/geforce/a...) has 35.6Tflops FP32, with a single
| die.
| gigatexal wrote:
| "Intel usually considers FP16 to be the optimal precision for AI,
| so when the company says that that its Ponte Vecchio is a
| 'PetaFLOP scale AI computer in the palm of the hand,' this might
| mean that that the GPU features about 1 PFLOPS FP16 performance,
| or 1,000 TFLOPS FP16 performance. To put the number into context,
| Nvidia's A100 compute GPU provides about 312 TFLOPS FP16
| performance. "
|
| wow
| mrDmrTmrJ wrote:
| Manufacturing the chip-lets independently appears to be an
| interesting approach to maximizing yields. If anyone component
| has a defect, you just assemble using a different chip-let, as
| opposed to it affecting the final product.
|
| Anyone know how this affects power, compute, or communications
| metrics compared to monolithic designs?
|
| Or am I off in thinking this approach maximizes yields?
| zamadatix wrote:
| See/compare AMD's CPUs from the last couple of years to Intel's
| - they use the chiplet approach with up to 8+1 in their Epyc
| CPUs for example.
| baybal2 wrote:
| > Anyone know how this affects power, compute, or
| communications metrics compared to monolithic designs?
|
| It does effect ennormously, but everything is highly design
| specific.
|
| The die size limits are not only yield related.
|
| Power, and clock have stopped scaling few generations ago.
|
| New chips have more, and more disaggregated, independent blocks
| separated by asynchronous interfaces to accomodate more clock,
| and power domains.
|
| If you have to break a chip along such domain boundary, you
| loose little in terms of speed unlike if you did it right
| across registers, logic, and synchronous parallel links.
|
| Caches also stopped scaling too, and making them bigger, also
| makes them slower.
|
| Instead, more elaborate application specific cache hierarchies
| are getting popular. L1-2 get smaller, and faster, but L3 can
| be made to ones fantasy: eDRAM, standalone SRAM, stacked memory
| etc.
| LegitShady wrote:
| I don't think I've ever cared how many transistors were in
| something I purchased.
| varispeed wrote:
| I can't shake the feeling that buying anything with Intel today
| is like buying already obsolete technology. Did I get myself too
| much under influence of advertisement etc. or is it valid to an
| extent? My laptop is currently 3 years old so I am looking for a
| replacement and it seems like there is no point to buy anything
| right now apart from M1 and AMD is out of stock everywhere. But
| even latest AMD processors are not that great of an upgrade. So I
| am left with M1, but I cannot support this company politics and
| my conclusion is that I am going to stick to my old laptop for a
| time being...
| wmf wrote:
| You're basically right. Tiger Lake-H and Alder Lake should
| catch up to AMD this year though.
| zokier wrote:
| Alder Lake is still only 8+8 big+small cores, while you can
| already get 16 big cores in 5950X with hopefully Zen3
| Threadrippers in the pipeline coming soon now that Milan is
| out. Feels like Intel has little to offer in competition.
| NathanielK wrote:
| 3 years isn't that old. Unless someone else is footing it, keep
| using what works for you. The 14nm Intel laptops haven't
| changed much in that time.
|
| Very small laptops with Intel Tigerlake are on level with AMD
| and Apple products. They have all the new IO bits (PCIe 4,
| LPDDR4x, Wifi6) and low power usage on 10nm.
|
| If you wanted a bit more battery life, performance, or just
| want to try a fancier display upgrading could be nice.
| xiphias2 wrote:
| I don't see Apple laptops having worse politics than other
| companies. On my iPad I feel the lots of problems of the closed
| ecosystem, but M1 laptops are accessable enough for developers
| to work with (even though it is sadly undocumented).
| bserge wrote:
| Performance aside, that thing looks beautiful
| choppaface wrote:
| The Cebras prototype was about 0.86 PFLOPS (?) for a whole wafer
| (1T transistors) so this Intel chip looks like a potential viable
| competitor at 1PFLOP for only 100B transistors (even if just
| FP16). I'm sure Intel will want to chase NVidia but Cerebras is
| also a threat given it already has software support (Tensorflow,
| Pytorch, etc). Maybe I'm making an unfair comparison but looks
| Ponte Vecchio would put Intel just above where Cerebras was a
| couple years ago.
|
| https://www.nextbigfuture.com/2020/11/cerebras-trillion-tran...
| aokiji wrote:
| Lets not forget all of the Intel backdoors that were exploited
| and forced us to use patched hardware with lower performance than
| what was advertised.
| caycep wrote:
| Actually - regardless of the performance of this, and perhaps
| this is orthogonal to their GPU - with the global crunch in
| chips/GPU, would this be a natural market space for Intel,
| especially with the new foundry services, to compete? I would
| imagine there is a lot of business to be had from Nvidia/AMD for
| GPUs...assuming the mining boom holds up.
| wmf wrote:
| Intel has the same capacity shortage as everyone else and GPUs
| actually seem pretty cheap (i.e. less profitable) given their
| large dies.
| onli wrote:
| Intel with their own production facilities seems to manage
| the shortage better than everyone else. Their product may be
| worse, but their supply situation has been consistently
| better since December.
| Google234 wrote:
| Very cool! I'm looking forward to see seeing how it performs.
| cs702 wrote:
| NVIDIA's hardware and software (CUDA) badly need competition in
| this space -- from Intel, from AMD, from anyone, please.
|
| If anyone at Intel is reading this, please consider releasing all
| Ponte Vecchio drivers under a permissive open-source license; it
| would facilitate and encourage faster adoption.
| dogma1138 wrote:
| Intel's OneAPI is already miles a head of AMD's ROCm which is
| pretty awesome.
| zepmck wrote:
| When? Where? How can it be miles ahead if the hardware has
| not been released yet?
| baybal2 wrote:
| Yes, seconding that.
|
| What the point of using OneAPI, a yet another compute API
| wrapper, to make software just for a single platform?
|
| You can just use regular computing libs, and C, or C++.
|
| Serious HPC will still stay with its own serious HPC stuff,
| superoptimised C, and fortran code, no matter how labour
| intensive it is.
|
| So, I see very little point in that.
| dogma1138 wrote:
| OneAPI is already cross platform through codeplay's
| implementation which also can run on NVIDIA GPUs, its
| whole point is to be open cross platform framework that
| targets a wide range of hardware.
|
| Wether it would be successful or not is up in the air but
| it's goals are pretty solid.
| my123 wrote:
| So basically, a thing that will provide first-class
| capabilities only on Intel hardware, and won't be really
| optimised for maximum performance/expose all the
| underlying capabilities of the hardware elsewhere.
| pjmlp wrote:
| Now they need to catch up with polyglot CUDA eco-system.
| johnnycerberus wrote:
| I really don't get this push to polyglot programming when
| 99% of the high performance libraries use C++. Even more,
| openAPI has DPC++, SPIR-V has SYCL, CUDA is even building a
| C++ standard library that is heterogeneous supporting both
| CPU and GPU, libcu++. Seriously now, how many people from
| JVM or CLR world actually need this level of high
| performance? How many actually push kernels to the GPU from
| these runtimes? I have yet to see a programming language
| that will replace C++ at what it does best. Maybe Zig
| because it is streamlined and easier to get into will be a
| true contender to C++ HPC but only time will tell.
| pjmlp wrote:
| Enough people to keep a couple of companies in business,
| and NVidia doing collaboration projects with Microsoft
| and Oracle, HPC is not the only market for CUDA.
| bionhoward wrote:
| Whenever I hit AI limits, it's due to memory. That's why
| I would argue the future of AI is Rust, not C++. Memory
| efficiency matters!
| jacques_chester wrote:
| > _Seriously now, how many people from JVM or CLR world
| actually need this level of high performance?_
|
| The big data ecosystem is Java-centric.
| johnnycerberus wrote:
| Indeed it is, but the developers in these ecosystems
| created complements like Apache Arrow that will unload
| the data in a language-independent columnar memory format
| for efficient analytics in services that will run C++ on
| clusters of CPUs and GPUs. Even Spark has rewritten their
| own analytics engine in C++ recently. These were created
| because of the limitations of the JVM. We have tried to
| move the numerical processing away from C++ in the past
| decades but we have always failed.
| jacques_chester wrote:
| You asked who in the JVM world would be interested in
| this kind of performance: that's big data folks. To the
| extent that improvements accrue to the JVM they accrue to
| that world without needing to rewrite into C++.
| dogma1138 wrote:
| Finance too, large exchanges with micro second latency
| have their core systems written in Java, CME Globex and
| EBS/Brokertec are written in Java.
| spijdar wrote:
| Sadly, that's not a very high bar to set...
| xiphias2 wrote:
| CUDA is not as important as Tensorflow, PyTorch and JAX support
| at this point. Those frameworks are what people code against,
| so having high quality backends for them are more important
| than the drivers themselves.
| elihu wrote:
| The One-API and OpenCL implementations, Intel Graphics
| Compiler, and Linux driver are all open source. Ponte Vechio
| support just hasn't been publicly released yet.
|
| https://github.com/intel/compute-runtime
|
| https://github.com/intel/intel-graphics-compiler
|
| https://github.com/torvalds/linux/tree/master/drivers/gpu/dr...
| zepmck wrote:
| One-API is not completely open source. Support for Ponte
| Vecchio will be not released open source for many reasons.
| elihu wrote:
| I don't have specific knowledge of Ponte Vecchio in
| particular, so I'll defer to you if you have such info. The
| support for their mainstream GPU products is open source,
| though.
| nine_k wrote:
| Where to find more details?
| pjmlp wrote:
| One-API focus too much on C++ (SYSCL + Intel own stuff),
| while OpenCL is all about C.
|
| CUDA is polyglot, with very nice graphical debuggers that can
| even single step shaders.
|
| Something that the anti-CUDA keep forgeting.
| UncleOxidant wrote:
| oneAPI support in Julia:
| https://github.com/JuliaGPU/oneAPI.jl
| pjmlp wrote:
| Nice to know, thanks.
| dogma1138 wrote:
| CUDA's biggest advantage over OpenCL other than not being a
| camel was its C++ support which is still the main language
| in use for CUDA in production, I doubt FORTRAN was the
| reason why CUDA got to where it is, C++ on the other hand
| had quite a lot to do with it during its initial days when
| OpenCL was still stuck in OpenGL C-Land.
|
| NVIDIA understood also early on the importance of first
| party libraries and commercial partnerships something Intel
| also understands which is why OneAPI has wider adoption
| already than ROCm.
| pjmlp wrote:
| CUDA supports much more languages than just C++ and
| Fortan.
|
| .NET, Java, Julia, Python (RAPIDS/cuDF), Haskell don't
| have a place on OneAPI so far.
|
| And yes, going back to C++, the hardware is based on
| C++11 memory model (which was based on Java/.NET models).
|
| So plenty of stuff to catch up, besides "we can do C++".
| dragandj wrote:
| How does CUDA support any of these (.Net, Java, etc?).
| It's the first time I hear this claim. There are 3rd
| party wrappers in Java, .Net, etc. that call CUDA's C++
| API, and that's all. Equivalet APIs exist for OpenCL
| too...
| my123 wrote:
| The CUDA runtime gets as input the PTX intermediate
| language.
|
| The toolkit ships with compilers from C++ and Fortran to
| NVVM, and provides you documentation about the PTX
| virtual machine at https://docs.nvidia.com/cuda/parallel-
| thread-execution/index... and about the higher-level NVVM
| (which compiles down to PTX) at
| https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html.
| navaati wrote:
| Oooh, I didn't know PTX was an intermediate
| representation and explicitly documented as such, I
| really thought it was the actual assembly ran by the
| chips...
| my123 wrote:
| You can get the GPU-targeted assembly (sometimes called
| SASS by NVIDIA) through specifically compiling to a given
| GPU then using nvdisasm, which also has a very terse
| definition of the underlying instruction set in the docs
| (https://docs.nvidia.com/cuda/cuda-binary-
| utilities/index.htm...).
|
| But it's one way only, NVIDIA ships a disassembler, but
| explicitly doesn't ship an assembler.
| The_rationalist wrote:
| https://github.com/NVIDIA/grcuda
| dogma1138 wrote:
| There are Java and C# compilers for CUDA such as JCUDA
| and http://www.altimesh.com/hybridizer-essentials/ but
| the CUDA runtime, libraries and the first party compiler
| only supports C/C++ and FORTRAN, for Python you need to
| use something like Numba.
|
| Most non C++ frameworks and implementations tho would
| simply use wrappers and bindings.
|
| I also am not aware of any high performance lib for CUDA
| that wasn't written in C++.
| pjmlp wrote:
| "Hybridizer: High-Performance C# on GPUs"
|
| https://developer.nvidia.com/blog/hybridizer-csharp/
|
| "Simplifying GPU Access: A Polyglot Binding for GPUs with
| GraalVM"
|
| https://developer.nvidia.com/gtc/2020/video/s21269-vid
|
| And then you can browse for products on
| https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
| Cent...
| dogma1138 wrote:
| Hybridizer simply creates CUDA C++ code from C# which is
| then compiled to PTX it also does it for AVX which you
| can the compile with Intel's compiler or gcc, it's not
| particularly good and you often need to debug the
| generated CUDA source code yourself, it's also doesn't
| always play well with the CUDA programming model
| especially its more advanced features.
|
| And again it's a commercial product developed by a 3rd
| party, whilst someone uses it I wouldn't even put it as a
| rounding error when accounting for why CUDA has the
| market share it has.
| pjmlp wrote:
| It is like everyone arguing about C++ for AAA studios, as
| if everyone was doing Crysis and Fortnight clones, while
| forgetting the legions of people making money selling A
| games.
|
| Or forgetting the days when games written in C were
| actually full of inline Assembly.
|
| It is still CUDA, regardless if it goes through PTX or
| CUDA C++ as implementation detail for the high level
| code.
| my123 wrote:
| https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/
| com... goes to a level above:
|
| "Alternatively you can let the virtual machine (VM) make
| this decision automatically by setting a system property
| on the command line. The JIT can also offload certain
| processing tasks based on performance heuristics."
|
| A lot of what ultimately limits GPUs today is that they
| are connected over a relatively slow bus (PCIe), this
| will change in the future, allowing smaller and smaller
| tasks to be offloaded.
| The_rationalist wrote:
| in addition, grCuda is a breakthrough that enable interop
| with much more languages such as Ruby, R, Js, (soon
| python), etc https://github.com/NVIDIA/grcuda
___________________________________________________________________
(page generated 2021-03-27 23:00 UTC)