[HN Gopher] AMD NPU and Xilinx Versal AI Engines Signal Processi...
___________________________________________________________________
AMD NPU and Xilinx Versal AI Engines Signal Processing in Radio
Astronomy (2024) [pdf]
Author : transpute
Score : 57 points
Date : 2025-04-13 11:16 UTC (11 hours ago)
(HTM) web link (git.astron.nl)
(TXT) w3m dump (git.astron.nl)
| CheeksTheGeek wrote:
| WHY ARE VERSAL BOARDS SO EXPENSIVE (i had to rant somewhere)
|
| I'm waiting for the similar cost reduction that happened to
| Ultrascale+ devices and we finally got something like the ZuBoard
| fargle wrote:
| Versal "edge" VE2302 boards are coming from multiple vendors.
| much better pricing.
|
| i'm guessing they will be available in a month or so - they are
| supposed to "Q2" but seem to be a little bit late (as is
| typical).
| fecal_henge wrote:
| Are the boards significantly more expensive than the devices?
| transpute wrote:
| Need something like this $160 Zuboard entry to 5-figure Zynq
| market, https://news.avnet.com/press-releases/press-release-
| details/...
|
| _> the smallest, lowest power, and most cost-optimized
| member of the Zynq UltraScale+ family.. jump-start.. MPSoC-
| based end systems like miniaturized, compute-intensive edge
| applications in industrial and healthcare IoT systems,
| embedded vision cameras, AV-over-IP 4K and 8K-ready
| streaming, hand-held test equipment, consumer, medical
| applications and more.. board is ideal for design engineers,
| software engineers, system architects, hobbyists, makers and
| even students_
| echelon wrote:
| Can someone with more knowledge of AMD explain if these are
| useful for real AI work? Without CUDA does it feel like working
| in the dark ages?
| KeplerBoy wrote:
| They are useful for AI, but it's a completely different beast
| than a GPU.
| _sbrk wrote:
| F = Field P = Programmable G = Gate <---- important A = Array
|
| You aren't "programming", you're "wiring gates together". In
| other words, you can build custom hardware to solve a problem
| without using a generic CPU (or GPU) to do it. FPGAs are
| implemented as a fabric of LUTs (Look-up Tables) which take
| 4- or 6- (or more) inputs and produce an output. That allows
| Boolean algebra functions to be processed. The tools you use
| (Vivado / ISE / YoSys / etc.) take a your intended design,
| written in a HDL (Hardware Design Language) such as Verilog
| or VHDL, and turn it into a configuration file which is
| injected into the FPGA, causing it to be configured to into
| the hardware you want (if you've done it right). FPGAs are a
| stepping stone between generic hardware such as a CPU or GPU
| and a custom ASIC. They win when you can express the problem
| in specialized hardware much better than writing code to do
| something on a CPU/GPU. Parallelization is the key to many
| FPGA designs. Also, you don't have to spend >$1MM on a mask
| set to go have an ASIC fabricated by TSMC, etc.
| OneDeuxTriSeiGo wrote:
| It depends where you get them from. A lot of the dev boards
| have extra tooling and of course a healthy chunk of "dev tax"
| unfortunately. Luckily you can find much more barebones boards
| available if you know where to look.
|
| https://www.en.alinx.com/Product/SoC-Development-Boards/Vers...
| _sbrk wrote:
| It's not just that the boards are expensive; you'll also need
| a Vivado license to create any designs for it. That license
| is at least several thousand dollars for the Versal devices.
| transpute wrote:
| It's taken many years of reverse engineering, but there's
| now an efficient OSS toolchain for the smaller Artix7 FPGA
| family, https://antmicro.com/blog/2020/05/multicore-vex-in-
| litex/
| tux3 wrote:
| This blog doesn't seem to talk about the OSS toolchain,
| litex/vexriscv are very neat but they don't replace
| Vivado, right?
| transpute wrote:
| Like all open-source, it's an ongoing effort. Bunnie has
| a comparison,
| https://www.bunniestudios.com/blog/2017/litex-vs-vivado-
| firs...
|
| _> Thanks to the extensive work of the MiSoC and LiteX
| crowd, there's already IP cores for DRAM, PCI express,
| ethernet, video, a softcore CPU (your choice of or1k or
| lm32) and more.. LiteX produces a design that uses about
| 20% of an XC7A50 FPGA with a runtime of about 10 minutes,
| whereas Vivado produces a design that consumes 85% of the
| same FPGA with a runtime of about 30-45 minutes.. LiteX,
| in its current state, is probably best suited for people
| trained to write software who want to design hardware,
| rather than for people classically trained in circuit
| design who want a tool upgrade._
| oasisaimlessly wrote:
| I think transpute likely meant to link F4PGA[1] or one of
| the projects it makes use of (Yosys, nextpnr, Project
| IceStorm, Project X-Ray, etc).
|
| [1] https://f4pga.org/
| transpute wrote:
| Thanks for the pointer! DARPA ERI investment was
| initially directed to US academic teams, while Yosys &
| related decentralized OSS efforts were barely running on
| conviction fumes in the OSS wilderness. Glad to see this
| umbrella ecosystem structure from LF Chips Alliance. Next
| we need a cultural step change in commercial EDA tools.
| _sbrk wrote:
| Artix 7 is simplistic compared to any of the Versal
| chips. You buy an expensive FPGA and then try using an
| "open-source" tool chain that exposes 25% of the FPGA's
| potential. Not a great trade-off, eh?
| imtringued wrote:
| The Versal AI edge SOMs are mildly overpriced. The boards are
| worth it, but in the embedded space Nvidia is offering the
| cheapest solutions so an FPGA based application will always
| need to justify the additional cost for slightly worse
| performance, by arguing that the application has latency
| requirements that a GPU cannot help with.
|
| GPUs tend to perform worse when you have small batches and
| frequent kernel launches. This is especially annoying in
| cases where a simple kernel wide synchronization barrier
| could solve your problems, but CUDA expects you to not
| synchronize like that within the kernel, you're supposed to
| launch a sequence of kernels one after the other. That's not
| a good solution if a for loop over n iterations turns into n
| kernel calls.
| almostgotcaught wrote:
| The title is editorialized: this has nothing to do with NPU (it
| does not appear in the PDF), which is the term of art for the
| version of these cores that are sold in laptops.
| OneDeuxTriSeiGo wrote:
| The Versal AI Engine is the NPU. And the Ryzen CPUs NPU is
| almost exactly a Versal AI Engine IP block to the point that in
| the Linux kernel they share the same driver (amdxdna) and the
| reference material the kernel docs link to for the Ryzen NPUs
| is the Versal SoC's AI Engine architecture reference manual
|
| https://docs.kernel.org/next/accel/amdxdna/amdnpu.html
| transpute wrote:
| Author ported their software between near-identical AMD AIE and
| NPU platforms, https://www.hackster.io/tina/tina-running-non-
| nn-algorithms-...
|
| _> The PFB is found in many different application domains such
| as radio astronomy, wireless communication, radar, ultrasound
| imaging and quantum computing.. the authors worked on the
| evaluation of a PFB on the AIE.. [developing] a performant
| dataflow implementation.. which made us curious about the AMD
| Ryzen NPU.
|
| > The [NPU] PFB figure shows.. speedup of circa 9.5x compared
| to the Ryzen CPU.. TINA allows running a non-NN algorithm on
| the NPU with just two extra operations or approximately 20
| lines of added code.. on [Nvidia] GPUs CUDA memory is a
| limiting factor.. This limitation is alleviated on the AMD
| Ryzen NPU since it shares the same memory with the CPU
| providing up to 64GB of memory._
|
| Consumer Ryzen NPU hardware is more accessible to students and
| hackers than industrial Versal AIE products.
| imtringued wrote:
| My issue with your comment is that you're acting as if you're
| clarifying something, but you're just replacing it with another
| confusion.
|
| There are three generations of AI Engines: AIE, AIE-ML and AIE-
| MLv2.
|
| The latter are known as XDNA and XDNA2, which are available on
| laptops and the 8000G series on desktops. The former is
| exclusively available on select FPGAs specialising in DSP using
| single precision floating point.
|
| The AI focused FPGAs use AIE-MLv2 and therefore are identical
| to XDNA2.
| almostgotcaught wrote:
| the cores/arches themselves are referred to by a bagillion
| different names AIE1 AIE2 AIEML Phoenix Strix blah blah (and
| *DNA refers to the driver/runtime not the core/arch itself)
| but NPU exclusively refers to consumer edge SoC products.
| 01100011 wrote:
| IIRC, the European Extremely Large Telescope(love the name) is
| using Nvidia GPUs to handle adaptive optics.
| KeplerBoy wrote:
| This AstronNL project also uses Nvidia GPUs just a stage
| further down the processing chain.
|
| https://youtu.be/RpXTbcBRiRw?si=0yTCNmPZuK29Cf1-
| K7mR2vZq wrote:
| Interesting to see Astron developing a radio astronomy
| accelerator that handles 200 Gbps streams with modest power
| consumption. The FPGA + MISD approach seems well-matched to the
| problem domain. Curious how this compares to other astronomy
| processing architectures in terms of FLOPS/watt metrics.
___________________________________________________________________
(page generated 2025-04-13 23:01 UTC)