[HN Gopher] AMD NPU and Xilinx Versal AI Engines Signal Processi...
       ___________________________________________________________________
        
       AMD NPU and Xilinx Versal AI Engines Signal Processing in Radio
       Astronomy (2024) [pdf]
        
       Author : transpute
       Score  : 57 points
       Date   : 2025-04-13 11:16 UTC (11 hours ago)
        
 (HTM) web link (git.astron.nl)
 (TXT) w3m dump (git.astron.nl)
        
       | CheeksTheGeek wrote:
       | WHY ARE VERSAL BOARDS SO EXPENSIVE (i had to rant somewhere)
       | 
       | I'm waiting for the similar cost reduction that happened to
       | Ultrascale+ devices and we finally got something like the ZuBoard
        
         | fargle wrote:
         | Versal "edge" VE2302 boards are coming from multiple vendors.
         | much better pricing.
         | 
         | i'm guessing they will be available in a month or so - they are
         | supposed to "Q2" but seem to be a little bit late (as is
         | typical).
        
         | fecal_henge wrote:
         | Are the boards significantly more expensive than the devices?
        
           | transpute wrote:
           | Need something like this $160 Zuboard entry to 5-figure Zynq
           | market, https://news.avnet.com/press-releases/press-release-
           | details/...
           | 
           |  _> the smallest, lowest power, and most cost-optimized
           | member of the Zynq UltraScale+ family.. jump-start.. MPSoC-
           | based end systems like miniaturized, compute-intensive edge
           | applications in industrial and healthcare IoT systems,
           | embedded vision cameras, AV-over-IP 4K and 8K-ready
           | streaming, hand-held test equipment, consumer, medical
           | applications and more.. board is ideal for design engineers,
           | software engineers, system architects, hobbyists, makers and
           | even students_
        
         | echelon wrote:
         | Can someone with more knowledge of AMD explain if these are
         | useful for real AI work? Without CUDA does it feel like working
         | in the dark ages?
        
           | KeplerBoy wrote:
           | They are useful for AI, but it's a completely different beast
           | than a GPU.
        
           | _sbrk wrote:
           | F = Field P = Programmable G = Gate <---- important A = Array
           | 
           | You aren't "programming", you're "wiring gates together". In
           | other words, you can build custom hardware to solve a problem
           | without using a generic CPU (or GPU) to do it. FPGAs are
           | implemented as a fabric of LUTs (Look-up Tables) which take
           | 4- or 6- (or more) inputs and produce an output. That allows
           | Boolean algebra functions to be processed. The tools you use
           | (Vivado / ISE / YoSys / etc.) take a your intended design,
           | written in a HDL (Hardware Design Language) such as Verilog
           | or VHDL, and turn it into a configuration file which is
           | injected into the FPGA, causing it to be configured to into
           | the hardware you want (if you've done it right). FPGAs are a
           | stepping stone between generic hardware such as a CPU or GPU
           | and a custom ASIC. They win when you can express the problem
           | in specialized hardware much better than writing code to do
           | something on a CPU/GPU. Parallelization is the key to many
           | FPGA designs. Also, you don't have to spend >$1MM on a mask
           | set to go have an ASIC fabricated by TSMC, etc.
        
         | OneDeuxTriSeiGo wrote:
         | It depends where you get them from. A lot of the dev boards
         | have extra tooling and of course a healthy chunk of "dev tax"
         | unfortunately. Luckily you can find much more barebones boards
         | available if you know where to look.
         | 
         | https://www.en.alinx.com/Product/SoC-Development-Boards/Vers...
        
           | _sbrk wrote:
           | It's not just that the boards are expensive; you'll also need
           | a Vivado license to create any designs for it. That license
           | is at least several thousand dollars for the Versal devices.
        
             | transpute wrote:
             | It's taken many years of reverse engineering, but there's
             | now an efficient OSS toolchain for the smaller Artix7 FPGA
             | family, https://antmicro.com/blog/2020/05/multicore-vex-in-
             | litex/
        
               | tux3 wrote:
               | This blog doesn't seem to talk about the OSS toolchain,
               | litex/vexriscv are very neat but they don't replace
               | Vivado, right?
        
               | transpute wrote:
               | Like all open-source, it's an ongoing effort. Bunnie has
               | a comparison,
               | https://www.bunniestudios.com/blog/2017/litex-vs-vivado-
               | firs...
               | 
               |  _> Thanks to the extensive work of the MiSoC and LiteX
               | crowd, there's already IP cores for DRAM, PCI express,
               | ethernet, video, a softcore CPU (your choice of or1k or
               | lm32) and more.. LiteX produces a design that uses about
               | 20% of an XC7A50 FPGA with a runtime of about 10 minutes,
               | whereas Vivado produces a design that consumes 85% of the
               | same FPGA with a runtime of about 30-45 minutes.. LiteX,
               | in its current state, is probably best suited for people
               | trained to write software who want to design hardware,
               | rather than for people classically trained in circuit
               | design who want a tool upgrade._
        
               | oasisaimlessly wrote:
               | I think transpute likely meant to link F4PGA[1] or one of
               | the projects it makes use of (Yosys, nextpnr, Project
               | IceStorm, Project X-Ray, etc).
               | 
               | [1] https://f4pga.org/
        
               | transpute wrote:
               | Thanks for the pointer! DARPA ERI investment was
               | initially directed to US academic teams, while Yosys &
               | related decentralized OSS efforts were barely running on
               | conviction fumes in the OSS wilderness. Glad to see this
               | umbrella ecosystem structure from LF Chips Alliance. Next
               | we need a cultural step change in commercial EDA tools.
        
               | _sbrk wrote:
               | Artix 7 is simplistic compared to any of the Versal
               | chips. You buy an expensive FPGA and then try using an
               | "open-source" tool chain that exposes 25% of the FPGA's
               | potential. Not a great trade-off, eh?
        
           | imtringued wrote:
           | The Versal AI edge SOMs are mildly overpriced. The boards are
           | worth it, but in the embedded space Nvidia is offering the
           | cheapest solutions so an FPGA based application will always
           | need to justify the additional cost for slightly worse
           | performance, by arguing that the application has latency
           | requirements that a GPU cannot help with.
           | 
           | GPUs tend to perform worse when you have small batches and
           | frequent kernel launches. This is especially annoying in
           | cases where a simple kernel wide synchronization barrier
           | could solve your problems, but CUDA expects you to not
           | synchronize like that within the kernel, you're supposed to
           | launch a sequence of kernels one after the other. That's not
           | a good solution if a for loop over n iterations turns into n
           | kernel calls.
        
       | almostgotcaught wrote:
       | The title is editorialized: this has nothing to do with NPU (it
       | does not appear in the PDF), which is the term of art for the
       | version of these cores that are sold in laptops.
        
         | OneDeuxTriSeiGo wrote:
         | The Versal AI Engine is the NPU. And the Ryzen CPUs NPU is
         | almost exactly a Versal AI Engine IP block to the point that in
         | the Linux kernel they share the same driver (amdxdna) and the
         | reference material the kernel docs link to for the Ryzen NPUs
         | is the Versal SoC's AI Engine architecture reference manual
         | 
         | https://docs.kernel.org/next/accel/amdxdna/amdnpu.html
        
         | transpute wrote:
         | Author ported their software between near-identical AMD AIE and
         | NPU platforms, https://www.hackster.io/tina/tina-running-non-
         | nn-algorithms-...
         | 
         |  _> The PFB is found in many different application domains such
         | as radio astronomy, wireless communication, radar, ultrasound
         | imaging and quantum computing.. the authors worked on the
         | evaluation of a PFB on the AIE.. [developing] a performant
         | dataflow implementation.. which made us curious about the AMD
         | Ryzen NPU.
         | 
         | > The [NPU] PFB figure shows.. speedup of circa 9.5x compared
         | to the Ryzen CPU.. TINA allows running a non-NN algorithm on
         | the NPU with just two extra operations or approximately 20
         | lines of added code.. on [Nvidia] GPUs CUDA memory is a
         | limiting factor.. This limitation is alleviated on the AMD
         | Ryzen NPU since it shares the same memory with the CPU
         | providing up to 64GB of memory._
         | 
         | Consumer Ryzen NPU hardware is more accessible to students and
         | hackers than industrial Versal AIE products.
        
         | imtringued wrote:
         | My issue with your comment is that you're acting as if you're
         | clarifying something, but you're just replacing it with another
         | confusion.
         | 
         | There are three generations of AI Engines: AIE, AIE-ML and AIE-
         | MLv2.
         | 
         | The latter are known as XDNA and XDNA2, which are available on
         | laptops and the 8000G series on desktops. The former is
         | exclusively available on select FPGAs specialising in DSP using
         | single precision floating point.
         | 
         | The AI focused FPGAs use AIE-MLv2 and therefore are identical
         | to XDNA2.
        
           | almostgotcaught wrote:
           | the cores/arches themselves are referred to by a bagillion
           | different names AIE1 AIE2 AIEML Phoenix Strix blah blah (and
           | *DNA refers to the driver/runtime not the core/arch itself)
           | but NPU exclusively refers to consumer edge SoC products.
        
       | 01100011 wrote:
       | IIRC, the European Extremely Large Telescope(love the name) is
       | using Nvidia GPUs to handle adaptive optics.
        
         | KeplerBoy wrote:
         | This AstronNL project also uses Nvidia GPUs just a stage
         | further down the processing chain.
         | 
         | https://youtu.be/RpXTbcBRiRw?si=0yTCNmPZuK29Cf1-
        
       | K7mR2vZq wrote:
       | Interesting to see Astron developing a radio astronomy
       | accelerator that handles 200 Gbps streams with modest power
       | consumption. The FPGA + MISD approach seems well-matched to the
       | problem domain. Curious how this compares to other astronomy
       | processing architectures in terms of FLOPS/watt metrics.
        
       ___________________________________________________________________
       (page generated 2025-04-13 23:01 UTC)