[HN Gopher] Xilinx HBM2 Internals (2023)
       ___________________________________________________________________
        
       Xilinx HBM2 Internals (2023)
        
       Author : hasheddan
       Score  : 67 points
       Date   : 2024-05-09 09:08 UTC (13 hours ago)
        
 (HTM) web link (lovehindpa.ws)
 (TXT) w3m dump (lovehindpa.ws)
        
       | willis936 wrote:
       | I'm not an expert on memory interfaces. How do you use HBM2's
       | 1024-bit interface when you have ~200 I/O on a zynq ultrascale+?
       | Are these psuedo-channels a SerDes for the HBM2 bus?
        
         | someguydave wrote:
         | Look at the (non-Zynq) VCU128 board for an example. The HBM2 is
         | on the PL side, and the interconnect is via a die-to-die
         | interface. So the 32 AXI3 interfaces to HBM2 here are hard
         | silicon, not FPGA I/O pins.
        
         | huntero wrote:
         | The HBM stacks are on-package for these parts, so you don't
         | have to use any external I/O to interface with them.
         | 
         | You end up with a similar challenge accessing that much
         | bandwidth internally from your FPGA logic though, it looks like
         | the Xilinx HBM IP presents a set of 16 or 32 separate AXI
         | interfaces, each of which gives you about 14.4GB/s of bandwidth
         | (https://docs.amd.com/r/en-US/pg276-axi-hbm/Introduction).
        
       | TacticalCoder wrote:
       | > Conventions
       | 
       | > MiB = Megabytes (2^20 bytes)
       | 
       | > Gb = Gigabits (2^27 bytes, or 128MiB)
       | 
       | > GiB = Gibibytes (2^30 bytes)
       | 
       | Shouldn't MiB be Mebibytes then?
        
       | pclmulqdq wrote:
       | I wonder if the author is doing anything to overclock the HBM
       | here or if this is within the ratings of the Samsung HBM stacks.
       | It's nice to be able to do this when you have a few cards, but if
       | you are working with hundreds, it may not be practical to push
       | the HBM this far without overvolting them a bit.
        
         | latchkey wrote:
         | I automated the tuning of 150k gpus that were being used to
         | mine ethereum.
         | 
         | The trick was that as a whole, you knew the limits of the
         | hardware. You know how to set the knobs to max performance. Due
         | to the silicon lottery, cards that can't perform at max end up
         | crashing.
         | 
         | So what I did was kind of the opposite of what everyone else
         | was doing. I first set everything at max, watched for a crash,
         | then tuned the knobs to be a bit lower. All of this was done
         | with an automated piece of software that I built. The cards we
         | used essentially had 3 knobs to twist, which resulted in
         | hundreds of combinations. Eventually, the cards stop crashing,
         | so you're at the right settings, for that individual piece of
         | hardware.
         | 
         | We were running in seasonal climates too... so each
         | winter/summer, I'd reset things and let it auto tune back
         | again. Heat plays a huge factor on stability.
         | 
         | Of course, each workload has different settings too... so that
         | plays into it, but if everything else is static, this ended up
         | being a great way to do things.
        
           | rowanG077 wrote:
           | That seems great if a failure always results in a crash.
           | There are a ton of failure modes where your result will just
           | spuriously be wrong.
        
             | latchkey wrote:
             | To my knowledge, HPC rarely tunes cards for max
             | performance. My MI300x are stock settings and I doubt I'll
             | ever modify them.
        
           | pclmulqdq wrote:
           | Interesting, I generally assumed Eth miners would undervolt
           | their GPUs to get more life out of them rather than
           | overclocking them for absolute max performance.
        
             | latchkey wrote:
             | Undervolt / overclock / memory timings
        
         | Wolf9466 wrote:
         | Author here. I did overclock it - that was one of the points of
         | the writeup: when you modify the memory clock, you should
         | change the timings along with, because they are often specified
         | in tCK (ticks of the memory clock), and as such, they will
         | change when the clock changes.
         | 
         | I have reliable information from folks with several thousand of
         | these FPGAs that they reliably clock to 1100Mhz - 1150Mhz on
         | the HBM2 at stock voltage (or a bit less.) This falls in line
         | with my personal experiences - I have seven XCVU35P FPGAs, and
         | they range from doing only 1100Mhz to 1150Mhz, to some handling
         | 1200Mhz.
         | 
         | Samsung's documentation specifies this HBM2 for 1000Mhz to
         | 1100Mhz, based on binning - this is why I was annoyed that
         | Xilinx limited it to 900Mhz, and worked to learn how to change
         | the PLL settings.
        
           | pclmulqdq wrote:
           | I am also aware the Xilinx sets their own clock specs
           | annoyingly conservatively, and I think they do it to preserve
           | device lifetime or something similar. However, I did want to
           | clarify whether you were overvolting these things or just
           | raising the clock frequency.
           | 
           | I have run into issues where you do get a dud FPGA that is
           | just a lot slower than other FPGAs of its speed bin (it must
           | have come from the edge of the wafer or something), and
           | debugging that is pretty annoying.
        
       | akira2501 wrote:
       | I feel like domains are pretty cheap so it would be easy to
       | separate your fetishes from your work life.
        
         | doctor_eval wrote:
         | I made the mistake of looking at the gallery. NSFW.
        
           | formerly_proven wrote:
           | There are no mistakes, just happy little accidents.
        
             | doctor_eval wrote:
             | You're right, I shouldn't have said mistake.
             | 
             | The context switch nearly gave me whiplash, tho.
        
       ___________________________________________________________________
       (page generated 2024-05-09 23:01 UTC)