[HN Gopher] Memory Mapping an FPGA from an STM32
       ___________________________________________________________________
        
       Memory Mapping an FPGA from an STM32
        
       Author : hasheddan
       Score  : 88 points
       Date   : 2024-07-25 14:21 UTC (8 hours ago)
        
 (HTM) web link (serd.es)
 (TXT) w3m dump (serd.es)
        
       | Already__Taken wrote:
       | real quite high level sorry, most of your embedded projects going
       | forward are MCU+fpga to do what? I thought a custom router but
       | 284mbps isn't nearly fast for a network.
        
         | UncleOxidant wrote:
         | It's a good question. A lot of FPGA projects I see (including
         | some real life products I've looked into recently) don't really
         | need an FPGA. One I was asked to evaluate recently could easily
         | have been done with a microcontroller with PWM outputs. The
         | frequencies involved were well under 40MHz. Yes, there were a
         | couple of multiplications going on in the FPGA, but there those
         | could've been easily handled by a micorcontroller. An RP2040
         | would've sufficed instead of what they had - a microcontroller
         | + an FPGA.
        
           | azonenberg wrote:
           | The projects in question include things like a 48 port
           | gigabit Ethernet switch with packet datapath in the FPGA, and
           | dual 10/25G SFP28 uplinks. You're not doing that on a MCU.
           | Also higher end oscilloscope work (e.g. 10 Gsps 12-bit
           | JESD204B)
           | 
           | But a STM32 is more than sufficient for the management
           | interface on both.
        
           | rvense wrote:
           | I think a lot of people don't fully appreciate how fast a
           | modern "microcontroller" is. That 'H735 is probably faster
           | than every computer I had up to and including the iBook G4 I
           | until early 2009.
        
             | buescher wrote:
             | I keep running into that also. It's like the common mental
             | model of a microcontroller froze around Y2K as a sort of
             | headless VIC-20. I had an _FAE_ , and a good one, from a
             | major supplier tell me "you can't implement a filter" on a
             | low-end micro that was roughly as powerful as an early
             | nineties DSP.
        
               | rvense wrote:
               | Cortex-Ms, man. A lot of 'em you just give 3.3V and a
               | couple of bypass caps, and GCC will use the single-cycle
               | hardware MAC (for M4 and above) if you just write the
               | straight-forward C code and you can put it on there in
               | 600ms with DFU. I'm a hobbyist, not an embedded wizard,
               | but it really seems *pretty* good compared to what I
               | understand about the old days.
               | 
               | (Like I like retro stuff and during COVID I bought an old
               | DSP56k dev board with a book about the assembly language
               | but oh boy, oh dear)
        
               | buescher wrote:
               | They're amazing. And - you can run a _PC emulator_ on an
               | ESP32. Sure, you need the fancy ram. OK. And then people
               | will tell me an ESP32 can 't do things that people
               | definitely were doing on bare-bones PCs in the eighties.
        
           | 15155 wrote:
           | Zynq 7010s are $2.50 and are a hell of a lot more chip than
           | an RP2040. If you already have the design (or copy one of the
           | 50 available), it's a good option when you don't want to
           | fight the chip.
           | 
           | PIO has extraordinarily sloppy timing (skew in all
           | categories) compared to the cheapest and smallest FPGAs.
        
             | azonenberg wrote:
             | Where are you getting them for $2.50?? The XC7Z010-1CLG225C
             | is $74.83 at Digikey in qty 1.
             | 
             | Checking sketchier places Win-Source has the CLG400 package
             | for $22.20 and even the cheapest aliexpress seller wants
             | $4.84 for something marked as a 7Z010 that may or may not
             | be legit.
             | 
             | Also "fight the chip" is pretty much the definition of what
             | I did last time I did a zynq project. Just give me a plain
             | FPGA and MCU with no wizards or GUIs or automatic code
             | generation.
        
               | 15155 wrote:
               | https://www.aliexpress.us/item/3256803970893483.html
               | 
               | I've ordered trays (and they send the OEM tray) - unique
               | barcodes, legit.
               | 
               | > Just give me a plain FPGA and MCU with no wizards or
               | GUIs or automatic code generation.
               | 
               | You can pretty much cut out all of their tools and get a
               | pure Yocto/Vivado TCL build for the bitstream for the 7
               | series Zynqs. Very low touch.
               | 
               | Their IO planner (in the Vivado IP integrator) is
               | somewhat necessary for complex peripheral scenarios and
               | is one of the few things I ever use Xilinx GUI
               | applications for anymore.
        
         | azonenberg wrote:
         | The intent is for the high performance datapath to live
         | entirely in FPGA (and the project you're probably thinking of
         | is switching, not routing).
         | 
         | The MCU is for control plane only. Several hundred Mbps between
         | the control and data plane is more than enough for a SSH
         | management CLI and poking registers on the FPGA to move a port
         | to a different VLAN in response to a CLI command or add an ACL
         | rule or something.
        
         | dragontamer wrote:
         | Embedded projects are never about doing things as fast as
         | computers: we have full scale computers (and routers, and
         | firewalls, and switches) for that.
         | 
         | Embedded is about solving problems more physical in nature, as
         | you are physically closer to reality in nearly all aspects.
         | 
         | --------
         | 
         | An MCU + FPGA project could implement... say... the VFIR IrDA
         | (Infrared) protocol at 16Mbit.
         | 
         | Traditional IrDA is widely supported at SIR and MIR levels
         | (upto 1.152MBit or so). Anything faster and the equipment has
         | basically been lost to the 1990s (and never was very popular
         | anyway).
         | 
         | IrDA I'd explain as a remote-controller on steroids. Its
         | infrared based (like TV Remote Controllers), so you need to
         | line up both devices and have them looking at each other.
         | Infrared can reliably travel about 3 meters over the open air
         | in a variety of conditions. IrDA allows for bidirectional
         | communications. Its a truly wireless protocol, albeit one that
         | requires significant alignment to function correctly. But ~3
         | meters is good range and practical for many applications.
         | 
         | Nominally, you could use an entire MCU to handle the encoding /
         | decoding of these light-pulses. However, that's a bit
         | redundant. Its far more cost efficient to dedicate a few LUTs
         | in an FPGA to the task.
         | 
         | Yes, the MCU is needed for the final application-level / OSI
         | layer 4/5/6/7 aspects of IrDA protocol. But the lowest PHY and
         | MAC levels of the protocol can and (probably) should be a small
         | section of FPGA.
         | 
         | Upgrading from standard MCU 1MBit to 16MBit would be a 1600%
         | improvement to communications compared to what's readily
         | available with commercial-off-the-shelf solutions. If you've
         | determined that IR Communications is good for whatever purpose
         | you're using, maybe the 1600% improvement is going to be
         | useful.
         | 
         | ------------
         | 
         | EDIT: The "physicality" of this is because photodiodes react
         | very quickly to light pulses. And an expensive enough
         | transistor can amplify that at the ~100MHz speeds needed to run
         | VFIR (at least in theory. I've never done this).
         | 
         | The FPGA (or MCU if you go that route...) just needs to clock
         | at 100MHz or so, and interpret the start-of-frame and end-of-
         | frame signals, while also interpreting a few other low-level
         | details. Overall, this turns the sequence of light pulses into
         | bits-and-bytes for higher-level processing (which code can and
         | should handle).
        
       | throwawayabcdef wrote:
       | This is dope. I work with Zynq/Versal quite a bit and respect and
       | understand (conceptually) the decisions you have made!
       | 
       | You get to own every aspect of your toolchain and with that will
       | come a lot of power.
       | 
       | Are you familiar with:
       | 
       | https://github.com/corundum/corundum
       | 
       | Perhaps you can build a support package for your platform.
        
       | chillingeffect wrote:
       | Neat! I love that H7 chip and its gargantuan inatruction
       | manual... ...and you didn't even mention its 2nd core :)
        
         | azonenberg wrote:
         | H735 is one of the single core SKUs. Just a 550 MHz M7.
         | 
         | Would not surprise me if the M4 was there and fused off (i.e.
         | same die as multicore H7 offerings), but it's not active.
        
           | duskwuff wrote:
           | Probably not. The dual-core parts are DIE450 (which is shared
           | with some single-core parts like the H750 series!), but
           | STM32H735 is DIE483.
        
             | azonenberg wrote:
             | I have a H735 on a retired board slated for decap so we'll
             | find out once I open it up.
             | 
             | Do you know if it's fabbed in house, TSMC, or Samsung? I've
             | seen ST silicon from all 3 foundries but the only thing
             | I've seen stated publicly is 40nm. When I get it opened up
             | it should be easy to tell, TSMC and Samsung processes have
             | distinctive features on them that I recognize by sight.
        
               | duskwuff wrote:
               | No idea - I'm reading the die IDs out of the STM32Cube
               | DB. I haven't looked at the silicon, but I have no reason
               | to doubt what the DB says, especially since it confirms
               | that a lot of allegedly different parts use the same
               | dies.
        
       | 15155 wrote:
       | I recommend checking out SpinalHDL generally - I do a ton of this
       | very same kind of work with these same chips (7 series, US+) and
       | would never look back to Verilog!
       | 
       | AXI (and all memory-mapped bus protocol schemes) becomes very
       | very _pleasant._ SV interfaces get you 5% of the way there,
       | though!
       | 
       | Also - I was under the impression that S1000-2M is a higher-end
       | material, not cost-optimized? (But not Rogers, of course.)
        
         | azonenberg wrote:
         | S1000-2 is quite cheap and lossy (Df 0.016), slightly better
         | than Isola 370HR (0.021) but nowhere near the stuff I usually
         | use. At my usual Chinese board house it's one of the lowest
         | cost substrates available for prototypes since it's always in
         | stock and there's no need to special order.
         | 
         | For higher end digital work I typically reach for Taiwan Union
         | TU872SLK (Df 0.009) which also has a better range of prepregs
         | and glass styles available to help minimize fiber weave effect.
         | Still quite a bit lossier than e.g. RO4350B but far less
         | expensive and if you have decent equalizers on your SERDES the
         | difference is typically not significant unless you're making
         | some kind of humongous backplane. I get wide open eyes with
         | just a tiny bit of post-cursor emphasis on the TX FFE at
         | 10.3125 Gbps on TU872SLK for my typical shortish high speed
         | tracks (FPGA to SFP+ cage).
        
           | 15155 wrote:
           | Curious who you are using in CN for higher-speed FPGA boards,
           | if you can share!
           | 
           | I haven't seen these as directly-advertised options at any of
           | my usual suspects.
        
             | azonenberg wrote:
             | Multech (multech-pcb.com) is my preferred manufacturer
             | these days for high end stuff. I've done six layer HDI any-
             | layer via stackups, ten layers with filled via-in-pad,
             | RO4350B, TU872SLK, flex, 75 micron trace/space, etc. And
             | that's nowhere near the limit of their capabilities, I just
             | haven't needed higher end yet.
             | 
             | I have some 25/100G stuff in the pipe for probably some
             | time next year that I plan to make with them too.
             | 
             | Their website undersells, I get the impression most of the
             | actual sales contacts are word of mouth. I talk to my sales
             | rep by skype mostly (the alternatives are expensive
             | international phone calls or wechat).
             | 
             | The really cool thing is that you get a 10+ page QA report
             | with every order including measured
             | copper/dielectric/soldermask thicknesses, hole sizes, ionic
             | contamination measurements, and a ton of other metrics. And
             | they send the TDR strips and polished cross section with
             | every order as their way of saying "look, we actually did
             | the QA, double check our measurements if you don't trust
             | us". (I actually have repeated some of the measurements to
             | spot-check and got results within a few percent of their QA
             | department, no surprises there).
             | 
             | And they don't make silent gerber changes or anything. They
             | do a full CAM review and send you working gerbers and a
             | list of suggested DFM tweaks for you to sign off before
             | beginning manufacture. If something doesn't look right you
             | have a chance to say "wait there's a problem".
             | 
             | For example, one time they wanted to make a really large
             | width adjustment for impedance on some RF traces that I had
             | carefully modeled in an EM solver. But they didn't make a
             | bad board without telling me, they flagged it on the CAM
             | review and we went back and forth before realizing the
             | mistake was on their end (they had calculated impedance
             | assuming solder mask over the traces, while they were
             | actually exposed copper). They re-ran the numbers which
             | then closely matched my simulations, I signed off on the
             | modified design, and the board was manufactured without
             | issue.
        
           | buescher wrote:
           | Also S1000-2 is not rated/controlled past 1GHz. It shouldn't
           | vary that much so for small runs the risk is minimal. But for
           | volume production that's exactly the sort of thing you never
           | want to have to investigate in hindsight.
        
       | buescher wrote:
       | This is really crisp work and nice to see. Before the Zynq era I
       | worked with some designs that used a DSP or StrongARM along with
       | a medium-sized FPGA, where the FPGA would be both the glue logic
       | for RAM as well as custom peripherals, but I've been out of that
       | world for a while. It would be fun to find an application for a
       | big FPGA and a modern microcontroller.
        
       | dmitrygr wrote:
       | Be veeeery careful. STM32H QSPI peripheral is _FULL OF_ very
       | nasty bugs, especially the second version (supports writes) that
       | you find in STM32H0B chips . You are currently avoiding them by
       | having QSPI mapped as device memory, but the minute you attempt
       | to use it with cache or run code from it, or (god help you) put
       | your stack, heap, and /or vector table on a QSPI device, you are
       | in for a world of poorly-debuggable 1:1,000,000 failures. STM
       | knows but refuses to publicly acknowledge, even if they privately
       | admit some other customers have "hit similar issues". Issues I've
       | found, demonstrated to them, and wrote reliable replications of:
       | 
       | * non-4-byte-sized writes randomly lost about 1/million writes if
       | QSPI is writeable and not cached
       | 
       | * non-4-byte-sized writes randomly rounded up in size to 2 or 4
       | bytes with garbage, overwriting nearby data about 1/million
       | writes if QSPI is writeable and cached
       | 
       | * when PC, SP, and VTOR all point to QSPI memory, any interrupt
       | has about a 1/million chance of reading garbage instead of the
       | proper vector from the vector table if it interrupts a LDM/STM
       | instruction targeting the QSPI memory and it is cached and misses
       | the cache
       | 
       | Some of these have workarounds that I found (contact me). I am
       | refusing to disclose them to STM until they acknowledge the bugs
       | publicly.
       | 
       | I recommend NOT using STM32H7 chips in any product where you want
       | QSPI memory to work properly.
        
         | azonenberg wrote:
         | I have encountered issues with QSPI (mostly caused by the
         | annoying prefetch queue) which is why I am switching to the FMC
         | for FPGA interfacing (i.e. not using OCTOSPI). That was the
         | whole point of this experiment, validating FMC as a replacement
         | for my legacy OCTOSPI based MCU-APB bridge. I have a previous
         | board using QSPI reliably in indirect mode (i.e. not memory
         | mapped) but found it was full of pain when memory mapped
         | specifically in writes. So that firmware memory maps it for
         | reads but switches to indirect mode for writes. And has cache
         | disabled.
         | 
         | So far I have it working quite reliably (my test firmware does
         | a loopback test with 100K reads/writes of a 32-bit register at
         | the start that I had written with intent of using it for link
         | training of the PLLs to optimize read/write capture timing but
         | never ended up using as such) and my iperf test can push tens
         | of thousands of packets per second without issue.
        
           | 15155 wrote:
           | The NXP IMXRT-series chips have a similar EMC (external
           | memory controller) as well as "FlexIO" - PIO-like
           | programmable IO. I've used both for this kind of FPGA
           | interface without issue.
           | 
           | The IMXRT1064 is around $7 and is also an M7 core with an HS
           | USB PHY, programmable PLL-connected LVDS clock output, 2
           | EMACs, excellent hardened IP generally.
        
             | azonenberg wrote:
             | I have some RT1176's in my "to try" pile.
             | 
             | The big thing holding me back was that their crypto
             | accelerators were all locked behind NDAs (a dealbreaker for
             | F/OSS work) while the ST ones are documented in the freely
             | downloadable datasheet you can just google up.
             | 
             | But I did find some third party wrapper libraries that
             | seemed to be able to use the crypto registers so it might
             | be possible to figure things out from that. I haven't tried
             | yet.
             | 
             | The other issue I had with the RT is that they lacked
             | internal flash so PCB complexity is slightly higher than
             | with a STM32.
        
               | 15155 wrote:
               | > I have some RT1176's in my "to try" pile.
               | 
               | Keep in mind the dual-core 11xx chips are a bit harder to
               | boot than the rest of the line - but you probably need
               | the power domain flexibility for most FPGA projects (1064
               | has way fewer practically-usable 1v8 banks.)
               | 
               | > crypto accelerators were all locked behind NDAs
               | 
               | I've been able to use every bit of hard IP and high-
               | assurance boot from registers using no vendor code
               | whatsoever.
               | 
               | Here's what you are looking for:
               | 
               | https://github.com/JayHeng/imxrt-
               | level2-boot/blob/master/dev...
               | 
               | > The other issue I had with the RT is that they lacked
               | internal flash
               | 
               | The IMXRT1064 has a 4MB Winbond QSPI chip in-package, by
               | the way!
               | 
               | > PCB complexity is slightly higher than with a STM32.
               | 
               | The Xilinx FPGA that is sitting next to your MCU incurs
               | multiple orders of magnitude more PCB-complexity than a
               | little QSPI flash, haha.
        
           | dmitrygr wrote:
           | > 100K reads/writes of a 32-bit register
           | 
           | You'll hit almost no bugs if you keep accessing the same
           | address in a loop. Lucky you :)
        
             | azonenberg wrote:
             | Yeah but again, we're talking about the FMC here not the
             | OCTOSPI.
             | 
             | Have you hit issues with the FMC? From what other people
             | are telling me, the OCTOSPI is full of land mines and the
             | FMC is pretty decent. The worst errata I've encountered so
             | far is two dummy clocks with CS# asserted at the end of a
             | read burst.
        
         | mips_r4300i wrote:
         | Thanks for the heads up. I have a design at fab that uses the
         | H7's OctoSPI so this concerns me. I steered away from the
         | memory mapped mode because it seemed too good to be true -
         | wanted to be able to qsort() and put heaps in this extra space.
         | 
         | I suspect ST only ever tested it with their single PSRAM they
         | intend this mode for. My intent is to use indirect mode and
         | manually poke the peripheral, though DMA will have to happen
         | still.
         | 
         | Back on the PIC32MX platform there was a similar type of bug
         | that doesn't exist anywhere else but to me: If any interrupt
         | fires while the PMP peripheral is doing a DMA, there is a 1 in
         | a million chance that it will silently drop 1 byte. Noticed
         | this because all my accesses were 32bit (4 bytes) and broke
         | horribly at the misalignment. The solution is to disable all
         | interupts while doing DMA.
        
           | dmitrygr wrote:
           | it is worse: i think they also did not test random access. I
           | suspect their test was to: fill PSRAM linearly and then read
           | it back and verify linearly. Random word accesses in
           | unachached mode also randomly lose writes. I am unable to
           | replicate quickly _on purpose_ , only randomly, so i guess it
           | is under 1/100mil so it is not in my list above. My
           | workarounds avoid these crashes too though.
        
         | azonenberg wrote:
         | As far as using QSPI memory, one thing I have planned (and will
         | be thoroughly testing) is using an external SPI flash as
         | configuration data storage. Right now if I want to store any
         | nonvolatile settings with power loss protection I need to burn
         | two 128 kB erase blocks (one primary and one secondary, so I
         | can ping-pong data between them and not lose anything if I have
         | a power loss during a write cycle or similar) of the on-chip
         | flash, space that I'd much rather use for firmware.
         | 
         | MicroKVS expects to be able to memory map data fetches
         | (uncached), but is fine with using indirect access for writes.
        
           | azonenberg wrote:
           | But if I can memory map the FPGA via the FMC, I can simply
           | put an APB memory mapped QSPI controller on the FPGA and
           | store my config there, using the same flash for the FPGA
           | bitstream as well.
           | 
           | This saves a chip on the board, reduces the amount of PCB
           | routing required, and eliminates use of the sketchy OCTOSPI
           | peripheral entirely. Testing that out is on my list of things
           | to do on this board eventually.
        
             | 15155 wrote:
             | I almost always include I2C EEPROM - just too cheap and
             | pretty easy to route.
        
               | azonenberg wrote:
               | That can't be memory mapped, so I'd need to rewrite my
               | KVS code which currently expects to be able to return a
               | pointer to the raw on-flash image of the config data.
               | Doable but a pain.
        
         | mystified5016 wrote:
         | What the hell is going on at ST? Every STM uC I've tried to use
         | in the past few years has had showstopper bugs with loads of
         | very similar complaints online dating back to the release of
         | the part. Bugs that have been in the wild for _years_ and still
         | exist in the current production run.
         | 
         | After burning enough company time chasing bugs through ST's
         | crappy silicon, I've had to just swear them off entirely. We're
         | an Atmel house now. Significantly fewer (zero) problems, and
         | some pretty nifty features like UPDI.
        
           | hmry wrote:
           | In college, our SoC design instructor told us that to pass
           | the class, our modules should be better than ST's "which is
           | not that high of a bar" :P
        
           | mips_r4300i wrote:
           | They churn out new parts and don't bring in fixes. See all
           | the chips in their lineup that have a USB host controller.
           | Every one of them (they use Synopsys IP) will fail with
           | multiple LS devices through a hub. We talked to our FAE about
           | this and they have no plans to fix it. The bug has existed
           | for years and the bad IP is being baked into all the new
           | chips still. Solution? Just use yet another chip for its host
           | controller, and don't use a hub.
        
       ___________________________________________________________________
       (page generated 2024-07-25 23:04 UTC)