[HN Gopher] Pico Cray - Small scale distributed computing
       ___________________________________________________________________
        
       Pico Cray - Small scale distributed computing
        
       Author : ecliptik
       Score  : 118 points
       Date   : 2023-04-20 20:45 UTC (1 days ago)
        
 (HTM) web link (www.extremeelectronics.co.uk)
 (TXT) w3m dump (www.extremeelectronics.co.uk)
        
       | crest wrote:
       | It would've been cray-cray to boot them from a single board with
       | extra flash via SWD. Does anyone know how fast the RP2040 SWD
       | clock can be driven on a Pico soldered to a central board? How
       | many DMA channels and PIO state machines would be required to
       | build an even faster remote memory access interface? Can a
       | central 9th RP2040 serve as fast RDMA router for 8 other Pico
       | boards? So many unanswered questions ;-).
       | 
       | One way the RP2040 really punches above his price class is
       | internal bandwidth. It has four AHB-Lite bus masters (two M0+
       | cores with one read-write port, and the DMA engine with a read
       | and a write port) connected a full crossbar switch. There are six
       | SRAM banks of which the four largest are commonly used word
       | interleaved, but can if you want full control over memory timing
       | and bandwidth you can use the uninterleaved alias instead. Most
       | chips priced around a RP2040 (and the SPI flash to go with it)
       | have only a single small SRAM bank. I've found the DMA engine
       | surprisingly pleasant to work with (easy to understand, no fixed
       | request routing to fight, but still very flexible which can save
       | a lot of interrupts). Whoever designed the combination of the PIO
       | channel I/O coprocessors and this DMA engine had a clever tinkers
       | mind. I have to stop myself from wasting time looking for ways to
       | save CPU cycles with trickery that only makes the code hardware
       | to maintain.
        
         | dfox wrote:
         | One of the things that I put some though into is that the PIO
         | block is fast and you can build a custom fast interconnect from
         | that (or maybe even Transputer Link) and then build a large
         | distributed memory "supercomputer" from that.
         | 
         | One of the limiting factors is that RP2040 expects to boot from
         | SPI Flash and there is no documented way for external SSRAM
         | (the Synopsys DW IP that is used by RP2040 as SPI flash
         | controller probably even can do that, but its documentation in
         | RP2040 datasheet seems somewhat redacted).
        
         | ta988 wrote:
         | SWD is 25MHz max. IOs on normal ports can reach CPU freq in
         | some cases (so 133-250MHz).
        
           | crest wrote:
           | That's the official spec, but given how far the rest of the
           | chip overclocks i would like to check that. The problem is
           | that non of my SWD probes go above 24MHz.
        
       | fredrikholm wrote:
       | This is adorable!
       | 
       | Always loved the CRAY aesthetic. Saw one in a museum a while
       | back, couldn't get away from it.
        
         | mstade wrote:
         | Me too - I saw it at the Computer History Museum in Mountain
         | View, CA. Gorgeous machine! The museum was great too, highly
         | recommend a visit to anyone interested in the history of
         | computing.
        
       | azubinski wrote:
       | To be honest, it's not exactly Cray, and not even Cray at all.
       | 
       | Here is the homebrew cycle-accurate Cray-1, and it's a really
       | impressive project:
       | 
       | http://www.chrisfenton.com/homebrew-cray-1a/
        
         | [deleted]
        
         | fentonc wrote:
         | But mine was only 1/10-scale!
         | 
         | You're correct this isn't really a Cray other than the vaguely
         | rounded shape, but it is still adorable and looks super fun to
         | program.
        
       | tleb_ wrote:
       | About I2C addressing: I've thought about the issue. That solution
       | could fail in theory because of collisions, ie a micro-controller
       | cannot atomically check the assert pin and raise it if low. The
       | project uses 1000ms + rand(0, 5000ms), which sounds big enough.
       | 
       | I've always wondered if a solution based of multi-master I2C
       | support could work: all childs shout a unique ID and the parent
       | answers with their assigned I2C address. Children that do not win
       | the arbitration will detect collisions and stop because of
       | arbitration errors. Unsure how well it could work with many
       | children.
       | 
       | See https://www.i2c-bus.org/multimaster/ and
       | https://www.i2c-bus.org/i2c-primer/clock-generation-stretchi...
       | for introductions to I2C multimaster and arbitration. The spec is
       | of course not open.
       | 
       | I'd be interested in discussing the topic more!
        
         | dfox wrote:
         | I vaguely remember that Access.BUS (I2C based PC peripheral bus
         | that was kind of a precursor to USB) did address assignment
         | using somewhat similar protocol.
        
       | markus_zhang wrote:
       | Nicely done. I wonder if it can hold a term based BBS server and
       | if so how many users it can serve. But I guess we need to figure
       | out storage first. Is SD card good for pico?
        
       | billpg wrote:
       | How many Rasperry Pis have the equivalent processing power to an
       | original Cray-1?
       | 
       | (I would not be at all surprised if a single RPi were as powerful
       | as twenty Cray-1s.)
        
         | AquaLineSpirit wrote:
         | A Raspberry Pi 4B has 13.5 GFLOPS[0] while a Cray-1 has 160
         | MFLOPS[1] so you need about 85 Cray-1s :)
         | 
         | Couldn't find any numbers for a Pi Pico.
         | 
         | 0:
         | https://web.eece.maine.edu/~vweaver/group/green_machines.htm...
         | 1: https://en.wikipedia.org/wiki/Cray-1
        
         | ta988 wrote:
         | Are you talking about Pis or picos like mentioned in this
         | article.
        
         | thrtythreeforty wrote:
         | Wikipedia says the Cray-1 was capable of 160 MFLOPS.
         | 
         | As a rule of thumb, modern scalar pipelines can sustain one ALU
         | op per cycle, and you see nearly all Linux-capable CPUs quoted
         | in GHz. So we should expect gigaflops, minimum. And indeed [1]
         | suggests that the Pi 4 is capable of 13.5GFLOP, so about 84
         | Cray 1's. (The further speedups come from the fact that ARM
         | also has NEON vector instructions, and from multiple cores.)
         | 
         | The Pi _Pico_ , on the other hand, does not have a floating
         | point unit. So it emulates it in software (soft float). The C
         | SDK docs [2] suggest 13.8kHz (!) operation for single-precision
         | add. I'll be generous and suggest that the 2x cores could
         | double this performance. So then, it'd achieve 0.0086% of the
         | Cray-1's performance. Oof.
         | 
         | If you're willing to do integer arithmetic, things look much
         | better for the Pico, of course - it runs at 125MHz and the
         | above scalar rule of thumb applies.
         | 
         | [1]:
         | https://web.eece.maine.edu/~vweaver/group/green_machines.htm...
         | 
         | [2]: https://datasheets.raspberrypi.com/pico/raspberry-pi-
         | pico-c-...
        
         | crest wrote:
         | The original Cray 1 had already had a very powerful vector
         | execution unit and low memory latency relative to the CPU
         | frequency (by todays standards). The RP2040 has even lower
         | memory latency, but only 1/4 MiB RAM and no hardware floating
         | point support or SIMD (neither packed-SIMD nor "true" Cray
         | style vectors). At least the BootROM includes hand optimised
         | Soft-FP code and the single-cycle I/O block even includes a
         | memory mapped hardware integer divider and two interpolators
         | per CPU core. These can help a lot with fixed point DSP
         | workloads, but are totally different from what made the Cray 1
         | special at its time. Todays better MCUs may match certain
         | performance numbers of early supercomputers, but their designs
         | have more differences than similarities. There have been
         | attempts to recreate early Crays in FPGAs, but so little
         | software for them has been preserved (in a way accessible to
         | the public) that it's difficult to judge how good the
         | recreations are and even the later Cray 1 based designs need a
         | really big FPGA to reimplement all relevant pieces at least as
         | fast as the original. Cray didn't just build fast floating
         | point adders and multipliers, but the main memories and
         | interconnects between them to make them useful. Good luck
         | finding a way to attach 100s of interleaved DRAM banks to your
         | FPGA, because your internal block RAM won't be enough as main
         | memory and memory access timings have improved the least over
         | time.
        
         | ignite wrote:
         | And what's the relative power consumption? :-)
        
           | adestefan wrote:
           | Add in cooling requirements, too.
        
         | mechagodzilla wrote:
         | A Cray-1 ran at 80 MHz, and with careful coding could sustain
         | about 2 double-precision floating point operations per cycle -
         | so 160 MFLOPS. Looking at linpack benchmarks, it looks like
         | it's right around a raspberry pi 2 running at 1 GHz (169 DP
         | MFLOPS), and a little worse than a Raspberry Pi 3 at 180
         | MFLOPS.
        
       ___________________________________________________________________
       (page generated 2023-04-21 23:01 UTC)