[HN Gopher] Pico Cray - Small scale distributed computing
___________________________________________________________________
Pico Cray - Small scale distributed computing
Author : ecliptik
Score : 118 points
Date : 2023-04-20 20:45 UTC (1 days ago)
(HTM) web link (www.extremeelectronics.co.uk)
(TXT) w3m dump (www.extremeelectronics.co.uk)
| crest wrote:
| It would've been cray-cray to boot them from a single board with
| extra flash via SWD. Does anyone know how fast the RP2040 SWD
| clock can be driven on a Pico soldered to a central board? How
| many DMA channels and PIO state machines would be required to
| build an even faster remote memory access interface? Can a
| central 9th RP2040 serve as fast RDMA router for 8 other Pico
| boards? So many unanswered questions ;-).
|
| One way the RP2040 really punches above his price class is
| internal bandwidth. It has four AHB-Lite bus masters (two M0+
| cores with one read-write port, and the DMA engine with a read
| and a write port) connected a full crossbar switch. There are six
| SRAM banks of which the four largest are commonly used word
| interleaved, but can if you want full control over memory timing
| and bandwidth you can use the uninterleaved alias instead. Most
| chips priced around a RP2040 (and the SPI flash to go with it)
| have only a single small SRAM bank. I've found the DMA engine
| surprisingly pleasant to work with (easy to understand, no fixed
| request routing to fight, but still very flexible which can save
| a lot of interrupts). Whoever designed the combination of the PIO
| channel I/O coprocessors and this DMA engine had a clever tinkers
| mind. I have to stop myself from wasting time looking for ways to
| save CPU cycles with trickery that only makes the code hardware
| to maintain.
| dfox wrote:
| One of the things that I put some though into is that the PIO
| block is fast and you can build a custom fast interconnect from
| that (or maybe even Transputer Link) and then build a large
| distributed memory "supercomputer" from that.
|
| One of the limiting factors is that RP2040 expects to boot from
| SPI Flash and there is no documented way for external SSRAM
| (the Synopsys DW IP that is used by RP2040 as SPI flash
| controller probably even can do that, but its documentation in
| RP2040 datasheet seems somewhat redacted).
| ta988 wrote:
| SWD is 25MHz max. IOs on normal ports can reach CPU freq in
| some cases (so 133-250MHz).
| crest wrote:
| That's the official spec, but given how far the rest of the
| chip overclocks i would like to check that. The problem is
| that non of my SWD probes go above 24MHz.
| fredrikholm wrote:
| This is adorable!
|
| Always loved the CRAY aesthetic. Saw one in a museum a while
| back, couldn't get away from it.
| mstade wrote:
| Me too - I saw it at the Computer History Museum in Mountain
| View, CA. Gorgeous machine! The museum was great too, highly
| recommend a visit to anyone interested in the history of
| computing.
| azubinski wrote:
| To be honest, it's not exactly Cray, and not even Cray at all.
|
| Here is the homebrew cycle-accurate Cray-1, and it's a really
| impressive project:
|
| http://www.chrisfenton.com/homebrew-cray-1a/
| [deleted]
| fentonc wrote:
| But mine was only 1/10-scale!
|
| You're correct this isn't really a Cray other than the vaguely
| rounded shape, but it is still adorable and looks super fun to
| program.
| tleb_ wrote:
| About I2C addressing: I've thought about the issue. That solution
| could fail in theory because of collisions, ie a micro-controller
| cannot atomically check the assert pin and raise it if low. The
| project uses 1000ms + rand(0, 5000ms), which sounds big enough.
|
| I've always wondered if a solution based of multi-master I2C
| support could work: all childs shout a unique ID and the parent
| answers with their assigned I2C address. Children that do not win
| the arbitration will detect collisions and stop because of
| arbitration errors. Unsure how well it could work with many
| children.
|
| See https://www.i2c-bus.org/multimaster/ and
| https://www.i2c-bus.org/i2c-primer/clock-generation-stretchi...
| for introductions to I2C multimaster and arbitration. The spec is
| of course not open.
|
| I'd be interested in discussing the topic more!
| dfox wrote:
| I vaguely remember that Access.BUS (I2C based PC peripheral bus
| that was kind of a precursor to USB) did address assignment
| using somewhat similar protocol.
| markus_zhang wrote:
| Nicely done. I wonder if it can hold a term based BBS server and
| if so how many users it can serve. But I guess we need to figure
| out storage first. Is SD card good for pico?
| billpg wrote:
| How many Rasperry Pis have the equivalent processing power to an
| original Cray-1?
|
| (I would not be at all surprised if a single RPi were as powerful
| as twenty Cray-1s.)
| AquaLineSpirit wrote:
| A Raspberry Pi 4B has 13.5 GFLOPS[0] while a Cray-1 has 160
| MFLOPS[1] so you need about 85 Cray-1s :)
|
| Couldn't find any numbers for a Pi Pico.
|
| 0:
| https://web.eece.maine.edu/~vweaver/group/green_machines.htm...
| 1: https://en.wikipedia.org/wiki/Cray-1
| ta988 wrote:
| Are you talking about Pis or picos like mentioned in this
| article.
| thrtythreeforty wrote:
| Wikipedia says the Cray-1 was capable of 160 MFLOPS.
|
| As a rule of thumb, modern scalar pipelines can sustain one ALU
| op per cycle, and you see nearly all Linux-capable CPUs quoted
| in GHz. So we should expect gigaflops, minimum. And indeed [1]
| suggests that the Pi 4 is capable of 13.5GFLOP, so about 84
| Cray 1's. (The further speedups come from the fact that ARM
| also has NEON vector instructions, and from multiple cores.)
|
| The Pi _Pico_ , on the other hand, does not have a floating
| point unit. So it emulates it in software (soft float). The C
| SDK docs [2] suggest 13.8kHz (!) operation for single-precision
| add. I'll be generous and suggest that the 2x cores could
| double this performance. So then, it'd achieve 0.0086% of the
| Cray-1's performance. Oof.
|
| If you're willing to do integer arithmetic, things look much
| better for the Pico, of course - it runs at 125MHz and the
| above scalar rule of thumb applies.
|
| [1]:
| https://web.eece.maine.edu/~vweaver/group/green_machines.htm...
|
| [2]: https://datasheets.raspberrypi.com/pico/raspberry-pi-
| pico-c-...
| crest wrote:
| The original Cray 1 had already had a very powerful vector
| execution unit and low memory latency relative to the CPU
| frequency (by todays standards). The RP2040 has even lower
| memory latency, but only 1/4 MiB RAM and no hardware floating
| point support or SIMD (neither packed-SIMD nor "true" Cray
| style vectors). At least the BootROM includes hand optimised
| Soft-FP code and the single-cycle I/O block even includes a
| memory mapped hardware integer divider and two interpolators
| per CPU core. These can help a lot with fixed point DSP
| workloads, but are totally different from what made the Cray 1
| special at its time. Todays better MCUs may match certain
| performance numbers of early supercomputers, but their designs
| have more differences than similarities. There have been
| attempts to recreate early Crays in FPGAs, but so little
| software for them has been preserved (in a way accessible to
| the public) that it's difficult to judge how good the
| recreations are and even the later Cray 1 based designs need a
| really big FPGA to reimplement all relevant pieces at least as
| fast as the original. Cray didn't just build fast floating
| point adders and multipliers, but the main memories and
| interconnects between them to make them useful. Good luck
| finding a way to attach 100s of interleaved DRAM banks to your
| FPGA, because your internal block RAM won't be enough as main
| memory and memory access timings have improved the least over
| time.
| ignite wrote:
| And what's the relative power consumption? :-)
| adestefan wrote:
| Add in cooling requirements, too.
| mechagodzilla wrote:
| A Cray-1 ran at 80 MHz, and with careful coding could sustain
| about 2 double-precision floating point operations per cycle -
| so 160 MFLOPS. Looking at linpack benchmarks, it looks like
| it's right around a raspberry pi 2 running at 1 GHz (169 DP
| MFLOPS), and a little worse than a Raspberry Pi 3 at 180
| MFLOPS.
___________________________________________________________________
(page generated 2023-04-21 23:01 UTC)