https://axio.ms/projects/2024/06/16/MicroMac.html
axio.ms
About me Archive Talks
MicroMac, a Macintosh for under PS5
Jun 16, 2024
A microcontroller Macintosh
This all started from a conversation about the RP2040 MCU, and
building a simple desktop/GUI for it. I'd made a comment along the
lines of "or, just run some old OS", and it got me thinking about the
original Macintosh.
The original Macintosh was released 40.5 years before this post, and
is a pretty cool machine especially considering that the hardware is
very simple. Insanely Great and folklore.org are fun reads, and give
a glimpse into the Macintosh's development. Memory was a squeeze; the
original 128KB version was underpowered and only sold for a few
months before being replaced by the Macintosh 512K, arguably a more
appropriate amount of memory.
But, the 128 still runs some real applications and, though it
pre-dates MultiFinder/actual multitasking, I found it pretty
charming. As a tourist. In 1984 the Mac cost roughly 1/3 as much as a
VW Golf and, as someone who's into old computers and old cars, it's
hard to decide which is more frustrating to use.
So back to this PS3.80 RPi Pico microcontroller board: The RP2040's
264KB of RAM gives a lot to play with after carving out the Mac's
128KB - how cool would it be to do a quick hack, and play with a Mac
on it?
Time passes. A lot of time. But I totally delivered on the janky hack
front:
[DEL:You won't believe that this quality item didn't take that long
to build.:DEL] So the software was obviously the involved part, and
turned into work on 3 distinct projects.
This post is going to be a "development journey" story, as a kind of
code/design/venting narrative. If you're just here for the pictures,
scroll along!
What is pico-mac?
A Raspberry Pi RP2040 microcontroller (on a Pico board), driving
monochrome VGA video and taking USB keyboard/mouse input, emulating a
Macintosh 128K computer and disc storage. The RP2040 has easily
enough RAM to house the Mac's memory, plus that of the emulator; it's
fast enough (with some tricks) to meet the performance of the real
machine, has USB host capability, and the PIO department makes
driving VGA video fairly uneventful (with some tricks). The basic
Pico board's 2MB of flash is plenty for a disc image with OS and
software.
Here's the Pico MicroMac in action, ready for the paperless office of
the future:
The Pico MicroMac RISC CISC workstation of the future The Pico
MicroMac RISC CISC workstation of the future
I hadn't really used a Mac 128K much before; a few clicks on a museum
machine once. But I knew they ran MacDraw, and MacWrite, and
MacPaint. All three of these applications are pretty cool for a 128K
machine; a largely WYSIWYG word processor with multiple fonts, and a
vector drawing package.
A great way of playing with early Macintosh system software, and
applications of these wonderful machines is via https://
infinitemac.org, which has shrinkwrapped running the Mini vMac
emulator by emscriptening it to run in the browser. Highly
recommended, lots to play with.
As a spoiler, MicroMac does run MacDraw, and it was great to play
with it on "real fake hardware":
(Do you find "Pico Micro Mac" doesn't really scan? I didn't think
this taxonomy through, did I?)
GitHub links are at the bottom of this page: the pico-mac repo has
construction directions if you want to build your own!
The journey
Back up a bit. I wasn't committed to building a Pico thing, but was
vaguely interested in whether it was feasible, so started tinkering
with building a Mac 128K emulator on my normal computer first.
The three rules
I had a few simple rules for this project:
1. It had to be fun. It's OK to hack stuff to get it working, it's
not as though I'm being paid for this.
2. I like writing emulation stuff, but I really don't want to learn
68K assembler, or much about the 68K. There's a lot of love for
68K out there and that's cool, but meh I don't adore it as a CPU.
So, right from the outset I wanted to use someone else's 68K
interpreter - I knew there were loads around.
3. Similarly, there are a load of OSes whose innards I'd like to
learn more about, but the shittiest early Mac System software
isn't high on the list. Get in there, emulate the hardware, boot
the OS as a black box, done.
I ended up breaking 2 of and sometimes all 3 of these rules during
this project.
The Mac 128K
The machines are generally pretty simple, and of their time. I
started with schematics and Inside Macintosh, PDFs of which covered
various details of the original Mac hardware, memory map, mouse/
keyboard, etc.
* https://tinkerdifferent.com/resources/
macintosh-128k-512k-schematics.79/
* https://vintageapple.org/inside_o/ Inside Macintosh Volumes I-III
are particularly useful for hardware information; also Guide to
Macintosh Family Hardware 2nd Edition.
The Macintosh has:
* A Motorola 68000 CPU running at [DEL:7.whatever MHz:DEL] roughly
8MHz
* Flat memory, decoded into regions for memory-mapped IO going to
the 6522 VIA, the 8530 SCC, and the IWM floppy controller. (Some
of the address decoding is a little funky, though.)
* Keyboard and mouse hang off the VIA/SCC chips.
* No external interrupt controller: the 68K has 3 IRQ lines, and
there are 3 IRQ sources (VIA, SCC, programmer switch/NMI).
* "No slots" or expansion cards.
* No DMA controller: a simple autonomous PAL state machine scans
video (and audio samples) out of DRAM. Video is fixed at 512x342
1BPP.
* The only storage is an internal FDD (plus an external drive),
driven by the IWM chip.
The first three Mac models are extremely similar:
* The Mac 128K and Mac 512K are the same machine, except for RAM.
* The Mac Plus added SCSI to a convenient space in the memory map
and an 800K floppy drive, which is double-sided whereas the
original was a single 400K side.
* The Mac Plus ROM also supports the 128K/512K, and was an upgrade
to create the Macintosh 512Ke. 'e' for Extra ROM Goodness.
The Mac Plus ROM supports the HD20 external hard disc, and HFS, and
Steve Chamberlin has annotated a disassembly of it. This was the ROM
to use: I was making a Macintosh 128Ke.
Mac emulator: umac
After about 8 minutes of research, I chose the Musashi 68K
interpreter. It's C, simple to interface to, and had a simple
out-of-box example of a 68K system with RAM, ROM, and some IO.
Musashi is structured to be embedded in bigger projects: wire in
memory read/write callbacks, a function to raise an IRQ, call execute
in a loop, done.
I started building an emulator around it, which ultimately became the
umac project. The first half (of, say, five halves) went pretty well:
1. A simple commandline app loading the ROM image, allocating RAM,
providing debug messages/assertions/logging, and configuring
Musashi.
2. Add address decoding: CPU reads/writes are steered to RAM, or
ROM. The "overlay" register lets the ROM boot at 0x00000000 and
then trampoline up to a high ROM mirror after setting up CPU
exception vectors - this affects the address decoding. This is
done by poking a VIA register, so decoded just that bit of that
register for now.
3. At this point, the ROM starts running and accessing more
non-existent VIA and SCC registers. Added more decoding and a
skeleton for emulating these devices elsewhere - the MMIO read/
writes are just stubbed out.
4. There are some magic addresses that the ROM accesses that "miss"
documented devices: there's a manufacturing test option that
probes for a plugin (just thunk it), and then we witness the RAM
size probing. The Mac Plus ROM is looking for up to 4MB of RAM.
In the large region devoted to RAM, the smaller amount of actual
RAM is mirrored over and over, so the probe writes a magic value
at high addresses and spots where it starts to wrap around.
5. RAM is then initialised and filled with a known pattern. This was
an exciting point to get to because I could dump the RAM, convert
the region used for the video framebuffer into an image, and see
the "diagonal stripe" pattern used for RAM testing! "She's alive!
"
6. Not all of the device code enjoyed reading all zeroes, so there
was a certain amount of referring to the disassembly and
returning, uh, 0xffffffff sometimes to push it further. The goal
was to get it as far as accessing the IWM chip, i.e. trying to
load the OS.
7. After seeing some IWM accesses there and returning random rubbish
values, the first wonderful moment was getting the "Unknown Disc"
icon with the question mark - real graphics! The ROM was REALLY
DOING SOMETHING!
8. I think I hadn't implemented any IRQs at this point, and found
the ROM in an infinite loop: it was counting a few Vsyncs to
delay the flashing question mark. Diversion into a better VIA,
with callbacks for GPIO register read/write, and IRQ handling.
This also needed to wire into Musashi's IRQ functions.
This was motivating to get to - remembering rule #1 - and "graphics",
even though via a manual memory dump/ImageMagick conversion, was
great.
I knew the IWM was an "interesting" chip, but didn't know details. I
planned to figure it out when I got there (rule #1).
IWM, 68K, and disc drivers
My god, I'm glad I put IWM off until this point. If I'd read the
"datasheet" (vague register documentation) first, I'd've just gone to
the pub instead of writing this shitty emulator.
IWM is very clever, but very very low-level. The disc controllers in
other contemporary machines, e.g. WD1770, abstract the disc physics.
At one level, you can poke regs to step to track 17 and then ask the
controller to grab sector 3. Not so with IWM: first, the discs are
Constant Linear Velocity, meaning the angular rotation needs to
change appropriate to whichever track you're on, and second the IWM
just gives the CPU a firehose of crap from the disc head (with
minimal decoding). I spent a while reading through the disassembly of
the ROM's IWM driver (breaking rule #2 and rule #1): there's some
kind of servo control loop where the driver twiddles PWM values sent
to a DAC to control the disc motor, measured against a VIA timer
reference to do some sort of dynamic rate-matching to get the correct
bitrate from the disc sectors. I think once it finds the track start
it then streams the track into memory, and the driver decodes the
symbols (more clever encoding) and selects the sector of interest.
I was sad. Surely Basilisk II and Mini vMac etc. had solved this in
some clever way - they emulated floppy discs. I learned they do not,
and do the smart engineering thing instead: avoid the problem.
The other emulators do quite a lot of ROM patching: the ROM isn't run
unmodified. You can argue that this then isn't a perfect hardware
emulation if you're patching out inconvenient parts of the ROM, but
so what. I suspect they were also abiding by a rule #1 too.
I was going to do the same: I figured out a bit of how the Mac driver
interface works (gah, rule #3!) and understood how the other
emulators patched this. They use a custom paravirtualised 68K driver
which is copied over the ROM's IWM driver, servicing .Sony requests
from the block layer and routing them to more convenient host-side
code to manage the requests. Basilisk II uses some custom 68K opcodes
and a simple driver, and Mini vMac a complex driver with trappy
accesses to a custom region of memory. I reused the Basilisk II
driver but converted to access a trappy region (easier to route: just
emulate another device). The driver callbacks land in the host/C side
and some cut-down Basilisk II code interprets the requests and copies
data to/from the OS-provided buffers. Right now, all I needed was to
read blocks from one disc: I didn't need different formats (or even
write support), or multiple drives, or ejecting/changing images.
Getting the first block loaded from disc took waaaayyy longer than
the first part. And, I'd had to learn a bit of 68K (gah), but just in
the nick of time I got a Happy Mac icon as the System software
started to load.
This was still a simple Linux commandline application, with zero UI.
No keyboard or mouse, no video. Time to wrap it in an SDL2 frontend
(the unix_main test build in the umac project), and I could watch the
screen redraw live. I hadn't coded the 1Hz timer interrupt into the
VIA, and after adding that it booted to a desktop!
The first boot The first boot
As an aside, I try to create a dual-target build for all my embedded
projects, with a native host build for rapid prototyping/debugging;
libSDL instead of an LCD. It means I don't need to code at the MCU,
so I can code in the garden. :)
Next was mouse support. Inside Macintosh and the schematics show how
it's wired, to the VIA (good) and the SCC (a beast). The SCC is my
second least-favourite chip in this machine; it's complex and the
datasheet/manual seems to be intentionally written to hide
information, piss off readers, get one back at the world. (I didn't
go near the serial side, its main purpose, just external IRQ
management. But, it'll do all kinds of exciting 1980s line coding
schemes, offloading bitty work from the CPU. It was key for
supporting things like AppleTalk.)
Life was almost complete at this point; with a working mouse I could
build a new disc image (using Mini vMac, an exercise in itself) with
Missile Command. This game is pretty fun for under 10KB on disc.
So:
* Video works
* Boots from disc
* Mouse works, Missile Command
I had no keyboard, but it's largely working now. Time to start on
sub-project numero due:
Hardware and RP2040
Completely unrelated to umac, I built up a circuit and firmare with
two goals:
1. Display 512x342x1 video to VGA with minimal components,
2. Get the TinyUSB HID example working and integrated.
This would just display a test image copied to a framebuffer, and
printf() keyboard/mouse events, as a PoC. The video portion was fun:
I'd done some I2S audio PIO work before, but here I wanted to scan
out video and arbitrarily control Vsync/Hsync.
Well, to test I needed a circuit. VGA wants 0.7V max on the video
R,G,B signals and (mumble, some volts) on the syncs. The R,G,B
signals are 75O to ground: with some maths, a 3.3V GPIO driving all
three through a 100O resistor is roughly right.
The day I started soldering it together I needed a VGA connector. I
had a DB15 but wanted it for another project, and felt bad about
cutting up a VGA cable. But when I took a walk at lunchtime, no
shitting you, I passed some street cables. I had a VGA cable - the
rust helps with the janky aesthetic.
Free VGA cable Free VGA cable
The VGA PIO side was pretty fun. It ended up as PIO reading config
info dynamically to control Hsync width, display position, and so on,
and then some tricks with DMA to scan out the config info interleaved
with framebuffer data. By shifting the bits in the right direction
and by using the byteswap option on the RP2040 DMA, the big-endian
Mac framebuffer can be output directly without CPU-side copies or
format conversion. Cool. This can be fairly easily re-used in other
projects: see video.c.
But. I ended up (re)writing the video side three times in total:
First version had two DMA channels writing to the PIO TX FIFO. The
first would transfer the config info, then trigger the second to
transfer video data, then raise an IRQ. The IRQ handler would then
have a short time (the FIFO depth!) to choose a new framebuffer
address to read from, and reprogram DMA. It worked OK, but was highly
sensitive to other activity in the system. First and most obvious fix
is that any latency-sensitive IRQ handler must have the
__not_in_flash_func() attribute so as to run out of RAM. But even
with that, the design didn't give much time to reconfigure the DMA:
random glitches and blanks occurred when moving the mouse rapidly.
Second version did double-buffering with the goal of making the IRQ
handler's job trivial: poke in a pre-prepared DMA config quickly,
then after the critical rush calculate the buffer to use for next
time. Lots better, but still some glitches under some high load. Even
weirder, it'd sometimes just blank out completely, requiring a reset.
This was puzzling for a while; I ended up printing out the PIO FIFO's
FDEBUG register to try to catch the bug in the act. I saw that the
TXOVER overflow flag was set, and this should be impossible: the
FIFOs pull data from DMA on demand with DMA requests and a credited
flow-contr...OH WAIT. If credits get messed up or duplicated, too many
transfers can happen, leading to an overflow at the receiver side.
Well, I'd missed a subtle rule in the RP2040 DMA docs:
Another caveat is that multiple channels should not be connected
to the same DREQ.
So the third version...... doesn't break this rule, and is more
complicated as a result:
* One DMA channel transfers to the PIO TX FIFO
* Another channel programs the first channel to send from the
config data buffer
* A third channel programs the first to send the video data
* The programming of the first triggers the corresponding "next
reprogram me" channel
The nice thing - aside from no lock-ups or video corruption - is that
this now triggers a Hsync IRQ during the video line scan-out, greatly
relaxing the deadline of reconfiguring the DMA. I'd like to further
improve this (with yet another DMA channel) to transfer without an
IRQ per line, as the current IRQ overhead of about 1% of CPU time can
be avoided.
(It would've been simpler to just hardwire the VGA display timing in
the PIO code, but I like (for future projects) being able to
dynamically-reconfigure the video mode.)
So now we have a platform and firmware framework to embed umac into,
HID in and video out. The hardware's done, fuggitthat'lldo, let's
throw it over to the software team:
How it all works How it all works
Back to emulating things
A glance at the native umac binary showed a few things to fix before
it could run on the Pico:
* Musashi constructed a huge opcode decode jumptable at runtime, in
RAM. It's never built differently, and never changes at runtime.
I added a Musashi build-time generator so that this table could
be const (and therefore live in flash).
* The disassembler was large, and not going to be used on the Pico,
so another option to build without.
* Musashi tries to accurately count execution cycles for each
instruction, with more large lookup tables. Maybe useful for
console games, but the Mac doesn't have the same degree of timing
sensitivity. REMOVED.
(This work is in my small-build branch.)
pico-mac takes shape, with the ROM and disc image in flash, and
enjoyably it now builds and runs on the Pico! With some careful
attention to not shoving stuff in RAM, the RAM use is looking pretty
good. The emulator plus HID code is using about 35-40KB on top of the
Mac's 128KB RAM area - there's 95+KB of RAM still free.
This was a good time to finish off adding the keyboard support to
umac. The Mac keyboard is interfaced serially through the VIA 'shift
register', a basic synchronous serial interface. This was logically
simple, but frustrating because early attempts at replying to the
ROM's "init" command just were persistently ignored. The ROM
disassembly was super-useful again: reading the keyboard init code,
it looked like a race condition in interrupt acknowledgement if the
response byte appears too soon after the request is sent. Shoved in a
delay to hold off a reply until a later poll, and then it was just a
matter of mapping keycodes (boooooorrrriiiiing).
With a keyboard, the end-of-level MacWrite boss is reached:
One problem though: it totally sucked. It was suuuuper slow. I added
a 1Hz dump of instruction count, and it was doing about 300 KIPS.
The 68000 isn't an amazing CPU in terms of IPC. Okay, there are some
instructions that execute in 4 cycles. But you want to use those
extravagant addressing modes don't you, and touching memory is
spending those cycles all over the place. Not an expert, but
targeting about 1 MIPS for an about 8MHz 68000 seems right. Only 3x
improvement needed.
Performance
I didn't say I wasn't gonna cheat: let's run that Pico at 250MHz
instead of 125MHz. Okay better, but not 2x better. From memory, only
about 30% better. Damn, no free lunch today.
Musashi has a lot of configurable options. My first goal was to get
its main loop (as seen from disassembly/post-compile end!) small: the
Mac doesn't report Bus Errors, so the registers don't need copies for
unwinding. The opcodes are always fetched from a 16b boundary, so
don't need alignment checking, and can use halfword loads (instead of
two byte loads munged into a halfword!). For the Cortex-M0+/armv6m
ISA, reordering some of the CPU context structure fields enabled
immediate-offset access and better code. The CPU type, mysteriously,
was dynamically-changeable and led to a bunch of runtime indirection.
Looking better, maybe 2x improvement, but not enough. Missile Command
was still janky and the mouse wasn't smooth!
Next, some naughty/dangerous optimisations: remove address alignment
checking, because unaligned accesses don't happen in this constrained
environment.
(Then, this work is in my umac-hacks branch.)
But the real perf came from a different trick. First, a diversion!
RP2040 memory access
The RP2040 has fast RAM, which is multi-banked so as to allow
generally single-cycle access to multiple users (2 CPUs, DMA, etc.).
Out of the box, most code runs via XIP from external QSPI flash. The
QSPI usually runs at the core clock (125MHz default), but has a
latency of ~20 cycles for a random word read. The RP2040 uses a
relatively simple 16KB cache in front of the flash to protect you
from horrible access latency, but the more code you have the more
likely you are to call a function and have to crank up QSPI. When
overclocking to 250MHz, the QSPI can't go that fast so stays at
125MHz (I think). Bear in mind, then, that your 20ish QSPI cycles on
a miss become 40ish CPU cycles.
The particular rock-and-a-hard-place here is that Musashi build-time
generates a ton of code, a function for each of its 1968 opcodes,
plus that 256KB opcode jumptable. Even if we make the inner execution
loop completely free, the opcode dispatch might miss in the flash
cache, and the opcode function itself too. (If we want to get 1 MIPS
out of about 200 MIPS, a few of these delays are going to really add
up.)
The __not_in_flash_func() attribute can be used to copy a given
function into RAM, guaranteeing fast execution. At the very minimum,
the main loop and memory accessors are decorated: every instruction
is going to access an opcode and most likely read or write RAM.
This improves performance a few percent.
Then, I tried decorating whole classes of opcodes: move is frequent,
as are branches, so put 'em in RAM. This helped a lot, but the
remaining free RAM was used up very quickly, and I wasn't at my goal
of much above 1 MIPS.
Remember that RISC architecture is gonna change everything?
We want to put some of those 1968 68K opcodes into RAM to make them
fast. What are the top 10 most often-used instructions? Top 100? By
adding a 64K table of counters to umac, booting the Mac and running
key applications (okay, playing Missile Command for a bit), we get a
profile of dynamic instruction counts. It turns out that the 100
hottest opcodes (5% of the total) account for 89% of the execution.
And the top 200 account for a whopping 98% of execution.
Armed with this profile, the umac build post-processes the Musashi
auto-generated code and decorates the top 200 functions with
__not_in_flash_func(). This adds only 17KB of extra RAM usage
(leaving 95KB spare), and hits about 1.4 MIPS! Party on!
At last, the world can enjoy Missile Command's dark subject matter in
performant comfort:
Missile Command on pico-mac Missile Command on pico-mac
What about MacPaint?
Everyone loves MacPaint. Maybe you love MacPaint, and have noticed
I've deftly avoided mentioning it. Okay, FINE:
There is not enough memory for MacPaint!
It doesn't run on a Mac 128Ke, because the Mac Plus ROM uses more RAM
than the original. :sad-face:
I'd seen this thread on 68kMLA about a "Mac 256K": https://68kmla.org
/bb/index.php?threads/the-mythical-mac-256k.46149/ Chances are that
the Mac 128K was really a Mac 256K in the lab (or maybe even intended
to have 256K and cost-cut before release), as the OS functions fine
with 256KB.
I wondered, does the Mac ROM/OS need a power-of-two amount of RAM? If
not, I have that 95K going spare. Could I make a "Mac 200K", and then
run precious MacPaint?
Well, I tried a local hack that patches the ROM to update its global
memTop variable based on a given memory size, and yes, System 3.2 is
happy with non-power-of-2 sizes. I booted with 256K, 208K, and 192K.
However, there were some additional problems to solve: the ROM
memtest craps itself without a power-of-2 size (totally fair), and
NOPping that out leads to other issues. These can be fixed, though
also some parts of boot access off the end of RAM. A power-of-2 size
means a cheap address mask wraps RAM accesses to the valid buffer,
and that can't be done with 192K.
Unfortunately, when I then tested MacPaint it still wouldn't run
because it wanted to write a scratch file to the read-only boot
volume. This is totally breaking rule #1 by this point, so we are
staying with 128KB for now.
However, a 256K MicroMac is extremely possible. We just need an MCU
with, say, 300KB of RAM... Then we'd be cooking on gas.
Goodbye, friend
Well, dear reader, this has been a blast. I hope there's been
something fun here for ya. Ring off now, caller!
[micromac-thumb]
The MicroMac!
[umac_early_macwrite-]
HDMI monitor, using a VGA-to-HDMI box
[umac_desktop-thumb]
umac screenshot
[umac_sys32_desktop-t]
System 3.2, Finder 5.3
[umac_missile-thumb]
Performance tuning
[umac_sys32_de-thumb]
Random disc image working OK
Resources
* https://github.com/evansm7/umac
* https://github.com/evansm7/pico-mac
* https://www.macintoshrepository.org/
7038-all-macintosh-roms-68k-ppc-
* https://winworldpc.com/product/mac-os-0-6/system-3x
* https://68kmla.org/bb/index.php?threads/
macintosh-128k-mac-plus-roms.4006/
* https://docs.google.com/spreadsheets/d/
1wB2HnysPp63fezUzfgpk0JX_b7bXvmAg6-Dk7QDyKPY/edit#gid=840977089
Macintosh 68K ARM
* Previous: Classical virtualisation rules applied to RISC-style
atomics
Observations on Popek & Goldberg's well-known paper, and modern
virtualisation issues
* evansm7
* hachyderm.io/@mattmos
* atthehackofdawn
axio.ms is Matt Evans's blog and project-writeup site. All content is
copyright (c) 2002-2024 Matt Evans, unless otherwise stated.