http://rv6502.ca/post/2019/01/05/starduino-3d-gaming-in-28kb-behind-the-pixels/ * Wiki * About Me * Twitter Stephane Hockenhull's website Bits and Can-do Dad * Home * Blog * Experiments * Arduino * ATmega32U4 * AVR * Game Boy * Audio * Computer Graphics * Computer Vision * Stephane Hockenhull's website Starduino - 3D Gaming in 28KB - Behind the Pixels 0 January 5, 2019 by rv6502 [cover] [starduino_title] I was looking for an interesting hobby project to hone my AVR assembly skills when I found the Arduboy, a sturdy "Arduino" I could keep in my pocket. There were a lot of fun games already and even some in 3D but no filled-polygon games? This had to be rectified. First, what do we have to work with? [arduboy-small]The Arduboy The Arduboy is a portable game console compatible with the Arduino IDE and uses the same microcontroller as the Arduino Leonardo, an ATmega32u4 but unlike the Leonardo board it comes with hardware pre-attached all in a nice sturdy casing. The ATmega32u4 contains an 8 bits CPU, a tiny bit of RAM, some flash ROM for the program and data, a tiny EEPROM for saving data when the power goes out, and features a bunch of GPIO pins to connect to external hardware. CPU 8bits ATmega32u4 Speed 16Mhz 2.5KB SRAM 1KB taken by the display buffer 1.5KB left to work with 28KB For the application Flash ROM 4KB For the boot loader 32KB Total EEPROM 1KB Screen 128x64 1bpp monochrome OLED Buttons D-pad and 2 buttons Sound 1 bit x 2 channels (or 3 levels.) Speaker Piezoelectric speaker Software Controllable LEDs 1 RGB, 1 Green, 1 Yellow External IO Ports 1 micro USB port. Device mode only. The display is 128x64 1bpp which means that a full frame takes away 1KB of our memory. Unless we're generating the display on the fly (which isn't practical for a polygon-based 3D game, I tried. More on that later) we're left with 1.5KB of RAM for all our working data including the CPU stack. It is connect to the MCU via one-way SPI and we don't get the v-sync signal from the OLED controller as that particular display manufacturer did not expose the signal on the ribbon cable. There is no audio generation circuitry. The sound has to be generated in software through an IRQ for each PWM pulse. Eating away at our precious CPU time. The early tests In late May 2018 when I got my unit I started with a filled-polygon rasterizer to see if it'd be possible to get anything at a playable speed. Then I started experimenting with different game concepts. To see which styles would fit the OLED display better, the engine's performance, ROM space, limited controls, etc. I settled for a concept based on Nintendo's Star Fox on the SNES. SNES Arduboy 16 bits 8 bits 3 CPUs (65816, SuperFX, SPC700) 1 CPU ROM: 1024KB ROM: 28KB Main RAM: 128KB Video RAM: 64KB Audio RAM: 64KB 2.5KB RAM SuperFX RAM: 256KB Total RAM: 512KB 10Mhz (SuperFX) 16Mhz (Ha! Take that!) Saving time and preparing for the inevitable debugging ahead Debugging on MCU hardware is never as convenient as on your favorite workstation in your favorite IDE with all the nice tools like boundary checkers and log files. So I moved the project out of the Arduino IDE and into my own Makefile-based build system with a native PC port so I could do as much testing on PC without having to program the flash every time. [Screenshot_2019-01-05_15-29-21-] It also lets me do some convenient time-saving things like showing the bounding boxes and bezier paths in a level preview window and live-editing the levels and sounds. [Screenshot_2019-01-05_15-45-50-]Starduino - Debugging on PC If you're planning on spending months of your spare time working on a game invest time in making your life easier first. Doesn't take long to pay off. Throw out what you don't need Moving out of the Arduino IDE let me throw out the USB library and a few others pieces. But that meant I didn't have access to the USB serial port for debugging... Remember that part about having the engine & game running on PC too? I'll live. The OLED Display By default the display controller is in "horizontal mode" which is fine for 2D blitting but isn't convenient for rasterizing polygon. Luckily the controller has a "Vertical Addressing Mode" which turns the memory mapping into a familiar and rasterizing-friendly 64x128 screen (sideways) of 8x128 bytes starting at the top left, going right, with the lsb on the leftmost pixel (0x01 = leftmost, 0x80 = rightmost). [OLED] The initialisation commands used: static const uint8_t init_cmds[] PROGMEM = { 0xAE, 0xA8, 64-1, 0xD3, 0x00, 0x40, 0x8D, 0x14, 0xC8, 0xDA, 0x12, 0x81, 0xFF, 0xd9, 0x22, 0xDB, 0x00, 0x2E, 0xA6, 0xA4, 0xAF, 0x20, 0x01, 0xA0 }; The great thing about making a 3D engine is you don't really care which way the screen is. You can just fix this in the projection matrix. Rendering I'm not going to get into the details of rasterizing polygons as there's already plenty of books and tutorials about this both online and in dead tree format. Pick your favorite. As long as texturing isn't involved the CPU can handle it just fine. With 32 registers it's very apt at this job. [background]Sky Rendering However I used a little trick for the sky box. It's actually a dynamically generated 2x96 texture drawn over the screen in a skewed fashion according to the camera's angle by projecting the horizon onto the edges of the screen to figure out the height and slope. The lines representing the ground are drawn onto the background pattern before clearing the frame buffer with it. Their position is calculated according to the projection matrix. Let's butcher some 3D models ROM space is at a premium here so obviously data needs to be compressed somehow. We also don't have much RAM to work with so everything will have to be decompressed (or generated!) on the fly. And a balance has to be struck between the saved data and the decompressor code. Each instruction on the CPU is 2 bytes and this can add up quickly. I didn't want to model 3D objects by entering vertex values by hand so I made an OBJ to C++ converter-compressor tool and modelled the objects in Modo. The first obvious place to save bytes is to mirror your models whenever possible so the compressor looks for matching mirrored polygons on all 3 dimensions (X,Y,Z) and merge them into mirrored batches. This means all I need is 3 bits to tell the decompressor if it has to repeat the poly batch mirrored on X, Y, and/or Z. This all fitted nicely into a 16bit batch header VVZYXIIIIICCCCCC where * V is the number of vertices per poly-1 (dot, line, tri, quad), * ZYX are the mirror bits, * I is the index offset of vertex 0 in the batch, * C is the number of primitives to render in the batch. The engine uses 32bits signed fixed point math in a 16.16 format (16 bits of fraction). This meant 12 bytes per vertex but this is much too wasteful. That's when floating points become your friend. But wait! Floats are 32bits! That's still 12 bytes! Yes, standard floats are but I'm not using that. The trick I used was to encode the vertices as 4bits of significand and a shared 4 bits of exponent, for a total of 2 bytes per vertex. Modern GPUs use a similar trick to encode HDR colors such as RGB9_E5_EXT in OpenGL in only 32bits. void VectorUnpack(Fixed *out_v, const uint16_t *v) { uint16_t t = pgm_read_word_near(v); byte e = (t & 0xF) + 8; int32_t *dst = (int32_t *)out_v; for(byte i=3; i--;){ int8_t tx = (t & 0xF0); tx >>= 4; *dst++ = ((int32_t)tx) << e; t >>= 4; } } This is where the 3D models get butchered by the compressor tool but with careful adjustment of the vertices I could get the models to look fine even after rounding. Beggars can't be choosers. Each poly has a 4bit "color" index into a 16 entries palette that maps to 9 different dithering patterns. The indirection allows me to create the pulsing brightness animation and blinking. Animations [modo]Models Animated Usually animations in 3D engines are done using bone matrix transforms with complex bezier curves but this was going to be much too large for this project and the poor little CPU could not have dealt with multiple bone matrices in any reasonable time frame. So animations were done using simple vertex interpolation instead. When vertices are only 2 bytes (and mirrored!) it becomes more reasonable for the few meshes that would get animated. Bounding boxes and collisions Ugh! more bytes, right? [arch_boundary-300x256] Nope. Not even going to store any bounding boxes. All the objects have a scaling value to reuse the meshes and the bounding box was implicitly 1x1x1 centered around the object's origin. (-0.5 to 0.5) then scaled by the instance's scaling. Complex collisions were created using invisible objects. [arch_ingame-300x233]Extra invisible collision objects You'd think the extra invisible objects would add more bytes than storing bounding boxes, however the extra code complexity necessary to handle multiple bounding boxes of varying sizes and positions per objects out-weights the savings. The level The level was encoded using spawning sections that are indexed to be reused. [Screenshot_2019-01-05_17-27-12-300x233]Level sections in red-green pattern The game contains a single intro level where elements are gradually introduced to let new players learn the controls. After which a random level generator picks from 3 pools of level segments: * Dangers * Breaks * Energy Refills When the game starts only the first few segments of each category can be selected by the RNG. As the score goes up the game: * Accelerates from 1x at 0 points to 2x at 128K points. * The more dangerous segments from each pool get enabled. * More dangers get selected before breaks are picked. * More breaks get selected before an energy refill gets picked. The random number generator is seeded with a constant so the same level gets generated every time to keep the high scores fair. Text rendering The font is encoded as 5x3 pixels per glyph into a 16bit value. The glyph is first unpacked into an 8x5 buffer (remember the screen is on its side). A style buffer tells the text rendering routine how to render the multiple text passes: * Expand Border * Fill 8x5 Cell * NAND Mask Draw (black) * XOR Draw (invert) Drawing white text is a mask to set to black followed by an invert. It seems wasteful but the black outline step takes care of clearing to black so the 2nd pass only needs to invert. This allows various styles using the same text routine and saves on code space. Reducing the lag To help controlling the game when the frame rate dips the inputs are read 60 times per second (same rate as the game play logic) and stored into a circular buffer so inputs are processed at their input time. Internally the game simulation runs at 60 ticks per second regardless of frame rate. The killer math Doing 32bit fixed point 3D math really kills that poor 8bit CPU. Around 50% of the time is spent doing the geometry transforms, another 30% for the clipping. I had to write the 16.16 x 16.16 -> 16.16 fixed point multiplication code in assembly to get a decent frame rate. The main advantage other than not doing the ridiculously overkill 32 x 32 -> 64 fixed point math in C++ was that I was able to skip some of the long-multiplication steps as 3D vertex*matrix transform has a lot of zero bytes the way the models are rounded off so the assembly code skips an entire (byte * 4 bytes) multiply-accumulate chunk of code for each zero bytes. [fixed-point-300x255] Step 1 and Step 4 (not pictured) can be entirely skipped in this example. The fastest routine is the one that does nothing. This was the biggest gain in speed. The tally Data 3220 Bytes Code 25232 Bytes Frame buffer 1024 Bytes Work memory 1314 Bytes Stack left 222 Bytes Download the .arduboy file (98KB Turbo version) (Classic version: .arduboy file (100KB) ) to flash the game on your Arduboy. Or the full .zip file with the printable PDF instruction booklet (600KB) and contains the .hex file to upload using avrdude or your method of choice. Direct link to the .hex file (Turbo Version with 50% higher FPS) Direct link to the .hex file (Classic Version) Or if you need only the instruction booklet (516KB). * [starduino_title] * [screenshot01] * [screenshot03] [booklet]If you're gonna spoof something, go all the way : Arduino, ATmega32U4, AVR, Computer Graphics, Games, GPU, Hardware, Renderer : 3D, Arduino, Assembly, AVR, Computer Graphics Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment [ ] Name * [ ] Email * [ ] Website [ ] [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] RSS feed for this post (comments) * TrackBack URI Previous Post << Starduino For Arduboy - 3D Gaming Inside 28KB Next Post Starduino Turbo For Arduboy - Faster 3D Gaming Inside 28KB >> * [Search here .... ] * Recent Posts + We built an EDM BBQ + Starduino Turbo For Arduboy - Faster 3D Gaming Inside 28KB + Starduino - 3D Gaming in 28KB - Behind the Pixels + Starduino For Arduboy - 3D Gaming Inside 28KB + Bad Duino - Bad Apple!! on Arduino * Recent Comments + Roman on Starduino For Arduboy - 3D Gaming Inside 28KB + Picea Pungens on Starduino For Arduboy - 3D Gaming Inside 28KB + Maciej on Starduino For Arduboy - 3D Gaming Inside 28KB + Starduino Brings Star Fox Game Play to Arduboy | Retro Gaming Magazine on Starduino For Arduboy - 3D Gaming Inside 28KB + z-nexx on Game Boy Tracker Replay Routines Released * Archives + October 2019 + July 2019 + January 2019 + January 2018 + February 2015 + July 2014 + June 2014 + May 2012 + March 2012 + January 2012 + December 2011 + December 2010 * Categories + Arduino + ATmega32U4 + Audio + AVR + Blog + Computer Graphics + Computer Vision + Experiments + Game Boy + Games + GPU + Hardware + Linux + Mac OSX + MS Windows + MSDOS + Open Pandora + Raspberry Pi + Renderer + Software Releases + Technical + Uncategorized * Meta + Log in + Entries feed + Comments feed + WordPress.org * For more stuff PayPal - The safer, easier way to pay online! * * Tags 3D Arduino Assembly AVR Bad Apple!! Computer Graphics Ying Hui Dong Fang Project * Meta + Log in + Entries feed + Comments feed + WordPress.org * January 2019 M T W T F S S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 << Jan Jul >> Stephane Hockenhull's website Albizia Theme designed by itx