https://seelengrab.github.io/articles/Running%20Julia%20baremetal%20on%20an%20Arduino/ Voyage:Running Julia baremetal on an Arduino * Articles * About [hamburger] * julia * c * arduino Running Julia baremetal on an Arduino 1. Preamble 2. an LED in C 3. A first piece of julia pseudocode 1. Datasheets & Memory Mapping 4. Compiling our code 1. Configuring LLVM 2. Defining an architecture 5. Looking at the binary 1. Atomicity 2. Inline LLVM-IR 6. an LED in Julia 7. Limitations 8. Links & references Preamble I don't really have much experience with microcontrollers. I've played around with some arduinos before and the main entry point for my home network is a Raspberry Pi, but that's about it for recent experience. I did take a single course on microcontrollers a few years back, and I was hilariously bad at it, barely reaching a passing grade. Nonetheless, I am fascinated by them - they're low powered devices that we can program to make almost anything happen, as long as we're a little careful with ressource management and don't shoot ourselves in the foot. One thing that is always implicitly assumed when talking about julia is the requirement for a runtime and garbage collector. Most of the time, optimizing julia (or any code, really) comes down to two things: 1. minimize the time spent running code you didn't write 2. have as much code you want to run compiled to the native instructions of where you want to run it Requirement 1) results more or less in "don't talk to runtime & GC if you don't have to" and 2) boils down to "make sure you don't run unnecessary code, like an interpreter" - i.e. statically compile your code and avoid dynamicness wherever you can.^[1] I'm already used to 1) due to regular optimization when helping people on Slack and Discourse, and with better static compilation support inching ever closer over the past few years and me procrastinating writing my bachelors' thesis last week, I thought to myself 1. Julia is based on LLVM and is basically already a compiled language. 2. You've got some old arduinos lying around. 3. You know those take in some AVR blob to run as their code. 4. LLVM has an AVR backend. and the very next thought I had was "that can't be too difficult to get to work, right?". This is the (unexpectedly short) story of how I got julia code to run on an arduino. Funnily enough, once you're looking for it, you can find these concepts everywhere. For example, you want to minimize the number [1] of times you talk to the linux kernel on an OS, since context switches are expensive. You also want to call into fast native code as often as possible, as is done in python by calling into C when performance is required. an LED in C So, what are we dealing with? Well, even arduino don't sell these anymore: [arduino] This is an Arduino Ethernet R3, a variation on the common Arduino UNO. It's the third revision, boasting an ATmega328p, an ethernet port, a slot for an SD card as well as 14 I/O pins, most of which are reserved. It has 32KiB of flash memory, 2KiB SRAM and 1KiB EEPROM. Its clock runs at measly 16 MHz, there's a serial interface for an external programmer and it weighs 28g. With this documentation, the schematic for the board, the datasheet for the microcontroller and a good amount of "you've done harder things before" I set out to achieve the simplest goal imaginable: Let the LED labeled L9 (see the lower left corner of the board in the image above, right above the on LED above the power connector) blink. For comparison sake and to have a working implementation to check our arduino with, here's a C implementation of what we're trying to do: #include #include #define MS_DELAY 3000 int main (void) { DDRB |= _BV(DDB1); while(1) { PORTB |= _BV(PORTB1); _delay_ms(MS_DELAY); PORTB &= ~_BV(PORTB1); _delay_ms(MS_DELAY); } } This short piece of code does a few things. It first configures our LED-pin as an output, which we can do by setting pin DDB1^[2] in DDRB (which is a contraction of "Data Direction Register Port B" - it controls whether a given I/O pin is interpreted as input or output). After that, it enters an infinite loop, where we first set our pin PORTB1 on PORTB to HIGH (or 1) to instruct our controller to power the LED. We then wait for MS_DELAY milliseconds, or 3 seconds. Then, we unpower the LED by setting the same PORTB1 pin to LOW (or 0). Compiling & flashing this code like so^[3] : avr-gcc -Os -DF_CPU=16000000UL -mmcu=atmega328p -c -o blink_led.o blink_led.c avr-gcc -mmcu=atmega328p -o blink_led.elf blink_led.o avr-objcopy -O ihex blink_led.elf blink_led.hex avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:blink_led.hex results in a nice, blinking LED. These few shell commands compile our .c soure code to an .o object file targeting our microcontroller, link it into an .elf, translate that to the Intel .hex format the controller expects and finally flash it to the controller with the appropriate settings for avrdude. Pretty basic stuff. It shouldn't be hard to translate this, so where's the catch? Well, most of the code above is not even C, but C preprocessor directives tailored to do exactly what we mean to do. We can't make use of them in julia and we can't import those .h files, so we'll have to figure out what they mean. I haven't checked, but I think not even _delay_ms is a function. On top of this, we don't have a convenient existing avr-gcc to compile julia to AVR for us. However, if we manage to produce a .o file, we should be able to make the rest of the existing toolchain work for us - after all, avr-gcc can't tell the difference between a julia-created .o and a avr-gcc created .o. Finding the right pin & port took a while. The documentation states that the LED is connected to "digital pin 9", which is supported by the label L9 next to the LED itself. It then goes on to say that on most of the arduino boards, this LED is placed on pin 13, which is used for SPI on mine instead. This is confusing, [2] because the datasheet for our board connects this LED to pin 13 (PB1, port B bit 1) on the controller, which has a split trace leading to pin 9 of the J5 pinout. I mistakenly thought "pin 9" referred to the microcontroller, and tried to control the LED through PD5 (port D, bit 5) for quite some time, before I noticed my mistake. The upside was that I now had a known-good piece of code that I could compare to - even on the assembly level. -DF_CPU=16000000UL is required for _delay_ms to figure out how to translate from milliseconds to "number of cycles required to [3] wait" in our loops. While it's nice to have, it's not really required - we only have to wait some visibly distinct amount to notice the blinking, and as such, I've skipped implementing this in the julia version. A first piece of julia pseudocode So with all that in mind, let's sketch out what we think our code should look like: const DDRB = ?? const PORTB = ?? function main() set_high(DDRB, DDB1) # ?? while true set_high(PORTB, PORTB1) # ?? for _ in 1:500000 # busy loop end set_low(PORTB, PORTB1) # ?? for _ in 1:500000 # busy loop end end end From a high level, it's almost exactly the same. Set bits, busy loop, unset bits, loop. I've marked all places where we have to do something, though we don't know exactly what yet, with ??. All of these places are a bit interconnected, so let's dive in with the first big question: how can we replicate what the C-macros DDRB, DDB1, PORTB and PORTB1 end up doing? Datasheets & Memory Mapping To answer this we first have to take a step back, forget that these are defined as macros in C and think back to what these represent. Both DDRB and PORTB reference specific I/O registers in our microcontroller. DDB1 and PORTB1 refer to the (zero-based) 1st bit of the respective register. In theory, we only have to set these bits in the registers above to make the controller blink our little LED. How do you set a bit in a specific register though? This has to be exposed to a high level language like C somehow. In assembly code we'd just access the register natively, but save for inline assembly, we can't do that in either C or julia. When we take a look in our microcontroller datasheet, we can notice that there's a chapter 36. Register Summary from page 621 onwards. This section is a register reference table. It has an entry for each register, specifying an address, a name, the name of each bit, as well as the page in the datasheet where further documentation, such as initial values, can be found. Scrolling to the end, we find what we've been looking for: Address Name Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 Page 0x05 PORTB PORTB7 PORTB6 PORTB5 PORTB4 PORTB3 PORTB2 PORTB1 PORTB0 100 (0x25) 0x04 DDRB DDR7 DDR6 DDR5 DDR4 DDR3 DDR2 DDR1 DDR0 100 (0x24) So PORTB is mapped to addresses 0x05 and 0x25, while DDRB is mapped to addresses 0x04 and 0x24. Which memory are those addresses referring to? We have EEPROM, flash memory as well as SRAM after all. Once again, the datasheet comes to our help: Chapter 8 AVR Memories has a short section on our SRAM memory, with a very interesting figure: Data Memory Map as well as this explanation: The first 32 locations [of SRAM] address the Register File, the next 64 locations the standard I/O memory, then 160 locations of Extended I/O memory, and the next 512/1024/1024/2048 locations address the internal data SRAM. So the addresses we got from the register summary actually correspond 1:1 to SRAM addresses^[4]. Neat! Translating what we've learned into code, our prototype now looks like this: const DDRB = Ptr{UInt8}(36) # 0x25, but julia only provides conversion methods for `Int` const PORTB = Ptr{UInt8}(37) # 0x26 # The bits we're interested in are the same bit 1 # 76543210 const DDB1 = 0b00000010 const PORTB1 = 0b00000010 function main_pointers() unsafe_store!(DDRB, DDB1) while true pb = unsafe_load(PORTB) unsafe_store!(PORTB, pb | PORTB1) # enable LED for _ in 1:500000 # busy loop end pb = unsafe_load(PORTB) unsafe_store!(PORTB, pb & ~PORTB1) # disable LED for _ in 1:500000 # busy loop end end end builddump(main_pointers, Tuple{}) We can write to our registers by storing some data at its address, as well as read from our register by reading from the same address. In one fell swoop, we got rid of all of our ?? at once! This code now seemingly has everything the C version has, so let's start on the biggest unknown: how do we compile this? This is in contrast to more high level systems like an OS kernel, [4] which utilizes virtual RAM and paging of sections of memory to give the illusion of being on the "baremetal" machine and handling raw pointers. Compiling our code Julia has for quite some time now run on more than just x86(_64) - it also has support for Linux as well as macOS on ARM. These are, in large part, possible due to LLVM supporting ARM. However, there is one other large space where julia code can run directly: GPUs. For a while now, the package GPUCompiler.jl has done a lot of work to compile julia down to NVPTX and AMDGPU, the NVidia and AMD specific architectures supported by LLVM. Because GPUCompiler.jl interfaces with LLVM directly, we can hook into this same mechanism to have it produce AVR instead - the interface is extensible! Configuring LLVM The default julia install does not come with the AVR backend of LLVM enabled, so we have to build both LLVM and julia ourselves. Be sure to do this on one of the 1.8 betas, like v1.8.0-beta3. More recent commits currently break GPUCompiler.jl with this, which should be fixed in the future as well. Julia luckily already supports building its dependencies, so we just have to make a few changes to two Makefiles, enabling the backend diff --git a/deps/llvm.mk b/deps/llvm.mk index 5afef0b83b..8d5bbd5e08 100644 --- a/deps/llvm.mk +++ b/deps/llvm.mk @@ -60,7 +60,7 @@ endif LLVM_LIB_FILE := libLLVMCodeGen.a # Figure out which targets to build -LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF +LLVM_TARGETS := host;NVPTX;AMDGPU;WebAssembly;BPF;AVR LLVM_EXPERIMENTAL_TARGETS := LLVM_CFLAGS := and instruct julia not to use the prebuilt LLVM by setting a flag in Make.user: USE_BINARYBUILDER_LLVM=0 Now, after running make to start the build process, LLVM is downloaded, patched & built from source and made available to our julia code. The whole LLVM compilation took about 40 minutes on my laptop. I honestly expected worse! Defining an architecture With our custom LLVM built, we can define everything that's necessary for GPUCompiler.jl to figure out what we want. We start by importing our dependencies, defining our target architecture and its target triplet: using GPUCompiler using LLVM ##### # Compiler Target ##### struct Arduino <: GPUCompiler.AbstractCompilerTarget end GPUCompiler.llvm_triple(::Arduino) = "avr-unknown-unkown" GPUCompiler.runtime_slug(::GPUCompiler.CompilerJob{Arduino}) = "native_avr-jl_blink" struct ArduinoParams <: GPUCompiler.AbstractCompilerParams end We're targeting a machine that's running avr, with no known vendor and no OS - we're baremetal after all. We're also providing a runtime slug to identify our binary by. We're also defining a dummy struct to hold additional parameters for our target architecture. We don't require any, so we can just leave it empty and otherwise ignore it. Since the julia runtime can't run on GPUs, GPUCompiler.jl also expects us to provide a replacement module for various operations we might want to do, like allocating memory on our target architecture or throwing exceptions. We're of course not going to do any of that, which is why we can just define an empty placeholder for these as well: module StaticRuntime # the runtime library signal_exception() = return malloc(sz) = C_NULL report_oom(sz) = return report_exception(ex) = return report_exception_name(ex) = return report_exception_frame(idx, func, file, line) = return end GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{<:Any,ArduinoParams}) = StaticRuntime GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{Arduino}) = StaticRuntime GPUCompiler.runtime_module(::GPUCompiler.CompilerJob{Arduino,ArduinoParams}) = StaticRuntime In the future, these calls may be used to provide a simple bump allocator or report exceptions via the serial bus for other code targeting the arduino. For now though, this "do nothing" runtime is sufficient.^[5] Now for the compilation. We first define a job for our pipeline: function native_job(@nospecialize(func), @nospecialize(types)) @info "Creating compiler job for '$func($types)'" source = GPUCompiler.FunctionSpec( func, # our function Base.to_tuple_type(types), # its signature false, # whether this is a GPU kernel GPUCompiler.safe_name(repr(func))) # the name to use in the asm target = Arduino() params = ArduinoParams() job = GPUCompiler.CompilerJob(target, source, params) end This then gets passed to our LLVM IR builder: function build_ir(job, @nospecialize(func), @nospecialize(types)) @info "Bulding LLVM IR for '$func($types)'" mi, _ = GPUCompiler.emit_julia(job) ir, ir_meta = GPUCompiler.emit_llvm( job, # our job mi; # the method instance to compile libraries=false, # whether this code uses libraries deferred_codegen=false, # is there runtime codegen? optimize=true, # do we want to optimize the llvm? only_entry=false, # is this an entry point? ctx=JuliaContext()) # the LLVM context to use return ir, ir_meta end We first get a method instance from the julia runtime and ask GPUCompiler to give us the corresponding LLVM IR for our given job, i.e. for our target architecture. We don't use any libraries and we can't run codegen, but julia specific optimizations sure would be nice. They're also required for us, since they remove obviously dead code regarding the julia runtime, which we neither want nor can call into. If it would remain in the IR, we'd error out when trying to build our ASM, due to the missing symbols. After this, it's just emitting of AVR ASM: function build_obj(@nospecialize(func), @nospecialize(types); kwargs...) job = native_job(func, types) @info "Compiling AVR ASM for '$func($types)'" ir, ir_meta = build_ir(job, func, types) obj, _ = GPUCompiler.emit_asm( job, # our job ir; # the IR we got strip=true, # should the binary be stripped of debug info? validate=true, # should the LLVM IR be validated? format=LLVM.API.LLVMObjectFile) # What format would we like to create? return obj end We're also going to strip out debug info since we can't debug anyway and we're additionally asking LLVM to validate our IR - a very useful feature! The eagle eyed may notice that this is suspiciously similar to what one needs for Rust - something to allocate and something to report errors. This is no coincidence - it's the minimum required for a language that usually has a runtime that handles things [5] like signals and allocation of memory for you. Spinning this further could lead one to think that Rust too is garbage collected, since you never have to call malloc and free yourself - it's all handled by the runtime & compiler, which inserts calls to these (or another allocator) in the appropriate places. Looking at the binary When calling this like build_obj(main_pointers, Tuple{}) (we don't pass any arguments to main), we receive a String containing binary data - this is our compiled object file: obj = build_obj(main_pointers, Tuple{}) \x7fELF\x01\x01\x01\0\0\0\0\0\0\0\0\0\x01\0S\0\x01\0\0\0\0\0\0\0\0\0\0\0\xf8\0\0\0\x02\0\0\x004\0\0\0\0\0(\0\x05\0\x01\0\x82\xe0\x84\xb9\0\xc0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\a\0\0\0\0\0\0\0\0\0\0\0\x04\0\xf1\xff\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\x02\0\e\0\0\0\0\0\0\0\x06\0\0\0\x12\0\x02\0?\0\0\0\0\0\0\0\0\0\0\0\x10\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\x10\0\0\0\x04\0\0\0\x03\x02\0\0\x04\0\0\0\0.rela.text\0__do_clear_bss\0julia_main_pointers\0.strtab\0.symtab\0__do_copy_data\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0/\0\0\0\x03\0\0\0\0\0\0\0\0\0\0\0\xa8\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\x06\0\0\0\x01\0\0\0\x06\0\0\0\0\0\0\x004\0\0\0\x06\0\0\0\0\0\0\0\0\0\0\0\x04\0\0\0\0\0\0\0\x01\0\0\0\x04\0\0\0\0\0\0\0\0\0\0\0\x9c\0\0\0\f\0\0\0\x04\0\0\0\x02\0\0\0\x04\0\0\0\f\0\0\x007\0\0\0\x02\0\0\0\0\0\0\0\0\0\0\0<\0\0\0`\0\0\0\x01\0\0\0\x03\0\0\0\x04\0\0\0\x10\0\0\0 Let's take a look at the disassembly, to confirm that this is what we expect to see: function builddump(fun, args) obj = build_obj(fun, args) mktemp() do path, io write(io, obj) flush(io) str = read(`avr-objdump -dr $path`, String) end |> print end builddump(main_pointers, Tuple{}) /tmp/jl_uOAUKI: file format elf32-avr Disassembly of section .text: 00000000 : 0: 82 e0 ldi r24, 0x02 ; 2 2: 84 b9 out 0x04, r24 ; 4 4: 00 c0 rjmp .+0 ; 0x6 4: R_AVR_13_PCREL .text+0x4 Well that doesn't look good - where has all our code gone? All that's left is a single out followed by a single do-nothing relative jump. That's almost nothing if we compare to the equivalent C code: $ avr-objdump -d blink_led.elf [...] 00000080
: 80: 21 9a sbi 0x04, 1 ; 4 82: 2f ef ldi r18, 0xFF ; 255 84: 8b e7 ldi r24, 0x7B ; 123 86: 92 e9 ldi r25, 0x92 ; 146 88: 21 50 subi r18, 0x01 ; 1 8a: 80 40 sbci r24, 0x00 ; 0 8c: 90 40 sbci r25, 0x00 ; 0 8e: e1 f7 brne .-8 ; 0x88 90: 00 c0 rjmp .+0 ; 0x92 92: 00 00 nop 94: 29 98 cbi 0x05, 1 ; 5 96: 2f ef ldi r18, 0xFF ; 255 98: 8b e7 ldi r24, 0x7B ; 123 9a: 92 e9 ldi r25, 0x92 ; 146 9c: 21 50 subi r18, 0x01 ; 1 9e: 80 40 sbci r24, 0x00 ; 0 a0: 90 40 sbci r25, 0x00 ; 0 a2: e1 f7 brne .-8 ; 0x9c a4: 00 c0 rjmp .+0 ; 0xa6 a6: 00 00 nop a8: ec cf rjmp .-40 ; 0x82 [...] This sets the same bit as our code on 0x04 (remember, this was DDRB), initializes a loop variable over three words, branches, jumps, sets and clears bits.. Basically everything we'd expect our code to do as well, so what gives? In order to figure out what's going on, we have to remember that julia, LLVM and gcc are optimizing compilers. If they can deduce that some piece of code has no visible effect, for example because you're always overwriting previous loop iterations with known constants, the compiler is usually free to just delete the superfluous writes because you can't observe the difference anyway. Here, I believe two things happened: 1. The initial unsafe_load from our pointer triggered undefined behavior, since the initial value of a given pointer is not defined. LLVM saw that, saw that we actually used the read value and eliminated both read & store due to it being undefined behavior and it being free to pick the value it "read" to be the one we wrote, making the load/store pair superfluous. 2. The now empty loops serve no purpose, so they got removed as well. In C, you can solve this problem by using volatile. That keyword is a very strict way of telling the compiler "Look, I want every single read & write from and to this variable to happen. Don't eliminate any and don't shuffle them around (except for non-volatile, you're free to shuffle those around)". In contrast, julia doesn't have this concept at all - but we do have atomics. So let's use them to see if they're enough, even though semantically they're a tiny bit different ^[6]. Atomicity With the atomics, our code now looks like this: const DDRB = Ptr{UInt8}(36) # 0x25, but julia only provides conversion methods for `Int` const PORTB = Ptr{UInt8}(37) # 0x26 # The bits we're interested in are the same bit as in the datasheet # 76543210 const DDB1 = 0b00000010 const PORTB1 = 0b00000010 function main_atomic() ddrb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(DDRB, ddrb | DDB1, :sequentially_consistent) while true pb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(PORTB, pb | PORTB1, :sequentially_consistent) # enable LED for _ in 1:500000 # busy loop end pb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(PORTB, pb & ~PORTB1, :sequentially_consistent) # disable LED for _ in 1:500000 # busy loop end end end Note This is not how you'd usually use atomics in julia! I'm using intrinsics in hopes of communicating with LLVM directly, since I'm dealing with pointers here. For more high-level code, you'd use @atomic operations on struct fields. giving us the following assembly: /tmp/jl_UfT1Rf: file format elf32-avr Disassembly of section .text: 00000000 : 0: 85 b1 in r24, 0x05 ; 5 2: 82 60 ori r24, 0x02 ; 2 4: a4 e2 ldi r26, 0x24 ; 36 6: b0 e0 ldi r27, 0x00 ; 0 8: 0f b6 in r0, 0x3f ; 63 a: f8 94 cli c: 8c 93 st X, r24 e: 0f be out 0x3f, r0 ; 63 10: 85 b1 in r24, 0x05 ; 5 12: a5 e2 ldi r26, 0x25 ; 37 14: b0 e0 ldi r27, 0x00 ; 0 16: 98 2f mov r25, r24 18: 92 60 ori r25, 0x02 ; 2 1a: 0f b6 in r0, 0x3f ; 63 1c: f8 94 cli 1e: 9c 93 st X, r25 20: 0f be out 0x3f, r0 ; 63 22: 98 2f mov r25, r24 24: 9d 7f andi r25, 0xFD ; 253 26: 0f b6 in r0, 0x3f ; 63 28: f8 94 cli 2a: 9c 93 st X, r25 2c: 0f be out 0x3f, r0 ; 63 2e: 00 c0 rjmp .+0 ; 0x30 2e: R_AVR_13_PCREL .text+0x18 At first glance, it doesn't look too bad. We have a little bit more code and we see some out instructions, so are we good? Unfortunately, no. There is only a single rjmp, meaning our nice busy loops got eliminated. I also had to insert those unsafe_load to not get a segfault during compilation.. Further, the atomics seem to have ended up reading some pretty weird addresses - they appear to read/write 0x3f (or address 63) which is mapped to SREG, or the status register. Even weirder is what it's doing with the value it read: 8: 0f b6 in r0, 0x3f ; 63 a: f8 94 cli ... e: 0f be out 0x3f, r0 ; 63 First, reading SREG into r0, then clearing the interrupt bit, then writing the value we saved back out. I don't know how it got to this code, but I do know that it's not what we want. So atomics are not the way to go. "Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile, which ensures that every volatile load and store happens and is performed in the stated order. A [6] couple examples: if a SequentiallyConsistent store is immediately followed by another SequentiallyConsistent store to the same address, the first store can be erased. This transformation is not allowed for a pair of volatile stores.", LLVM Documentation - Atomics Inline LLVM-IR The other option we still have at our disposal is writing inline LLVM-IR. Julia has great support for such constructs, so let's use them: const DDRB = Ptr{UInt8}(36) const PORTB = Ptr{UInt8}(37) const DDB1 = 0b00000010 const PORTB1 = 0b00000010 const PORTB_none = 0b00000000 # We don't need any other pin - set everything low function volatile_store!(x::Ptr{UInt8}, v::UInt8) return Base.llvmcall( """ %ptr = inttoptr i64 %0 to i8* store volatile i8 %1, i8* %ptr, align 1 ret void """, Cvoid, Tuple{Ptr{UInt8},UInt8}, x, v ) end function main_volatile() volatile_store!(DDRB, DDB1) while true volatile_store!(PORTB, PORTB1) # enable LED for _ in 1:500000 # busy loop end volatile_store!(PORTB, PORTB_none) # disable LED for _ in 1:500000 # busy loop end end end with our disassembly looking like: /tmp/jl_3twwq9: file format elf32-avr Disassembly of section .text: 00000000 : 0: 82 e0 ldi r24, 0x02 ; 2 2: 84 b9 out 0x04, r24 ; 4 4: 90 e0 ldi r25, 0x00 ; 0 6: 85 b9 out 0x05, r24 ; 5 8: 95 b9 out 0x05, r25 ; 5 a: 00 c0 rjmp .+0 ; 0xc a: R_AVR_13_PCREL .text+0x6 Much better! Our out instructions save to the correct register. Unsurprisingly, all loops are still eliminated. We could force the variable from busy looping to exist by writing its value somewhere in SRAM, but that's a little wasteful. Instead, we can go one step deeper with our nesting and have inline AVR assembly in our inline LLVM-IR: const DDRB = Ptr{UInt8}(36) const PORTB = Ptr{UInt8}(37) const DDB1 = 0b00000010 const PORTB1 = 0b00000010 const PORTB_none = 0b00000000 # We don't need any other pin - set everything low function volatile_store!(x::Ptr{UInt8}, v::UInt8) return Base.llvmcall( """ %ptr = inttoptr i64 %0 to i8* store volatile i8 %1, i8* %ptr, align 1 ret void """, Cvoid, Tuple{Ptr{UInt8},UInt8}, x, v ) end function keep(x) return Base.llvmcall( """ call void asm sideeffect "", "X,~{memory}"(i16 %0) ret void """, Cvoid, Tuple{Int16}, x ) end function main_keep() volatile_store!(DDRB, DDB1) while true volatile_store!(PORTB, PORTB1) # enable LED for y in Int16(1):Int16(3000) keep(y) end volatile_store!(PORTB, PORTB_none) # disable LED for y in Int16(1):Int16(3000) keep(y) end end end This slightly unorthodox not even nop construct pretends to execute an instruction that has some sideeffect, using our input as an argument. I've changed the loop to run for fewer iterations because it makes the assembly easier to read. Checking the disassembly we get... /tmp/jl_xOZ5hH: file format elf32-avr Disassembly of section .text: 00000000 : 0: 82 e0 ldi r24, 0x02 ; 2 2: 84 b9 out 0x04, r24 ; 4 4: 21 e0 ldi r18, 0x01 ; 1 6: 30 e0 ldi r19, 0x00 ; 0 8: 9b e0 ldi r25, 0x0B ; 11 a: 40 e0 ldi r20, 0x00 ; 0 c: 85 b9 out 0x05, r24 ; 5 e: 62 2f mov r22, r18 10: 73 2f mov r23, r19 12: e6 2f mov r30, r22 14: f7 2f mov r31, r23 16: 31 96 adiw r30, 0x01 ; 1 18: 68 3b cpi r22, 0xB8 ; 184 1a: 79 07 cpc r23, r25 1c: 6e 2f mov r22, r30 1e: 7f 2f mov r23, r31 20: 01 f4 brne .+0 ; 0x22 20: R_AVR_7_PCREL .text+0x16 22: 45 b9 out 0x05, r20 ; 5 24: 62 2f mov r22, r18 26: 73 2f mov r23, r19 28: e6 2f mov r30, r22 2a: f7 2f mov r31, r23 2c: 31 96 adiw r30, 0x01 ; 1 2e: 68 3b cpi r22, 0xB8 ; 184 30: 79 07 cpc r23, r25 32: 6e 2f mov r22, r30 34: 7f 2f mov r23, r31 36: 01 f4 brne .+0 ; 0x38 36: R_AVR_7_PCREL .text+0x2c 38: 00 c0 rjmp .+0 ; 0x3a 38: R_AVR_13_PCREL .text+0xc Huzzah! Pretty much everything we'd expect to see is here: * We write to 0x05 with out * We have some brne to busy loop with * We add something to some register for our looping Granted, the binary is not as small as the one we compiled with -Os from C, but it should work! The only remaining step is to get rid of all those .+0 jump labels, which would prevent us from actually looping. I've also enabled dumping of relocation labels (that's the R_AVR_7_PCREL stuff) - these are inserted by the compiler make the code relocatable in an ELF file and used by the linker during final linking of the assembly. Now that we're probably ready to flash, we can link our code into a binary (thereby resolving those relocation labels) and flash it onto our arduino: $ avr-ld -o jl_blink.elf jl_blink.o $ avr-objcopy -O ihex jl_blink.elf jl_blink.hex $ avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:jl_blink.hex avrdude: AVR device initialized and ready to accept instructions Reading | ################################################## | 100% 0.00s avrdude: Device signature = 0x1e950f (probably m328p) avrdude: NOTE: "flash" memory has been specified, an erase cycle will be performed To disable this feature, specify the -D option. avrdude: erasing chip avrdude: reading input file "jl_blink.hex" avrdude: input file jl_blink.hex auto detected as Intel Hex avrdude: writing flash (168 bytes): Writing | ################################################## | 100% 0.04s avrdude: 168 bytes of flash written avrdude done. Thank you. and after flashing we get... an LED in Julia Your browser does not support the video tag. Now THAT is what I call two days well spent! The arduino is powered through the serial connector I use to flash programs on the right. I want to thank everyone in the Julialang Slack channel # static-compilation for their help during this! Without them, I wouldn't have thought of the relocation labels in linking and their help was invaluable when figuring out what does and does not work when compiling julia to a, for this language, exotic architecture. Limitations Would I use this in production? Unlikely, but possibly in the future. It was finicky to get going and random segmentation faults during the compilation process itself are bothersome. But then again - nothing of this was part of a supported workflow, so I guess I'm happy that it has worked as well as it has! I do believe that this area will steadily improve - after all, it's already working well on GPUs and FPGAs (or so I'm told - Julia on an FPGA is apparently some commercial offering from a company). From what I know, this is the first julia code to run native & baremetal on any Arduino/ATmega based chip, which in and of itself is already exciting. Still, the fact that there is no such thing as a runtime for this (julia uses libuv for tasks - getting that on an arduino seems challenging) means you're mostly going to be limited to self-written or vetted code that doesn't rely on too advanced features, like a GC. Some niceties I'd like to have are better custom-allocator support, to allow actual proper "heap" allocation. I haven't tried yet, but I think immutable structs (which are often placed on the stack already, which the ATmega328p does have!) should work out of the box. I'm looking forward to trying out some i2c and SPI communication, but my gut tells me it won't be much different from writing this in C (unless we get custom allocator support or I use one of the malloc based arrays from StaticTools.jl, that is). Links & references * Arduino Ethernet R3 Documentation * Arduino Ethernet R3 Schematic * ATmega328p Datasheet * GPUCompiler * LLVM Documentation - Atomics CC BY-SA 4.0 Sukera. Last modified: May 23, 2022. Website built with Franklin.jl, the Julia programming language and !