: 80: 21 9a sbi 0x04, 1 ; 4 82: 2f ef ldi r18, 0xFF ; 255 84: 8b e7 ldi r24, 0x7B ; 123 86: 92 e9 ldi r25, 0x92 ; 146 88: 21 50 subi r18, 0x01 ; 1 8a: 80 40 sbci r24, 0x00 ; 0 8c: 90 40 sbci r25, 0x00 ; 0 8e: e1 f7 brne .-8 ; 0x88 90: 00 c0 rjmp .+0 ; 0x92 92: 00 00 nop 94: 29 98 cbi 0x05, 1 ; 5 96: 2f ef ldi r18, 0xFF ; 255 98: 8b e7 ldi r24, 0x7B ; 123 9a: 92 e9 ldi r25, 0x92 ; 146 9c: 21 50 subi r18, 0x01 ; 1 9e: 80 40 sbci r24, 0x00 ; 0 a0: 90 40 sbci r25, 0x00 ; 0 a2: e1 f7 brne .-8 ; 0x9c a4: 00 c0 rjmp .+0 ; 0xa6 a6: 00 00 nop a8: ec cf rjmp .-40 ; 0x82 [...] This sets the same bit as our code on 0x04 (remember, this was DDRB), initializes a loop variable over three words, branches, jumps, sets and clears bits.. Basically everything we'd expect our code to do as well, so what gives? In order to figure out what's going on, we have to remember that julia, LLVM and gcc are optimizing compilers. If they can deduce that some piece of code has no visible effect, for example because you're always overwriting previous loop iterations with known constants, the compiler is usually free to just delete the superfluous writes because you can't observe the difference anyway. Here, I believe two things happened: 1. The initial unsafe_load from our pointer triggered undefined behavior, since the initial value of a given pointer is not defined. LLVM saw that, saw that we actually used the read value and eliminated both read & store due to it being undefined behavior and it being free to pick the value it "read" to be the one we wrote, making the load/store pair superfluous. 2. The now empty loops serve no purpose, so they got removed as well. In C, you can solve this problem by using volatile. That keyword is a very strict way of telling the compiler "Look, I want every single read & write from and to this variable to happen. Don't eliminate any and don't shuffle them around (except for non-volatile, you're free to shuffle those around)". In contrast, julia doesn't have this concept at all - but we do have atomics. So let's use them to see if they're enough, even though semantically they're a tiny bit different ^[6]. Atomicity With the atomics, our code now looks like this: const DDRB = Ptr{UInt8}(36) # 0x25, but julia only provides conversion methods for `Int` const PORTB = Ptr{UInt8}(37) # 0x26 # The bits we're interested in are the same bit as in the datasheet # 76543210 const DDB1 = 0b00000010 const PORTB1 = 0b00000010 function main_atomic() ddrb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(DDRB, ddrb | DDB1, :sequentially_consistent) while true pb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(PORTB, pb | PORTB1, :sequentially_consistent) # enable LED for _ in 1:500000 # busy loop end pb = unsafe_load(PORTB) Core.Intrinsics.atomic_pointerset(PORTB, pb & ~PORTB1, :sequentially_consistent) # disable LED for _ in 1:500000 # busy loop end end end Note This is not how you'd usually use atomics in julia! I'm using intrinsics in hopes of communicating with LLVM directly, since I'm dealing with pointers here. For more high-level code, you'd use @atomic operations on struct fields. giving us the following assembly: /tmp/jl_UfT1Rf: file format elf32-avr Disassembly of section .text: 00000000 : 0: 85 b1 in r24, 0x05 ; 5 2: 82 60 ori r24, 0x02 ; 2 4: a4 e2 ldi r26, 0x24 ; 36 6: b0 e0 ldi r27, 0x00 ; 0 8: 0f b6 in r0, 0x3f ; 63 a: f8 94 cli c: 8c 93 st X, r24 e: 0f be out 0x3f, r0 ; 63 10: 85 b1 in r24, 0x05 ; 5 12: a5 e2 ldi r26, 0x25 ; 37 14: b0 e0 ldi r27, 0x00 ; 0 16: 98 2f mov r25, r24 18: 92 60 ori r25, 0x02 ; 2 1a: 0f b6 in r0, 0x3f ; 63 1c: f8 94 cli 1e: 9c 93 st X, r25 20: 0f be out 0x3f, r0 ; 63 22: 98 2f mov r25, r24 24: 9d 7f andi r25, 0xFD ; 253 26: 0f b6 in r0, 0x3f ; 63 28: f8 94 cli 2a: 9c 93 st X, r25 2c: 0f be out 0x3f, r0 ; 63 2e: 00 c0 rjmp .+0 ; 0x30 2e: R_AVR_13_PCREL .text+0x18 At first glance, it doesn't look too bad. We have a little bit more code and we see some out instructions, so are we good? Unfortunately, no. There is only a single rjmp, meaning our nice busy loops got eliminated. I also had to insert those unsafe_load to not get a segfault during compilation.. Further, the atomics seem to have ended up reading some pretty weird addresses - they appear to read/write 0x3f (or address 63) which is mapped to SREG, or the status register. Even weirder is what it's doing with the value it read: 8: 0f b6 in r0, 0x3f ; 63 a: f8 94 cli ... e: 0f be out 0x3f, r0 ; 63 First, reading SREG into r0, then clearing the interrupt bit, then writing the value we saved back out. I don't know how it got to this code, but I do know that it's not what we want. So atomics are not the way to go. "Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile, which ensures that every volatile load and store happens and is performed in the stated order. A [6] couple examples: if a SequentiallyConsistent store is immediately followed by another SequentiallyConsistent store to the same address, the first store can be erased. This transformation is not allowed for a pair of volatile stores.", LLVM Documentation - Atomics Inline LLVM-IR The other option we still have at our disposal is writing inline LLVM-IR. Julia has great support for such constructs, so let's use them: const DDRB = Ptr{UInt8}(36) const PORTB = Ptr{UInt8}(37) const DDB1 = 0b00000010 const PORTB1 = 0b00000010 const PORTB_none = 0b00000000 # We don't need any other pin - set everything low function volatile_store!(x::Ptr{UInt8}, v::UInt8) return Base.llvmcall( """ %ptr = inttoptr i64 %0 to i8* store volatile i8 %1, i8* %ptr, align 1 ret void """, Cvoid, Tuple{Ptr{UInt8},UInt8}, x, v ) end function main_volatile() volatile_store!(DDRB, DDB1) while true volatile_store!(PORTB, PORTB1) # enable LED for _ in 1:500000 # busy loop end volatile_store!(PORTB, PORTB_none) # disable LED for _ in 1:500000 # busy loop end end end with our disassembly looking like: /tmp/jl_3twwq9: file format elf32-avr Disassembly of section .text: 00000000 : 0: 82 e0 ldi r24, 0x02 ; 2 2: 84 b9 out 0x04, r24 ; 4 4: 90 e0 ldi r25, 0x00 ; 0 6: 85 b9 out 0x05, r24 ; 5 8: 95 b9 out 0x05, r25 ; 5 a: 00 c0 rjmp .+0 ; 0xc a: R_AVR_13_PCREL .text+0x6 Much better! Our out instructions save to the correct register. Unsurprisingly, all loops are still eliminated. We could force the variable from busy looping to exist by writing its value somewhere in SRAM, but that's a little wasteful. Instead, we can go one step deeper with our nesting and have inline AVR assembly in our inline LLVM-IR: const DDRB = Ptr{UInt8}(36) const PORTB = Ptr{UInt8}(37) const DDB1 = 0b00000010 const PORTB1 = 0b00000010 const PORTB_none = 0b00000000 # We don't need any other pin - set everything low function volatile_store!(x::Ptr{UInt8}, v::UInt8) return Base.llvmcall( """ %ptr = inttoptr i64 %0 to i8* store volatile i8 %1, i8* %ptr, align 1 ret void """, Cvoid, Tuple{Ptr{UInt8},UInt8}, x, v ) end function keep(x) return Base.llvmcall( """ call void asm sideeffect "", "X,~{memory}"(i16 %0) ret void """, Cvoid, Tuple{Int16}, x ) end function main_keep() volatile_store!(DDRB, DDB1) while true volatile_store!(PORTB, PORTB1) # enable LED for y in Int16(1):Int16(3000) keep(y) end volatile_store!(PORTB, PORTB_none) # disable LED for y in Int16(1):Int16(3000) keep(y) end end end This slightly unorthodox not even nop construct pretends to execute an instruction that has some sideeffect, using our input as an argument. I've changed the loop to run for fewer iterations because it makes the assembly easier to read. Checking the disassembly we get... /tmp/jl_xOZ5hH: file format elf32-avr Disassembly of section .text: 00000000 : 0: 82 e0 ldi r24, 0x02 ; 2 2: 84 b9 out 0x04, r24 ; 4 4: 21 e0 ldi r18, 0x01 ; 1 6: 30 e0 ldi r19, 0x00 ; 0 8: 9b e0 ldi r25, 0x0B ; 11 a: 40 e0 ldi r20, 0x00 ; 0 c: 85 b9 out 0x05, r24 ; 5 e: 62 2f mov r22, r18 10: 73 2f mov r23, r19 12: e6 2f mov r30, r22 14: f7 2f mov r31, r23 16: 31 96 adiw r30, 0x01 ; 1 18: 68 3b cpi r22, 0xB8 ; 184 1a: 79 07 cpc r23, r25 1c: 6e 2f mov r22, r30 1e: 7f 2f mov r23, r31 20: 01 f4 brne .+0 ; 0x22 20: R_AVR_7_PCREL .text+0x16 22: 45 b9 out 0x05, r20 ; 5 24: 62 2f mov r22, r18 26: 73 2f mov r23, r19 28: e6 2f mov r30, r22 2a: f7 2f mov r31, r23 2c: 31 96 adiw r30, 0x01 ; 1 2e: 68 3b cpi r22, 0xB8 ; 184 30: 79 07 cpc r23, r25 32: 6e 2f mov r22, r30 34: 7f 2f mov r23, r31 36: 01 f4 brne .+0 ; 0x38 36: R_AVR_7_PCREL .text+0x2c 38: 00 c0 rjmp .+0 ; 0x3a 38: R_AVR_13_PCREL .text+0xc Huzzah! Pretty much everything we'd expect to see is here: * We write to 0x05 with out * We have some brne to busy loop with * We add something to some register for our looping Granted, the binary is not as small as the one we compiled with -Os from C, but it should work! The only remaining step is to get rid of all those .+0 jump labels, which would prevent us from actually looping. I've also enabled dumping of relocation labels (that's the R_AVR_7_PCREL stuff) - these are inserted by the compiler make the code relocatable in an ELF file and used by the linker during final linking of the assembly. Now that we're probably ready to flash, we can link our code into a binary (thereby resolving those relocation labels) and flash it onto our arduino: $ avr-ld -o jl_blink.elf jl_blink.o $ avr-objcopy -O ihex jl_blink.elf jl_blink.hex $ avrdude -V -c arduino -p ATMEGA328P -P /dev/ttyACM0 -U flash:w:jl_blink.hex avrdude: AVR device initialized and ready to accept instructions Reading | ################################################## | 100% 0.00s avrdude: Device signature = 0x1e950f (probably m328p) avrdude: NOTE: "flash" memory has been specified, an erase cycle will be performed To disable this feature, specify the -D option. avrdude: erasing chip avrdude: reading input file "jl_blink.hex" avrdude: input file jl_blink.hex auto detected as Intel Hex avrdude: writing flash (168 bytes): Writing | ################################################## | 100% 0.04s avrdude: 168 bytes of flash written avrdude done. Thank you. and after flashing we get... an LED in Julia Your browser does not support the video tag. Now THAT is what I call two days well spent! The arduino is powered through the serial connector I use to flash programs on the right. I want to thank everyone in the Julialang Slack channel # static-compilation for their help during this! Without them, I wouldn't have thought of the relocation labels in linking and their help was invaluable when figuring out what does and does not work when compiling julia to a, for this language, exotic architecture. Limitations Would I use this in production? Unlikely, but possibly in the future. It was finicky to get going and random segmentation faults during the compilation process itself are bothersome. But then again - nothing of this was part of a supported workflow, so I guess I'm happy that it has worked as well as it has! I do believe that this area will steadily improve - after all, it's already working well on GPUs and FPGAs (or so I'm told - Julia on an FPGA is apparently some commercial offering from a company). From what I know, this is the first julia code to run native & baremetal on any Arduino/ATmega based chip, which in and of itself is already exciting. Still, the fact that there is no such thing as a runtime for this (julia uses libuv for tasks - getting that on an arduino seems challenging) means you're mostly going to be limited to self-written or vetted code that doesn't rely on too advanced features, like a GC. Some niceties I'd like to have are better custom-allocator support, to allow actual proper "heap" allocation. I haven't tried yet, but I think immutable structs (which are often placed on the stack already, which the ATmega328p does have!) should work out of the box. I'm looking forward to trying out some i2c and SPI communication, but my gut tells me it won't be much different from writing this in C (unless we get custom allocator support or I use one of the malloc based arrays from StaticTools.jl, that is). Links & references * Arduino Ethernet R3 Documentation * Arduino Ethernet R3 Schematic * ATmega328p Datasheet * GPUCompiler * LLVM Documentation - Atomics CC BY-SA 4.0 Sukera. Last modified: May 23, 2022. Website built with Franklin.jl, the Julia programming language and !