https://eclecticlight.co/2021/07/06/code-in-arm-assembly-flow-pipelines-and-performance/ Skip to content [eclecticlight] The Eclectic Light Company Macs, painting, and more Main navigation Menu * Downloads * M1 Macs * Mac Problems * Mac articles * Art * Macs * Painting hoakley July 6, 2021 Macs, Technology Code in ARM Assembly: Flow, pipelines and performance [hopper02] Before moving on to look at integer and other instructions involving the general-purpose registers, this article rounds off the topic of controlling flow by considering performance, and compares two different conditional loop schemes with those generated by Apple's Swift compiler. However good learning assembly language may be for the soul, and for those who need to be able to grok disassembled code, if you want to run your own assembly routines, the most popular reason is performance. However good modern compilers are, they have to make compromises when generating code. When you write your own assembly language you can optimise it for your particular purpose. Pipelines One of the problems with writing code for modern processors is that they work quite differently from the simple model most still have in our mind. Simple processors used to chug steadily through code, one instruction at a time: load that value into a register, multiply the contents of two registers putting the result into a third, and so on. As each instruction requires a series of smaller operations, such as decoding the operation itself, calculating a memory address, fetching a value from an address, and more, what modern processors do is process several instructions at a time. This works a bit like a production line, with four or more instructions at various stages of completion at any one time, in the processor's pipeline. When that processor is running simple unbranching code, it's easy for it to keep the pipeline full and the processing of instructions running at maximum throughput. But the moment that a branch instruction enters the pipeline there's a problem: which instruction should be loaded next, one that follows the branch, or one that doesn't? To ensure that its pipeline doesn't grind to a halt each time a branch instruction is loaded, a processor uses heuristics to try to predict which branch will be followed, and then load those instructions. If branch prediction gets it right almost all the time, then performance remains essentially unaffected by branching. Normally, conditional loops, such as those used instead of for and while statements, can be predicted most reliably. Those used for if ... else ... cascades are harder to predict, so ARM64 assembly language offers conditional selection as a more efficient alternative - something I'll consider in a future article in this series. Evaluating performance One ARM64 feature I'm particularly interested in is its small family of combined floating point multiply-and-add instructions such as FMADD D1, D2, D3, D4 which multiplies D3 and D4, adds D2 to that result, and stores the final result in D1 (in this example). There's an equivalent in Intel x86, and these combined instructions are known not for their effects on performance, but on reducing error, as the intermediate result of the multiplication can be kept in higher precision. To take a first tentative look at this, I coded two iterative loops thus: for_loop: FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 ADD X5, X5, #1 CMP X5, X4 B.LE for_loop which tests in the tail, just like a conventional for loop, and while_loop: CMP X5, X4 B.GE while_done FMADD D0, D4, D5, D6 FSUB D0, D0, D6 FDIV D4, D0, D5 FADD D4, D4, D7 ADD X5, X5, #1 B while_loop while_done: which tests at the head, like a while loop. hopper01 To add a bit of excitement, I pitted my code against that generated by Apple's Swift compiler, providing it with the code for _ in 1...theReps { dZero = (tempA * theB) + theC tempA = ((dZero - theC)/theB) + theInc } which explains what I am calculating. I then used the Hopper disassembler to see what the compiler produced: hopper02 mov x23, x0 adrp x27, #0x100005000 adrp x26, #0x100005000 adrp x25, #0x100005000 b.eq loc_10000430c ldr d0, [x27, #0xba0] ldr d1, [x26, #0xba8] fmov d2, #0x3ff0000000000000 ldr d3, [x25, #0xbb0] loc_1000042f0: fmul d4, d11, d0 fadd d4, d4, d1 fadd d4, d4, d3 fdiv d4, d4, d0 fadd d11, d4, d2 subs x8, x8, #0x1 b.ne loc_1000042f0 Despite dropping it a big hint, the Swift compiler didn't use the combined FMADD instruction, but separate FMUL and FADD. Instead of its tail test using a separate CMP, it actually counts down with its loop counter, and uses a SUBS instruction, which effectively includes comparison with 0 to set the end of the loop. Swift optimises exceedingly well. To compare between these, I've updated my AsmAttic app to obtain high-precision timings, and improved its output view. Version 2 of AsmAttic, complete with the source for this test, is available from here: asmattic2 Running performance tests in Xcode needs great care. For instance, my first test runs showed consistently that my assembly while loop was fastest, slightly ahead of the for loop, with Swift trailing a long way behind. But by default Xcode builds debug versions: that has no effect on my assembly code, but slows Swift down something rotten. So you need to build for distribution rather than debugging before you can make meaningful comparisons. The call overhead, taken from running just one loop, is negligibly small across all three tests. Running a million loops, Swift was slightly faster than the while loop, and the for loop was slightly slower again. But increase the number of loops to 10 million or more and my hand-coded while loop was fastest of all: Time for = 0.10187779100000001 seconds Time while = 0.063158666 seconds Time Swift = 0.071843375 seconds What was disappointing, though, was that my crude estimates of error showed that Swift's code is consistently more accurate. I clearly need to look in more detail at how FMADD performs, compared with separate instructions. LDR or ADR? If you've read Steven Smith's article about building Hello World in assembly language for M1 Macs, you may now be wondering why my code doesn't: * use ADR rather than LDR to load addresses; * use the .align 2 directive to ensure it starts on a 64-bit boundary. The second is easy to address: his example is a standalone executable, which must be correctly aligned. My assembly routines are part of a whole project, and within that the build tools will ensure appropriate alignment of the whole. ADR and LDR are less clear. In my assembly code, I use instructions such as LDR D5, B_DOUBLE to load the double defined at B_DOUBLE into the floating point register D5, which appears to work fine when used within a full Xcode app. I note that the code generated by the Swift compiler is similar, for example using ldr d3, [x25, #0xbb0] to load a double constant. I don't know whether this is tolerated here because of the different build tools being used. However, if you do experience problems with LDR, you now know that it might need to be converted to ADR. And that finally does lead me on to the next article. Previous articles in this series: 1: Building an app to develop assembly routines, including an explanation of calling assembly language from Swift, with a complete Xcode project 2: Registers explained 3: Working with pointers 4: Controlling flow 5: Conditional loops Downloads: ARM register summary ARM operand architecture Conditions and conditional branching instructions Control Flow AsmAttic 2, a complete Xcode project (version 2) AsmAttic, a complete Xcode project (version 1) References Procedure Call Standard for the Arm 64-bit Architecture (ARM) from Github Writing ARM64 Code for Apple Platforms (Apple) Stephen Smith (2020) Programming with 64-Bit ARM Assembly Language, Apress, ISBN 978 1 4842 5880 4. Daniel Kusswurm (2020) Modern Arm Assembly Language Programming, Apress, ISBN 978 1 4842 6266 5. ARM64 Instruction Set Reference (ARM). Share this: * Twitter * Facebook * Reddit * Pinterest * Email * Print * Like this: Like Loading... Related Posted in Macs, Technology and tagged Apple silicon, ARM, assembler, assembly language, M1, Swift, Xcode. Bookmark the permalink. iThere are no comments Add yours Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Google photo You are commenting using your Google account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. Quick Links * Downloads * Mac Troubleshooting Summary * M1 Macs * Mac problem-solving * Painting topics * Painting * Long Reads Search Search for: [ ] [Search] Monthly archives * July 2021 (16) * June 2021 (71) * May 2021 (80) * April 2021 (79) * March 2021 (77) * February 2021 (75) * January 2021 (75) * December 2020 (77) * November 2020 (84) * October 2020 (81) * September 2020 (79) * August 2020 (103) * July 2020 (81) * June 2020 (78) * May 2020 (78) * April 2020 (81) * March 2020 (86) * February 2020 (77) * January 2020 (86) * December 2019 (82) * November 2019 (74) * October 2019 (89) * September 2019 (80) * August 2019 (91) * July 2019 (95) * June 2019 (88) * May 2019 (91) * April 2019 (79) * March 2019 (78) * February 2019 (71) * January 2019 (69) * December 2018 (79) * November 2018 (71) * October 2018 (78) * September 2018 (76) * August 2018 (78) * July 2018 (76) * June 2018 (77) * May 2018 (71) * April 2018 (67) * March 2018 (73) * February 2018 (67) * January 2018 (83) * December 2017 (94) * November 2017 (73) * October 2017 (86) * September 2017 (92) * August 2017 (69) * July 2017 (81) * June 2017 (76) * May 2017 (90) * April 2017 (76) * March 2017 (79) * February 2017 (65) * January 2017 (76) * December 2016 (75) * November 2016 (68) * October 2016 (76) * September 2016 (78) * August 2016 (70) * July 2016 (74) * June 2016 (66) * May 2016 (71) * April 2016 (67) * March 2016 (71) * February 2016 (68) * January 2016 (90) * December 2015 (96) * November 2015 (103) * October 2015 (119) * September 2015 (115) * August 2015 (117) * July 2015 (117) * June 2015 (105) * May 2015 (111) * April 2015 (119) * March 2015 (69) * February 2015 (54) * January 2015 (39) Tags Adobe APFS Apple AppleScript Apple silicon App Store backup Big Sur Blake Bonnard bug bugs Catalina Consolation Console diagnosis Disk Utility Dore El Capitan extended attributes Finder firmware Gatekeeper Gerome HFS+ High Sierra history history of painting iCloud Impressionism iOS landscape LockRattler log logs M1 Mac Mac history macOS macOS 10.12 macOS 10.13 macOS 10.14 macOS 10.15 macOS 11 malware Metamorphoses Mojave Monet Moreau MRT myth narrative OS X Ovid painting Pissarro Poussin privacy realism riddle Rubens Sargent scripting security Sierra Swift symbolism Time Machine Turner update upgrade vulnerability xattr Xcode XProtect Statistics * 9,128,097 hits Blog at WordPress.com. Footer navigation * About & Contact * Macs * Painting * Language * Tech * Life * General * Downloads * Mac problem-solving * Extended attributes (xattrs) * Painting topics * Hieronymus Bosch * English language * LockRattler: 10.12 Sierra * LockRattler: 10.13 High Sierra * LockRattler: 10.11 El Capitan * Updates: El Capitan * Updates: Sierra, High Sierra, Mojave, Catalina, Big Sur * LockRattler: 10.14 Mojave * SilentKnight, silnite, LockRattler, SystHist & Scrub * DelightEd & Podofyllin * xattred, Metamer, Sandstrip & xattr tools * 32-bitCheck & ArchiChect * T2M2, Ulbow, Consolation and log utilities * Cirrus & Bailiff * Taccy, Signet, Precize, Alifix, UTIutility, Sparsity, alisma * Revisionist & DeepTools * Text Utilities: Nalaprop, Dystextia and others * PDF * Keychains & Permissions * LockRattler: 10.15 Catalina * Updates * Spundle, Cormorant, Stibium, Dintch, Fintch and cintch * Long Reads * LockRattler: 11.0 Big Sur * Mac Troubleshooting Summary * M1 Macs * Mints: a multifunction utility Secondary navigation * Search Post navigation Don Quixote 17: The captive's tale Painting within tent: Contents and indexes Search for: [ ] [Search] Begin typing your search above and press return to search. Press Esc to cancel. Write a Comment... [ ] Email (Required) [ ] Name (Required) [ ] Website [ ] [Post Comment] Loading Comments... Comment x Send to Email Address [ ] Your Name [ ] Your Email Address [ ] [ ] loading [Send Email] Cancel Post was not sent - check your email addresses! Email check failed, please try again Sorry, your blog cannot share posts by email. %d bloggers like this: [b]