https://eclecticlight.co/2021/07/06/code-in-arm-assembly-flow-pipelines-and-performance/

Skip to content
[eclecticlight]

The Eclectic Light Company

Macs, painting, and more

Main navigation

Menu

  * Downloads
  * M1 Macs
  * Mac Problems
  * Mac articles
  * Art
  * Macs
  * Painting

hoakley July 6, 2021 Macs, Technology

Code in ARM Assembly: Flow, pipelines and performance

[hopper02]

Before moving on to look at integer and other instructions involving
the general-purpose registers, this article rounds off the topic of
controlling flow by considering performance, and compares two
different conditional loop schemes with those generated by Apple's
Swift compiler. However good learning assembly language may be for
the soul, and for those who need to be able to grok disassembled
code, if you want to run your own assembly routines, the most popular
reason is performance. However good modern compilers are, they have
to make compromises when generating code. When you write your own
assembly language you can optimise it for your particular purpose.

Pipelines

One of the problems with writing code for modern processors is that
they work quite differently from the simple model most still have in
our mind. Simple processors used to chug steadily through code, one
instruction at a time: load that value into a register, multiply the
contents of two registers putting the result into a third, and so on.
As each instruction requires a series of smaller operations, such as
decoding the operation itself, calculating a memory address, fetching
a value from an address, and more, what modern processors do is
process several instructions at a time.

This works a bit like a production line, with four or more
instructions at various stages of completion at any one time, in the
processor's pipeline. When that processor is running simple
unbranching code, it's easy for it to keep the pipeline full and the
processing of instructions running at maximum throughput. But the
moment that a branch instruction enters the pipeline there's a
problem: which instruction should be loaded next, one that follows
the branch, or one that doesn't?

To ensure that its pipeline doesn't grind to a halt each time a
branch instruction is loaded, a processor uses heuristics to try to
predict which branch will be followed, and then load those
instructions. If branch prediction gets it right almost all the time,
then performance remains essentially unaffected by branching.
Normally, conditional loops, such as those used instead of for and
while statements, can be predicted most reliably. Those used for if ...
else ... cascades are harder to predict, so ARM64 assembly language
offers conditional selection as a more efficient alternative -
something I'll consider in a future article in this series.

Evaluating performance

One ARM64 feature I'm particularly interested in is its small family
of combined floating point multiply-and-add instructions such as
FMADD D1, D2, D3, D4
which multiplies D3 and D4, adds D2 to that result, and stores the
final result in D1 (in this example). There's an equivalent in Intel
x86, and these combined instructions are known not for their effects
on performance, but on reducing error, as the intermediate result of
the multiplication can be kept in higher precision. To take a first
tentative look at this, I coded two iterative loops thus:

for_loop:
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
ADD X5, X5, #1
CMP X5, X4
B.LE for_loop
which tests in the tail, just like a conventional for loop, and
while_loop:
CMP X5, X4
B.GE while_done
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
ADD X5, X5, #1
B while_loop
while_done:
which tests at the head, like a while loop.

hopper01

To add a bit of excitement, I pitted my code against that generated
by Apple's Swift compiler, providing it with the code
for _ in 1...theReps {
dZero = (tempA * theB) + theC
tempA = ((dZero - theC)/theB) + theInc
}
which explains what I am calculating. I then used the Hopper
disassembler to see what the compiler produced:

hopper02

mov x23, x0
adrp x27, #0x100005000
adrp x26, #0x100005000
adrp x25, #0x100005000
b.eq loc_10000430c
ldr d0, [x27, #0xba0]
ldr d1, [x26, #0xba8]
fmov d2, #0x3ff0000000000000
ldr d3, [x25, #0xbb0]
loc_1000042f0:
fmul d4, d11, d0
fadd d4, d4, d1
fadd d4, d4, d3
fdiv d4, d4, d0
fadd d11, d4, d2
subs x8, x8, #0x1
b.ne loc_1000042f0

Despite dropping it a big hint, the Swift compiler didn't use the
combined FMADD instruction, but separate FMUL and FADD. Instead of
its tail test using a separate CMP, it actually counts down with its
loop counter, and uses a SUBS instruction, which effectively includes
comparison with 0 to set the end of the loop. Swift optimises
exceedingly well.

To compare between these, I've updated my AsmAttic app to obtain
high-precision timings, and improved its output view. Version 2 of
AsmAttic, complete with the source for this test, is available from
here: asmattic2

Running performance tests in Xcode needs great care. For instance, my
first test runs showed consistently that my assembly while loop was
fastest, slightly ahead of the for loop, with Swift trailing a long
way behind. But by default Xcode builds debug versions: that has no
effect on my assembly code, but slows Swift down something rotten. So
you need to build for distribution rather than debugging before you
can make meaningful comparisons.

The call overhead, taken from running just one loop, is negligibly
small across all three tests. Running a million loops, Swift was
slightly faster than the while loop, and the for loop was slightly
slower again. But increase the number of loops to 10 million or more
and my hand-coded while loop was fastest of all:
Time for = 0.10187779100000001 seconds
Time while = 0.063158666 seconds
Time Swift = 0.071843375 seconds

What was disappointing, though, was that my crude estimates of error
showed that Swift's code is consistently more accurate. I clearly
need to look in more detail at how FMADD performs, compared with
separate instructions.

LDR or ADR?

If you've read Steven Smith's article about building Hello World in
assembly language for M1 Macs, you may now be wondering why my code
doesn't:

  * use ADR rather than LDR to load addresses;
  * use the .align 2 directive to ensure it starts on a 64-bit
    boundary.

The second is easy to address: his example is a standalone
executable, which must be correctly aligned. My assembly routines are
part of a whole project, and within that the build tools will ensure
appropriate alignment of the whole.

ADR and LDR are less clear. In my assembly code, I use instructions
such as
LDR D5, B_DOUBLE
to load the double defined at B_DOUBLE into the floating point
register D5, which appears to work fine when used within a full Xcode
app. I note that the code generated by the Swift compiler is similar,
for example using
ldr d3, [x25, #0xbb0]
to load a double constant. I don't know whether this is tolerated
here because of the different build tools being used. However, if you
do experience problems with LDR, you now know that it might need to
be converted to ADR. And that finally does lead me on to the next
article.

Previous articles in this series:

1: Building an app to develop assembly routines, including an
explanation of calling assembly language from Swift, with a complete
Xcode project
2: Registers explained
3: Working with pointers
4: Controlling flow
5: Conditional loops

Downloads:

ARM register summary
ARM operand architecture
Conditions and conditional branching instructions
Control Flow
AsmAttic 2, a complete Xcode project (version 2)
AsmAttic, a complete Xcode project (version 1)

References

Procedure Call Standard for the Arm 64-bit Architecture (ARM) from
Github
Writing ARM64 Code for Apple Platforms (Apple)
Stephen Smith (2020) Programming with 64-Bit ARM Assembly Language,
Apress, ISBN 978 1 4842 5880 4.
Daniel Kusswurm (2020) Modern Arm Assembly Language Programming,
Apress, ISBN 978 1 4842 6266 5.
ARM64 Instruction Set Reference (ARM).

Share this:

  * Twitter
  * Facebook
  * Reddit
  * Pinterest
  * Email
  * Print
  * 

Like this:

Like Loading...

Related

Posted in Macs, Technology and tagged Apple silicon, ARM, assembler,
assembly language, M1, Swift, Xcode. Bookmark the permalink.

iThere are no comments

Add yours

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Google photo

You are commenting using your Google account. ( Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
This site uses Akismet to reduce spam. Learn how your comment data is
processed.

Quick Links

  * Downloads
  * Mac Troubleshooting Summary
  * M1 Macs
  * Mac problem-solving
  * Painting topics
  * Painting
  * Long Reads

Search

Search for: [                    ] [Search]
Monthly archives

  * July 2021 (16)
  * June 2021 (71)
  * May 2021 (80)
  * April 2021 (79)
  * March 2021 (77)
  * February 2021 (75)
  * January 2021 (75)
  * December 2020 (77)
  * November 2020 (84)
  * October 2020 (81)
  * September 2020 (79)
  * August 2020 (103)
  * July 2020 (81)
  * June 2020 (78)
  * May 2020 (78)
  * April 2020 (81)
  * March 2020 (86)
  * February 2020 (77)
  * January 2020 (86)
  * December 2019 (82)
  * November 2019 (74)
  * October 2019 (89)
  * September 2019 (80)
  * August 2019 (91)
  * July 2019 (95)
  * June 2019 (88)
  * May 2019 (91)
  * April 2019 (79)
  * March 2019 (78)
  * February 2019 (71)
  * January 2019 (69)
  * December 2018 (79)
  * November 2018 (71)
  * October 2018 (78)
  * September 2018 (76)
  * August 2018 (78)
  * July 2018 (76)
  * June 2018 (77)
  * May 2018 (71)
  * April 2018 (67)
  * March 2018 (73)
  * February 2018 (67)
  * January 2018 (83)
  * December 2017 (94)
  * November 2017 (73)
  * October 2017 (86)
  * September 2017 (92)
  * August 2017 (69)
  * July 2017 (81)
  * June 2017 (76)
  * May 2017 (90)
  * April 2017 (76)
  * March 2017 (79)
  * February 2017 (65)
  * January 2017 (76)
  * December 2016 (75)
  * November 2016 (68)
  * October 2016 (76)
  * September 2016 (78)
  * August 2016 (70)
  * July 2016 (74)
  * June 2016 (66)
  * May 2016 (71)
  * April 2016 (67)
  * March 2016 (71)
  * February 2016 (68)
  * January 2016 (90)
  * December 2015 (96)
  * November 2015 (103)
  * October 2015 (119)
  * September 2015 (115)
  * August 2015 (117)
  * July 2015 (117)
  * June 2015 (105)
  * May 2015 (111)
  * April 2015 (119)
  * March 2015 (69)
  * February 2015 (54)
  * January 2015 (39)

Tags

Adobe APFS Apple AppleScript Apple silicon App Store backup Big Sur
Blake Bonnard bug bugs Catalina Consolation Console diagnosis Disk
Utility Dore El Capitan extended attributes Finder firmware
Gatekeeper Gerome HFS+ High Sierra history history of painting iCloud
Impressionism iOS landscape LockRattler log logs M1 Mac Mac history
macOS macOS 10.12 macOS 10.13 macOS 10.14 macOS 10.15 macOS 11
malware Metamorphoses Mojave Monet Moreau MRT myth narrative OS X
Ovid painting Pissarro Poussin privacy realism riddle Rubens Sargent
scripting security Sierra Swift symbolism Time Machine Turner update
upgrade vulnerability xattr Xcode XProtect

Statistics

  * 9,128,097 hits

Blog at WordPress.com.

Footer navigation

  * About & Contact
  * Macs
  * Painting
  * Language
  * Tech
  * Life
  * General
  * Downloads
  * Mac problem-solving
  * Extended attributes (xattrs)
  * Painting topics
  * Hieronymus Bosch
  * English language
  * LockRattler: 10.12 Sierra
  * LockRattler: 10.13 High Sierra
  * LockRattler: 10.11 El Capitan
  * Updates: El Capitan
  * Updates: Sierra, High Sierra, Mojave, Catalina, Big Sur
  * LockRattler: 10.14 Mojave
  * SilentKnight, silnite, LockRattler, SystHist & Scrub
  * DelightEd & Podofyllin
  * xattred, Metamer, Sandstrip & xattr tools
  * 32-bitCheck & ArchiChect
  * T2M2, Ulbow, Consolation and log utilities
  * Cirrus & Bailiff
  * Taccy, Signet, Precize, Alifix, UTIutility, Sparsity, alisma
  * Revisionist & DeepTools
  * Text Utilities: Nalaprop, Dystextia and others
  * PDF
  * Keychains & Permissions
  * LockRattler: 10.15 Catalina
  * Updates
  * Spundle, Cormorant, Stibium, Dintch, Fintch and cintch
  * Long Reads
  * LockRattler: 11.0 Big Sur
  * Mac Troubleshooting Summary
  * M1 Macs
  * Mints: a multifunction utility

Secondary navigation

  * Search

Post navigation

 
Don Quixote 17: The captive's tale
Painting within tent: Contents and indexes
Search for: [                    ] [Search]
Begin typing your search above and press return to search. Press Esc
to cancel.

Write a Comment... [                    ]
Email (Required) [                    ] Name (Required)
[                    ] Website [                    ]
[Post Comment]  
Loading Comments...
Comment

x
Send to Email Address [                    ] Your Name
[                    ] Your Email Address [                    ]
[                         ]
loading [Send Email] Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.
%d bloggers like this:

[b]