https://github.com/TobyLobster/multiply_test

Skip to content
Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
TobyLobster / multiply_test Public

  * Notifications
  * Fork 2
  * Star 43

Comparing 6502 multiply routines

43 stars 2 forks Activity
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

TobyLobster/multiply_test

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/T]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone TobyLo]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@TobyLobster
TobyLobster Added missing average cycles counts to source
...
ebf614d Oct 19, 2023
Added missing average cycles counts to source
ebf614d

Git stats

  * 129 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
plotter
Added mult84.a, fixed typo.
January 23, 2023 10:13
results
Added omult31, updated mult27, added caveat to signedness
calculation.
January 26, 2023 11:50
source
fix for random testing. Added omult20
January 12, 2023 16:15
tests
Added missing average cycles counts to source
October 19, 2023 21:46
.gitignore
Initial commit
December 22, 2022 23:56
README.md
Added omult31, updated mult27, added caveat to signedness
calculation.
January 26, 2023 11:50
go
Added omult31, updated mult27, added caveat to signedness
calculation.
January 26, 2023 11:50
go_plot
Initial commit
December 22, 2022 23:56
log.py
More commentary, more tests.
December 31, 2022 22:41
log2.py
More commentary, more tests.
December 31, 2022 22:41
View code
[                    ]
6502 Integer Multiplication - which is best? Contents Introduction
The Implementations Unsigned multiply Signed multiply Miscellaneous
multiply The Results 8 bit x 8 bit unsigned multiply, with 16 bit
result 16 bit x 16 bit unsigned multiply, with 32 bit result Signed
multiply Miscellaneous multiply The Algorithms 1. Binary
Multiplication (Shift and Add) 2. Modified Shift and Add 3. Tables of
Squares 4. Logarithms GO64! magazine articles (omult9.a) Elite,
Master version (omult7.a) Elite, Second Processor version (omult8.a)
Alternative: a table-of-squares approximation (omult11.a) 5. Four bit
multiply 6. Booth's Algorithm 7. Hardware support 8. Repeated
addition Customising 1. Changing the number of bits 2. Changing the
In and Out Parameters 8 bit x 8 bit = 16 bit 16 bit x 16 bit = 32 bit
3. Only Using Partial Results 4. Making Signed Multiply Routines 5.
Self modifying code 6. Multiply using Binary Coded Decimal (BCD) How
to run the tests Dependencies Go How testing works See Also

README.md

 6502 Integer Multiplication - which is best?

 Contents

  * Introduction
  * The Implementations
  * The Results
  * The Algorithms
      + Binary Multiplication (Shift and Add)
      + Modified Shift and Add
      + Tables of Squares
      + Logarithms
      + Four bit multiply
      + Booth's Algorithm
      + Hardware support
      + Repeated addition
  * Customising
  * How to run the tests

 Introduction

    "The search for the ultimate multiply routine seems
    never-ending." - Brooke W. Boering (December 1980)

This document compares the runtime performance and memory used by a
wide variety of general purpose multiplication routines for the 6502
CPU. Over 120 different routines have been exhaustively tested, cycle
counted, and the results plotted.

There is no one 'best' routine or algorithm, because there are always
trade-offs between speed and memory. By speed, I mean the average,
best and worst cases of how many cycles are needed to perform the
multiplication. By memory I mean the total number of bytes needed for
the code itself and all necessary data tables.

There may be other gains based on the context in which it is being
used, e.g. the memory cost can be shared if data tables can be reused
by other routines (for example a square root or division routine).
Perhaps the multiplicands are more likely to lie in a given range. So
it is not possible to recommend a single routine as 'best'. What we
can say is that some routines are almost always, or actually always,
worse than others. In practice, only a few are worth considering.

The most common routines available are for unsigned numbers, either 8
bit x 8 bit with a 16 bit result, or 16 bit x 16 bit with a 32 bit
result. These are the natural focus, however several other routines
are listed further down. There is also a section later that discusses
how to how to customise the routines, e.g. how to handle signed
numbers, adjusting to different bit sizes, etc.

 The Implementations

I have tested the following routines:

 Unsigned multiply

 Source  Bits     Method     From
  code
mult1.a  16x16 shift and add 6502 Software Design by Leo J Scanlon
          =32                (1980) and codebase64 (2015)
         16x16               The Merlin 128 Macro Assembler disk
mult2.a   =32  shift and add (Commodore 128) by Glen Bredon (1986),
                             via The Fridge
mult3.a  16x16 shift and add Neil Parker (2005)
          =32
mult4.a  16x16 shift and add mult39 combined into a 16 bit multiply
          =32                by TobyLobster (2022)
mult5.a  8x8=    tables of   yerricde at everything2 (2001)
          16      squares
mult6.a  8x8=    tables of   eurorusty at everything2 (2013)
          16      squares
mult7.a  8x8=  shift and add Apple Assembly Line, January 1986
          16
mult8.a  8x8=  shift and add Apple Assembly Line, January 1986
          16
mult9.a  8x8=  shift and add The Fridge (2000)
          16
mult10.a 8x8=  shift and add White Flame at codebase64 (2015)
          16
mult11.a 8x8=  shift and add graham at codebase64 (2015)
          16
mult12.a 8x8=  shift and add djmips at codebase64 (2020)
          16
mult13.a 8x8=    tables of   Apple Assembly Line, March 1986
          16      squares
mult14.a 8x8=    tables of   Jackasser at codebase64 (2015)
          16      squares
mult15.a 16x16   tables of   Repose at codebase64 (2017)
          =32     squares
mult16.a 8x8=    tables of   litwr (Vladimir Lidovski) and Urusergi
          16      squares    codebase64 (2015)
mult17.a 8x8=  shift and add Elite (BBC Micro) (1984)
          16
mult18.a 8x8=  shift and add Elite (BBC Master version) (1986)
          16
mult19.a 8x8=  shift and add Australian Personal Computer, Aug 1984
          16                 and Neil Parker (2005)
mult20.a 8x8=  shift and add Becoming Julie (2020)
          16
         8x8=                Machine Code for the Atmos and Oric-I by
mult21.a  16   shift and add Bruce Smith (1984) and Niels Moller
                             (2015)
mult22.a 8x8=    tables of   Niels Moller (2015)
          16      squares
mult23.a 8x8=  shift and add tepples at NesDev (2017)
          16
mult24.a 8x8=  shift and add tepples unrolled at NesDev (2017)
          16
mult25.a 8x8=  shift and add Bregalad at NesDev (2009)
          16
mult26.a 8x8=  shift and add frantik at NesDev (2017)
          16
mult27.a 8x8=    tables of   H2Obsession (2013)
          16      squares
mult28.a 8x8=  shift and add Apple Assembly Line, January 1986
          16
mult29.a 8x8=  shift and add Apple Assembly Line, January 1986
          16                 (unrolled)
mult30.a 8x8=  shift and add tepples unrolled at NesDev (2017)
          16                 (adjusted)
mult31.a 16x16   tables of   Jackasser at codebase64 (2015)
          =32     squares
mult32.a 8x8=      4 bit     keldon at everything2 (2017)
          16     multiply
mult33.a 16x16   tables of   Retro64 (2019)
          =32     squares
mult34.a 8x8=  shift and add Atari Roots by Mark Andrews (1984)
          16
mult35.a 8x8=  shift and add Atari Roots by Mark Andrews (1984)
          16
         16x16               Best of Personal Computer World,
mult36.a  =32  shift and add ASSEMBLER ROUTINES FOR THE 6502 by David
                             Barrow (1985)
mult37.a 8x8=  shift and add Andrew Blance, at codeburst (2020)
          16
mult38.a 8x8=      4 bit     Aviator (BBC Micro) (1983)
          16     multiply
mult39.a 8x8=  shift and add Revs (BBC Micro) (1985)
          16
mult40.a 8x8=  shift and add Meteors (BBC Micro) (1982)
          16
mult41.a 16x16   tables of   mult13 combined into a 16 bit multiply
          =32     squares    by TobyLobster (2022)
mult42.a 16x16   tables of   mult16 combined into a 16 bit multiply
          =32     squares    by TobyLobster (2022)
mult43.a 8x8=  shift and add 6502 assembly language programming by
          16                 Lance A. Leventhal
mult44.a 8x8=  shift and add The Sentinel (BBC Micro) (1988)
          16
mult45.a 16x16 shift and add How to program the Apple II Using 6502
          =32                Assembly Language by Randy Hyde (1981)
mult46.a 16x16 shift and add Apple Programmers Handbook by Paul Irwin
          =32                (1984)
mult47.a 8x8=  shift and add Neil Parker (2005)
          16
mult48.a 16x16 shift and add Micro 6502 Journal Issue 31, Dec 1980,
          =32                p71-74 by Brooke Boering
         16x16               Micro 6502 Journal Issue 31, Dec 1980,
mult49.a  =32  shift and add p71-74 by Brooke Boering, with 8x16
                             multiply 'shortcut' removed
mult50.a 16x16 shift and add mult2 unrolled by TobyLobster (2023)
          =32
mult51.a 16x16 shift and add mult2 unrolled by TobyLobster (2023)
          =32
mult52.a 16x16 shift and add mult2 unrolled by TobyLobster (2023)
          =32
mult53.a 16x16 shift and add mult2 unrolled by TobyLobster (2023)
          =32
mult54.a 16x16 shift and add mult2 unrolled by TobyLobster (2023)
          =32
mult55.a 16x16 shift and add mult2 fully unrolled by TobyLobster
          =32                (2023)
mult56.a 16x16   tables of   mult27 combined into a 16 bit multiply
          =32     squares    by TobyLobster (2023)
mult57.a 8x8=    tables of   H2Obsession smaller memory version
          16      squares    without idTab (2013)
mult58.a 16x16   tables of   mult16 combined into a 16 bit multiply
          =32     squares    by TobyLobster (2023)
mult59.a 16x16   modified    Dr Jefyll (2012)
          =32  shift and add
mult60.a 16x16   modified    Dr Jefyll (2012) with modifications by
          =32  shift and add TobyLobster (2023)
mult61.a 16x16   modified    Dr Jefyll (2012) with modifications and
          =32  shift and add unrolling by TobyLobster (2023)
mult62.a 16x16   modified    Dr Jefyll (2012) with modifications and
          =32  shift and add unrolling by TobyLobster (2023)
mult63.a 16x16   modified    Dr Jefyll (2012) with modifications and
          =32  shift and add unrolling by TobyLobster (2023)
mult64.a 16x16   modified    Dr Jefyll (2012) with modifications and
          =32  shift and add unrolling by TobyLobster (2023)
mult65.a 8x8=    tables of   Nick Jameson's 3D Demo for the BBC Micro
          16      squares    (1994)
         8x8=    tables of   TobyLobster (2023), based on Nick
mult66.a  16      squares    Jameson's 3D Demo for the BBC Micro
                             (1994)
mult67.a 16x16   tables of   julie_m at stardot (Note: preserves
          =32     squares    carry) (2023)
mult68.a 8x8=  shift and add Programming The 6502 by Rodnay Zaks
          16                 (1983)
         16x16               Machine Language Routines for the
mult69.a  =32  shift and add Commodore 64 and 128 by Todd D Heimarck
                             and Patrick Parrish (1987)
         8x8=    repeated    Machine Language Routines for the
mult70.a  16     addition    Commodore 64 and 128 by Todd D Heimarck
                             and Patrick Parrish (1987)
         8x8=    repeated    Machine Language Routines for the
mult71.a  16     addition    Commodore 64 and 128 by Todd D Heimarck
                             and Patrick Parrish (1987)
mult72.a 8x8=    repeated    TobyLobster (2023)
          16     addition
mult73.a 8x8=    repeated    TobyLobster (2023)
          16     addition
mult74.a 16x16 shift and add Mikroprozessoren 6502, 6800, 8080, Z80,
          =32                9900 by Harald Schumny (1983)
mult75.a 8x8=  shift and add Practical Microcomputer Programming by
          16                 Walter J Weller (1980)
mult76.a 8x8=  shift and add Microcomputing magazine (June 1981)
          16                 article by Leo Scanlon
mult77.a 8x8=  shift and add Instrumentation of a Savonius Wind
          16                 Turbine by Samuel Martin Babb (1979)
mult78.a 8x8=  shift and add Commodore 128 Assembly Language
          16                 Programming by Mark Andrews (1986)
mult79.a 8x8=  shift and add NASA Report (1981)
          16
mult80.a 8x8=      4 bit     Kakemoms at Denial, the Commodore Vic 20
          16     multiply    Forum (2015)
mult81.a 8x8=  shift and add Graphics Extension ROM by Acornsoft
          16                 (1985) at $b8d4
mult82.a 8x8=    tables of   Retro Software (2008)
          16      squares
mult83.a 8x8=    tables of   Retro Software (2008)
          16      squares

 Signed multiply

 Source     Bits       Method     From
  code
smult1.a  8x8=16 (    tables of   Jackasser at codebase64 (2015)
          signed)      squares
smult2.a  8x8=16 (     Booth's    Marcus Thill (2017)
          signed)     algorithm
smult3.a 16x16=32 (   tables of   Jackasser at codebase64 (2015)
          signed)      squares
smult4.a  8x8=16 (  shift and add Neil Parker (2005)
          signed)
smult5.a  8x8=16 (  shift and add mult9 converted to signed multiply
          signed)                 by TobyLobster (2022)
smult6.a  8x8=16 (  shift and add EDN magazine (5th Sept 1979),
          signed)                 article by Arch D Robison
smult7.a  8x8=16 (    tables of   Oswald/Resource at codebase64
          signed)      squares    (2015)
smult8.a  8x8=16 (    tables of   mult65 converted to signed multiply
          signed)      squares    by TobyLobster (2022)
smult9.a 16x16=32 (   modified    Dr Jefyll (2012) with modifications
          signed)   shift and add by TobyLobster (2023)

 Miscellaneous multiply

Specialised multiply routines often find their niche in games.
Partial results (a result with fewer bits than expected) are common
for fixed point arithmetic. Even approximate results can be used in
cases where speed is more important than absolute accuracy.

 Source         Bits           Method     From
  code
          16x16=16 (partial               Programming the 65816 by
omult1.a    result,low 16   shift and add David Eyes (1986)
             bits only)
           8x8=8 (partial                 The BBC Micro Compendium by
omult2.a  result, low byte  shift and add Jeremy Ruston (1983), also
                only)                     Nightshade (1985) at $6121
           8x8=8 (partial                 Elite for the BBC Micro
omult3.a  result, high byte shift and add (1984)
                only)
              24x8=32 (                   Elite for the BBC Micro
omult4.a   sign-magnitude   shift and add (1984)
              numbers)
              16x16=16                    The Sentinel for the BBC
omult5.a   (approximate 2   shift and add Micro (1988)
          high bytes only)
          16x16=16 (low 16                The Commodore 64 BASIC/
omult6.a   bit result, or   shift and add KERNAL ROM at $b357 and
            carry set if                  Applesoft II BASIC at $e2b8
          overflow occurs)
           8x8=8 (partial    log and exp  Elite, BBC Master version
omult7.a   result, approx      tables     (1986) and APPLE II Elite
             high byte)                   (1985)
           8x8=8 (partial    log and exp  Elite, Second Processor
omult8.a   result, approx      tables     version (1985)
             high byte)
                                          from articles by Gunnar
           8x8=8 (partial    log and exp  'Krill' Ruthenburg / Plush
omult9.a   result, approx      tables     in the German GO64!
             high byte)                   magazine (2000), via
                                          codebase64
          16x32=32 (partial               BBC BASIC ROM integer
omult10.a   result,low 32   shift and add multiply code (1981)
             bits only)
           8x8=8 (partial     tables of   mult13 reduced to return
omult11.a result, high byte    squares    high byte only by
                only)                     TobyLobster (2023)
           8x8=8 (partial                 Gateway to Apshai, for the
omult12.a result, low byte  shift and add Atari 8-bit family (1983)
                only)
omult13.a 16x8=16 (partial  shift and add Stellar 7, for the Apple II
          result, div 128)                (1983)
          16x16=16 (partial               FastBasic BASIC interpreter
omult14.a   result,low 16   shift and add for the Atari 8-bit
             bits only)                   computers (2017)
          16x16=16 (partial   modified    Dr Jefyll (2012) with
omult15.a   result,low 16   shift and add modifications by
             bits only)                   TobyLobster (2023)
          16x16=16 (partial   tables of   BBC BASIC ROM
omult16.a   result,low 16      squares    multidimensional array
             bits only)                   access code (1981)
          16x8=16 (partial                How to program
omult17.a   result,low 16   shift and add microcomputers by William T
             bits only)                   Barden (1977)
          mxn=n+m (variable               Microcomputing magazine
omult18.a  size multiply)   shift and add (June 1981) article by Leo
                                          J Scanlon
omult19.a     24x24=48      shift and add Graphics Extension ROM by
                                          Acornsoft (1985) at $beb5
                                          6502.org based on 6502
omult20.a     32x32=64      shift and add Software Design by Leo J
                                          Scanlon (1980), expanded by
                                          Greg (1999)
                                          Dr Jefyll (2012) with
omult21.a     24x24=48        modified    modifications and expanded
                            shift and add to 24 bit by TobyLobster
                                          (2023)
                                          Dr Jefyll (2012) with
omult22.a     32x32=64        modified    modifications and expanded
                            shift and add to 32 bit by TobyLobster
                                          (2023)
                                          Dr Jefyll (2012) with
omult23.a mxn=n+m (variable   modified    modifications and
           size multiply)   shift and add generalised to mxn by
                                          TobyLobster (2023)
omult24.a     24x24=24      shift and add Neils at codebase64 (2018)
           3x8=8 (partial                 Starship Command at $1e69
omult25.a  result, high 8   shift and add (1983)
             bits only)
           8x8=8 (partial                 Starship Command at $0fc3
omult26.a  result, high 8   shift and add (1983)
             bits only)
          16x8=16 (partial                Starship Command at $0fa8
omult27.a  result, high 16  shift and add and $10be (1983)
             bits only)
          24x8=24 (partial                Starship Command at $1095
omult28.a  result, high 24  shift and add (1983)
             bits only)
                                          Splitting the Atom (The
          16x8=16 (partial                Acorn Recommended Advanced
omult29.a  result, low 16   shift and add User Guide) by J.R.
             bits only)                   Stevenson and John C.
                                          Rockett (early 1980s)
          24x8=24 (partial
omult30.a  result, high 24  shift and add TobyLobster (2023)
             bits only)
          24x8=24 (partial    tables of
omult31.a  result, high 24     squares    TobyLobster (2023)
             bits only)

 The Results

 8 bit x 8 bit unsigned multiply, with 16 bit result

In the diagrams below, grey dots are the also-rans. They are are
beaten for both cycles and memory by the better orange dots.

Results of 8 x 8 bit unsigned multiply

Take note that the fastest routines vary largely in size, but with
very little difference in cycle counts.

There's one trick however: if you are multiplying lots of numbers by
the same multiplier then these routines can be optimised further.
e.g. The largest (mult14) takes 45.99 cycles on average normally but
takes just 27.99 cycles if the multiplier (in A) doesn't change
between calls. This is because the first instructions of the routine
are setup code based on the multiplier that takes 18 cycles. This
only needs to be done once, leaving a faster multiply. This same
trick can also be done for a smaller benefit (6 cycles) to mult66,
mult27 and mult57.

To see the results of the smaller routines more clearly, here is a
zoomed in view of the same results:

Results of 8 x 8 bit unsigned multiply (detail)

All cycle counts and byte counts include the final RTS (1 byte, 6
cycles), but do not include any initial JSR mult (3 bytes, 6 cycles).

 Source   Average   Memory My Changes
  code     Cycles  (bytes)
mult5.a     92.01      834
mult6.a    137.21      620
mult7.a    133.53       36 with slight change to swap output
                           parameters
mult8.a    153.45       29
mult9.a    162.00       17
mult10.a   221.08       27
mult11.a   162.00       17
mult12.a   108.64       71 slightly tweaked
mult13.a    54.00     1075
mult14.a    45.99     2077
mult16.a    67.48      574
mult17.a   150.47       28 tweaked to handle X=0 on input
mult18.a   111.62       73 tweaked to handle X=0 on input and
                           unrolled
mult19.a   185.00       18
mult20.a   244.00       27 bug fixed
mult21.a   150.00       18
mult22.a    77.49      563
mult23.a   153.00       21
mult24.a   110.63       70 slightly tweaked
mult25.a   243.00       28 bug fixed, tweaked parameter passing
mult26.a   278.14       47 bug fixed
mult27.a    51.49     1316 slightly tweaked
mult28.a   130.00       27 tweaked
mult29.a   120.00       43 tweaked
mult30.a   114.00       74 tweaked
mult32.a   117.14      592
mult34.a   280.00       36
mult35.a   188.00       20
mult37.a   278.00       35
mult38.a    97.00     1345
mult39.a   107.00       69 tweaked slightly
mult40.a   278.00       35
mult43.a   208.90       26
mult44.a   109.00       69
mult47.a   175.00       20
mult57.a    48.49     1058
mult65.a    47.49     1061
mult66.a    45.49     1580
mult68.a   188.00       20 at label 'noadd' use 'ror' not 'lsr' as
                           seen in some editions of the book
mult70.a  1987.11       31
mult71.a  1572.91       41
mult72.a  1544.56       16
mult73.a  1174.08       28
mult75.a   205.90       24 bugs fixed
mult76.a   185.00       18
mult77.a   288.00       43
mult78.a   188.00       20 fixed misleading variable names
mult79.a   399.00       39
mult80.a   110.00      325
mult81.a   199.00       26
mult82.a    67.24      827
mult83.a    56.00     1079

 16 bit x 16 bit unsigned multiply, with 32 bit result

Results of 16 x 16 bit unsigned multiply

To see the results of the smaller routines more clearly, here is a
zoomed in view of the same results:

Results of 16 x 16 bit unsigned multiply (detail)

 Source   Average   Memory My Changes
           Cycles  (bytes)
mult1.a    751.00       38
mult2.a    578.00       33 optimised slightly
mult3.a    711.00       36
mult4.a    567.00      137 I use mult39 from Revs and combine to make
                           16x16
mult15.a   206.60     2181
mult31.a   238.07     2219
mult33.a   609.86     1276 with test code removed, and tables page
                           aligned. Stores numbers in MSB order
mult36.a   957.01       55
mult41.a   350.00     1149 I use mult13 and combine to make 16x16
mult42.a   403.83      647 I use mult16 and combine to make 16x16
mult45.a   695.00       38 optimised slightly
mult46.a   655.00       40
mult48.a   707.11       69
mult49.a   703.00       43 version of mult48 with 8x16 multiply
                           'shortcut' removed
mult50.a   534.00       55 unrolled mult2
mult51.a   524.00       69 unrolled mult2
mult52.a   519.00       75 unrolled mult2
mult53.a   514.00       95 unrolled mult2
mult54.a   497.00      192 unrolled mult2
mult55.a   483.50      344 fully unrolled mult2
mult56.a   259.96     1210 I use mult27 and combine to make 16x16
mult58.a   365.03      772 I use mult16 and combine to make 16x16
mult59.a   553.99       67
mult60.a   527.00       39 mult59 but I use fixed zero page
                           addresses, remove 'decrement to avoid clc'
mult61.a   482.00       57 ...then unrolled the outer loop
mult62.a   442.00       93 ...then unrolled the two inner loops once
mult63.a   422.00      165 ...then unrolled the two inner loops twice
mult64.a   392.00      285 ...then unrolled the two inner loops
                           fully, and optimise register use
mult67.a   633.00       37
mult69.a   946.52       65
mult74.a  1358.00       86 bug fixed

 Signed multiply

Here are some example signed multiply routines. The signed routines
are usually just an unsigned routine with adjustments made before and
/or after it. See below for how to adapt an unsigned multiply into a
signed multiply routine.

 Source  Average  Memory                    Notes
          cycles (bytes)
                         8 x 8 bit signed multiply (16 bit result),
smult1.a   62.99    2095 tweaked for size and speed (based on
                         mult14.a)
smult2.a  329.67      49 8 x 8 bit signed multiply (16 bit result),
                         Booth's Algorithm, bug fixed and optimised
smult3.a  277.57    2253 16 x 16 bit signed multiply (32 bit result),
                         tweaked slightly (based on mult31.a)
smult4.a  242.52      67 8 x 8 bit signed multiply (16 bit result)
                         based on the unsigned mult19
smult5.a  180.50      35 8 x 8 bit signed multiply (16 bit result)
                         based on the unsigned mult9
smult6.a  158.00      39 8 x 8 bit signed multiply (16 bit result)
smult7.a   88.50    1400 8 x 8 bit signed multiply (16 bit result)
                         with bug fix
smult8.a   62.99    1068 8 x 8 bit signed multiply (16 bit result)
                         based on the unsigned mult65
smult9.a  570.00      81 16 x 16 bit signed multiply (32 bit result)
                         based on the unsigned mult60

 Miscellaneous multiply

Other miscellaneous multiply routines with something 'specialised'
about it e.g. only returning an approximate result, or with different
bit depths. A decent variable bit length multiply is available in
omult23.a, but for other maths operations, see BBC Micro Machine Code
Portfolio by Bruce Smith (1984).

 Source   Average  Memory                    Notes
           cycles (bytes)
omult1.a   649.00      33 16 x 16 bit unsigned multiply, ONLY low 16
                          bit result
omult2.a   145.00      16 8 x 8 bit unsigned multiply, ONLY low 8 bit
                          result
omult3.a   128.00      24 8 x 8 bit unsigned multiply, ONLY high 8
                          bit result
omult4.a   686.88      70 24 x 8 bit sign-magnitude multiply, 32 bit
                          result
omult5.a   492.96     196 16 x 16 bit signed/sign-magnitude multiply,
                          16 bit signed approximate result
omult6.a   153.46      38 16 x 16 bit unsigned multiply, ONLY low 16
                          bit result (or carry set on overflow)
omult7.a    46.72     802 8 x 8 bit unsigned multiply, 8 bit high
                          byte approximate result
omult8.a    49.20    1075 8 x 8 bit unsigned multiply, 8 bit high
                          byte approximate result
omult9.a    22.97     780 8 x 8 bit unsigned multiply, 8 bit high
                          byte approximate result
omult10.a  909.00      50 16 x 32 bit unsigned multiply, 32 bit low
                          bytes result
omult11.a   43.00     547 8 x 8 bit unsigned multiply, ONLY
                          approximate high 8 bit result
omult12.a  181.04      27 8 x 8 bit unsigned multiply, ONLY low 8 bit
                          result
omult13.a  202.01     179 16 signed x 8 bit sign-magnitude, 16 bit
                          result, div 128
omult14.a  575.00      43 16 x 16 bit unsigned multiply, ONLY low 16
                          bit result
omult15.a  390.00      47 16 x 16 bit unsigned multiply, ONLY low 16
                          bit result
omult16.a  223.69      33 16 x 16 bit unsigned multiply, ONLY low 16
                          bit result (or carry set on overflow)
omult17.a  267.00      34 16 x 8 bit unsigned multiply, ONLY low 16
                          bit result
omult18.a 2036.00      76 variable m x n byte unsigned multiply (all
                          16 bit x 16 bit multiplies tested)
                          24 x 24 bit unsigned multiply, 48 bit
omult19.a 2169.00      48 result (tested over millions of random
                          inputs, and all 16 bit inputs)
                          32 x 32 bit unsigned multiply, 64 bit
omult20.a 2741.00      66 result (tested over millions of random
                          inputs, and all 16 bit inputs)
                          24 x 24 bit unsigned multiply, 48 bit
omult21.a 1014.00      49 result (tested over millions of random
                          inputs, and all 16 bit inputs)
                          32 x 32 bit unsigned multiply, 64 bit
omult22.a 1653.00      59 result (tested over millions of random
                          inputs, and all 16 bit inputs)
omult23.a 1381.00      76 variable m x n byte unsigned multiply (all
                          16 bit x 16 bit multiplies tested)
                          24 x 24 bit unsigned multiply, ONLY low 24
omult24.a 1356.94      61 bit result (tested over millions of random
                          inputs, and all 16 bit inputs)
omult25.a   60.00      16 3 x 8 bit unsigned multiply, ONLY high 8
                          bit result
omult26.a  145.00      16 8 x 8 bit unsigned multiply, ONLY high 8
                          bit result
omult27.a  444.00      22 16 x 8 bit unsigned multiply, ONLY high 16
                          bit result
omult28.a  897.00      24 24 x 8 bit unsigned multiply, ONLY high 24
                          bit result
omult29.a  267.00      34 16 x 8 bit unsigned multiply, ONLY low 16
                          bit result
omult30.a  310.00      40 24 x 8 bit unsigned multiply, ONLY high 24
                          bit result
omult31.a  168.90    2162 24 x 8 bit unsigned multiply, ONLY high 24
                          bit result

 The Algorithms

 1. Binary Multiplication (Shift and Add)

This is the classic algorithm found in all good textbooks, similar to
pen and paper 'long multiplication', but in base 2. A friendly
introduction is found here. In short, one number is shifted left
(doubled) each time around a loop, and the binary bits of the other
number are used to determine whether to add this shifted number to a
running total.

This is the method used by most programs that need multiplication. It
has the advantage that the code is small and it performs reasonably
well. Also known as Ancient Egyptian multiplication.

 2. Modified Shift and Add

This is a clever variation of the standard shift and add algorithm
that reduces the number of shifts required for a 16 bit multiply (and
larger). In the standard algorithm each of the 16 loop iterations
requires four byte shifts. In this variant each iteration only
requires three shifts. This was found by Dr Jefyll in 2012, and is
described here. The animated diagram is instructive.

 3. Tables of Squares

By storing tables of square numbers, we can speed up multiplication.
This uses:

$$ab = f(a+b) - f(a-b), where f(x) = \frac{x^2} {4}$$

So using two table lookups, an addition and two subtractions, we can
multiply. This is faster than 'shift and add'. The downside is how
much memory needed to store the data. For 8 bit multiplication, the
amount of data varies depending on the exact implementation, but is
either 2k of data (fastest), or 1k (only marginally slower), or 512
bytes (slightly slower again).

An added feature of the 1k and 2k routines particularly is that if
many multiplications are being done with one of the inputs unchanging
then some setup code can be skipped, for even better performance. For
example if a number of points are being rotated by some known angle.

The data tables can be either loaded from storage, or initialised in
code.

 4. Logarithms

This is an approximation for multiplication. This uses:

$$log(ab) = log(a) + log(b)$$

By using a log and exponentiation tables, we can multiply using just
three table lookups and one addition. This is fast.

However, since we are working with integers and not floating point,
this is only an approximation. In particular, when multiplying 8 bit
x 8 bit and returning an 8 bit (high byte) result only, this can give
a reasonable approximation.

The method is described further here here. It has an implementation
we look at next, and compare it with others:

 GO64! magazine articles (omult9.a)

This uses a 256 byte log table and a 511 byte antilog table (total:
767 bytes of data).

Note that its formula for the antilog table $y=2^{(x/f-8)}+.5$ should
not have the $+.5$ as this makes the results less accurate. In
particular, testing with $+.5$ over all 65536 possible inputs we get
the following results:

Error: -5  count: 1
Error: -4  count: 32
Error: -3  count: 262
Error: -2  count: 1086
Error: -1  count: 3934
Error: 0  count: 26871
Error: 1  count: 28384
Error: 2  count: 3937
Error: 3  count: 833
Error: 4  count: 180
Error: 5  count: 16

Root-mean-square deviation: 257.06 (smaller is better)

omult9 results with 0.5 bias

which is more often wrong than it is right. Without the $+.5$ the
code gives more accurate results:

Error: -5  count: 9
Error: -4  count: 93
Error: -3  count: 468
Error: -2  count: 2088
Error: -1  count: 10529
Error: 0  count: 41848
Error: 1  count: 8275
Error: 2  count: 1753
Error: 3  count: 411
Error: 4  count: 61
Error: 5  count: 1

Root-mean-square deviation: 211.64 (smaller is better)

omult9 results without 0.5 bias

 Elite, Master version (omult7.a)

The Master and Second Processor versions of Elite for the BBC Micro
also use logarithms for approximating some 8 bit x 8 bit = 8 bit
(high byte) multiplications (see here).

The BBC Master and Apple II versions of Elite have identical routines
with two log tables and an antilog table (total: 768 bytes of data)
for a version that is wrong by no more than six:

Error -6: 10
Error -5: 119
Error -4: 626
Error -3: 2590
Error -2: 7082
Error -1: 20656
Error 0: 34451
Error 1: 2

Root-mean-square deviation: 292.66 (smaller is better)

omult7 results

 Elite, Second Processor version (omult8.a)

The Second Processor version of Elite has a more accurate version
using an extra antilog table (total: 1024 bytes of data), for a
version that is wrong by no more than three:

Error -3: 90
Error -2: 1981
Error -1: 19356
Error 0: 44109

Root-mean-square deviation: 167.60 (smaller is better)

omult8 results

 Alternative: a table-of-squares approximation (omult11.a)

The same log and antilog tables can be used to implement an
approximate division.

If division is not needed however, then a table of squares method can
be used (total: 512 bytes of data), and assuming (as with log based
methods above) only the high byte of the product is required, the
code for the low byte can be removed, for a version that is wrong by
no more than one:

Error -1: 4707
Error 0: 43681
Error 1: 17148
Root-mean-square deviation: 147.83 (smaller is better)

The table has been biased by '-0.74' by manual experimentation to
minimize the root mean square deviation.

omult11 results

 5. Four bit multiply

Instead of 'binary multiplication' using base 2 (as described above),
we use base 16 (hexadecimal). We use a 256 byte table that stores the
result of multiplying two 4 bit numbers together.

To get an 8 bit x 8 bit multiply, we think of our two 8 bit numbers
as being two pairs of hex digits AB and CD. We multiply each pair of
hex digits together using the lookup table, and add them together as
shown below. This is the same method as regular pen and paper 'long
multiplication':

        AB
       *CD
      ----
        xx      (B*D)+
       xx0      (A*D*16)+
       xx0      (B*C*16)+
      xx00      (A*C*256)

This algorithm is not the fastest, it's nearly 2 times slower than a
regular shift and add.

Aviator for the BBC Micro uses this method (see here).

 6. Booth's Algorithm

The classic shift and add algorithm can sometimes end up doing a lot
of addition. For instance multiplying by 15 involves four additions
since 15 = 1+2+4+8, corresponding to a run of set bits in the
multiplier. It would be quicker to multiply by 16 and subtract the
original number.

Booth's Algorithm tracks when successive bits of the multiplier
change and either adds or subtracts the other number from the total
as needed.

Unusually, this method is designed for signed numbers, not unsigned.

This method turns out to be ~2.7 times slower on the 6502 than an
equivalent 'shift-and-add' routine, so doesn't seem to be used much
in practice. It's used more in designing hardware circuits.

Further explanation of Booth's algorithm here.

 7. Hardware support

Some hardware has multiplication support in silicon. These are likely
to be fastest where available. For instance, the SNES CPU with its
extended 6502 instruction set has hardware for 'unsigned 8 bit x 8
bit = 16 bit' and 'signed 16 bit x 8 bit = 24 bit' routines.

Some early vector based arcade machines like Tempest and Battlezone
were programmed in 6502, with an external processor (Atari's Math Box
) to handle the vector maths, including multiply routines.

 8. Repeated addition

To multiply m*n, just add m, n times. This is stupidly slow for
anything that isn't very small in n, so avoid in general. With 8 bit
multiply, if I show them with all the others, the graph looks like
this:

all results

Only one (mult72, being smallest) is just worthy of an orange dot, in
the unlikely scenario that you can afford 16 bytes but not 17:

repeated addition results

The booby prize for the least efficient multiply goes to mult70 at
nearly 2000 cycles average.

So it's better generally to use binary multiplication instead (e.g.
mult9 or mult11 are 17 bytes). However, for multiplying by small
numbers (between 0 and 24), mult72 is more efficient than mult9.

 Customising

 1. Changing the number of bits

The most common routines I've found either multiply two 8 bit values
to get a 16 bit result, or multiply two 16 bit values to get a 32 bit
result.

These are useful, but in practice what you may need is something
different, something custom made. For example you may need to
multiply a 24 bit number by an 8 bit number, scaling the result down
by 256 to get a new 24 bit number.

The shift-and-add method is straightforward to extend to larger the
number of bits, since the principles are the same no matter how many
bits are used. An m-bit by n-bit multiply needs a result of m+n bits.

It also helps to realise that you can make these routines by building
on your favourite standard 8 bit x 8 bit = 16 bit routine.

Just as binary multiplication works in base 2, this works in base
256. Each byte is one digit. For example, to make a 16 x 16 bit
multiply:

        AB
       *CD
      ----
        xx      (B*D)+
       xx0      (A*D*256)+
       xx0      (B*C*256)+
      xx00      (A*C*256*256)

Adding the four partial results as shown.

 2. Changing the In and Out Parameters

Routines can take input values either from registers or from memory.
It can also return results in registers and/or memory.

 8 bit x 8 bit = 16 bit

The 8 bit routines I have presented here will generally use whichever
parameter method is fastest.

However, the calling code may want to use registers for the
parameters for the multiply for both input and output as this is
often most efficient. You may want to adjust the in/out parameters of
the routine depending on your usage.

In particular, if on exiting the routine the low byte of the result
is in A, then it can be used as the starting point for a subsequent
add or subtract, as used when combining to make a larger bit
multiply. Sometimes carry is guaranteed clear after the multiply
which also helps with optimising a subsequent addition.

 16 bit x 16 bit = 32 bit

These routines mostly use memory locations for in/out parameters, as
there are too many values to hold in the registers.

 3. Only Using Partial Results

For speed, some routines only provide a partial answer. e.g. it may
return only the high byte of the result (as an approximation, often
used with fixed point calculations) or the low byte (for multiplying
small numbers that don't lead to results larger than one byte).

For example, if a routine wants to multiply a 16 bit number by the
sine of an angle this is a problem for an integer routine since the
sine of an angle is a floating point number not an integer. By
scaling up the fractional value to an integer e.g. N=256*sin(angle),
then the integer multiplication by N can happen and the result scaled
down by 256. Note also that negative numbers will need special
treatment:

 4. Making Signed Multiply Routines

Two's complement representation is most commonly used to represent
signed numbers. Occasionally routines use a sign-magnitude
representation (e.g. omult4.a), but I will assume here the standard
two's complement representation is used.

There are two methods of dealing with multiplying signed numbers; one
obvious, the other less obvious but faster. The more obvious method
is:

 1. remember if the sign of the two input numbers differ
 2. remove the sign of each input value (abs)
 3. do a regular unsigned multiply
 4. recall if the signs differ and negate the result if needed

The faster, craftier method is:

 1. do a regular unsigned multiply
 2. If the first input is negative, subtract the second input from
    the high byte(s) of the result.
 3. If the second input is negative, subtract the first input from
    the high byte(s) of the result.

This takes less memory and fewer cycles than the more obvious method.
See C=Hacking16 for more details.

Caveat: If you are using a shift-and-add (or modified shift-and-add)
then a small negative number like -1 will have lots of bits set,
meaning lots of adds occur in the unsigned multiply. But it works
well for table-of-squares routines.

The code to do this can be optimised to be quite small. For instance
smult1.a has:

    ; Step 1: Unsigned multiply
    ;     <do an unsigned multiply here>
    ; Suppose at this point:
    ;     X=one of the original input numbers
    ;     A=high byte of result
    ;     Y=low byte of result

    ; Step 2: apply sign.
    cpx #$80             ; check the sign of one input              2 cycles
    bcc +                ; branch if positive.                      2/3/4
    sbc sm1              ; take off the other input.                3
+
    bit sm1              ; check the sign with of the other input.  3
    bpl +                ; branch if positive.                      2/3/4
    stx temp             ; store the amount to subtract.            4
    sec                  ; prepare carry for subtract.              2
temp = * + 1
    sbc #0               ; subtract (self modifying code).          2
+

Corollary: For an 8 bit x 8 bit multiply where only the low 8 bits of
the result are required, there is no difference between the unsigned
and signed result, the same answer works for both. For a 16 bit x 16
bit multiply where only the lower 16 bits are required, the same is
true.

 5. Self modifying code

If not running from ROM, self-modifying code can be used to optimise
for speed. The table of squares routines often do this, for example.
Implementations that use the shift-and-add algorithm will often add
the multiplicand in a single location in the loop. It may be possible
to replace these 'adc multiplicand' and 'adc multiplicand+1'
instructions with immediate versions 'adc #0' and write the
multiplicand into the adc operands directly from the caller code.

If using self-modifying code, putting the code itself in zero page
can make it run a little faster, if you have the space!

 6. Multiply using Binary Coded Decimal (BCD)

This can be done, but not very efficiently. Here is an implementation
that uses the 'Russian peasant multiplication'. There is discussion
of various methods on the 6502.org forum. A multibyte BCD multiply
(for numbers up to 255 bytes long!) is in 6502 Assembly Lanuage
Routines.

 How to run the tests

 Dependencies

  * I use the MOS Technology 6502 CPU Emulator to emulate the 6502.
  * I use the Z header only library as it is required by the
    emulator.
  * I use the libpng library to plot the log error images.
  * I use matplotlib python library to plot the graphs.
  * I use the acme assembler.
  * I use clang to compile the C code.
  * I use python3 to create the graphs.

 Go

  * I'm set up for macOS. The 'go' script specifies which tests to
    execute. Uncomment the test(s) you want to run. Run the 'go'
    script to execute the tests.
  * The 'tests' folder contains a number of 6502 assembly language
    files ('.a' files) to test.
  * The testing is configured by a small associated '.c' file.
  * The test results are written to the results/ folder.
  * Tests can be executed on multiple threads for speed. Adjust this
    in the go script: -n<number> on the command line for the tester
    program specifies the number of threads.
  * The 'go_plot' script is used to create the graphs from the
    results as svg files.

 How testing works

  * The assembly language code is assembled into a binary file using
    the acme assembler.
  * The tester C code is compiled (using clang) along with the test
    parameters '.c' file.
  * The 6502 binary is loaded and executed (simulated) multiple
    times, over all possible inputs (specified by the test's '.c'
    file).
  * Any unexpected results (e.g. due to errors in the algorithm or
    the test) are reported. The test case that failed is re-run with
    a full disassembly output to aid debugging.
  * The average cycle count is reported and results are output to a
    json file.

 See Also

See also my sqrt_test repository for comparing implementations of
square root.

About

Comparing 6502 multiply routines

Resources

Readme
Activity

Stars

43 stars

Watchers

4 watching

Forks

2 forks
Report repository

Contributors 2

  * @TobyLobster TobyLobster Toby Nelson
  * @Adrian-Nelson Adrian-Nelson Adrian Nelson

Languages

  * C 90.0%
  * Python 10.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * 

You can't perform that action at this time.