Path: ns-mx!hobbes.physics.uiowa.edu!zaphod.mps.ohio-state.edu!rpi!uwm.edu!ogicse!news.u.washington.edu!nntp.uoregon.edu!cie.uoregon.edu!nparker
From: nparker@cie.uoregon.edu (Neil Parker)
Newsgroups: comp.sys.apple2
Subject: Re: Memory Moving
Message-ID: <1992Mar8.121259.2394@nntp.uoregon.edu>
Date: 8 Mar 92 12:12:59 GMT
References: <c573199@pro-calgary.cts.com>
Sender: news@nntp.uoregon.edu
Organization: The Universal Society for the Prevention of Reality
Lines: 95

In article <c573199@pro-calgary.cts.com> qd@pro-calgary.cts.com (Quinn Dunki) writes:
>Could somebody please instruct me on the use of the opcodes MVP and MVN?
>Based on the descriptions in my books, I haven't gotten them to work yet.

The MVN and MVP instructions move data from one bank of the IIGS's memory to
another.  

MVN and MVP are both 3-byte instructions.  The format in memory is as
follows:

     MVN:    $54 dd ss
     MVP:    $44 dd ss

where "dd" is the a byte indicating the destination bank, and "ss" is a
byte indicating the source bank.

Both instructions use the X register to indicate the location of the source
data in the source bank, the Y register to indicate the location of the
destination data in the destination bank, and the A register to indicate
how many bytes to move.

The MVN instruction is a "forward" move.  It works like this:

     REPEAT
          Get the byte at location $ss0000+X
          Store it at location $dd0000+Y
          X := X + 1
          Y := Y + 1
          A := A - 1
     UNTIL A = $FFFF

The MVP instruction is a "backward" move.  It works like this:

     REPEAT
          Get the byte at location $ss0000+X
          Store it at location $dd0000+Y
          X := X - 1
          Y := Y - 1
          A := A - 1
     UNTIL A = $FFFF

When either instruction is finished, it leaves the data bank register
pointing at the destination bank.

These instructions work in either emulaton mode or native mode, with
the index registers and accumulator either 8 or 16 bits wide, but it is
most useful in native mode with all registers 16 bits wide.  If the
accumulator is 8 bits wide, you will only be able to move 256 bytes or less,
and if X and Y are 8 bits wide, you will only be able to move the first 256
bytes of the source bank into the first 256 bytes of the destination bank.

As an example, suppose we want to move 1024 bytes (4 pages) from location
$01/2345 to location $02/6789:

     CLC          ;Make sure we're in 16-bit native mode
     XCE
     REP #$30
     LDX #$2345   ;Source location
     LDY #$6789   ;Destination location
     LDA #$03FF   ;Number of bytes minus 1
     MVN $0102    ;Move from bank 1 to bank 2 (note flipped operand bytes)

We could do exactly the same thing with the MVP instruction, but we have to
remember that for MVP, the X and Y registers start at the END of the source
and destination areas and work backwards:

     CLC
     XCE
     REP #$30
     LDX #$2744   ;($2744 = $2345 + $03FF)
     LDY #$6B88   ;($6B88 = $6789 + $03FF)
     LDA #$03FF
     MVP $0102

In all the time I've owned my IIGS, I don't think I've ever programmed an MVP
instruction--it's easier to use MVN and start at the beginning than to use
MVP and start at the end.  Still, MVP does have a useful purpose...if the
source and destination bank are the same, and the destination area starts
inside the source area, MVN won't copy correctly, and MVP will (to see why,
try working through a few iterations of the MVN and MVP pseudocode above
with ss=dd and Y=X+1).  (Actually, MVP suffers from the opposite problem--
it fails if the destination area ENDS inside the source area, in which
case MVN works fine.)

               - Neil Parker

P.S.  The two examples above use the mini-assembler syntax for MVN and
MVP--the mini-assembler wants a two-byte operand with the source bank in
the high byte and the destination bank in the low byte.  Other assemblers
will probably want a different syntax...for example, the APW assembler
syntax is "MVN expr,expr" and "MVP expr,expr".
--
Neil Parker                 No cute ASCII art...no cute quote...no cute
nparker@cie.uoregon.edu     disclaimer...no deposit, no return...
parker@corona.uoregon.edu   (This space intentionally left blank:           )
Newsgroups: comp.sys.apple2.programmer
Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!jussieu.fr!univ-lyon1.fr!ghost.dsi.unimi.it!batcomputer!munnari.oz.au!labtam!philip
From: philip@labtam.oz.au (Philip Stephens)
Subject: Re: Move on IIGS
Message-ID: <1994May18.070458.2418@labtam.labtam.oz.au>
Sender: philip@labtam.labtam.oz.au (Philip Stephens)
Reply-To: philip@labtam.oz.au
Organization: Labtam Australia Pty. Ltd.
References: <hO4P9lA.ewall@delphi.com>
Date: Wed, 18 May 1994 07:04:58 GMT
Lines: 57

Ed Wall writes:

>I was reading the book on Apple IIGs Assembly Language Prog. by Lichty
>and Eyes and they discuss some problems with the MVP and the MVN commands
>and end by saying that it is faster to code a series of 16-bit load/store
>instruction pairs in a loop than use these instructions. My question is
>how many to be faster and how does this all compare to the BlockMove
>command.

  The block move command takes 7 cycles to copy a single byte.  Consider
an example where you copy a whole 64K bank to another bank: the block move
will thus take 7 x 65536 = 458752 cycles.

  Now let's do the same copy using absolute indexing (16-bit accumulator
and index register):

	ldx #0			;3
Loop	lda sourcebank,x	;5 cycles x 32768 + 256
	sta targetbank,x	;6 cycles x 32768
	inx			;2 cycles x 32768
	inx			;2 cycles x 32768
	bne Loop		;3 cycles x 32768 - 1

  This loop completes in 590082 cycles, so is obviously slower than a block
move.  But now lets unroll the loop so that we are perform 128 loads and
stores before we bump up the index register by 256:

	ldx #0			;3		
	clc			;2
Loop	lda sourcebank,x	;5 cycles x 32768 + 256
	sta targetbank,x	;6 cycles x 32768
	lda sourcebank+2,x	;...
	sta targetbank+2,x
	...
	lda sourcebank+126,x
	sta targetbank+126,x
	txa			;2 cycles x 256
	adc #256		;3 cycles x 256
	tax			;2 cycles x 256
	beq Done		;2 cycles x 256 + 1
	jmp Loop		;3 cycles x 256
Done	...

  This loop completes in 363781 cycles, and hence is _faster_ than the block
move.  The more loads and stores you do in a single loop, the less overhead
you have and hence the faster the copy operation will be.
  The disadvantage of this technique is obvious: lots of code :-)  But if you
need the speed, that tradeoff is acceptable enough (just use a macro in your
assembly code so that _you_ don't see the massive increase in code ;-)



-- 
=========== Philip Stephens, Systems Programmer, Labtam Australia ============
====== Owned by Motor (Mo for short), the power-purrer from Down Under =======
=== DS (B+S)t Y 2 X L  W C+++ I T+++ A+ E H+++ S+ V+ F Q+ P++ B+ PA+ PL+++ ===
"Many views yield the truth.  Therefore, be not alone." -- Viggies' Prime Song
Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!agate!library.ucla.edu!csulb.edu!paris.ics.uci.edu!bonnie.ics.uci.edu!jlee
From: jlee@ics.uci.edu (Orion Pax)
Newsgroups: comp.sys.apple2.programmer
Subject: Re: Move on IIGS
Date: 18 May 1994 09:55:07 GMT
Organization: UC Irvine Department of ICS
Lines: 43
Message-ID: <2rcolr$lg3@paris.ics.uci.edu>
References: <hO4P9lA.ewall@delphi.com> <1994May18.070458.2418@labtam.labtam.oz.au>
NNTP-Posting-Host: bonnie.ics.uci.edu
X-Newsreader: TIN [version 1.2 PL2]

Philip Stephens (philip@labtam.oz.au) wrote:
[ block move of 64k takes 458752 cycles ]
[ absolute index copy takes 590082 cycles ]
[ absolute index copy in 128 pairs takes 363781 ]

As Nate said, these speed comparisions are only valid on stock machines.
Things go totally wacky if an accelerator is involved.  It's then a matter
of how much instructions/data gets cached for the code speedup.

Take block move, it moves one byte in 7 cycles.  At 2.5 mhz, each cycle
takes 400 ns so to move a whole word takes 400 x 7 x 2 = 5600 ns. On an
accelerator, say a generic 7mhz/8k TWGS, 5 cycles of the move are entirely
cached while two are forced to run at 2.5 (to read/write the byte from/to
memory.)  A cycle at 7mhz takes approximately 142.8571 ns, so a move takes
400 x 2 x 2 + 142.8571 x 5 x 2 = 3028.5714 ns, which is 54% faster.

A lda abs,x is 5 cycles so 2000/1228.5714 ns, sta abs,x is 6 cycles so
2400/1371.4286 ns, inx is 2 cycles (800/285.7143 ns) so for two it's
(1600/571.4286 ns), and a bne is 3 cycles (1200/428.5714).  The total
time comes out to 7200/3600 ns which makes it pretty close to a move at
accelerated speed.

Eeek, ok now for the 128 lda/sta paired unrolled loop.  Using the
previous calculations, that's (lda + sta) * 128 or (4400/2600 ns * 128).
The txa/adc/tax combination is (2800/1000 ns).  The calculation for this
is more complex because txa/adc/tax times are added in every 128th time.
However, the average time is still faster than that of mvp/mvn.

Deciding which coding technique is better depends on the application
and usage.  Caching speedup only works on the second time around and
unrolled loop code tends to flush the cache and counter the speedup
because of its size.

This is some of the weird stuff one has to learn and be aware of to
write really good code on an Apple II.  Thanks to Toshi for "learning"
me.  Life just isn't simple ;-p.

Joseph
--
jlee@bonnie.ics.uci.edu  | "If builders built buildings the way programmers
-------------------------+  wrote programs, then the first woodpecker that
     Creative 'Ware      |  came along would destroy civilization."
        II(><)II         |  -- Weinberg's Law
Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!news.cac.psu.edu!psuvm!kcr103
Organization: Penn State University
Date: Wed, 18 May 1994 20:29:02 EDT
From: Ken Richardson <KCR103@psuvm.psu.edu>
Message-ID: <94138.202902KCR103@psuvm.psu.edu>
Newsgroups: comp.sys.apple2.programmer
Subject: Move Routines
Lines: 100


Try this loop:

QuickCopy  anop
           phb
           phk
           plb
           ldx  #$0ffe
Loop       lda source,x
           sta >destination,x
           lda source+$1000,x
           sta >destination+$1000,x
           .
           .
           .
           lda source+$f000,x
           sta >destination+$f000,x
           dex
           dex
           bne Loop
           lda source
           sta >destination
           .
           .
           .
           lda source+$f000
           sta >destination+$f000

           plb
           rtl

This routine including the jsl and rtl, is 374805 cycles.
Since the loop consists of 16 loads and stores it should
also stay in the cache through consecutive loops.
If not you could reduce the number of loads and stores
to 8 and still have a reasonable loop (389141 cycles).
If 374805 is still too slow, you could double the loop
size and reduce the cycles to 367637.  This is only 4000
or so cycles greater than the 128 load/store loop but is
one fourth the size.  It is only approx. 1% slower.
Ken


>Ed Wall writes:

>>I was reading the book on Apple IIGs Assembly Language Prog. by Lichty
>>and Eyes and they discuss some problems with the MVP and the MVN commands
>>and end by saying that it is faster to code a series of 16-bit load/store
>>instruction pairs in a loop than use these instructions. My question is
>>how many to be faster and how does this all compare to the BlockMove
>>command.

>  The block move command takes 7 cycles to copy a single byte.  Consider
>an example where you copy a whole 64K bank to another bank: the block move
>will thus take 7 x 65536 = 458752 cycles.

>  Now let's do the same copy using absolute indexing (16-bit accumulator
>and index register):

>        ldx #0                  ;3
>Loop    lda sourcebank,x        ;5 cycles x 32768 + 256
>        sta targetbank,x        ;6 cycles x 32768
>        inx                     ;2 cycles x 32768
>        inx                     ;2 cycles x 32768
>        bne Loop                ;3 cycles x 32768 - 1

>  This loop completes in 590082 cycles, so is obviously slower than a block
>move.  But now lets unroll the loop so that we are perform 128 loads and
>stores before we bump up the index register by 256:

>        ldx #0                  ;3
>        clc                     ;2
>Loop    lda sourcebank,x        ;5 cycles x 32768 + 256
>        sta targetbank,x        ;6 cycles x 32768
>        lda sourcebank+2,x      ;...
>        sta targetbank+2,x
>        ...
>        lda sourcebank+126,x
>        sta targetbank+126,x
>        txa                     ;2 cycles x 256
>        adc #256                ;3 cycles x 256
>        tax                     ;2 cycles x 256
>        beq Done                ;2 cycles x 256 + 1
>        jmp Loop                ;3 cycles x 256
>Done    ...

>  This loop completes in 363781 cycles, and hence is _faster_ than the block
>move.  The more loads and stores you do in a single loop, the less overhead
>you have and hence the faster the copy operation will be.
>  The disadvantage of this technique is obvious: lots of code :-)  But if you
>need the speed, that tradeoff is acceptable enough (just use a macro in your
>assembly code so that _you_ don't see the massive increase in code ;-)



>--
>=========== Philip Stephens, Systems Programmer, Labtam Australia ============
>====== Owned by Motor (Mo for short), the power-purrer from Down Under =======
>=== DS (B+S)t Y 2 X L  W C+++ I T+++ A+ E H+++ S+ V+ F Q+ P++ B+ PA+ PL+++ ===
>"Many views yield the truth.  Therefore, be not alone." -- Viggies' Prime Song