Path: ns-mx!hobbes.physics.uiowa.edu!zaphod.mps.ohio-state.edu!rpi!uwm.edu!ogicse!news.u.washington.edu!nntp.uoregon.edu!cie.uoregon.edu!nparker From: nparker@cie.uoregon.edu (Neil Parker) Newsgroups: comp.sys.apple2 Subject: Re: Memory Moving Message-ID: <1992Mar8.121259.2394@nntp.uoregon.edu> Date: 8 Mar 92 12:12:59 GMT References: Sender: news@nntp.uoregon.edu Organization: The Universal Society for the Prevention of Reality Lines: 95 In article qd@pro-calgary.cts.com (Quinn Dunki) writes: >Could somebody please instruct me on the use of the opcodes MVP and MVN? >Based on the descriptions in my books, I haven't gotten them to work yet. The MVN and MVP instructions move data from one bank of the IIGS's memory to another. MVN and MVP are both 3-byte instructions. The format in memory is as follows: MVN: $54 dd ss MVP: $44 dd ss where "dd" is the a byte indicating the destination bank, and "ss" is a byte indicating the source bank. Both instructions use the X register to indicate the location of the source data in the source bank, the Y register to indicate the location of the destination data in the destination bank, and the A register to indicate how many bytes to move. The MVN instruction is a "forward" move. It works like this: REPEAT Get the byte at location $ss0000+X Store it at location $dd0000+Y X := X + 1 Y := Y + 1 A := A - 1 UNTIL A = $FFFF The MVP instruction is a "backward" move. It works like this: REPEAT Get the byte at location $ss0000+X Store it at location $dd0000+Y X := X - 1 Y := Y - 1 A := A - 1 UNTIL A = $FFFF When either instruction is finished, it leaves the data bank register pointing at the destination bank. These instructions work in either emulaton mode or native mode, with the index registers and accumulator either 8 or 16 bits wide, but it is most useful in native mode with all registers 16 bits wide. If the accumulator is 8 bits wide, you will only be able to move 256 bytes or less, and if X and Y are 8 bits wide, you will only be able to move the first 256 bytes of the source bank into the first 256 bytes of the destination bank. As an example, suppose we want to move 1024 bytes (4 pages) from location $01/2345 to location $02/6789: CLC ;Make sure we're in 16-bit native mode XCE REP #$30 LDX #$2345 ;Source location LDY #$6789 ;Destination location LDA #$03FF ;Number of bytes minus 1 MVN $0102 ;Move from bank 1 to bank 2 (note flipped operand bytes) We could do exactly the same thing with the MVP instruction, but we have to remember that for MVP, the X and Y registers start at the END of the source and destination areas and work backwards: CLC XCE REP #$30 LDX #$2744 ;($2744 = $2345 + $03FF) LDY #$6B88 ;($6B88 = $6789 + $03FF) LDA #$03FF MVP $0102 In all the time I've owned my IIGS, I don't think I've ever programmed an MVP instruction--it's easier to use MVN and start at the beginning than to use MVP and start at the end. Still, MVP does have a useful purpose...if the source and destination bank are the same, and the destination area starts inside the source area, MVN won't copy correctly, and MVP will (to see why, try working through a few iterations of the MVN and MVP pseudocode above with ss=dd and Y=X+1). (Actually, MVP suffers from the opposite problem-- it fails if the destination area ENDS inside the source area, in which case MVN works fine.) - Neil Parker P.S. The two examples above use the mini-assembler syntax for MVN and MVP--the mini-assembler wants a two-byte operand with the source bank in the high byte and the destination bank in the low byte. Other assemblers will probably want a different syntax...for example, the APW assembler syntax is "MVN expr,expr" and "MVP expr,expr". -- Neil Parker No cute ASCII art...no cute quote...no cute nparker@cie.uoregon.edu disclaimer...no deposit, no return... parker@corona.uoregon.edu (This space intentionally left blank: ) Newsgroups: comp.sys.apple2.programmer Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!jussieu.fr!univ-lyon1.fr!ghost.dsi.unimi.it!batcomputer!munnari.oz.au!labtam!philip From: philip@labtam.oz.au (Philip Stephens) Subject: Re: Move on IIGS Message-ID: <1994May18.070458.2418@labtam.labtam.oz.au> Sender: philip@labtam.labtam.oz.au (Philip Stephens) Reply-To: philip@labtam.oz.au Organization: Labtam Australia Pty. Ltd. References: Date: Wed, 18 May 1994 07:04:58 GMT Lines: 57 Ed Wall writes: >I was reading the book on Apple IIGs Assembly Language Prog. by Lichty >and Eyes and they discuss some problems with the MVP and the MVN commands >and end by saying that it is faster to code a series of 16-bit load/store >instruction pairs in a loop than use these instructions. My question is >how many to be faster and how does this all compare to the BlockMove >command. The block move command takes 7 cycles to copy a single byte. Consider an example where you copy a whole 64K bank to another bank: the block move will thus take 7 x 65536 = 458752 cycles. Now let's do the same copy using absolute indexing (16-bit accumulator and index register): ldx #0 ;3 Loop lda sourcebank,x ;5 cycles x 32768 + 256 sta targetbank,x ;6 cycles x 32768 inx ;2 cycles x 32768 inx ;2 cycles x 32768 bne Loop ;3 cycles x 32768 - 1 This loop completes in 590082 cycles, so is obviously slower than a block move. But now lets unroll the loop so that we are perform 128 loads and stores before we bump up the index register by 256: ldx #0 ;3 clc ;2 Loop lda sourcebank,x ;5 cycles x 32768 + 256 sta targetbank,x ;6 cycles x 32768 lda sourcebank+2,x ;... sta targetbank+2,x ... lda sourcebank+126,x sta targetbank+126,x txa ;2 cycles x 256 adc #256 ;3 cycles x 256 tax ;2 cycles x 256 beq Done ;2 cycles x 256 + 1 jmp Loop ;3 cycles x 256 Done ... This loop completes in 363781 cycles, and hence is _faster_ than the block move. The more loads and stores you do in a single loop, the less overhead you have and hence the faster the copy operation will be. The disadvantage of this technique is obvious: lots of code :-) But if you need the speed, that tradeoff is acceptable enough (just use a macro in your assembly code so that _you_ don't see the massive increase in code ;-) -- =========== Philip Stephens, Systems Programmer, Labtam Australia ============ ====== Owned by Motor (Mo for short), the power-purrer from Down Under ======= === DS (B+S)t Y 2 X L W C+++ I T+++ A+ E H+++ S+ V+ F Q+ P++ B+ PA+ PL+++ === "Many views yield the truth. Therefore, be not alone." -- Viggies' Prime Song Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!agate!library.ucla.edu!csulb.edu!paris.ics.uci.edu!bonnie.ics.uci.edu!jlee From: jlee@ics.uci.edu (Orion Pax) Newsgroups: comp.sys.apple2.programmer Subject: Re: Move on IIGS Date: 18 May 1994 09:55:07 GMT Organization: UC Irvine Department of ICS Lines: 43 Message-ID: <2rcolr$lg3@paris.ics.uci.edu> References: <1994May18.070458.2418@labtam.labtam.oz.au> NNTP-Posting-Host: bonnie.ics.uci.edu X-Newsreader: TIN [version 1.2 PL2] Philip Stephens (philip@labtam.oz.au) wrote: [ block move of 64k takes 458752 cycles ] [ absolute index copy takes 590082 cycles ] [ absolute index copy in 128 pairs takes 363781 ] As Nate said, these speed comparisions are only valid on stock machines. Things go totally wacky if an accelerator is involved. It's then a matter of how much instructions/data gets cached for the code speedup. Take block move, it moves one byte in 7 cycles. At 2.5 mhz, each cycle takes 400 ns so to move a whole word takes 400 x 7 x 2 = 5600 ns. On an accelerator, say a generic 7mhz/8k TWGS, 5 cycles of the move are entirely cached while two are forced to run at 2.5 (to read/write the byte from/to memory.) A cycle at 7mhz takes approximately 142.8571 ns, so a move takes 400 x 2 x 2 + 142.8571 x 5 x 2 = 3028.5714 ns, which is 54% faster. A lda abs,x is 5 cycles so 2000/1228.5714 ns, sta abs,x is 6 cycles so 2400/1371.4286 ns, inx is 2 cycles (800/285.7143 ns) so for two it's (1600/571.4286 ns), and a bne is 3 cycles (1200/428.5714). The total time comes out to 7200/3600 ns which makes it pretty close to a move at accelerated speed. Eeek, ok now for the 128 lda/sta paired unrolled loop. Using the previous calculations, that's (lda + sta) * 128 or (4400/2600 ns * 128). The txa/adc/tax combination is (2800/1000 ns). The calculation for this is more complex because txa/adc/tax times are added in every 128th time. However, the average time is still faster than that of mvp/mvn. Deciding which coding technique is better depends on the application and usage. Caching speedup only works on the second time around and unrolled loop code tends to flush the cache and counter the speedup because of its size. This is some of the weird stuff one has to learn and be aware of to write really good code on an Apple II. Thanks to Toshi for "learning" me. Life just isn't simple ;-p. Joseph -- jlee@bonnie.ics.uci.edu | "If builders built buildings the way programmers -------------------------+ wrote programs, then the first woodpecker that Creative 'Ware | came along would destroy civilization." II(><)II | -- Weinberg's Law Path: news.weeg.uiowa.edu!news.uiowa.edu!hobbes.physics.uiowa.edu!math.ohio-state.edu!howland.reston.ans.net!news.cac.psu.edu!psuvm!kcr103 Organization: Penn State University Date: Wed, 18 May 1994 20:29:02 EDT From: Ken Richardson Message-ID: <94138.202902KCR103@psuvm.psu.edu> Newsgroups: comp.sys.apple2.programmer Subject: Move Routines Lines: 100 Try this loop: QuickCopy anop phb phk plb ldx #$0ffe Loop lda source,x sta >destination,x lda source+$1000,x sta >destination+$1000,x . . . lda source+$f000,x sta >destination+$f000,x dex dex bne Loop lda source sta >destination . . . lda source+$f000 sta >destination+$f000 plb rtl This routine including the jsl and rtl, is 374805 cycles. Since the loop consists of 16 loads and stores it should also stay in the cache through consecutive loops. If not you could reduce the number of loads and stores to 8 and still have a reasonable loop (389141 cycles). If 374805 is still too slow, you could double the loop size and reduce the cycles to 367637. This is only 4000 or so cycles greater than the 128 load/store loop but is one fourth the size. It is only approx. 1% slower. Ken >Ed Wall writes: >>I was reading the book on Apple IIGs Assembly Language Prog. by Lichty >>and Eyes and they discuss some problems with the MVP and the MVN commands >>and end by saying that it is faster to code a series of 16-bit load/store >>instruction pairs in a loop than use these instructions. My question is >>how many to be faster and how does this all compare to the BlockMove >>command. > The block move command takes 7 cycles to copy a single byte. Consider >an example where you copy a whole 64K bank to another bank: the block move >will thus take 7 x 65536 = 458752 cycles. > Now let's do the same copy using absolute indexing (16-bit accumulator >and index register): > ldx #0 ;3 >Loop lda sourcebank,x ;5 cycles x 32768 + 256 > sta targetbank,x ;6 cycles x 32768 > inx ;2 cycles x 32768 > inx ;2 cycles x 32768 > bne Loop ;3 cycles x 32768 - 1 > This loop completes in 590082 cycles, so is obviously slower than a block >move. But now lets unroll the loop so that we are perform 128 loads and >stores before we bump up the index register by 256: > ldx #0 ;3 > clc ;2 >Loop lda sourcebank,x ;5 cycles x 32768 + 256 > sta targetbank,x ;6 cycles x 32768 > lda sourcebank+2,x ;... > sta targetbank+2,x > ... > lda sourcebank+126,x > sta targetbank+126,x > txa ;2 cycles x 256 > adc #256 ;3 cycles x 256 > tax ;2 cycles x 256 > beq Done ;2 cycles x 256 + 1 > jmp Loop ;3 cycles x 256 >Done ... > This loop completes in 363781 cycles, and hence is _faster_ than the block >move. The more loads and stores you do in a single loop, the less overhead >you have and hence the faster the copy operation will be. > The disadvantage of this technique is obvious: lots of code :-) But if you >need the speed, that tradeoff is acceptable enough (just use a macro in your >assembly code so that _you_ don't see the massive increase in code ;-) >-- >=========== Philip Stephens, Systems Programmer, Labtam Australia ============ >====== Owned by Motor (Mo for short), the power-purrer from Down Under ======= >=== DS (B+S)t Y 2 X L W C+++ I T+++ A+ E H+++ S+ V+ F Q+ P++ B+ PA+ PL+++ === >"Many views yield the truth. Therefore, be not alone." -- Viggies' Prime Song