[HN Gopher] How fast can a 6502 transfer memory?
       ___________________________________________________________________
        
       How fast can a 6502 transfer memory?
        
       Author : xmyatniyx
       Score  : 164 points
       Date   : 2022-06-16 11:24 UTC (11 hours ago)
        
 (HTM) web link (imapenguin.com)
 (TXT) w3m dump (imapenguin.com)
        
       | forinti wrote:
       | I once thought about reading HD floppies on a BBC Micro (which
       | can handle SD and DD). But it turns out it can't handle the speed
       | at which the bits come in (500kbps).
       | 
       | SD and even DD are fine (125kbps and 250kbps), so you could read
       | 360KB floppies from a PC.
        
       | spc476 wrote:
       | The 6502 doesn't have a pipeline, so it's quite easy to count
       | instruction cycles and find out how much time a given piece of
       | code will take. I did this technique to bit-bang a serial port
       | back in the day (given two one-bit ports with the CPU doing all
       | the work because the system was too cheap to have an actual
       | UART).
        
       | joosters wrote:
       | To be really pedantic, there's a big difference between 'memory
       | bandwidth' and 'memory transfer speed'. The former is just
       | reading (or writing) to a block of memory, and the latter is
       | copying data from one location to another. So a 'memory transfer
       | speed' is going to be slower.
        
         | jmull wrote:
         | I think it's actually not that pedantic.
         | 
         | "Memory bandwidth" is being used in marketing materials today,
         | so it's a little useful to understand what it means. (The
         | author of this article confuses it with memory transfer,
         | probably others do as well.)
        
       | bluemax wrote:
       | Wow, this brings back memories from more than 3 decades ago. I
       | created a routine on the C64 to copy memory and calculated the
       | performance then around 25KB/sec.
       | 
       | The first version contained a memory corrupting bug that took
       | some time to figure out. Depending on the locations of the source
       | and destination you have to start copying forwards from the
       | beginning of the source, or backwards from the end. If there's an
       | overlap you risk overwriting the source before it is copied to
       | the destination.
        
       | scarface74 wrote:
       | I immediately noticed that in one of the code samples that he is
       | loading and storing data from memory that's not on the first page
       | - memory locations $00-$FF - memory access to and from the first
       | page took one less clock cycle.
       | 
       | The LDA, STA, etc operators for zero page access are different
       | opcodes than their two byte address equivalents
        
       | mmphosis wrote:
       | This also misses a lot of other tricks: self-modifying unrolled
       | code, keeping track of blocks of memory that don't need to be
       | updated, or blocks of memory that have the same value. Memory
       | moves may not be the fastest way.                 0 HIMEM: 5608
       | 1DATA553256330092213902139021393213504523826353650303725262750384
       | 56153058173120922939029328358454788756458365750371488455503510839
       | 66132653706165276445377489621322135821322334213502130051520282036
       | 2DATAQLNZQLNZQAQQDSAQRDSAQQDSAQVDSAQXCKDNAPFXANAQXXANANXNXNXQXNXC
       | QXKXNZQXKXNZQCQODMAQODMAUUYQXCKATAUVQXCKMNXQXKANXUAUTCQXKENXUMUSQ
       | JHSAHFQZTXRDFQZTAOCQZTAOBQHDSAPDSAQADSANZNZKDSAQZDSAXZQZOZXZUAXZJ
       | 3 READ L$: READ H$       4 FOR I = 1 TO  LEN (L$)
       | 5POKE767+I,10*(ASC(MID$(H$,I,1))-65)+VAL(MID$(L$,I,1))       6
       | NEXT       RUN       HGR : CALL 768: CALL 5608
        
       | metadat wrote:
       | Is Mario Bros a ripoff of Sam's Journey? Or vice versa?
       | 
       | Either way, beautiful.
        
         | gs17 wrote:
         | Sam's Journey looks to be a 2017 release, so it definitely
         | didn't inspire Mario.
         | 
         | [0] https://www.knightsofbytes.games/
        
         | cldellow wrote:
         | Vice versa, Sam's Journey was released in 2017:
         | https://www.knightsofbytes.games/samsjourney
        
       | dusted wrote:
       | I wouldn't call the 6502 a RISC CPU.. It was clearly designed for
       | humans to program, it has multiple and complex addressing modes
       | and instructions to make it easier for us..
       | 
       | Sure it is a small instruction set compared to modern CPUs, but
       | RISC is an idea, not a number.
       | 
       | I'd venture to say that RISC is designing with the goal of making
       | very efficient instructions and allow very efficient compilers to
       | be written for it.. It's the idea to have one faster way of doing
       | something rather than multiple convenient ways, because the
       | compiler don't care, and compiler vendors appreciate not having
       | to chose between multiple almost similar instructions that may or
       | may not be faster in some particular case if they don't have to.
        
         | jsrcout wrote:
         | Don't remember where I saw this, but someone said the 6502 was
         | really an RTCC (Reduced Transistor Count Computer).
        
         | [deleted]
        
         | pvg wrote:
         | It's a joke, as the article says.
        
       | vardump wrote:
       | Two nits to pick from this article:
       | 
       | While the article does mention it ignores loop unrolling, it's a
       | bit disingenuous, because that almost DOUBLES performance and
       | it's what nearly all real world code is doing.
       | 
       | Also Sam's Journey PAL version does not need any kind of DMA
       | transfer tricks. NTSC version is just a tiny bit behind in
       | timing, so to be glitch free, REU is used. PAL version still
       | works with minor glitches on an NTSC system.
       | 
       | This is because NTSC has 263 * 63 - 25 * 40 = 15569 cycles
       | available per frame (ignoring those stolen by sprites) and PAL
       | 263 * 63 - 25 * 40 = 18656 cycles (again, ignoring sprites).
       | 
       | The difference is enough that the NTSC version can't move
       | required 2000 bytes of color RAM and character RAM in time in the
       | worst case without REU.
        
         | Rediscover wrote:
         | I'm remembering (possibly incorrectly) PAL being 312 (not 263).
         | 
         | Is that what You intended?
        
           | vardump wrote:
           | Yeah. I accidentally left that value wrong after a copy
           | paste. But the result is right. :-)
        
       | Joyfield wrote:
       | My DOCSIS 3.1 Internet connection has more download bandwidth
       | than my Amiga 500 had to RAM. Latency, not so much.
        
       | cmrdporcupine wrote:
       | Anybody else interested in writing a WASM VM for the 6502 or
       | 65816 etc? This was my brainwave this week. I think this would be
       | a supremely nerdy fun thing to do.
        
       | DeathArrow wrote:
       | Beat that, Apple!
        
         | cestith wrote:
         | Apple used the very same processor family at one time. ;-)
         | 
         | They've come a long way.
        
           | NobodyNada wrote:
           | For a very loose definition of "same professor family", they
           | still do -- ARM is sort of a spiritual successor to the 6502:
           | https://en.wikipedia.org/wiki/ARM_architecture_family#Histor.
           | ..
           | 
           | ARM was designed by the team at Acorn that had worked on the
           | BBC Micro, which used a 6502. They decided to design a custom
           | processor because bit felt none of the 16- or 32-bit
           | processors on the market at the time met the standard set by
           | the 6502 for simplicity and low cost. So, they designed their
           | own architecture which took cues from both the cutting-edge
           | RISC research in academia, and the simple practicality of the
           | 6502.
           | 
           | (On a similar note: the 6502's main competitor, the Zilog
           | Z80, is an early ancestor of x86! The Z80 is an enhanced
           | clone of the Intel 8080, which of course the 8086 was heavily
           | based on.)
           | 
           | This legacy still shows up today in the instruction
           | mnemonics: ARM uses "branch" naming (BEQ - branch if equal,
           | BCS - branch if carry set, etc) because that's what the 6502
           | used, whereas x86 spells it "jump" (JEQ, JCS, etc.). ARM uses
           | LDR/STR to load and store registers from memory (like the
           | 6502's LDA/LDX/LDY/STA/STX/STY), whereas x86 just spells
           | everything "MOV". ARM only uses memory-mapped I/O to access
           | hardware, whereas x86 has separate input and output ports.
        
             | cestith wrote:
             | The 6502 was a clone-ish of the Motorola 6800 made to be
             | lower cost. The 6800 led to the 6809 (another
             | competitor,used by the Tandy CoCo and IIRC the Dragon) and
             | to the 68000 series, used by Apple in the Mac, Sun in its
             | early systems, NeXT, Amiga, Atari in their later systems,
             | and more. That led to the PowerPC partnership of Motorola,
             | Apple, and IBM.
             | 
             | PowerPC was outliving its useful life due not to ISA, but
             | manufacturing limitations. So Apple went to Intel, but that
             | wasn't fit for mobile. Apple partnered with ARM to make
             | their mobile chips. Then their mobile chips grew into the
             | M1 and M2 along with ARM, bringing them back to a RISC-ish
             | platform like they had with PowerPC. So it's sort of a dual
             | path back to the same place.
        
               | NobodyNada wrote:
               | > So Apple went to Intel, but that wasn't fit for mobile.
               | Apple partnered with ARM to make their mobile chips.
               | 
               | There's a lot of interesting history there too: in 1990,
               | after seeing the first-generation ARM CPU, Apple
               | partnered with Acorn to co-found ARM Ltd and develop a
               | mobile processor for the Apple Newton. Although the
               | Newton was a failure, ARM was very successful and powered
               | pretty much the entirety of the mobile device revolution
               | -- including of course the iPod and iPhone.
               | 
               | Apple's co-founder status gives them a lot of influence
               | over the ARM architecture -- they led the AArch64 design
               | process, and they seem to be allowed to do things that
               | even other architectural licensees aren't allowed to do,
               | like implementing custom instructions in their ARM cores:
               | https://news.ycombinator.com/item?id=29783549
               | 
               | And Apple's iteration of ARM owes a lot to the PowerPC
               | world as well -- Apple's processor design team was
               | originally PA Semi, a company that designed PowerPC
               | cores.
        
               | klelatti wrote:
               | > they led the AArch64 design process
               | 
               | Interesting - is there a reference for this?
        
               | NobodyNada wrote:
               | Here's a Twitter thread from a former Apple engineer:
               | https://twitter.com/stuntpants/status/1346470705446092811
               | 
               | > arm64 is the Apple ISA, it was designed to enable
               | Apple's microarchitecture plans. There's a reason Apple's
               | first 64 bit core (Cyclone) was years ahead of everyone
               | else, and it isn't just caches.
               | 
               | > Arm64 didn't appear out of nowhere, Apple contracted
               | ARM to design a new ISA for its purposes. When Apple
               | began selling iPhones containing arm64 chips, ARM hadn't
               | even finished their own core design to license to others.
               | 
               | > ARM designed a standard that serves its clients and
               | gets feedback from them on ISA evolution. In 2010 few
               | cared about a 64-bit ARM core. Samsung & Qualcomm, the
               | biggest mobile vendors, were certainly caught unaware by
               | it when Apple shipped in 2013.
               | 
               | > Apple planned to go super-wide with low clocks, highly
               | OoO, highly speculative. They needed an ISA to enable
               | that, which ARM provided.
               | 
               | > M1 performance is not so because of the ARM ISA, the
               | ARM ISA is so because of Apple core performance plans a
               | decade ago.
        
               | klelatti wrote:
               | Very interesting - many thanks!
               | 
               | Edit: I'm a bit puzzled by the claim that Apple was
               | selling Aarch64 before Arm had finished their first
               | design - A7 announced at end 2013 but A53 appeared in
               | 2012?
        
               | NobodyNada wrote:
               | It looks like A53 was _announced_ in October 2012, but
               | I've found no indication of whether the design was
               | actually finished by then [0]. And remember that ARM just
               | sells IP and other companies are responsible for
               | manufacturing it; it doesn't look like anyone actually
               | produced A53 cores until 2015 [1] -- whereas Apple was
               | shipping actual consumer products with A7's in them by
               | October 2013.
               | 
               | [0]: https://www.techspot.com/news/50656-arm-
               | announces-64-bit-cor...
               | 
               | [1]: https://en.wikichip.org/wiki/arm_holdings/microarchi
               | tectures...
        
               | klelatti wrote:
               | Very fair point. OTOH there was a lot of detailed info on
               | the A53 available in 2013 and SoCs were being announced
               | with it.
               | 
               | I suspect this thread may be slightly exaggerating the
               | position but certainly the case that Apple were well
               | ahead of all the competitors - and no doubt they were
               | deeply involved in the ISA design.
        
               | cmrdporcupine wrote:
               | I honestly don't think there's any kind of straight line
               | from the 6809 to the 68000. They share little in common
               | other than the '68' prefix and coming from the same
               | company and being big endian. The instruction sets are
               | very different. Designed by different teams. The
               | peripheral chip set and bus management was different too.
               | 
               | The 68k shares more with 1970s minicomputers especially
               | the PDP-11 and/or VAX architectures than any MPU that
               | preceded it.
        
             | DeathArrow wrote:
             | >the Zilog Z80, is an early ancestor of x86! The Z80 is an
             | enhanced clone of the Intel 8080, which of course the 8086
             | was heavily based on
             | 
             | I owned an Z80 based computer when I was 8 to 10 years old.
             | Its instruction set and memory access does not have any
             | resemblance for me with 8086.
             | 
             | They seem like very distant relatives.
        
               | klelatti wrote:
               | The 8086 was designed to allow automated translation of
               | 8080 assembly to 8086 assembly - so the instruction set
               | may 'look' different but in fact has a lot in common.
               | 
               | Not quite right too to call the Z80 an ancestor of the
               | 8086 but certainly closely related due to the common
               | inheritance from the 8080.
        
               | NobodyNada wrote:
               | Yeah, perhaps more of an uncle than a direct ancestor :)
        
               | klelatti wrote:
               | Indeed - someone should do a family tree of CPUs!
        
               | cestith wrote:
               | We need to decide where the NEC v20, v30, v40, and v50
               | live.
        
               | krallja wrote:
               | And the NSC-800, which is like a Z80 with 8085 half-
               | interrupts!
        
               | klelatti wrote:
               | It's the offspring of the marriage of Z80 and 8085!
        
       | MarkusWandel wrote:
       | This also misses loop unrolling, combined with an assembly
       | language version of "Duff's device" to be able to do an arbitrary
       | number of transfers even if your loop is unrolled to, say, 8
       | transfers.
       | 
       | This stuff used to matter! I had an NCR5380 chip on an Amiga,
       | simple, memory mapped I/O, no DMA or interrupts. To get a tape
       | drive to stream (remember that?) the byte transfer loop really
       | had to be tweaked. But once fully tweaked, "whoooooooosh" instead
       | of "chugga chugga chugga".
       | 
       | And truly heroic programming techniques had to be employed on the
       | C64 to do X/Y smooth scrolling games. Often a static part of the
       | screen, conveniently displaying scores etc, existed to make it
       | work - there was just enough bandwidth to do 80% of the screen,
       | say, so you find an excuse to keep the rest of it static.
       | 
       | I kinda miss those days, and I kinda don't. I guess it was good
       | to have experienced them.
        
         | djmips wrote:
         | Those days still exist! If you want your mind blown watch the
         | Epic Games Nanite talk from last year's SIGGRAPH where the core
         | rendering of the dense vertex data is done directly in Compute
         | , IE software rendered, instead of using the hardware
         | rasterization hardware which has a minimum 4 pixel invocation
         | overhead which gets expensive with very small triangles.
         | 
         | This is but one example of this that's happening every day,
         | there is much much more like hair rendering in EA FIFA soccer
         | or automatic trading financial software running on GPUs.
         | 
         | There's a whole world of applications where people are still
         | concerned with every last cycle of performance just like in the
         | C64 days.
        
           | djmips wrote:
           | Here I'll save you the trouble of trying to find the video.
           | 
           | https://www.youtube.com/watch?v=eviSykqSUUw&list=PLabw4gCouT.
           | ..
        
         | amelius wrote:
         | We're now doing sort of the same tricks, but with power
         | management on mobile devices.
        
         | djmips wrote:
         | Kind of off topic for 6502 memory copy speeds but with regard
         | to scrolling, there became a pretty cool software hack for the
         | C64 (called VSP) where you could trick the poor VIC chip into
         | starting scanning out the screen position later in memory. Move
         | the start one character and the whole screen shifts left by 8
         | pixels. You only need to repaint a vertical column for this
         | 'course' scroll instead of moving the entire screen of
         | characters. This is something that should have been built into
         | the hardware and was very useful on other systems that had that
         | ability (like the NES for example)
         | 
         | With it you can reduce the amount of memory you need to copy
         | every 8 pixels (the 8 pixel part can be done with smooth scroll
         | registers).
         | 
         | There's a thread and example code on github here.
         | https://www.lemon64.com/forum/viewtopic.php?t=70539
         | 
         | Also note it's such a terrible hack on the DRAM that it doesn't
         | work on all C64s and there's a technical discussion about that
         | here. https://www.linusakesson.net/scene/safevsp/index.php
         | 
         | Hardware mod if VSP doesn't work on your C64 and more technical
         | details. http://wiki.icomp.de/wiki/VSP-Fix
         | 
         | Also it makes mention of the C64-Reloaded which is a modern C64
         | product that includes the fix.
        
         | js2 wrote:
         | > This also misses loop unrolling.
         | 
         | It's mentioned at the bottom of the article in the "Thoughts"
         | section:
         | 
         |  _You could certainly use self modifying code and unroll this
         | copy routine to get better performance at the price of
         | flexibility and arguably understanding for the average casual
         | 6502 assembly coder. Again, this was not a "how fast can we
         | absolutely make it" but an everyday use examination._
        
           | hinkley wrote:
           | Duff's device is a fixed size loop unrolling with an ugly
           | hack to make it behave for arbitrary inputs. The assembly
           | makes sense but the C code is rough.
           | 
           | It's not quite as fast as self-modifying or custom compiled
           | code, but it's pretty close.
        
         | tialaramex wrote:
         | Tom Duff's device was doing that because he's doing MMIO, you
         | should not [I know you're not suggesting it, but just in case
         | anybody reading thinks it's clever] do this today when you
         | don't want MMIO, your compiler is very capable of just doing an
         | actual copy quickly, so tell it that's what you want, don't
         | write gymnastics like Duff's device.
         | 
         | However, expressing these partially unrolled loops nicely is a
         | nice performance-not-safety feature of WUFFS called "Iterate
         | loops":
         | 
         | https://github.com/google/wuffs/blob/main/doc/note/iterate-l...
         | 
         | Well, I say performance not safety, as always they want both,
         | but you _could_ safely just write the never unrolled case,
         | while the existence of Iterate loops allows you to express a
         | much faster special case but know the compiler will fix things
         | up properly no matter what.
        
           | vardump wrote:
           | We're talking about C64 (and maybe Amiga).
           | 
           | Compiler is not going to do absolutely anything for you on
           | those retro platforms.
        
             | tialaramex wrote:
             | Aw, just needs a better compiler (with a 6502 target) :D
             | 
             | Jason Turner's CppCon 2021 talk, "Your New Mental Model of
             | constexpr" has half the presentation as a C64 program
             | (though for practical reasons not actually running on a C64
             | but instead an emulator) because _most_ of the heavy
             | lifting is done by the C++ 20 compiler.
             | https://youtu.be/MdrfPSUtMVM?t=1422
             | 
             | Now, Jason's approach is not going to beat hand-crafted
             | 6502 machine code _in a fair fight_ but he often doesn 't
             | need to fight fair and that's the point of his talk.
        
         | localhost wrote:
         | The C64 VIC-II chip would grab the address bus from the CPU
         | every 8 scan lines on the screen. Some of the early "fast load"
         | cartridges like the Epyx FastLoad cartridge that would
         | accelerate loading games from the floppy drive would blank the
         | entire screen during load so that their async data transfer
         | routines wouldn't get interrupted by the VIC-II chip grabbing
         | the bus. I wrote a similar (better?) cartridge where I would
         | need to use the register on the VIC-II chip that reported the
         | scan line as a sync marker to transfer 3 bytes asynchronously
         | from the 1541 down the clock and data lines of the serial bus.
         | Good times.
        
           | MarkusWandel wrote:
           | In my recollection Epyx Fastload did not blank the screen,
           | though some earlier fast loaders did.
           | 
           | I also remember the software voice synthesizer "SAM" needing
           | to blank the screen to render glitch-free sampled audio. Then
           | along came, what was it "Impossible Mission" ("Another
           | visitor! Stay a while...") doing pretty clean sampled audio
           | with the screen on. Not that the C64 SID chip was even
           | remotely intended to be able to play sampled audio in the
           | first place!
           | 
           | The Amiga was unimaginably powerful by comparison. Even a
           | basic configuration had 8x the memory a C64 had, and it had
           | all those fancy DMA toys to offload the CPU.
        
             | jimsmart wrote:
             | > Not that the C64 SID chip was even remotely intended to
             | be able to play sampled audio in the first place!
             | 
             | I don't recall how SAM did it, but sample playing on the
             | C64 SID chip was indeed a nice trick -- it was actually
             | done by modulating the main output volume, which made a
             | slight click when changing.
             | 
             | Eventually this got used by some of the C64 musicians /
             | music player libs, so one could play a channel of samples
             | as well as the three regular synth channels on the SID.
             | IIRC, Outrun used this particularly well in its title
             | screen and/or loading music, having some vocal samples
             | "O-O-Outrun!" (and skidding sound effects) as well a
             | sampled drums.
             | 
             | Annoyingly, IIRC, some revisions of the SID chip behaved
             | slightly differently, and had louder or softer sample
             | playback when this hack was used. But still: clever stuff.
        
               | weinzierl wrote:
               | ... and main output volume had only 16 levels so the
               | samples were 4 bit quantized. It is a wonder that we
               | could get understandable vocal samples with this hack at
               | all. I distinctly remember "Goal!" from Peter Shilton's
               | Fottball and "Accolade presents" from Test Drive [1]. In
               | the examples one can hear the amount of quantization
               | noise that low bit depth caused.
               | 
               | [1] https://m.youtube.com/watch?v=L1u-WydiiCI
        
               | vardump wrote:
               | You can actually get about 6-7 bits resolution out of
               | same SID volume register. 4 bits from the volume
               | register, channel 3 disable and 3 filter bits. Requires
               | some setup to get SID in a particular state first.
               | 
               | For details, see: https://livet.se/mahoney/c64-files/Musi
               | k_RunStop_Technical_D...
        
         | le-mark wrote:
         | Crypto mining with fpga was in the weeds down this path, or 100
         | Gbps signal processing. Two examples where low level stuff is
         | still relevant, just not commodity and widely available like
         | the 8 bit micros were.
        
           | aerique wrote:
           | I also read a high frequency trading blog that was posted
           | here a few years ago. Same thing: hacking hardware and
           | software so the first bytes of info could be grabbed from a
           | stream and acted upon, instead of needing to wait for the
           | whole package to have come in.
           | 
           | Also when I was in the demo scene on the Atari ST one had to
           | do specific timings in the assembly code to be able to draw
           | outside the screen's borders (so the borders were on screen
           | but couldn't be drawn on by code).
        
         | Psyladine wrote:
         | >I kinda miss those days, and I kinda don't. I guess it was
         | good to have experienced them.
         | 
         | There's a certain reflective quality and even satisfaction from
         | using a chainsaw after coming up using a tree saw by hand. It
         | feels progressive, even if it is just optimizing for time at
         | the expense of energy.
        
         | 6510 wrote:
         | I eventually managed to do a scroll texts fast enough to do
         | them while the pixels of the scroll text were printed to the
         | screen. It was even fast enough to have char combinations in
         | the scroll text that modified its speed and direction with
         | speeds like "one time scroll 1 pixel the next 2". One of the
         | tricks was to do "poor" timing with NOP's by requiring an empty
         | row of bits between each scroll text. (The text becomes
         | unreadable anyway if there is no space between the lines)
        
         | vidarh wrote:
         | Most C64 games used character-based graphics (coupled with the
         | smooth scrolling support in the VIC) which meant you'd at most
         | move 2000 bytes to scroll the entire screen every 4 to 8 pixels
         | scrolled.
         | 
         | You can easily scroll the entire screen on a C64 if that's all
         | you're doing.
         | 
         | Some games did also scroll bitmaps. There the naive version
         | requires moving 9000 bytes (40x25x8 for the bitmap data, 40x25
         | for the colour data) every time you need to scroll, and that
         | indeed starts to bite. There are games which reduces the cost
         | this using a trick called AGSP ("Any Given Screen Position").
         | 
         | But you're right static parts of the screen were often larger
         | to reduce the dynamic part. That was rarely down to just
         | scrolling the screen in isolation, though, but because the
         | overall budget of cycles you had to work with was tiny. Often
         | you might also have a lot of other stuff which consumed lots of
         | cycles _affected directly by the size of the playing field_.
         | E.g. if you did sprite multiplexing (moving a sprite after it
         | had been partially or fully rendered to reuse the same hardware
         | sprite), you might well be keeping the CPU busy throughout the
         | full rendering of the playing field.
         | 
         | There was also the consideration of how much effort you wanted
         | to go to in order to avoid glitches, since unless you could do
         | the scrolling entirely while the VIC was rendering the parts of
         | the screen outside the playing field, you'd need to make sure
         | the rendering and copying didn't overlap, and of course just
         | restricting playing field size was an easy workaround for that
         | problem.
        
           | MarkusWandel wrote:
           | I didn't get that fancy. I got as far as a horizontal smooth
           | scroller, but with the "move the screen memory during one
           | 1/60 second redraw cycle" mentality - racing the redraw, and
           | when it was just about caught up, whoa, time for the static
           | bar at the bottom.
           | 
           | Quite right, one could prepare the moved version in the
           | background during the 7 steps where you're merely diddling
           | the smooth scroll register, and then flip to it in an
           | instant. But wait, was it possible to page flip the colour
           | map? Also, always having the appropriate moved version ready
           | even as the player is doing unpredictable things goes into
           | the "heroic programming techniques" zone again.
           | 
           | As for glitches, it's amazing what can be done if perfection
           | is sacrificed and there were plenty of good games that did
           | have them, e.g. sprite multiplexing. But I did mean
           | "effortless looking perfect" smooth scrolling.
        
             | jimsmart wrote:
             | Ex-C64 games coder here: you are are correct - no, you
             | couldn't relocate/page-flip the colour map, like you could
             | the character map. So you had to update it all somehow on
             | the required frame.
             | 
             | The fastest technique I saw for updating the colour map in
             | a single go, was to have the whole thing as a huge block of
             | immediate mode load-stores, then one could 'scroll' the
             | data across the LDA instructions within that code, in
             | advance, over n-frames, then call this self-modified code
             | block when one did the character screen flip (immediate
             | load-stores was faster than load-stores from colour ram).
             | e.g.                 scroll_splat_colour:         LDA #$00
             | # colour data for char         STA $D800 # colour ram
             | LDA #$00         STA $D801         # etc., for every
             | visible char onscreen in scrolling area
             | 
             | And one would be updating/scrolling those values loaded
             | into the A register, in chunks over previous frames,
             | similar to:                   LDA scroll_splat_colour+6
             | STA scroll_splat_colour+1         LDA
             | scroll_splat_colour+11         STA scroll_splat_colour+6
             | # etc., for every lda/sta in the above
             | 
             | Perhaps not the clearest explanation, but hopefully enough
             | to communicate the idea.
             | 
             | FWIW, I didn't invent that technique, it was an improvement
             | Jon Williams made to my code, whilst we both worked for
             | Images Software (now Climax). Not sure where he got it
             | from, maybe he invented it himself, maybe he cribbed it
             | from elsewhere.
             | 
             | Related: I thought sprite multiplexing was awesome, and
             | there were quite a few tricks there too to get it
             | performant. But that's another far more complex topic.
        
               | vidarh wrote:
               | Another "obvious" trick is to narrow the playing field
               | but animate the rest, and then safe cycles by a
               | combination of bands that requires less updates and
               | sprites. E.g. Pole Position is a classic example, where
               | the graphics covers most of the screen, but only about
               | half actually has gameplay. The rest consists of a very
               | narrow band of mountains, and a couple of bands of
               | clouds. I haven't looked at what they did for Pole
               | Position, but that pattern of the actual gameplay being
               | constrained to a much smaller portion of the screen than
               | what looks like the playing area is pretty common.
        
               | vardump wrote:
               | How to handle the case when player changes direction to
               | exact opposite immediately after the frame color data was
               | transferred? Double buffering splatting code? Although
               | one copy of it is 5001 bytes, ouch.
        
               | jimsmart wrote:
               | You generally don't handle that case! :) -- Instead you
               | let the player move within a rectangular area onscreen,
               | and decide on which way you are going to scroll the
               | screen in advance (or rather: after the fact, depending
               | on how one looks at things), based upon where the player
               | is inside that rectangle. So the screen catches up with
               | where the player is moving/pushing.
               | 
               | Eight-way scrolling like this was always a massive pain
               | on the C64 (and other systems that used buffered
               | scrolling with no h/w, e.g. Atari ST), but that way (a
               | box the player moved around inside) was the only
               | realistic way of handling it if you had to do a bunch of
               | work in advance before doing the actual scrolling. Turns
               | out that having the player in a loose rectangle is also
               | easier on the eye too, which is perhaps why it's also
               | used on systems that don't suffer the same h/w
               | restrictions.
               | 
               | Yeah, the colour RAM update was a lot of bytes to move.
               | But dedicating a big chunk of code to it meant one could
               | be a little freer to use slightly slower techniques
               | elsewhere in the update cycle. Side note: the C64
               | actually only had 39 visible char across the screen when
               | in 'scrolling' mode, because the borders where shrunk-in
               | slightly (and slightly more than one expects). So one
               | less char to worry about per line. That saved a tiny
               | amount of code / memory / execution-time for the colour
               | splat (and the scrolling of partial chunks - whether on
               | back buffers, or the data within the colour splat code -
               | over the other frames). Sure, it's only one less
               | character. But it saved some cycles. And cycles mattered!
               | Particularly when doing something with that much data to
               | move / that took that much time.
        
               | vardump wrote:
               | > C64 actually only had 39 visible char across the screen
               | 
               | But 40 color cells were still visible, unless horizontal
               | scroll register was 0.
        
               | jimsmart wrote:
               | No, but that's an easy enough mistake to make :) -- It's
               | called 38-column mode, and when enabled the VIC shrinks
               | both borders in by 8 pixels, and then offsets the screen
               | according to the x-scroll register bits.
               | 
               | [Edit: another source says it's actually 7-pixels hidden
               | on the left, and 9 on the right. But whatever: same
               | principle, the screen is shrunk by 16 pixels in total
               | horizontally]
               | 
               | Which meant that at most only 39 characters were visible
               | across the screen -- with two of those, one at each end
               | of the row, being partially visible -- and that applies
               | to both the character screen and its associated colour
               | RAM. Only 38 characters were visible when the x scroll
               | register was zero, and as soon as one shifted to a value
               | of 1-7, the 39th column became visible (and the 1st one
               | became partially offscreen). But the 40th column is never
               | visible when in that mode.
               | 
               | For more info see:
               | 
               | http://www.devili.iki.fi/Computers/Commodore/C64/Programm
               | ers...
               | 
               | "When scrolling in the X direction, it is necessary to
               | place the VIC-II chip into 38 column mode. This gives new
               | data a place to scroll from. When scrolling LEFT, the new
               | data should be placed on the right. When scrolling RIGHT
               | the new data should be placed on the left. Please note
               | that there are still 40 columns to screen memory, but
               | only 38 are visible."
               | 
               | -- But it's discussed on a handful of other pages too, if
               | you google.
        
               | vardump wrote:
               | Oh damn... and I did a fair bit of coding on C64 back in
               | the day. :-D
               | 
               | Somehow I thought it hid 4 pixels both sides. Totally
               | wrong.
               | 
               | PS. Then it's so unfair bad line still takes 40 cycles!
        
               | jimsmart wrote:
               | Please stop giving the above comment downvotes because of
               | this person's lack of knowledge: we all have to learn
               | things -- there was once a time I didn't know this
               | either.
               | 
               | It's not like vardump here was being a dick about
               | anything in their comment, cut them some slack!
        
               | vardump wrote:
               | Thanks. Although I really should have known better, wrote
               | scrolling routines 35 years ago.
               | 
               | It's scary time can corrupt memories we consider as
               | facts.
        
               | Luc wrote:
               | I made an 8-way full-screen scroller.
               | 
               | To avoid situations like this the player sprite at the
               | center of the screen had momentum, i.e. the sprite had to
               | rotate 180 degrees to change to the opposite direction,
               | giving a few frames time to set everything up.
        
               | jimsmart wrote:
               | That's a nice little trick, cheers for sharing. (Not that
               | I'll get a chance to use it these days, but still)
        
               | 6510 wrote:
               | > # etc., for every visible char onscreen in scrolling
               | area
               | 
               | For every changed char. (which is sometimes more and
               | sometimes less)
               | 
               | You could do them in order but if you're using only a few
               | characters you need only 1 LDA for each char. (How to do
               | this is left as a creative exercise for the reader)
        
               | jimsmart wrote:
               | But the overheads of tracking which characters might have
               | been changed here completely outweighed simply scrolling
               | / updating the whole thing. The code becomes too involved
               | in tracking changes, and fudging about with- / rewriting-
               | the splat code.
               | 
               | You can leave it as 'a creative exercise for the reader',
               | but that's because you can't solve this for the generic
               | case (i.e. any map the graphics artists might give you)
               | in less cycles than simply dealing with each and every
               | character, which is the worst case.
               | 
               | Processing that many bytes, and doing comparisons and
               | extra branches, simply becomes overheads, and, very
               | quickly, your code is slower than simply updating /
               | scrolling everything simply.
               | 
               | For the colour splat routine, having a giant, pre-
               | assmebled block of immediate-mode load store pairs for
               | every character is as optimal as it gets -- and handles
               | all cases -- on the C64, you only have a frame to update
               | the colour RAM (because it cannot be relocated/paged),
               | and you are generally chasing the scan beam to move that
               | much data before the next frame.
               | 
               | You don't have the luxury of having extra cycles to re-
               | write that block of code at runtime, and rewrite the code
               | that scrolls the data within that code, nor do you have
               | the luxury of having enough spare cycles to be comparing
               | data, and branching conditionally depending on if it has
               | changed or not.
               | 
               | Perhaps you misunderstand the technique I describe, or
               | perhaps you under-estimate the overheads required to
               | perform what you describe. Or perhaps both.
        
               | egypturnash wrote:
               | That's... that's horrible. Beautiful, but horrible.
               | 
               | Which kinda describes any advanced c64 technique, really.
        
               | jimsmart wrote:
               | Indeed, I totally agree on all points :)
        
               | justinlloyd wrote:
               | Yeah, compiled graphics and compiled colour tables, also,
               | a routine that could self-modify code in regions of RAM
               | to do the colour table writes. A slow set-up function at
               | level start would build the code to be JSR'd later in the
               | level. We did that on a few games on the C64 and the
               | Speccy and Beeb and Atari. Later used the same techniques
               | in DOS on PC. And of course, doing the same tricks but
               | with D0 through D7 and A0 through A6 on Atari ST and
               | Amiga. Also doing "stuff" in zero page because the
               | address loads were shorter. And avoiding 256-byte page
               | boundaries where possible because of the cycle penalty.
        
               | jimsmart wrote:
               | > A slow set-up function at level start would build the
               | code
               | 
               | Interesting, and good thinking :)
               | 
               | IIRC, when we used this technique on the C64, we didn't
               | build the code during init at runtime, we actually built
               | the code in the dev environment, using macros, so it got
               | built at assembly/compile time. So we skipped the small
               | time hit at runtime init, at the expense of a slightly
               | longer load time for the user (and a tiny bit longer on
               | our assembly/compile times, although that was fairly
               | negligible cos we were building on PCs).
        
           | jimsmart wrote:
           | Ex C64-games coder here! -- If your sprite multiplexer was
           | taking most of CPU during the screen draw time, then honestly
           | it was not a particularly great multiplexer! ;)
           | 
           | Most decent multiplexers took just a scanline or two/three,
           | multiple times down the screen (i.e. whenever relocating any
           | already drawn sprites) -- often with decent sized gaps (time
           | when the CPU wasn't involved in manipulating sprites and
           | could do other things), with a larger chunk during the
           | offscreen period / at the bottom of the screen, when one was
           | preping the data (mostly sorting the sprite's y-coords) for
           | the next frame's screen draw.
           | 
           | -- During debugging/etc, we'd often enable colour changes to
           | the screen border, at the beginning and end of the
           | multiplexer code (for both the interrupt stuff in the
           | playfield, and the non-playfield section), so we could
           | visually see how it was working/performing.
        
             | vidarh wrote:
             | Sure, the "nice" way of doing it is to rely on the raster
             | interrupt. But I've also seen way too much C64 code where
             | pretty much everything ran in the interrupt handler, with
             | associated stupid busy waiting because it saved people from
             | having to synchronise. I'd guess more commonly for cheap
             | and cheerful ports from less capable machines, but it's
             | been a couple of decades since I've actually looked at any
             | of this code.
        
             | cesaref wrote:
             | 64 coder here too!
             | 
             | The border changing thing has just reminded me how bad the
             | development process was using the commodore assembler with
             | a 1541 drive which was horribly slow. assemble, dump image,
             | reboot, crash, reboot, load assembler, try and work out
             | what had happened :)
             | 
             | At some point I ended up with a PC running a system called,
             | I think PDS, which was a cross assembler with dongle to
             | push the image straight into the memory of the C64. I even
             | think you could inspect and change memory on the running
             | machine - it was amazing!
        
               | jimsmart wrote:
               | Yeah, we all used PDS too, although not originally.
               | Pretty good system, particularly for that era, and
               | cost/capability-wise (though they weren't that cheap, and
               | folk eventually started cloning the boards for them,
               | IIRC).
               | 
               | I remember it was annoying to have only 8 main source
               | files in PDS though, most big projects went past the 8
               | files of however many kb (although it could also handle
               | include files, which was how one got around that limit).
               | 
               | Although when I actually started out as a C64 games dev,
               | my dev system was a BBC Micro B, linked to a C64. Not
               | quite a cool as PDS, but it could assemble code 2x the
               | speed of the C64 (the processor clocked twice the speed
               | on the Beeb), and it was great having a separate 'host'
               | system for development.
        
               | jimsmart wrote:
               | Here's a link to info about the PDS kit, in case anyone
               | is interested:
               | 
               | https://www.cpcwiki.eu/index.php/PDS_development_system
        
           | mgkimsal wrote:
           | Just watched a video of C64 "Seven Cities of Gold" with a
           | colleague yesterday, trying to convey just how... exciting
           | that was in 1984. Watching on YouTube, I had forgotten just
           | how small the playing 'viewport' was. It seems like possibly
           | a more extreme example - I don't remember too many other
           | games having an action viewport that small.
        
       | kken wrote:
       | This doesn't even cover all the neat assembly tricks with self-
       | modifying code that you would actually use on a 6502 to speed up
       | memory transfer.
        
         | MatthiasWandel wrote:
         | For games that scrolled the screen, those had to happen
         | essentially between scans, so a lot of tricks were employed.
         | Fixed addresses in the code, unrolled loops, and self modifying
         | code to avoid the expensive zero page indicrect indexed
         | addressing mode (the slowest instruction on the CPU). The other
         | trick was to start moving the first line of screen just after
         | it got displayed, which would give you nearly two jiffies to do
         | it before the scan caught up to you on the next frame.
        
           | vardump wrote:
           | No need, it can easily happen during the scan. As long as the
           | scan and update memory location never meet, there's
           | absolutely no problem.
        
           | natly wrote:
           | It's crazy how much work went into those old games. I have a
           | feeling those programmers weren't even paid that well
           | considering how few people owned computers back then (so the
           | market can't have been large).
        
             | wkearney99 wrote:
             | If you ever play(ed) the Atari 2600 version of River Raid,
             | you got to witness some SERIOUS tweaking to work around the
             | limits of that console. Every scanline processed on the fly
             | during the vertical blanking interval. No screen buffer.
             | The animation was soooo smooth.
        
             | kabdib wrote:
             | My first job out of college at Atari in 1982, writing game
             | cartridges for the 400/800 computers, paid $25K a year. My
             | first raise after a year was to $30K.
             | 
             | There were programmers in other divisions making royalties
             | off of their games. Tod Frye famously got $700K or so for
             | his terrible version of 2600 Pac-Man (it was terrible not
             | because he was a bad programmer, but because marketing
             | decided that 2K of ROM had to be enough, and he was smart
             | enough to pull off a miracle . . . of sorts).
             | 
             | Also, the OP apparently doesn't know how to unroll loops,
             | which is the first thing you do to your game's hot spots.
             | (Never had to resort to self-modifying code).
        
         | vikingerik wrote:
         | I did this in a homebrew Atari 2600 game. For a Space Invaders
         | grid of sprites. Each is triggered by writing to a register, as
         | the electron beam scans through to display each sprite.
         | 
         | The interval between sprites on the same scanline is 3 cpu
         | cycles. That's a single 6502 instruction, the write to that
         | register. How do you do any kind of load or compare instruction
         | along with that to decide whether to display that sprite?
         | 
         | The answer was to copy that stream of instructions to RAM ahead
         | of time, and replace each write to a missing invader with a no-
         | op. The code is here if anyone wants to see (the "inv3" demo):
         | http://dos486.com/atari/
        
           | krallja wrote:
           | > copy that stream of instructions to RAM ahead of time
           | 
           | Even this is easier said than done: there are only 128 bytes
           | of RAM in the entire machine, and that has to suffice for
           | global variables and stack memory in addition to storing
           | modified code like this!
        
         | rasz wrote:
         | Afaik its <120KB/s with all the tricks. 6502 was hand designed
         | and brain optimized for clever use of available silicon real-
         | estate, roughly 20% of CPU bus cycles are dead/bogus/useless.
         | RTS wastes 3 of its 6 cycles, RTI 2 of 6 wasted, JSR 1 of 6
         | wasted , all increments at least 1 cycle wasted etc. Sad to
         | think state machine handling DMA transfers in REU is probably
         | less than 50 macrocells, and Commodore ran its own fab, they
         | could have build-in REU DMA in C128 and it would cost cents.
        
           | mywittyname wrote:
           | Is there a way to make a compatible 6502 variant that doesn't
           | have this waste?
        
             | krallja wrote:
             | "The 100 MHz 6502" does a different clever thing - it
             | copies all the dedicated RAM and ROM into its own FPGA
             | copy. Then it can perform 7 to 25 instructions before the
             | next external read/write cycle!
             | 
             | http://www.e-basteln.de/computing/65f02/65f02/
        
             | rasz wrote:
             | https://en.wikipedia.org/wiki/CSG_65CE02#Pipeline_improveme
             | n... fixed most painful ones, but afaik not all dead
             | cycles. But it was 1988 and commodore didnt bother putting
             | it into anything other than some IO card for the AMIGA, not
             | to mention it still did nothing to cover slowness of moving
             | data around. Japanese decided to do something about it for
             | TurboGrafx-16 in 1987 Hu6502
             | http://shu.emuunlim.com/download/pcedocs/pce_cpu.html
             | 
             | Transfer Alternate Increment (TAI), Transfer Increment
             | Alternate (TIA), Transfer Decrement Decrement (TDD),
             | Transfer Increment Increment (TII) - pretty much x86 'rep
             | movsb', except not great at 6 cycles per byte (~160KB/s).
             | For contrast 5 years older 80286 already did 'rep movsw' at
             | 2 cycles per byte. 6 years later Pentium did 'rep movsd' at
             | 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb'
             | full cachelines at a time at full cache/memory controller
             | speed.
        
               | JPLeRouzic wrote:
               | I think there are tricks to rewrite the microcode on
               | Pentium, does similar tricks exist for 80286, 386 or 68K?
               | 
               | It would be fun to reconfigure one as a high speed 6502.
        
       | cmrdporcupine wrote:
       | The 65816's MVP/MVN opcodes can do bulk transfers a teeny bit
       | faster.
        
         | lscharen wrote:
         | For more 16-bit 65816 context -- other than for space-savings,
         | these instructions are never used when performance is needed
         | due to the low effective throughput of 7 cycles per byte. A
         | basic unrolled loop using 16-bit instructions is 20 - 30%
         | faster and specialized graphics routines that are able to use
         | the stack can approach 3 cycles per byte using the PEA and PEI
         | instructions.
        
           | cmrdporcupine wrote:
           | I'll defer to you I guess, as you seem to know more about
           | this than me. The only thing is searching through the
           | 6502.org forums I don't see a consensus on this?Plenty of
           | people talking about the advantages of MVN/MVP for bulk
           | transfers. I seem to recall doing the cycle counting myself
           | at one point, too, and finding it advantageous.
           | 
           | One neat trick (I remember reading about from Alan Cox I
           | believe) if you have control over the hardware is to memory
           | map I/O devices like serial input / output such that
           | incrementing addresses starting at a given address all point
           | to the same physical device/register. E.g. allocate 256
           | contiguous bytes in your memory map to point to the same
           | thing. This way you can do bulk I/O transfers to/from memory
           | using MVP/MVN instead of "get a byte, put a byte" instruction
           | by instruction.
        
             | rasz wrote:
             | The trick you describe was being used by Silicon Valley
             | Computer ADP50L IDE controller from early nineties (1991).
             | Memory mapped I/O instead of traditional x86 port access
             | lets you skip doing manual loop for 'rep movsb', result can
             | be 50% speed bump
             | 
             | https://forum.vcfed.org/index.php?threads/performance-of-
             | lo-...
             | 
             | Port IO Read Speed : 219.39 KB/s
             | 
             | MMIO Read Speed : 310.77 KB/s
             | 
             | Some variants of XTIDE hardware also implement this, as
             | does the free bios.
        
             | ksherlock wrote:
             | MVP/MVN are 7 cycles per byte.
             | 
             | If you're moving memory around in bank 0 (or have memory
             | mapping), you can use the direct page register to
             | read/write anywhere in bank 0 and the stack to read/write
             | anywhere in bank 0.
             | 
             | 16-bit LDA dp, PHA is 4 + 4 = 8 cycles or 4 cycles per
             | byte. Best case would be if you know it's constant data
             | before hand, eg, LDA #0, PHA, PHA .... 2 cycles per byte!
             | 
             | For general purpose copying MVP and MVN are easier and have
             | better code density.
        
               | mmphosis wrote:
               | _2 cycles per byte!_ It takes 4 cycles for PHA to push
               | the 16-bit Accumulator, two bytes, onto the stack. There
               | 's also 16-bit PHD, PHX and PHY.
        
             | cmrdporcupine wrote:
             | Ah here it is:
             | http://forum.6502.org/viewtopic.php?f=2&t=5035 referencing
             | a now-lost G+ post from Alan Cox:
             | 
             |  _" The emulator also has a fun hack for disk performance
             | I'm hoping will get replicated in some of the upcoming
             | retro 65C816 board design. Like the 6502 the 65C816 sucks
             | at continually reading from an MMIO port and writing it to
             | sequential memory locations. It sucks less than a 6502
             | because you've got 16bit index registers, but at the same
             | clock it was doing about 100K/second that a Z80 can do 250K
             | (with ini loops). The revised emulated disk interface has
             | the same mmio port replicated across a chunk of address
             | space and this allows a block move instruction (MVN) to do
             | all the work at 6 clocks/byte. At that point the 65C816
             | suddenly jumps to twice as fast as the Z80 on disk I/O."_
        
       | [deleted]
        
       | joosters wrote:
       | On the original ZX Spectrum, you could measure the write
       | bandwidth visually, because on startup it would write the value 2
       | into each byte in memory (which included the graphics RAM). It
       | would then re-read and decrease the value of each byte twice, to
       | check for any faulty memory.
       | 
       | You could see these patterns on-screen as the reads and writes
       | took place (I think it took about a couple of seconds to do this
       | to 48k of RAM)
        
         | becurious wrote:
         | You could change the stack pointer to the top of the area of
         | memory you wanted to fill and then use PUSH to fill at I think
         | 11 clock cycles per two bytes. It was faster than unrolled LDI
         | or LD (HL),A followed by INC HL. It would be filling memory in
         | the wrong direction for a Rainbow processor but you could use
         | it for repeating patterns. I think I did a checkerboard pattern
         | that would shift every frame and it was pretty smooth.
        
       ___________________________________________________________________
       (page generated 2022-06-16 23:01 UTC)