[HN Gopher] The 100MHz 6502 (2022)
___________________________________________________________________
The 100MHz 6502 (2022)
Author : throwup238
Score : 151 points
Date : 2024-01-27 21:58 UTC (1 days ago)
(HTM) web link (www.e-basteln.de)
(TXT) w3m dump (www.e-basteln.de)
| pvg wrote:
| Big previous thread in from 2021:
|
| https://news.ycombinator.com/item?id=28852857
| dmitrygr wrote:
| There are 300MHz 8051s around in all sorts of unlikely places.
| Many pocket mp3 players used to be based on them.
| Moto7451 wrote:
| Is there an advantage to a 300MHz 8051 vs a Cortex M0? Predated
| their existence?
|
| I know there's a lot of 8051 tooling but I'm only a dabbler in
| microcontrollers and it seems like AVRs and M0/M3s have taken
| the place of PICs and 8051s in hobby world.
| dmitrygr wrote:
| 8051 has no cost to license and if you are mostly using an
| accelerator to decode mp3, then the servicing of it is simple
| enough. Why rewrite the code you already have (from before
| cortex-m0 existed) or redesign the accelerator you already
| have?
| ddingus wrote:
| There are a few:
|
| If you need rapid, real time responses to external signals,
| the faster clocked 8 bitters are excellent! Many chips can
| get into an interrupt service routine in just a handful of
| cycles. In tandem with this, these devices can pack a lot of
| logic into a small amount of code.
|
| From Dallas Semi: Our 1 clock-per-machine-cycle processors
| reached a remarkable performance goal--1 clock-per-machine-
| cycle, currently at 33 million instructions per second
| (MIPS).
|
| From Silicon Labs: The proven 8051 core received a welcome
| second wind when its architecture lost patent protection in
| 1998. [...] The original Intel 8051 core took 12 cycles to
| execute 1 instruction; thus, at 12 MHz, it ran at 1million
| instructions per second (1 MIP). In contrast, a 100 MHz
| Silicon Labs 8051 core will run at 100 MIPS or 100 times
| faster than the classic 8051 at a frequency that is only
| about 8x the classic 8051's frequency.
|
| That's really fast, when it comes to responding to external
| events!
|
| Say one needs to read an incoming data stream, or control
| something moving at a high rate of speed. Both of these tasks
| depend on a device that can sense and respond in as close to
| real time as things get.
|
| Large, production proven, time tested code bodies. 6502,
| 8051, z80, etc... all have a ton of library code that's not
| difficult to understand and make use of.
|
| Often, these 8 bit designs can run on crazy low power, or
| operate very efficiently at full speed.
|
| Licensing isn't generally an issue. Adding a well documented
| and production proven 8 bit core to a specialized design
| works pretty well! Often, the custom hardware on chip does
| the heavy lifting, leaving UX and control tasks, both of
| which are easy and lean enough for 8 bits of CPU to make
| sense.
|
| The last thing I would put here is subjective, but ease of
| development can be an advantage, but it depends on the
| developer. Once someone has bootstrapped themselves onto 8
| bit computing, the constraints on development both limit
| possible application scope and with that limit comes ease of
| development. When used to their strengths, simple chips like
| these are easy to develop for. It's possible for one person
| to completely understand a device and with that understanding
| fully exploit it.
| nullc wrote:
| I wonder what the limit of computing power per joule is
| with current technology, assuming you were freely able to
| change the architecture.
|
| For example, perhaps you wouldn't use an integer
| instruction pointer, because a full adder to increment it
| is expensive. Instead you could use a LFSR where each
| increment requires only a couple xor gates and some wires.
| But it would mean that your code would have to be scattered
| in memory in a funny order. No problem for a smart
| assembler.
|
| How much computing power could you drive from a device
| powered by nothing other than ambient RF?
| dmitrygr wrote:
| The thing you describe about replacing PC with an LFSR
| has actually been done, to simplify the silicon. Some
| very cheap 4-bit micro controllers, often used for TV
| Remotes, in fact, do this.
| ddingus wrote:
| I think of this too. And I break it down into ads and
| operations. An ad is simply summing two inputs of some
| kind. In operation, might be a change of state or an
| input coming online or going away. Something analogous to
| the bit operations and or not exclusive or and friends.
|
| With all the physics talk of information being
| fundamental, I suspect we will find both an upper bound
| and a lower bound.
|
| The upper bound will be something like compute power per
| volume divided by energy or some similar construct.
| Basically you can only pack so much information and so
| many operations into a given region of space and energy
| level possible for that energy for that region of space
| to contain.
|
| The lower bound might be something like the plank
| constant for computation. What's the smallest unit of
| space and energy level that can support an add, for
| example. It's interesting to think about.
|
| Sorry for the typos I used voice dictation on this one
| crotchfire wrote:
| ... and adding an integer to a pointer would be
| hellaciously expensive.
|
| If you have the foundry SPICE models you can calculate
| these kinds of lower-limit values. I did this for 45nm a
| long time ago and vaguely recall getting numbers down in
| the double-digit femtojoule range for 32-bit addition,
| measuring only transistor R+C.
|
| But the transistors don't really cost anything; all the
| CV^2 is in charging and discharging the wires. And the
| "C" is totally geometry-dependent. It's not like software
| -- at least not when you're pushing all the limits --
| everything affects everything else.
| nullc wrote:
| > ... and adding an integer to a pointer would be
| hellaciously expensive.
|
| Fair, though if it were only the PC and instruction
| memory that were permuted that isn't much of an issue.
|
| It's not _that bad_ the circuit looks more like a
| multiplier rather than an adder. (search term would be
| LFSR fast-forward or jump-ahead).
|
| PC is updated presumably on every cycle, while adding an
| an integer to it is probably a rare operation (just don't
| use computed jumps...).
| IIAOPSW wrote:
| >I wonder what the limit of computing power per joule is
|
| Today you're going to learn that the universe puts a hard
| limit on this known as "Launder's principle".
|
| To derive it in short, the (information) entropy on the
| input side of a single traditional logic gate is 2 bits,
| but on the output side it is just 1 bit. This seems to
| imply that the (physical) entropy of the computer would
| go down after the computation, because your computer had
| more possible physical states it could be in at the time
| of input than it seems to have at the time of output. But
| this is impossible as it violates the 2nd law of
| thermodynamics.
|
| To resolve the contradiction, each logic gate must be
| putting the missing bit of entropy on an untracked non-
| computational degree of freedom within the physical
| system. In other words the untracked "missing"
| information is encoded as seemingly random waste heat,
| and dumped into the environment at room temperature.
|
| https://en.wikipedia.org/wiki/Landauer%27s_principle
| ComputerGuru wrote:
| From the linked Wikipedia, as a more direct answer to
| GP's question:
|
| > Modern computers use about a billion times as much
| energy per operation [than this theoretical minimum
| energy per bit of entropy "erased"]
| nullc wrote:
| well aware, which is why I put the technology limit on
| it!
| iamflimflam1 wrote:
| Interesting - I've been playing with some SD Card to USB
| interface ICs and almost all of them include an 8051 core.
| guenthert wrote:
| I'd think, if you need rapid, real time response to
| external signals, you don't use interrupts, since then you
| usually need it to be deterministic as well. Either use
| 'nother micro, a spare core, given specialized hardware
| (e.g. the PRUs in TI's sitara MCUs, the PIO of RP2040,
| P8X32A) or roll your own (these days probably using FPGAs).
| SomeoneFromCA wrote:
| You are wrong, as polling always introduces jitter (due
| indeterminacy of the event arriving moment between the
| last entry to the polling loop and the comparison
| operation). If you want really precise timing while
| reacting to external event is to sit in "halt" state,
| with interesting interrupts enabled.
| ddingus wrote:
| Another micro = big BOM cost increase
|
| Another core = overall device cost likely unnecessary.
|
| Both blow the power budget up.
|
| Regarding determinism, relative to what?
|
| An interrupt will, or can be very consistent relative to
| the signal. Polling jitters relative to the same signal.
| That may not be desirable.
|
| Now, I did see you mention a Propeller chip!
|
| The first chip worked as you describe and it is
| beautiful.
|
| There are good reasons why interrupts were added to the
| second generation chip. They are mostly the reasons found
| in this discussion.
| pclmulqdq wrote:
| M0s have taken the place of many 8051s in the "pro" world as
| well. There's still the niche that sibling comments have
| mentioned, but a lot of "default small MCUs" for new projects
| used to be 8051s and are now M0s.
| addaon wrote:
| Besides the other comments, you can get 8051s at much lower
| power than M0s... think 1 picojoule per (8 bit) op vs 10
| picojoules per (32 bit) op, give or take. It's pretty common
| to see 8051s in the low power zone of microcontrollers that
| also have one or more 32 bit cores on them. Generally the low
| power zone (including the 8051) can be run off an external
| clock (so 25 MHz - 100 MHz) in the 1 mW range, or can be run
| off an RC oscillator at a lower speed (like 7 +- 3 MHz) in
| the 100 uA range, both of which usefully extend the ability
| of the system to monitor for wake events and decide when to
| bring those Arm big boys on line. Some can even take the 8051
| down to your 32 kHz real time clock for < 40 uA operation.
| userbinator wrote:
| The Z80 was common in MP3/"MP4" players too.
| Dwedit wrote:
| Throwing a cache into a system that never had a cache before can
| be quite tricky.
|
| You could have these kinds of memory pages:
|
| * Fixed ROM bank
|
| * Bankswitchable ROM bank
|
| * Fixed RAM bank
|
| * Bankswtichable RAM bank
|
| * IO memory
|
| * RAM that's read by external devices
|
| * RAM that's written to by external devices (basically just IO)
|
| Caching is _trivially easy_ for fixed a ROM or RAM bank which are
| not used by other devices. Caching a bankswitchable bank requires
| either invalidating on bankswitch, or knowing the bank switching
| well enough to just cache everything. Pure IO memory is simple,
| no caching for that at all. For RAM that 's read by other
| devices, Write-Through caching would work.
| rbanffy wrote:
| The article mentions the bank-switching issues and that the
| FPGA only has 64K, which limits emulation of higher memory
| configs - it'd not emulate a //e with 80-column display (which
| requires 128K).
| bpye wrote:
| You could plausibly make bank switching work, but it'd take
| some effort. You'd want your block RAM to act as a write back
| cache, and then any bank switch must be intercepted and
| delayed until you can flush the full contents of the cache to
| memory.
|
| Or if bank switching is fast and occurs too frequently for
| that to be viable, you could avoid the flush across a bank
| switch, but then you may need to perform a bank switch for an
| eviction.
| masfuerte wrote:
| The BBC Master had an even funkier mode. The RAM bank accessed
| could depend on the current program counter.
|
| Imagine a video display taking 16K of RAM. This would be
| situated between addresses 0x4000 and 0x8000. This same memory
| range also included non-video RAM. The hardware transparently
| selected the video or non-video RAM depending on which code was
| accessing it.
|
| Specifically, if the program counter was at 0xC000 or above
| (i.e. code in the OS ROM was running) then accesses to the
| video range would go to the video RAM. But if the program
| counter was elsewhere (i.e. running user code or an application
| ROM) then accesses to the video range would go to user RAM, not
| video RAM.
|
| Additionally, there was a hardware register controlling this so
| that user code could choose to directly access the video RAM,
| and OS code could access the user RAM.
| forinti wrote:
| I've been looking for a detailed explanation on this, because
| I would like to know how they read the PC.
|
| The New Advanced User's Manual describes the flag at &FE34 on
| the Master and B+ and I've read this thing about using the PC
| from a few places but I haven't found any specifics.
|
| Could you clarify? How do you know from the outside what's in
| the PC?
| ddingus wrote:
| Would they not just mooch the address off the address bus?
| morcheeba wrote:
| Yep. The 65C02 has a SYNC output that goes high to
| indicate an instruction is being fetched on the current
| cycle. Since there is no cache, it's pretty simple to use
| this to determine the PC.
| ddingus wrote:
| Thought so. From there, it is a bit of logic to map the
| chip selects to make the writes and reads come and go
| from the intended resources.
| forinti wrote:
| That's it then. I was missing the sync pin. Thank you.
| flohofwoe wrote:
| There were also 8-bit home computers (like the Amstrad CPC
| and C64) where a different memory bank would be accessed
| depending on whether the CPU would do a read or write access,
| e.g. a read would access a ROM bank, but a write would access
| a RAM bank at the same address (usually called 'shadow ROM').
| rjeli wrote:
| Huh? How do you read from the ram after you write to it?
| vidarh wrote:
| You bank switch then. A common use would be to e.g. be
| able to use the ROM to load code into the RAM under the
| ROM and _then_ bank switch, but also e.g. for extensions
| where you might want to first copy the ROM into the
| underlying ROM, and then patch whatever changes you
| wanted into before bank switching.
| flohofwoe wrote:
| And sometimes you also don't need to read the written
| data back (for instance when writing to video memory).
| okl wrote:
| A minimum solution could be an instruction fetch buffer that
| memorizes the last N instructions (maybe even after decoding?)
| to alleviate pipeline bubbles and when jumping back.
| satiric wrote:
| I'd be surprised if 6502-based computers really ran OK at 100MHz:
| surely you'd run into EMI or timing issues when using the same
| motherboard at 100 times the original clock speed?
| rbanffy wrote:
| It slows down to access the bus when needed. Memory access runs
| at full speed all the time as the memory is inside the FPGA.
| Someone wrote:
| This doesn't do that. It runs an FPGA-built 6502 with 64kB of
| RAM at 100MHz. The FPGA knows which memory addresses it has to
| read and write to the actual memory of the system it's plugged
| in, and, when needed, accesses that at the speed the system
| expects.
| satiric wrote:
| Oh I see, that's neat. That makes way more sense that what I
| was thinking.
| bitwize wrote:
| People write games for the TI-99/4A in TI BASIC (or Extended
| BASIC) that would be too slow to be any fun on the original
| hardware, but flip on Classic99's Turbo mode and suddenly you
| have arcade action!
|
| I can see an upgrade like this enabling games, demos, and other
| software that wouldn't be possible on the stock Apple or
| Commodore systems.
| ddingus wrote:
| It does not take 100Mhz to do that.
|
| A while back I bought a FastChip for my Apple 2e. That delivers
| a 16Mhz 65C02 or 65C816 (I bought the latter option)
|
| The trick to getting superfast Applesoft is to copy it into the
| card fast RAM. Otherwise, Applesoft is still faster than stock,
| but main board RAM is still clocked at 1Mhz. Not enough of a
| boost to really matter.
|
| However, once Applesoft is on the card, the situation is
| reversed! All accesses to main board RAM are 1Mhz, but that is
| plenty fast to draw a ton of graphics. Applesoft programs run
| crazy fast when the 16Mhz 6502 has the job.
| leptons wrote:
| There are demos written for the SuperCPU 20mhz accelerator for
| the C64. There are also demos written for the Ram Expansion
| Unit. I think there was one recent demo released that runs in
| an emulator with no throttling so it's something like a 40MHz
| C64.
|
| One interesting demo combines 4 Commodore 64s to run one demo,
| called "Quad Core"
|
| https://youtu.be/B4UBlpTucFc?si=a1irvH7CRYhETnk9
| ddingus wrote:
| The Quad one is neat!
| nubinetwork wrote:
| Just throwing it out there, but I think you can still buy eZ180's
| that run at something like 133mhz.
|
| Edit: hmm nope, mouser and digikey don't have them anymore...
| fastest I could find was 50mhz and it was marked as not for new
| designs. Bummer.
| Theodores wrote:
| I wish I could go back in time to run Mandelbrot fractals on a
| BBC Micro with this - think how impressed people would be with
| instant zoom rather than a half hour wait!
| jacquesm wrote:
| Part of the magic was the half hour wait though. It felt as
| though you were doing some serious computation!
| ddingus wrote:
| Same with things like sine plots.
|
| The reveal is part of the fun.
|
| I just had a thought about what might be really fun at
| 100Mhz, and that is cellular automata. There are the classic
| game of life rules. Going fast on those is fun.
|
| But, maybe a more general engine is worth writing. I may try
| it on my 16Mhz 6502 system.
| Lio wrote:
| This sounds like it would be really suitable for a BBC Micro
| second processor.
|
| They had designs for, amongst other, a 64K 65C02 running at a
| different clock speed[1].
|
| Back in the day I always wanted one for playing Elite[2] (but
| then I also wished that Acorn had provided an official hardware
| update for 16 colours instead of 8.)
|
| 1. https://en.wikipedia.org/wiki/BBC_Micro_expansion_unit
|
| 2. https://www.bbcelite.com/6502sp/
| jacquesm wrote:
| It'd be amazing to have it replace the primary one. It would be
| so fast you'd not need the tube at all. Though, with the tube
| you probably would have fewer timing issues to contend with. I
| wonder if the Elite code would seamlessly adapt to being run
| that much faster.
| indrora wrote:
| I'm curious how competitors in the "tiny FPGA" market are going
| to affect things.
|
| I'd love to see this rebuilt not on a Xilinx but something like
| the Gowin GW1N FPGAs:
| https://www.gowinsemi.com/en/product/detail/46/
| jacquesm wrote:
| That's still such an amazing accomplishment. Look at the bottom
| of the board how densely packed it is, this is nothing short of
| jewelry.
| syngrog66 wrote:
| hopefully someone forwards to Woz, he might get a kick out of it
___________________________________________________________________
(page generated 2024-01-28 23:01 UTC)