[HN Gopher] Intel x86 documentation has more pages than the 6502...
___________________________________________________________________
Intel x86 documentation has more pages than the 6502 has
transistors (2013)
Author : optimalsolver
Score : 221 points
Date : 2023-08-05 10:49 UTC (12 hours ago)
(HTM) web link (www.righto.com)
(TXT) w3m dump (www.righto.com)
| ddingus wrote:
| That is an intriguing stat.
| znpy wrote:
| So what? The 6502 is stuck at what, 40 years ago?
| masklinn wrote:
| 50.
|
| The 6502 was a famously simplistic and cheap processor, with
| about 2/3rds the transistors of an 8080 or 6800 (and like
| 20~25% the price at release), half that of a Z80.
| mpweiher wrote:
| Back in 2006, Intel's Montecito had a transistor budget for more
| ARM6 cores than the ARM6 had transistors, so:
|
| Transistor : ARM6 = ARM6 : Montecito
|
| And that was a long time ago, Montecito had a measly 1.72 billion
| transistors, Apple's M2 Ultra has 130 billion.
|
| Makes the whole Transputer ( Transistor:Computer ) idea seem
| somewhat prescient.
|
| https://blog.metaobject.com/2007/09/or-transistor.html
|
| And an idea how you could use those transistors differently from
| now:
|
| https://blog.metaobject.com/2015/08/what-happens-to-oo-when-...
| [deleted]
| atomlib wrote:
| There are transistors that have more than one page of
| documentation.
| ngneer wrote:
| Brilliant!
| FartyMcFarter wrote:
| In other mind-boggling stats: A single modern games console has
| more RAM than all of the Atari 2600s ever manufactured put
| together.
| FirmwareBurner wrote:
| _> A single modern games console has more RAM than all of the
| Atari 2600s ever manufactured put together_
|
| In other mind boggling movie lore, an RTX 4090 has 2x the
| TFLOPS computing power than the killer AI Skynet used in
| Terminator 3[1].
|
| The writers back then probably though 60 TFLOPS is such a
| ridiculously high sci-fi number for the world-ending AI, that
| nothing could possibly come close to it, and 20 years later
| consumers can have twice more computing power in their home
| PCs.
|
| It's also a nice reminder how far technology has progressed in
| the last decades even if the pace has slowed down in the last
| years.
|
| [1]https://youtu.be/_Wlsd9mljiU?t=155
| segfaultbuserr wrote:
| > _RTX 4090 has 2x the TFLOPS computing power than the killer
| AI Skynet used in Terminator 3. The writers back then
| probably though 60 TFLOPS is such a ridiculously high sci-fi
| number_
|
| Also a fact worth noting but is routinely ignored in the
| popular press is that, these astronomical peak floating-point
| ratings of modern hardware are only achievable for a small
| selection of algorithms and problems. In practice, realizable
| performance is often much worse, efficiency can be as low as
| 1%.
|
| First, not all algorithms are best suited for the von Neumann
| architecture. Today, the memory wall is higher than ever. The
| machine balance (FLOPS vs. load/store) of modern hardware is
| around 100:1. To maximize floating-point operations, all data
| must fit in cache. This requires the algorithm to have a high
| level of data reuse via cache blocking. Some algorithms do it
| especially well, like dense linear algebra (Top500 LINPACK
| benchmark). Other algorithms are less compatible with this
| paradigm, they're going to be slow no matter how well the
| optimization is. Examples include many iterative physics
| simulation problems, sparse matrix code, and graph algorithms
| (Top500 HPCG benchmark). In the Top500 list, HPCG is usually
| 1% as fast as LINPACK. Best-optimized simulation code can
| perhaps reach 20% of Rpeak.
|
| This is why both Intel and AMD started offering special
| large-cache CPUs, either using on-package HBM or 3D-VCache.
| They're all targeted for HPC. Meanwhile in machine learning,
| people also made the switch to FP16, BF16 and INT8 largely
| because of the memory wall. Doing inference is a relatively
| cache-friendly problem, many HPC simulations are much worse
| in this aspect.
|
| Next, even if the algorithm is well-suited for cache
| blocking, peak datasheet performance is usually still
| unobtainable because it's often calculated from the peak FMA
| throughput. This is unrealistic in real problems, you can't
| just do everything in FMA - 70% is a more realistic target.
| In the worst case, you get 50% of the performance
| (disappointing, but not as bad as the memory wall). In
| contrast to datasheet peak performance, the LINPACK peak
| performance Rpeak is measured by a real benchmark.
| dogma1138 wrote:
| The 4090 provides over 80 tflops in bog standard raw FP32
| compute no tensor cores or MAD/FMA or any fancy
| instructions.
| jcranmer wrote:
| When you measure peak FLOPS, especially "my desktop
| computer has X FLOPS", you're generally computing N FMA
| units * f frequency, theoretical maximum FLOPS unit. This
| number, as you note, has basically no relation to anything
| practical: we've long been at the point where our ability
| to stamp out ALUs greatly exceeds our ability to keep those
| units fed with useful data.
|
| Top500 measures FLOPS on a different basis. Essentially,
| see how long it takes to solve an NxN equation Ax=b (where
| N is large enough to stress your entire system), and use a
| synthetic formula to convert N into FLOPS. However, this
| kind of dense linear algebra is an unusually computation-
| heavy benchmark--you need to do about n^1.5 FLOPS per n
| words of data. Most kernels tend to do more like O(n) or
| maybe as high as O(n lg n) work for O(n) data, which
| requires a lot higher memory bandwidth than good LINPACK
| numbers does.
|
| Furthermore, graph or sparse algorithms tend to do really
| bad because the amount of work you're doing isn't able to
| hide the memory latency (think one FMA per A[B[i]] access--
| you might be able to do massive memory bandwidth fetches on
| the first B[i] access, but you end up with a massive memory
| gather operation for the A[x] access, which is extremely
| painful).
| YetAnotherNick wrote:
| > Meanwhile in machine learning, people also made the
| switch to FP16, BF16 and INT8 largely because of the memory
| wall
|
| FP16 doesn't work any faster than mixed precision on Nvidia
| or any other platform(I have benchmarked GPUs, CPUs and
| TPUs). For matrix multiplication, computation is still the
| bottleneck due to N^3 computation vs N^2 memory access.
| CyberDildonics wrote:
| _In other mind boggling movie lore, an RTX 4090 has 2x the
| TFLOPS computing power than the killer AI Skynet used in
| Terminator 3[1]._
|
| That isn't really mind boggling since you are quoting
| fiction.
| FirmwareBurner wrote:
| _> That isn't really mind boggling since you are quoting
| fiction_
|
| Fiction of the past plays an important role in seeing how
| far tech has progressed, that what was once fiction is now
| a commodity.
| CyberDildonics wrote:
| How does this opinion explain a made up number as "mind
| boggling"?
| FirmwareBurner wrote:
| What makes you think it's a made up number? Just because
| it's been featured in a movie doesn't mean the number
| can't be grounded in the reality of the era. Yes, there's
| exaggerations but big buget movies usually hire technical
| consultants to aid writers, prop builders and art
| directors with setting scenes that look realistic and
| don't just pull radom numbers out of thin air which could
| be embarrassing mistakes for tech-savvy movie goers.
|
| 60 TFLOPS is the equivalent of 10.000x PS2s of processing
| power, the most powerful console at the time, or 2x NEC
| Earth Simulator, the most powerful supercomputer at the
| time, which seems about right for what would be a virus
| taking over all the compute power of the DoD.
|
| So definitely the writers consulted with some people who
| knew something about computers to get a figure grounded
| in reality at the time and not just pulled a random
| number out of thin air, especially that at the time even
| average joes were hearing about FLOPS as a measure of
| compute power, being advertised in PC and gaming console
| specs, so naturally they had to come up with a number
| that seemed very impresive but was also believable.
| pests wrote:
| It's not fiction that the writers thought 60TFLOPS would be
| huge today.
| CyberDildonics wrote:
| So what?
|
| It's a made up number that's supposed to sound fancy. It
| is for people who don't know much about computers. It's
| probably just there because people have heard the prefix
| 'tera', but wouldn't know what 'exa' or any other prefix
| means.
|
| It doesn't mean anything. Documentation made by people
| having more pages than a CPU which was also made by
| people is interesting because these are real things made
| for specific purposes, not a number pulled out of thin
| air for fiction.
|
| There is nothing 'mind blowing' about an uninformed
| person just being wrong. Is it 'mind blowing' that the
| original terminator was supposed to run on a 6502?
|
| In Johnny Mnemonic 320 GB was supposed to be a lot of
| data in 2021 when it costs the same as lunch for two
| people.
|
| https://www.imdb.com/title/tt0113481/plotsummary/
| iamgopal wrote:
| Isn't it a great reminder that technology is not progressed
| enough to even take advantage of 60TFLOPS.
| FirmwareBurner wrote:
| _> to even take advantage of 60TFLOPS_
|
| Rendering Electron apps and mining Dodgecoins?
| segfaultbuserr wrote:
| In scientific computing, it has become a serious problem.
| Because of the memory wall, many important algorithms can
| _never_ take advantage of 60 TFLOPS due to their low
| arithmetic intensity. The only solutions are (1) stop using
| these algorithms, (2) stop using von Neumann computers
| (e.g. in-memory computing). The stop-gap solution is HBM or
| 3D-VCache.
| pdimitar wrote:
| The writers could still turn out to be right. I am not sure
| we are making good use of all that hardware yet.
| koolba wrote:
| The only thing keeping us alive is that skynet is an
| electron app.
| MegaSec wrote:
| Damm high level programming languages. Just go back to
| assembly, that'll fix everything.
| pdimitar wrote:
| Yeah how dare they. ;)
|
| Truth be told though, I believe we are in for some more
| innovation in the area, especially with the advent of ARM
| lately. It's always kinda funny how these mega-machines
| we have still manage to stutter.
| SmellTheGlove wrote:
| > It's always kinda funny how these mega-machines we have
| still manage to stutter.
|
| I just figured that's the trade-off for general purpose
| computing. We can optimize for whatever covers the wide
| swath of use cases, but we can't optimize for everything,
| and some will continue to be mutually exclusive. Mind you
| I'm no expert, I'm just extrapolating on how differently
| what CPUs and GPUs are optimized for these days and
| historically.
| automatic6131 wrote:
| >I am not sure we are making good use of all that hardware
| yet. Dunno, working out the color of 8 million pixels every
| 6ms seems pretty good to me
| pdimitar wrote:
| True, though I was talking about the AI workloads.
| merob wrote:
| > The writers back then probably though 60 TFLOPS is such a
| ridiculously high sci-fi number for the world-ending AI, that
| nothing could possibly come close to it, and 20 years later
| consumers can have twice more computing power in their home
| PCs.
|
| If you look at the top500 supercomputer list of the time [1],
| they actually nailed it, the #1 rank at the time hitting a
| peak of 40TFLOPS
|
| [1] https://www.top500.org/lists/top500/2003/06/
| kabdib wrote:
| An SF book published in the 1950s (I have forgotten title and
| author, sigh) featured a then-imagined supercomputer with
|
| - 1M _bits_ of storage
|
| - A mile on a side
|
| - Located in Buffalo, NY, and cooled by Niagra Falls (vacuum
| tubes, natch)
|
| - Able to surveil every citizen in the nation for suspicious
| activity
|
| No mention of clock speed, cache-line size, or instruction
| set. I guess SF writers aren't computer designers :-)
| [deleted]
| karmakaze wrote:
| Would be more impressive if the 2600 had more than 128 bytes
| RAM--that's bytes not KB.
| zwirbl wrote:
| For reference, the Atari 2600 had 128 bytes of RAM with about
| 30 million devices sold
| kristopolous wrote:
| I thought I sort of understood how computers work until I saw
| that.
|
| I really can't figure out how to do a full screen video game
| with state in 128B
| FartyMcFarter wrote:
| There was no framebuffer in those consoles [1]. So you
| pretty much only have to store game state and some
| auxiliary data in those 128 bytes, which starts sounding a
| lot easier.
|
| [1]
| https://en.wikipedia.org/wiki/Television_Interface_Adaptor
| [deleted]
| jdblair wrote:
| or, a lot harder, since your code can only draw a line a
| time, not work with the whole frame buffer!
| charcircuit wrote:
| Modern games now have programmers deal with drawing a
| frame a pixel at a time when writing shaders. The GPUs
| themselves render a tile at a time and not the whole
| buffer.
| TheRealSteel wrote:
| Look up 'racing the beam' if you haven't before. The answer
| is... you can't! It didn't have a frame buffer and lines
| had to be written to the display one at a time. There was a
| lot of sprite flicker as many games had more on screen than
| the console could actually display in one frame.
| beebeepka wrote:
| Pacman was horrible.
| fredoralive wrote:
| It's basically only the state in the RAM, the game code is
| in ROM on the cartridge (you can have up to 4KB of ROM
| before have to rely on bank switching tricks). Video on the
| 2600 is weird, there isn't any video memory to speak of,
| you basically set up the video chip line by line in code.
| Tuna-Fish wrote:
| The program and assets are stored in a ROM cartridge, so
| only mutable data needs RAM.
|
| Actually drawing things on screen depends on two things:
|
| The first is racing the beam. The display output traces
| across the entire screen at 60Hz, one scanline at a time.
| At no point does a complete image exist, instead you just
| make sure to apply changes so that the machine can draw
| what needs to be drawn just before the beam traces that
| part of the image. You will need to cycle count so that
| your program takes exactly the right time to execute every
| section, because you certainly won't have time for
| interrupts or other synchronization.
|
| The second is using dedicated hardware, where you store the
| location on screen, color and memory address of a sprite,
| and the hardware draws it for you. There are a very limited
| amount of sprites available, which limits the amount of
| things that can happen in a single line.
| ddingus wrote:
| There is no frame buffer. The graphics are all drawn by
| manipulating a few registers in the video chip.
|
| Everything is scan lines and cycles. You get a one bit per
| pixel 40 pixel wide background, a couple of single color 8
| bit sprites and a couple more two bit wide sprites and that
| is pretty much it. A lot can be done by simply changing a
| color register at specific times too. Reusing sprites
| happens regularly as well. (Sprite drawn at left of screen
| can be repositioned to the right to be seen again. That is
| the "racing the beam" part you may have heard people
| mention.
|
| Most of the CPU run time available for each frame is spent
| generating the display a scan line at a time.
|
| The real game happens during the vertical blanking period.
|
| Almost everything comes from ROM, leaving ram for game
| state and the few objects that may need to be dynamic, and
| even those are in ROM when there is room for all the
| states.
|
| It is an odd feeling when you run out of room. The phrase,
| "I used every BIT of RAM" is literal! Happened to me once.
| No more bits and I had to either leave out a feature, or
| take a speed penalty by packing multiple states into single
| bytes.
| gerwim wrote:
| Great video [1] on how some clever tricks are used to stay
| within memory constraints.
|
| [1]: https://www.youtube.com/watch?v=sw0VfmXKq54
| csours wrote:
| https://humbletoolsmith.com/2017/07/08/learning-from-the-
| ata...
| andai wrote:
| > Anyway, at the time I did these measurements, my 4.2 GHz kaby
| lake had the fastest single-threaded performance of any machine
| you could buy but had worse latency than a quick machine from
| the 70s (roughly 6x worse than an Apple 2), which seems a bit
| curious. To figure out where the latency comes from, I started
| measuring keyboard latency because that's the first part of the
| pipeline. My plan was to look at the end-to-end pipeline and
| start at the beginning, ruling out keyboard latency as a real
| source of latency.
|
| > But it turns out keyboard latency is significant! I was
| surprised to find that the median keyboard I tested has more
| latency than the entire end-to-end pipeline of the Apple 2. If
| this doesn't immedately strike you as absurd, consider that an
| Apple 2 has 3500 transistors running at 1MHz and an Atmel
| employee estimates that the core used in a number of high-end
| keyboards today has 80k transistors running at 16MHz. That's
| 20x the transistors running at 16x the clock speed -- keyboards
| are often more powerful than entire computers from the 70s and
| 80s! And yet, the median keyboard today adds as much latency as
| the entire end-to-end pipeline as a fast machine from the 70s.
|
| https://danluu.com/keyboard-latency/
| _trampeltier wrote:
| An USB-C charger has much more computing power than an Apollo
| Moonlander.
|
| https://www.theverge.com/tldr/2020/2/11/21133119/usb-c-
| anker...
| FredPret wrote:
| We'll have computronium soon if we carry on like this!
| ddingus wrote:
| But it is seriously i/o deficient!
| KronisLV wrote:
| > https://danluu.com/keyboard-latency/
|
| This might be a bit off topic, but it was surprising to see a
| Logitech K120 have the same latency as a Unicomp Model M or
| other keyboards that are 5-10x more expensive than it.
|
| No wonder I liked using it for work years ago: as far as
| membrane keyboards go, it's pretty dependable and decently
| nice to use, definitely so for it's price.
| ddingus wrote:
| Why have that kind of resource in a keyboard?
|
| Some keyboards were made with 4 bit processors. I have yet to
| look one up and perhaps I should.
|
| Pretty much any 8 bit CPU would be luxurious. And low latency
| due to the single task, respectful code density, and rapid
| state changes for interrupts.
| ImAnAmateur wrote:
| That write up is fantastic but it's undated and probably from
| 2016/2017.
| mysterydip wrote:
| I know it's not a straight conversion, but I've wondered what
| performance would be like on a multi-core, multi-gigahertz 6502
| successor. Even a 486 had a million transistors, think how many
| 6502's could be on a die with that same count.
| junon wrote:
| Not very good, I wouldn't think. It didn't have many
| instructions or registers. There would be a LOT of memory
| reads/writes, a lot of extra finnagling of things that modern
| architectures can do in one or two instructions very cleanly.
|
| Assuming no extra modern stuff was added on top of the 6502 and
| you just got a bog-standard 6502 just at a very high clock
| speed, then as the other comment mentioned there was no memory
| caching, no pipelining (though there have been lots of people
| interested in designing one with a pipeliner), and not a whole
| lot of memory space to boot as it had 16-bit wide registers.
|
| Thus, most 6502 applications (including NES games, for example)
| had to use mappers to map in and out different physical memory
| regions around the memory bus, similar to modern day MMUs (just
| without the paging/virtualization). It would be hammering
| external memory like crazy.
| vidarh wrote:
| There's the mostly 6502 compatible 16-bit WDC 65C816 that'd
| probably be a much better, less painfully minimalist starting
| point.
|
| Apart from a MMU, _the_ biggest hurdle I think for modern dev
| on the original 6502 is the lack of multiply /divide and the
| pain of having to either forgo modern levels of stack use or
| emulate a bigger stack.
| Someone wrote:
| > and not a whole lot of memory space to boot as it had
| 16-bit wide registers.
|
| If only. Its program counter is 16 bits, but the other
| registers are 8 bits.
|
| > It would be hammering external memory like crazy.
|
| For some 6502s, like _crazier_. The NMOS 6502 has a 'feature'
| called 'phantom reads' where it will read memory more often
| than required.
|
| https://www.bigmessowires.com/2022/10/23/nmos-6502-phantom-r.
| ..:
|
| _"On the 65C02, STA (indirect),Y requires five clock cycles
| to execute: two cycles to fetch the opcode and operand, two
| cycles to read two bytes in the 16-bit pointer, and one cycle
| to perform the write to the calculated address including the
| Y offset. All good. But for the NMOS 6502 the same
| instruction requires six clock cycles. During the extra clock
| cycle, which occurs after reading the 16-bit pointer but
| before doing the write, the CPU performs a useless and
| supposedly-harmless read from the unadjusted pointer address:
| $CFF8. This is simply a side-effect of how the CPU works, an
| implementation detail footnote, because this older version of
| the 6502 can't read the pointer and apply the Y offset all in
| a single clock cycle."_
|
| I don't know whether it's true, but
| https://www.applefritter.com/content/w65c02s-6502 (Comment 8)
| claims:
|
| _"The Woz machine on the Disk II card is controlled by the
| address lines (which set its mode) and exploits this "phantom
| read" to switch from SHIFT to LOAD mode during floppy disk
| write operations so in the CPU cycle immediately following
| the phantom read, which is the write cycle, to the same
| address as the phantom read (assuming no page crossing this
| time), the Woz machine will be in LOAD mode and grab the disk
| byte ("nibble") from the 6502 data bus. The Woz machine does
| a blind grab of that byte, it does not look at the R/W signal
| to determine if it's a write cycle. If the phantom read is
| missing, it just won't work because the data byte will be
| gone before the mode changes to LOAD.
|
| I am quite sure that this is how the Disk II works. But this
| is where my own hands-on experience ends."_
| junon wrote:
| > If only. Its program counter is 16 bits, but the other
| registers are 8 bits.
|
| Oh wow, I didn't remember that. I had to go back to double
| check it! Maybe the NES had wider registers? Or maybe I'm
| just completely misremembering, which is definitely
| probably absolutely the case.
|
| Also, thanks for all the other links! Really interesting
| stuff.
| ddingus wrote:
| Back in the day, Rockwell had a dual 6502 variant in their
| datebook. Two cores, each basically interleaving memory access.
|
| The fun part about 6502 and many friends from that era was
| memory was direct access, bytes moving in and out every CPU
| cycle. In fact the CPU was slow enough that often did not
| happen, allowing for things like transparent DMA.
|
| Clock a 6502 at some Ghz frequency and would there even be RAM
| fast enough?
|
| Would be a crazy interesting system though.
|
| I have a 16Mhz one in my Apple and it is really fast! Someone
| made a 100Mhz one with an FPGA.
|
| Most 6502 ops took 3 to 5, maybe 7 cycles to perform.
|
| At 1Mhz it is roughly 250 to 300k instructions per second.
|
| 1ghz would be the same assuming fast enough RAM, yielding 300M
| instructions per second.
|
| I have always thought of it as adds and bit ops per second, and
| that is all a 6502 can really do.
| larschdk wrote:
| No division, no floating point, only 8-bit multiplication and
| addition, no pipeline, no cache, no MMU, no preemptive
| multitasking, very inefficient for higher level languages (even
| C). But you would get about 450 of them for the number of
| transistors on a 486.
| cbm-vic-20 wrote:
| The 6502 does not have any multiplication instructions. It
| does have very fast interrupt latency for the processors of
| its time, though.
| Findecanor wrote:
| Multitasking is tricky because the stack pointer is hard-
| coded to use page 1. it could be moved first on the 65EC02
| and on the 16-bit 65C816.
|
| In a code density comparison, the 6502 is as bad as some RISC
| processors where instructions are four bytes instead of one.
| <https://www.researchgate.net/publication/224114307_Code_dens
| ...>
|
| But it was fast, with the smallest instructions taking 2
| cycles (one to read the op-code, one to execute). A 1 MHz
| 6502 is considered on par with a 3.5 MHz Z80, overall.
| FartyMcFarter wrote:
| I'm not a hardware engineer, so take this with a grain of salt:
|
| The 6502 doesn't have a cache, only external memory. So
| performance would probably be much worse than naively expected
| (except perhaps for workloads that fit in CPU registers, _edit_
| -not even those due to the lack of a code cache as well).
| Memory latencies haven 't improved nearly as much as CPU speeds
| have, which is why modern CPUs have big caches.
|
| The CPU would be idly waiting for memory to respond most of the
| time, which completely kills performance.
| RetroTechie wrote:
| That's because on (most) modern systems, main RAM is combined
| (shared memory space) and external from CPU, connected
| through fat but high-latency pipe.
|
| A solution is to include RAM with each CPU core on-die. Afaik
| this is uncommon approach because semiconductor fabrication
| processes for RAM vs. CPU don't match well? But it's not
| impossible - IC's like this exist, and eg. SoC's with
| integrated RAM, CPU caches etc are a thing.
|
| So imagine a 'compute module' consisting of 6502 class CPU
| core + a few (dozen?) KB of RAM directly attached + some
| peripheral I/O to talk to neighbouring CPU's.
|
| Now imagine a matrix of 1000s of such compute modules,
| integrated on a single silicon die, all concurrently
| operating @ GHz clock speeds. Sounds like a GPU, with main
| RAM distributed across its compute units. :-)
|
| Examples: GreenArrays GA144 Cerebras
| Wafer Scale Engine
|
| (not sure about nature of the 'CPU' cores on the latter.
| General purpose or minimalist, AI/neural network processing
| specialized?)
|
| Afaik the 'problem' is more how to program such systems
| easily / securely, how to arrange common peripherals like
| USB, storage, video output etc. in a practical manner as to
| utilize the immense compute power + memory bandwidth in such
| a system.
| masklinn wrote:
| > Now imagine a matrix of 1000s of such compute modules,
| integrated on a single silicon die, all concurrently
| operating @ GHz clock speeds. Sounds like a GPU, with main
| RAM distributed across its compute units. :-)
|
| Sounds like a new version of the Connection Machine (the
| classic / original one).
| hasmanean wrote:
| Back then memory latency was faster than a cpu clock cycle so
| it was not needed.
| brazzy wrote:
| Now I have to wonder whether it would be possible to build a
| cache _around_ an unmodified 6502. I.e. a component that just
| sits between the (unmodified) CPU and memory and does
| caching.
| Someone wrote:
| I don't think an unmodified 6502 would get faster; it
| assumes memory is fast enough for its clock speed, and
| won't, for example, prefetch code or data.
|
| Also, if you add a modern-sized cache, you won't need the
| memory at all in a single-CPU system; the cache would
| easily be larger than the addressable memory.
| vidarh wrote:
| It'd only benefit if the latency of the external memory
| is bad enough to keep the cores waiting, sure, and
| certainly you're right regarding a single-CPU system. I
| think this fantasy experiment only "makes sense" (for low
| enough values of making sense) if you assume a massively
| parallel system. I don't think it'd be easy (or even
| doable) to get good performance out of it by modern
| standards - it'd be a fun weird experiment though.
| Someone wrote:
| If I were to build a system with lots of 6502s, I think I
| would build something transputer-like, so that each CPU
| has its own memory, and the total system can have more
| than 64 kB of RAM.
|
| The alternative of a massively parallel system with
| shared memory, on 6502, would mean all CPUs have 64 kB or
| memory, combined. I think that will limit what you can do
| with such a system too much.
| vidarh wrote:
| See: https://news.ycombinator.com/item?id=37011671
|
| These were intended to get to 4k cores (but 32 bit),
| w/32K on-die memory, but the ability to read from/write
| to other cores. A 6502 inspired variant would be cool,
| but to not get killed on the cycle cost of triggering
| reads/writes you'd probably want a modified core, or
| maybe a 65C816 (16-bit mostly 6502 compatible).
| ack_complete wrote:
| The Apple IIc+ did this with a 4MHz 65C02.
| [deleted]
| toast0 wrote:
| Should be, the 6502 has a RDY signal, to insert wait states
| in case accesses aren't ready yet. You'll need some way to
| configure the cache for address ranges and access types,
| but that would presumably just be more mmio.
| masklinn wrote:
| > The 6502 doesn't have a cache
|
| I don't think CPU caches were much of a thing back then, at
| least in the segments involved. AFAIK Intel wouldn't get on-
| die cache until the 486 (the 386 did support external L1, I
| think IBM's licensed variants also had internal caches but
| they were only for IBM to use).
|
| The most distinguishing feature of the 6502 is that it had
| almost no _registers_ and used external memory for most of
| the working set (it had shorter encodings for the first page,
| and IIRC later integrators would use faster memory for the
| zero page).
| FartyMcFarter wrote:
| > I don't think CPU caches were much of a thing back then
|
| Indeed, because memory was fast enough to respond within a
| single cycle back then. Or alternatively, CPU cycles were
| slow enough for that to happen depending on how you want to
| look at it :)
| vidarh wrote:
| You'd probably want to add at least a small cache, sure.
| Putting the zero page (256 bytes) on-die on each core
| (because it's typically used as "extra registers" of sort in
| 6502 code) plus a small additional instruction cache could do
| wonders. But you probably still wouldn't get anywhere near
| enough performance for it to be competitive with a more
| complex modern architecture.
|
| It'd be a fun experiment to see someone do in an FPGA,
| though.
| bitwize wrote:
| 6502 systems had RAM that ran at system speed. Everything was
| so much slower then so this was feasible... plus, some
| systems (like the Commodore VIC-20) used fast static RAM.
|
| The 6502 even had an addressing mode wherein accesses to the
| first 256 bytes of RAM -- the "zero page" -- were much faster
| due to not needing address translation, to the point where it
| was like having 256 extra CPU registers. Critical state for
| hot loops and such was often placed in the zero page.
|
| Do not presume to apply principles of modern systems to these
| old systems. They were very different.
| vidarh wrote:
| The saving for the zero page is only the more compact
| encoding which saves one cycle for memory access (for the
| non-indexed variants, anyway)
|
| E.g. LDA $42 is encoded as $A5 $42, and so takes 3 memory
| accesses (two to read the instruction, one to read from
| address $42). LDA $1234 is encoded as $AD $34 $12, and so
| takes 4 cycles: to read $AD $34 $12, and one to read the
| contents of address $1234. Same for the other zeropage vs.
| equivalent absolute instructions.
|
| See e.g. timing chart for LDA here:
|
| http://www.romdetectives.com/Wiki/index.php?title=LDA
| jdblair wrote:
| The 6502 didn't need cache because its clock speed was slower
| than the DRAM connected to it. That made memory accesses very
| inexpensive.
|
| The Apple II took advantage of the speed difference between
| CPU and DRAM and designed the video hardware to read from
| memory every other memory cycle, interleaved with CPU memory
| access.
| Vogtinator wrote:
| > The Apple II took advantage of the speed difference
| between CPU and DRAM and designed the video hardware to
| read from memory every other memory cycle, interleaved with
| CPU memory access.
|
| Same for the C64. Sometimes it was necessary to read
| slightly more than that for video display though, so the
| VIC (video chip) had to pause the CPU for a bit sometimes,
| resulting in so called "badlines".
| smolder wrote:
| I think they used SRAM and not DRAM, which is how it was
| faster than a clock cycle.
| Gracana wrote:
| Nope, they used 16k x 1-bit DRAM ICs.
| monocasa wrote:
| Both were faster. DRAM was a little cheaper, but required
| more circuitry to handle refresh on most CPUs making it a
| wash cost-wise on some designs. Typical woz engineering
| got the cost of the refresh circuitry down to where DRAM
| made sense economically on the Apple ][.
|
| Interestingly Z80s had DRAM refresh circuitry builtin
| which which was one reason for their prevalence.
| peterfirefly wrote:
| Screen output _was_ the DRAM refresh.
|
| And for the Z80: also that it only needed GND and 5V. The
| 8080 also needed 12V. And the Z80 only needed a single
| clock phase -- the 8080 needed two. The 6502 also only
| needed 5V and a single clock input (the 6800 needed two
| clock phases). The 6502 and Z80 were simply a lot easier
| to work with than most of the competition.
| causality0 wrote:
| This comment makes me want to read an article focusing on how
| performance-per-transistor has changed over time.
| vidarh wrote:
| For an idea in that vein (many 6502's on a die), look at (the
| sadly defunct) Epiphany, in the Parallela [1] (the buy links
| are still there but I doubt any are still available).
|
| I have two, and they're fun little toys and I wish they'd have
| gotten further, with the caveat that actually making use of
| them in a way that'd be better than just resorting to a GPU is
| hard.
|
| The 16-core Epiphany[2] in the Parallella is too small, but
| they were hoping for 1k or 4k core version.
|
| I'm saying it's a similar idea because the Epiphany cores had
| 32KB on-die RAM per core, and a predictable "single cycle per
| core traversed" (I think, not checked the docs in years) cost
| of accessing RAM on the other cores, arranged in a grid. Each
| core is very simple, though still nowhere near the simplicity
| of a 6502 (they're 32 bit, w/with a much more "complete" modern
| instruction set)
|
| The challenge is finding a problem that 1) decomposes well
| enough to benefit from many cores despite low RAM per core (if
| you need to keep hitting main system memory or neighbouring
| core RAM, you lose performance fast), 2) _does ot_ decompose
| well into SIMD style processing where a modern GPU would make
| mincemeat of it, and 3) is worth the hassle of figuring out how
| to fit your problem into the memory constraints.
|
| I don't know if this is an idea that will ever work - I suspect
| the problem set where they could potentially compete is too
| small, but I really wish we'd see more weird hardware attempt
| like this anyway.
|
| [1] https://parallella.org/
|
| [2] https://www.adapteva.com/docs/e16g301_datasheet.pdf
| LeonenTheDK wrote:
| That reminds me of the Green Arrays computer with 144 cores,
| programmed in Color Forth.
|
| Chuck Moore has a talk on it here, very interesting stuff:
| https://www.youtube.com/watch?v=0PclgBd6_Zs
|
| I'm not sure how far this has gone since then. The site is
| still up with buy links as well:
| https://www.greenarraychips.com
| vidarh wrote:
| Yeah, I think they're even more minimalist. It's an
| interesting design space, but hard to see it taking off,
| especially with few developers comfortable with thinking
| about the kind of minimalism required to take advantage.of
| these kinds of chips.
| LeonenTheDK wrote:
| Completely agree, it's an almost total shift from
| everything that's in use these days. Very interesting to
| play with though, I'd love to see some real world use.
| RetroTechie wrote:
| And 4) how to handle problems that _don 't_ map well onto
| massively-parallel machine. Some problems / algorithms are
| inherently serial by nature.
|
| Some applications like games decompose relatively easy into
| separate jobs like audio, video, user input, physics,
| networking, prefetch game data etc. Further decomposing those
| tasks... _not_ so easy. So eg. 4..8 cores are useful. 100 or
| 1k+ cores otoh... hmmm.
| vidarh wrote:
| True, but I'd expect for a chip like that you'd do what the
| parallella did, or what we do with CPU + GPU and pair it
| with a chip with a smaller number of higher powered cores.
| E.g. the Parallella had 2x ARM cores along with the 16x
| Epiphany cores.
| convolvatron wrote:
| for control parallelism. for data parallelism you get to
| soak up as many cores as you want if you can structure your
| problem the right way.
|
| personally I'd love to see more exploration at hybrid
| approaches. the Tera MTA supported both modes pretty well.
| 3cats-in-a-coat wrote:
| I'm trying to be wowed by this, but... yes, modern tech be
| complex. A modern website has more source code than the first few
| versions of Windows put together.
| serf wrote:
| anecdote : most 'high volume source code' sites i've
| encountered are that way because of a need to meet standards on
| a large amount of devices, compliance regulations, and
| scaffolding/boiler-plate; 90% of it being dead-code
| boilerplate.
|
| the logic comes at the very tail end, and often it is
| exceedingly limited, handing off msot of the work to third
| party APIs/whatever.
|
| I guess what i'm trying to say is that source-code volume is a
| lousy metric for ascertaining 'complexity'; something can be
| huge and cumbersome but still only use simple logic that's easy
| to follow once you get past the cruft.
| 3cats-in-a-coat wrote:
| What you describe is basically how DNA works. Most of it is
| inactive junk. Parts of which, however, gets activated
| when... who knows what happens.
|
| Basically we got coded by millions of interns.
| gary_0 wrote:
| And some of the code is never called, but removing it
| causes weird crashes because the cell.exe compiler is
| finicky. And some of the code shouldn't be there, but it
| was left behind by a virus infection that wasn't completely
| cleared out.
| rvba wrote:
| Most are done this way since they are outsourcrd to the
| cheapest bidders who "glue" multiple technologies poorly on
| top of each other. And then add tons of ads / tracker /
| analytics code.
|
| Lots of websites dont care that their technology is terrible.
| b5n wrote:
| There are parts that should be read, but when you only need a
| reference a good lookup tool is invaluable:
|
| https://github.com/skeeto/x86-lookup
| amelius wrote:
| It probably also has more bugs than the 6502 has transistors :)
| masklinn wrote:
| All processors have bugs. Despite their simplicity the 6502
| were hardly bug-free:
| https://en.wikipedia.org/wiki/MOS_Technology_6502#Bugs_and_q...
| kens wrote:
| Wow, I was surprised to see my article from 2013 appear on HN
| today! (It's hard to believe that the article is 10 years old.)
| In that time, the x86 documentation has expanded from four
| volumes to 10 volumes. Curiously, the number of pages has only
| grown by 21% (4181 pages to 5066). The x86 instruction set has
| added a bunch of features: AVX, FMA3, TSX, BMI, VNNI, and CET.
| But I guess the number of new instructions is relatively small
| compared to the existing ones.
| tos1 wrote:
| Update to the 2013 numbers: As of today, the Intel SDM
| (https://cdrdv2-public.intel.com/782158/325462-sdm-vol-1-2abc...)
| has 5066 pages.
___________________________________________________________________
(page generated 2023-08-05 23:01 UTC)