[HN Gopher] Intel x86 documentation has more pages than the 6502...
       ___________________________________________________________________
        
       Intel x86 documentation has more pages than the 6502 has
       transistors (2013)
        
       Author : optimalsolver
       Score  : 221 points
       Date   : 2023-08-05 10:49 UTC (12 hours ago)
        
 (HTM) web link (www.righto.com)
 (TXT) w3m dump (www.righto.com)
        
       | ddingus wrote:
       | That is an intriguing stat.
        
       | znpy wrote:
       | So what? The 6502 is stuck at what, 40 years ago?
        
         | masklinn wrote:
         | 50.
         | 
         | The 6502 was a famously simplistic and cheap processor, with
         | about 2/3rds the transistors of an 8080 or 6800 (and like
         | 20~25% the price at release), half that of a Z80.
        
       | mpweiher wrote:
       | Back in 2006, Intel's Montecito had a transistor budget for more
       | ARM6 cores than the ARM6 had transistors, so:
       | 
       | Transistor : ARM6 = ARM6 : Montecito
       | 
       | And that was a long time ago, Montecito had a measly 1.72 billion
       | transistors, Apple's M2 Ultra has 130 billion.
       | 
       | Makes the whole Transputer ( Transistor:Computer ) idea seem
       | somewhat prescient.
       | 
       | https://blog.metaobject.com/2007/09/or-transistor.html
       | 
       | And an idea how you could use those transistors differently from
       | now:
       | 
       | https://blog.metaobject.com/2015/08/what-happens-to-oo-when-...
        
         | [deleted]
        
       | atomlib wrote:
       | There are transistors that have more than one page of
       | documentation.
        
         | ngneer wrote:
         | Brilliant!
        
       | FartyMcFarter wrote:
       | In other mind-boggling stats: A single modern games console has
       | more RAM than all of the Atari 2600s ever manufactured put
       | together.
        
         | FirmwareBurner wrote:
         | _> A single modern games console has more RAM than all of the
         | Atari 2600s ever manufactured put together_
         | 
         | In other mind boggling movie lore, an RTX 4090 has 2x the
         | TFLOPS computing power than the killer AI Skynet used in
         | Terminator 3[1].
         | 
         | The writers back then probably though 60 TFLOPS is such a
         | ridiculously high sci-fi number for the world-ending AI, that
         | nothing could possibly come close to it, and 20 years later
         | consumers can have twice more computing power in their home
         | PCs.
         | 
         | It's also a nice reminder how far technology has progressed in
         | the last decades even if the pace has slowed down in the last
         | years.
         | 
         | [1]https://youtu.be/_Wlsd9mljiU?t=155
        
           | segfaultbuserr wrote:
           | > _RTX 4090 has 2x the TFLOPS computing power than the killer
           | AI Skynet used in Terminator 3. The writers back then
           | probably though 60 TFLOPS is such a ridiculously high sci-fi
           | number_
           | 
           | Also a fact worth noting but is routinely ignored in the
           | popular press is that, these astronomical peak floating-point
           | ratings of modern hardware are only achievable for a small
           | selection of algorithms and problems. In practice, realizable
           | performance is often much worse, efficiency can be as low as
           | 1%.
           | 
           | First, not all algorithms are best suited for the von Neumann
           | architecture. Today, the memory wall is higher than ever. The
           | machine balance (FLOPS vs. load/store) of modern hardware is
           | around 100:1. To maximize floating-point operations, all data
           | must fit in cache. This requires the algorithm to have a high
           | level of data reuse via cache blocking. Some algorithms do it
           | especially well, like dense linear algebra (Top500 LINPACK
           | benchmark). Other algorithms are less compatible with this
           | paradigm, they're going to be slow no matter how well the
           | optimization is. Examples include many iterative physics
           | simulation problems, sparse matrix code, and graph algorithms
           | (Top500 HPCG benchmark). In the Top500 list, HPCG is usually
           | 1% as fast as LINPACK. Best-optimized simulation code can
           | perhaps reach 20% of Rpeak.
           | 
           | This is why both Intel and AMD started offering special
           | large-cache CPUs, either using on-package HBM or 3D-VCache.
           | They're all targeted for HPC. Meanwhile in machine learning,
           | people also made the switch to FP16, BF16 and INT8 largely
           | because of the memory wall. Doing inference is a relatively
           | cache-friendly problem, many HPC simulations are much worse
           | in this aspect.
           | 
           | Next, even if the algorithm is well-suited for cache
           | blocking, peak datasheet performance is usually still
           | unobtainable because it's often calculated from the peak FMA
           | throughput. This is unrealistic in real problems, you can't
           | just do everything in FMA - 70% is a more realistic target.
           | In the worst case, you get 50% of the performance
           | (disappointing, but not as bad as the memory wall). In
           | contrast to datasheet peak performance, the LINPACK peak
           | performance Rpeak is measured by a real benchmark.
        
             | dogma1138 wrote:
             | The 4090 provides over 80 tflops in bog standard raw FP32
             | compute no tensor cores or MAD/FMA or any fancy
             | instructions.
        
             | jcranmer wrote:
             | When you measure peak FLOPS, especially "my desktop
             | computer has X FLOPS", you're generally computing N FMA
             | units * f frequency, theoretical maximum FLOPS unit. This
             | number, as you note, has basically no relation to anything
             | practical: we've long been at the point where our ability
             | to stamp out ALUs greatly exceeds our ability to keep those
             | units fed with useful data.
             | 
             | Top500 measures FLOPS on a different basis. Essentially,
             | see how long it takes to solve an NxN equation Ax=b (where
             | N is large enough to stress your entire system), and use a
             | synthetic formula to convert N into FLOPS. However, this
             | kind of dense linear algebra is an unusually computation-
             | heavy benchmark--you need to do about n^1.5 FLOPS per n
             | words of data. Most kernels tend to do more like O(n) or
             | maybe as high as O(n lg n) work for O(n) data, which
             | requires a lot higher memory bandwidth than good LINPACK
             | numbers does.
             | 
             | Furthermore, graph or sparse algorithms tend to do really
             | bad because the amount of work you're doing isn't able to
             | hide the memory latency (think one FMA per A[B[i]] access--
             | you might be able to do massive memory bandwidth fetches on
             | the first B[i] access, but you end up with a massive memory
             | gather operation for the A[x] access, which is extremely
             | painful).
        
             | YetAnotherNick wrote:
             | > Meanwhile in machine learning, people also made the
             | switch to FP16, BF16 and INT8 largely because of the memory
             | wall
             | 
             | FP16 doesn't work any faster than mixed precision on Nvidia
             | or any other platform(I have benchmarked GPUs, CPUs and
             | TPUs). For matrix multiplication, computation is still the
             | bottleneck due to N^3 computation vs N^2 memory access.
        
           | CyberDildonics wrote:
           | _In other mind boggling movie lore, an RTX 4090 has 2x the
           | TFLOPS computing power than the killer AI Skynet used in
           | Terminator 3[1]._
           | 
           | That isn't really mind boggling since you are quoting
           | fiction.
        
             | FirmwareBurner wrote:
             | _> That isn't really mind boggling since you are quoting
             | fiction_
             | 
             | Fiction of the past plays an important role in seeing how
             | far tech has progressed, that what was once fiction is now
             | a commodity.
        
               | CyberDildonics wrote:
               | How does this opinion explain a made up number as "mind
               | boggling"?
        
               | FirmwareBurner wrote:
               | What makes you think it's a made up number? Just because
               | it's been featured in a movie doesn't mean the number
               | can't be grounded in the reality of the era. Yes, there's
               | exaggerations but big buget movies usually hire technical
               | consultants to aid writers, prop builders and art
               | directors with setting scenes that look realistic and
               | don't just pull radom numbers out of thin air which could
               | be embarrassing mistakes for tech-savvy movie goers.
               | 
               | 60 TFLOPS is the equivalent of 10.000x PS2s of processing
               | power, the most powerful console at the time, or 2x NEC
               | Earth Simulator, the most powerful supercomputer at the
               | time, which seems about right for what would be a virus
               | taking over all the compute power of the DoD.
               | 
               | So definitely the writers consulted with some people who
               | knew something about computers to get a figure grounded
               | in reality at the time and not just pulled a random
               | number out of thin air, especially that at the time even
               | average joes were hearing about FLOPS as a measure of
               | compute power, being advertised in PC and gaming console
               | specs, so naturally they had to come up with a number
               | that seemed very impresive but was also believable.
        
             | pests wrote:
             | It's not fiction that the writers thought 60TFLOPS would be
             | huge today.
        
               | CyberDildonics wrote:
               | So what?
               | 
               | It's a made up number that's supposed to sound fancy. It
               | is for people who don't know much about computers. It's
               | probably just there because people have heard the prefix
               | 'tera', but wouldn't know what 'exa' or any other prefix
               | means.
               | 
               | It doesn't mean anything. Documentation made by people
               | having more pages than a CPU which was also made by
               | people is interesting because these are real things made
               | for specific purposes, not a number pulled out of thin
               | air for fiction.
               | 
               | There is nothing 'mind blowing' about an uninformed
               | person just being wrong. Is it 'mind blowing' that the
               | original terminator was supposed to run on a 6502?
               | 
               | In Johnny Mnemonic 320 GB was supposed to be a lot of
               | data in 2021 when it costs the same as lunch for two
               | people.
               | 
               | https://www.imdb.com/title/tt0113481/plotsummary/
        
           | iamgopal wrote:
           | Isn't it a great reminder that technology is not progressed
           | enough to even take advantage of 60TFLOPS.
        
             | FirmwareBurner wrote:
             | _> to even take advantage of 60TFLOPS_
             | 
             | Rendering Electron apps and mining Dodgecoins?
        
             | segfaultbuserr wrote:
             | In scientific computing, it has become a serious problem.
             | Because of the memory wall, many important algorithms can
             | _never_ take advantage of 60 TFLOPS due to their low
             | arithmetic intensity. The only solutions are (1) stop using
             | these algorithms, (2) stop using von Neumann computers
             | (e.g. in-memory computing). The stop-gap solution is HBM or
             | 3D-VCache.
        
           | pdimitar wrote:
           | The writers could still turn out to be right. I am not sure
           | we are making good use of all that hardware yet.
        
             | koolba wrote:
             | The only thing keeping us alive is that skynet is an
             | electron app.
        
             | MegaSec wrote:
             | Damm high level programming languages. Just go back to
             | assembly, that'll fix everything.
        
               | pdimitar wrote:
               | Yeah how dare they. ;)
               | 
               | Truth be told though, I believe we are in for some more
               | innovation in the area, especially with the advent of ARM
               | lately. It's always kinda funny how these mega-machines
               | we have still manage to stutter.
        
               | SmellTheGlove wrote:
               | > It's always kinda funny how these mega-machines we have
               | still manage to stutter.
               | 
               | I just figured that's the trade-off for general purpose
               | computing. We can optimize for whatever covers the wide
               | swath of use cases, but we can't optimize for everything,
               | and some will continue to be mutually exclusive. Mind you
               | I'm no expert, I'm just extrapolating on how differently
               | what CPUs and GPUs are optimized for these days and
               | historically.
        
             | automatic6131 wrote:
             | >I am not sure we are making good use of all that hardware
             | yet. Dunno, working out the color of 8 million pixels every
             | 6ms seems pretty good to me
        
               | pdimitar wrote:
               | True, though I was talking about the AI workloads.
        
           | merob wrote:
           | > The writers back then probably though 60 TFLOPS is such a
           | ridiculously high sci-fi number for the world-ending AI, that
           | nothing could possibly come close to it, and 20 years later
           | consumers can have twice more computing power in their home
           | PCs.
           | 
           | If you look at the top500 supercomputer list of the time [1],
           | they actually nailed it, the #1 rank at the time hitting a
           | peak of 40TFLOPS
           | 
           | [1] https://www.top500.org/lists/top500/2003/06/
        
           | kabdib wrote:
           | An SF book published in the 1950s (I have forgotten title and
           | author, sigh) featured a then-imagined supercomputer with
           | 
           | - 1M _bits_ of storage
           | 
           | - A mile on a side
           | 
           | - Located in Buffalo, NY, and cooled by Niagra Falls (vacuum
           | tubes, natch)
           | 
           | - Able to surveil every citizen in the nation for suspicious
           | activity
           | 
           | No mention of clock speed, cache-line size, or instruction
           | set. I guess SF writers aren't computer designers :-)
        
           | [deleted]
        
         | karmakaze wrote:
         | Would be more impressive if the 2600 had more than 128 bytes
         | RAM--that's bytes not KB.
        
         | zwirbl wrote:
         | For reference, the Atari 2600 had 128 bytes of RAM with about
         | 30 million devices sold
        
           | kristopolous wrote:
           | I thought I sort of understood how computers work until I saw
           | that.
           | 
           | I really can't figure out how to do a full screen video game
           | with state in 128B
        
             | FartyMcFarter wrote:
             | There was no framebuffer in those consoles [1]. So you
             | pretty much only have to store game state and some
             | auxiliary data in those 128 bytes, which starts sounding a
             | lot easier.
             | 
             | [1]
             | https://en.wikipedia.org/wiki/Television_Interface_Adaptor
        
               | [deleted]
        
               | jdblair wrote:
               | or, a lot harder, since your code can only draw a line a
               | time, not work with the whole frame buffer!
        
               | charcircuit wrote:
               | Modern games now have programmers deal with drawing a
               | frame a pixel at a time when writing shaders. The GPUs
               | themselves render a tile at a time and not the whole
               | buffer.
        
             | TheRealSteel wrote:
             | Look up 'racing the beam' if you haven't before. The answer
             | is... you can't! It didn't have a frame buffer and lines
             | had to be written to the display one at a time. There was a
             | lot of sprite flicker as many games had more on screen than
             | the console could actually display in one frame.
        
               | beebeepka wrote:
               | Pacman was horrible.
        
             | fredoralive wrote:
             | It's basically only the state in the RAM, the game code is
             | in ROM on the cartridge (you can have up to 4KB of ROM
             | before have to rely on bank switching tricks). Video on the
             | 2600 is weird, there isn't any video memory to speak of,
             | you basically set up the video chip line by line in code.
        
             | Tuna-Fish wrote:
             | The program and assets are stored in a ROM cartridge, so
             | only mutable data needs RAM.
             | 
             | Actually drawing things on screen depends on two things:
             | 
             | The first is racing the beam. The display output traces
             | across the entire screen at 60Hz, one scanline at a time.
             | At no point does a complete image exist, instead you just
             | make sure to apply changes so that the machine can draw
             | what needs to be drawn just before the beam traces that
             | part of the image. You will need to cycle count so that
             | your program takes exactly the right time to execute every
             | section, because you certainly won't have time for
             | interrupts or other synchronization.
             | 
             | The second is using dedicated hardware, where you store the
             | location on screen, color and memory address of a sprite,
             | and the hardware draws it for you. There are a very limited
             | amount of sprites available, which limits the amount of
             | things that can happen in a single line.
        
             | ddingus wrote:
             | There is no frame buffer. The graphics are all drawn by
             | manipulating a few registers in the video chip.
             | 
             | Everything is scan lines and cycles. You get a one bit per
             | pixel 40 pixel wide background, a couple of single color 8
             | bit sprites and a couple more two bit wide sprites and that
             | is pretty much it. A lot can be done by simply changing a
             | color register at specific times too. Reusing sprites
             | happens regularly as well. (Sprite drawn at left of screen
             | can be repositioned to the right to be seen again. That is
             | the "racing the beam" part you may have heard people
             | mention.
             | 
             | Most of the CPU run time available for each frame is spent
             | generating the display a scan line at a time.
             | 
             | The real game happens during the vertical blanking period.
             | 
             | Almost everything comes from ROM, leaving ram for game
             | state and the few objects that may need to be dynamic, and
             | even those are in ROM when there is room for all the
             | states.
             | 
             | It is an odd feeling when you run out of room. The phrase,
             | "I used every BIT of RAM" is literal! Happened to me once.
             | No more bits and I had to either leave out a feature, or
             | take a speed penalty by packing multiple states into single
             | bytes.
        
             | gerwim wrote:
             | Great video [1] on how some clever tricks are used to stay
             | within memory constraints.
             | 
             | [1]: https://www.youtube.com/watch?v=sw0VfmXKq54
        
             | csours wrote:
             | https://humbletoolsmith.com/2017/07/08/learning-from-the-
             | ata...
        
         | andai wrote:
         | > Anyway, at the time I did these measurements, my 4.2 GHz kaby
         | lake had the fastest single-threaded performance of any machine
         | you could buy but had worse latency than a quick machine from
         | the 70s (roughly 6x worse than an Apple 2), which seems a bit
         | curious. To figure out where the latency comes from, I started
         | measuring keyboard latency because that's the first part of the
         | pipeline. My plan was to look at the end-to-end pipeline and
         | start at the beginning, ruling out keyboard latency as a real
         | source of latency.
         | 
         | > But it turns out keyboard latency is significant! I was
         | surprised to find that the median keyboard I tested has more
         | latency than the entire end-to-end pipeline of the Apple 2. If
         | this doesn't immedately strike you as absurd, consider that an
         | Apple 2 has 3500 transistors running at 1MHz and an Atmel
         | employee estimates that the core used in a number of high-end
         | keyboards today has 80k transistors running at 16MHz. That's
         | 20x the transistors running at 16x the clock speed -- keyboards
         | are often more powerful than entire computers from the 70s and
         | 80s! And yet, the median keyboard today adds as much latency as
         | the entire end-to-end pipeline as a fast machine from the 70s.
         | 
         | https://danluu.com/keyboard-latency/
        
           | _trampeltier wrote:
           | An USB-C charger has much more computing power than an Apollo
           | Moonlander.
           | 
           | https://www.theverge.com/tldr/2020/2/11/21133119/usb-c-
           | anker...
        
             | FredPret wrote:
             | We'll have computronium soon if we carry on like this!
        
             | ddingus wrote:
             | But it is seriously i/o deficient!
        
           | KronisLV wrote:
           | > https://danluu.com/keyboard-latency/
           | 
           | This might be a bit off topic, but it was surprising to see a
           | Logitech K120 have the same latency as a Unicomp Model M or
           | other keyboards that are 5-10x more expensive than it.
           | 
           | No wonder I liked using it for work years ago: as far as
           | membrane keyboards go, it's pretty dependable and decently
           | nice to use, definitely so for it's price.
        
           | ddingus wrote:
           | Why have that kind of resource in a keyboard?
           | 
           | Some keyboards were made with 4 bit processors. I have yet to
           | look one up and perhaps I should.
           | 
           | Pretty much any 8 bit CPU would be luxurious. And low latency
           | due to the single task, respectful code density, and rapid
           | state changes for interrupts.
        
           | ImAnAmateur wrote:
           | That write up is fantastic but it's undated and probably from
           | 2016/2017.
        
       | mysterydip wrote:
       | I know it's not a straight conversion, but I've wondered what
       | performance would be like on a multi-core, multi-gigahertz 6502
       | successor. Even a 486 had a million transistors, think how many
       | 6502's could be on a die with that same count.
        
         | junon wrote:
         | Not very good, I wouldn't think. It didn't have many
         | instructions or registers. There would be a LOT of memory
         | reads/writes, a lot of extra finnagling of things that modern
         | architectures can do in one or two instructions very cleanly.
         | 
         | Assuming no extra modern stuff was added on top of the 6502 and
         | you just got a bog-standard 6502 just at a very high clock
         | speed, then as the other comment mentioned there was no memory
         | caching, no pipelining (though there have been lots of people
         | interested in designing one with a pipeliner), and not a whole
         | lot of memory space to boot as it had 16-bit wide registers.
         | 
         | Thus, most 6502 applications (including NES games, for example)
         | had to use mappers to map in and out different physical memory
         | regions around the memory bus, similar to modern day MMUs (just
         | without the paging/virtualization). It would be hammering
         | external memory like crazy.
        
           | vidarh wrote:
           | There's the mostly 6502 compatible 16-bit WDC 65C816 that'd
           | probably be a much better, less painfully minimalist starting
           | point.
           | 
           | Apart from a MMU, _the_ biggest hurdle I think for modern dev
           | on the original 6502 is the lack of multiply /divide and the
           | pain of having to either forgo modern levels of stack use or
           | emulate a bigger stack.
        
           | Someone wrote:
           | > and not a whole lot of memory space to boot as it had
           | 16-bit wide registers.
           | 
           | If only. Its program counter is 16 bits, but the other
           | registers are 8 bits.
           | 
           | > It would be hammering external memory like crazy.
           | 
           | For some 6502s, like _crazier_. The NMOS 6502 has a 'feature'
           | called 'phantom reads' where it will read memory more often
           | than required.
           | 
           | https://www.bigmessowires.com/2022/10/23/nmos-6502-phantom-r.
           | ..:
           | 
           |  _"On the 65C02, STA (indirect),Y requires five clock cycles
           | to execute: two cycles to fetch the opcode and operand, two
           | cycles to read two bytes in the 16-bit pointer, and one cycle
           | to perform the write to the calculated address including the
           | Y offset. All good. But for the NMOS 6502 the same
           | instruction requires six clock cycles. During the extra clock
           | cycle, which occurs after reading the 16-bit pointer but
           | before doing the write, the CPU performs a useless and
           | supposedly-harmless read from the unadjusted pointer address:
           | $CFF8. This is simply a side-effect of how the CPU works, an
           | implementation detail footnote, because this older version of
           | the 6502 can't read the pointer and apply the Y offset all in
           | a single clock cycle."_
           | 
           | I don't know whether it's true, but
           | https://www.applefritter.com/content/w65c02s-6502 (Comment 8)
           | claims:
           | 
           |  _"The Woz machine on the Disk II card is controlled by the
           | address lines (which set its mode) and exploits this "phantom
           | read" to switch from SHIFT to LOAD mode during floppy disk
           | write operations so in the CPU cycle immediately following
           | the phantom read, which is the write cycle, to the same
           | address as the phantom read (assuming no page crossing this
           | time), the Woz machine will be in LOAD mode and grab the disk
           | byte ("nibble") from the 6502 data bus. The Woz machine does
           | a blind grab of that byte, it does not look at the R/W signal
           | to determine if it's a write cycle. If the phantom read is
           | missing, it just won't work because the data byte will be
           | gone before the mode changes to LOAD.
           | 
           | I am quite sure that this is how the Disk II works. But this
           | is where my own hands-on experience ends."_
        
             | junon wrote:
             | > If only. Its program counter is 16 bits, but the other
             | registers are 8 bits.
             | 
             | Oh wow, I didn't remember that. I had to go back to double
             | check it! Maybe the NES had wider registers? Or maybe I'm
             | just completely misremembering, which is definitely
             | probably absolutely the case.
             | 
             | Also, thanks for all the other links! Really interesting
             | stuff.
        
         | ddingus wrote:
         | Back in the day, Rockwell had a dual 6502 variant in their
         | datebook. Two cores, each basically interleaving memory access.
         | 
         | The fun part about 6502 and many friends from that era was
         | memory was direct access, bytes moving in and out every CPU
         | cycle. In fact the CPU was slow enough that often did not
         | happen, allowing for things like transparent DMA.
         | 
         | Clock a 6502 at some Ghz frequency and would there even be RAM
         | fast enough?
         | 
         | Would be a crazy interesting system though.
         | 
         | I have a 16Mhz one in my Apple and it is really fast! Someone
         | made a 100Mhz one with an FPGA.
         | 
         | Most 6502 ops took 3 to 5, maybe 7 cycles to perform.
         | 
         | At 1Mhz it is roughly 250 to 300k instructions per second.
         | 
         | 1ghz would be the same assuming fast enough RAM, yielding 300M
         | instructions per second.
         | 
         | I have always thought of it as adds and bit ops per second, and
         | that is all a 6502 can really do.
        
         | larschdk wrote:
         | No division, no floating point, only 8-bit multiplication and
         | addition, no pipeline, no cache, no MMU, no preemptive
         | multitasking, very inefficient for higher level languages (even
         | C). But you would get about 450 of them for the number of
         | transistors on a 486.
        
           | cbm-vic-20 wrote:
           | The 6502 does not have any multiplication instructions. It
           | does have very fast interrupt latency for the processors of
           | its time, though.
        
           | Findecanor wrote:
           | Multitasking is tricky because the stack pointer is hard-
           | coded to use page 1. it could be moved first on the 65EC02
           | and on the 16-bit 65C816.
           | 
           | In a code density comparison, the 6502 is as bad as some RISC
           | processors where instructions are four bytes instead of one. 
           | <https://www.researchgate.net/publication/224114307_Code_dens
           | ...>
           | 
           | But it was fast, with the smallest instructions taking 2
           | cycles (one to read the op-code, one to execute). A 1 MHz
           | 6502 is considered on par with a 3.5 MHz Z80, overall.
        
         | FartyMcFarter wrote:
         | I'm not a hardware engineer, so take this with a grain of salt:
         | 
         | The 6502 doesn't have a cache, only external memory. So
         | performance would probably be much worse than naively expected
         | (except perhaps for workloads that fit in CPU registers, _edit_
         | -not even those due to the lack of a code cache as well).
         | Memory latencies haven 't improved nearly as much as CPU speeds
         | have, which is why modern CPUs have big caches.
         | 
         | The CPU would be idly waiting for memory to respond most of the
         | time, which completely kills performance.
        
           | RetroTechie wrote:
           | That's because on (most) modern systems, main RAM is combined
           | (shared memory space) and external from CPU, connected
           | through fat but high-latency pipe.
           | 
           | A solution is to include RAM with each CPU core on-die. Afaik
           | this is uncommon approach because semiconductor fabrication
           | processes for RAM vs. CPU don't match well? But it's not
           | impossible - IC's like this exist, and eg. SoC's with
           | integrated RAM, CPU caches etc are a thing.
           | 
           | So imagine a 'compute module' consisting of 6502 class CPU
           | core + a few (dozen?) KB of RAM directly attached + some
           | peripheral I/O to talk to neighbouring CPU's.
           | 
           | Now imagine a matrix of 1000s of such compute modules,
           | integrated on a single silicon die, all concurrently
           | operating @ GHz clock speeds. Sounds like a GPU, with main
           | RAM distributed across its compute units. :-)
           | 
           | Examples:                 GreenArrays GA144       Cerebras
           | Wafer Scale Engine
           | 
           | (not sure about nature of the 'CPU' cores on the latter.
           | General purpose or minimalist, AI/neural network processing
           | specialized?)
           | 
           | Afaik the 'problem' is more how to program such systems
           | easily / securely, how to arrange common peripherals like
           | USB, storage, video output etc. in a practical manner as to
           | utilize the immense compute power + memory bandwidth in such
           | a system.
        
             | masklinn wrote:
             | > Now imagine a matrix of 1000s of such compute modules,
             | integrated on a single silicon die, all concurrently
             | operating @ GHz clock speeds. Sounds like a GPU, with main
             | RAM distributed across its compute units. :-)
             | 
             | Sounds like a new version of the Connection Machine (the
             | classic / original one).
        
           | hasmanean wrote:
           | Back then memory latency was faster than a cpu clock cycle so
           | it was not needed.
        
           | brazzy wrote:
           | Now I have to wonder whether it would be possible to build a
           | cache _around_ an unmodified 6502. I.e. a component that just
           | sits between the (unmodified) CPU and memory and does
           | caching.
        
             | Someone wrote:
             | I don't think an unmodified 6502 would get faster; it
             | assumes memory is fast enough for its clock speed, and
             | won't, for example, prefetch code or data.
             | 
             | Also, if you add a modern-sized cache, you won't need the
             | memory at all in a single-CPU system; the cache would
             | easily be larger than the addressable memory.
        
               | vidarh wrote:
               | It'd only benefit if the latency of the external memory
               | is bad enough to keep the cores waiting, sure, and
               | certainly you're right regarding a single-CPU system. I
               | think this fantasy experiment only "makes sense" (for low
               | enough values of making sense) if you assume a massively
               | parallel system. I don't think it'd be easy (or even
               | doable) to get good performance out of it by modern
               | standards - it'd be a fun weird experiment though.
        
               | Someone wrote:
               | If I were to build a system with lots of 6502s, I think I
               | would build something transputer-like, so that each CPU
               | has its own memory, and the total system can have more
               | than 64 kB of RAM.
               | 
               | The alternative of a massively parallel system with
               | shared memory, on 6502, would mean all CPUs have 64 kB or
               | memory, combined. I think that will limit what you can do
               | with such a system too much.
        
               | vidarh wrote:
               | See: https://news.ycombinator.com/item?id=37011671
               | 
               | These were intended to get to 4k cores (but 32 bit),
               | w/32K on-die memory, but the ability to read from/write
               | to other cores. A 6502 inspired variant would be cool,
               | but to not get killed on the cycle cost of triggering
               | reads/writes you'd probably want a modified core, or
               | maybe a 65C816 (16-bit mostly 6502 compatible).
        
             | ack_complete wrote:
             | The Apple IIc+ did this with a 4MHz 65C02.
        
             | [deleted]
        
             | toast0 wrote:
             | Should be, the 6502 has a RDY signal, to insert wait states
             | in case accesses aren't ready yet. You'll need some way to
             | configure the cache for address ranges and access types,
             | but that would presumably just be more mmio.
        
           | masklinn wrote:
           | > The 6502 doesn't have a cache
           | 
           | I don't think CPU caches were much of a thing back then, at
           | least in the segments involved. AFAIK Intel wouldn't get on-
           | die cache until the 486 (the 386 did support external L1, I
           | think IBM's licensed variants also had internal caches but
           | they were only for IBM to use).
           | 
           | The most distinguishing feature of the 6502 is that it had
           | almost no _registers_ and used external memory for most of
           | the working set (it had shorter encodings for the first page,
           | and IIRC later integrators would use faster memory for the
           | zero page).
        
             | FartyMcFarter wrote:
             | > I don't think CPU caches were much of a thing back then
             | 
             | Indeed, because memory was fast enough to respond within a
             | single cycle back then. Or alternatively, CPU cycles were
             | slow enough for that to happen depending on how you want to
             | look at it :)
        
           | vidarh wrote:
           | You'd probably want to add at least a small cache, sure.
           | Putting the zero page (256 bytes) on-die on each core
           | (because it's typically used as "extra registers" of sort in
           | 6502 code) plus a small additional instruction cache could do
           | wonders. But you probably still wouldn't get anywhere near
           | enough performance for it to be competitive with a more
           | complex modern architecture.
           | 
           | It'd be a fun experiment to see someone do in an FPGA,
           | though.
        
           | bitwize wrote:
           | 6502 systems had RAM that ran at system speed. Everything was
           | so much slower then so this was feasible... plus, some
           | systems (like the Commodore VIC-20) used fast static RAM.
           | 
           | The 6502 even had an addressing mode wherein accesses to the
           | first 256 bytes of RAM -- the "zero page" -- were much faster
           | due to not needing address translation, to the point where it
           | was like having 256 extra CPU registers. Critical state for
           | hot loops and such was often placed in the zero page.
           | 
           | Do not presume to apply principles of modern systems to these
           | old systems. They were very different.
        
             | vidarh wrote:
             | The saving for the zero page is only the more compact
             | encoding which saves one cycle for memory access (for the
             | non-indexed variants, anyway)
             | 
             | E.g. LDA $42 is encoded as $A5 $42, and so takes 3 memory
             | accesses (two to read the instruction, one to read from
             | address $42). LDA $1234 is encoded as $AD $34 $12, and so
             | takes 4 cycles: to read $AD $34 $12, and one to read the
             | contents of address $1234. Same for the other zeropage vs.
             | equivalent absolute instructions.
             | 
             | See e.g. timing chart for LDA here:
             | 
             | http://www.romdetectives.com/Wiki/index.php?title=LDA
        
           | jdblair wrote:
           | The 6502 didn't need cache because its clock speed was slower
           | than the DRAM connected to it. That made memory accesses very
           | inexpensive.
           | 
           | The Apple II took advantage of the speed difference between
           | CPU and DRAM and designed the video hardware to read from
           | memory every other memory cycle, interleaved with CPU memory
           | access.
        
             | Vogtinator wrote:
             | > The Apple II took advantage of the speed difference
             | between CPU and DRAM and designed the video hardware to
             | read from memory every other memory cycle, interleaved with
             | CPU memory access.
             | 
             | Same for the C64. Sometimes it was necessary to read
             | slightly more than that for video display though, so the
             | VIC (video chip) had to pause the CPU for a bit sometimes,
             | resulting in so called "badlines".
        
             | smolder wrote:
             | I think they used SRAM and not DRAM, which is how it was
             | faster than a clock cycle.
        
               | Gracana wrote:
               | Nope, they used 16k x 1-bit DRAM ICs.
        
               | monocasa wrote:
               | Both were faster. DRAM was a little cheaper, but required
               | more circuitry to handle refresh on most CPUs making it a
               | wash cost-wise on some designs. Typical woz engineering
               | got the cost of the refresh circuitry down to where DRAM
               | made sense economically on the Apple ][.
               | 
               | Interestingly Z80s had DRAM refresh circuitry builtin
               | which which was one reason for their prevalence.
        
               | peterfirefly wrote:
               | Screen output _was_ the DRAM refresh.
               | 
               | And for the Z80: also that it only needed GND and 5V. The
               | 8080 also needed 12V. And the Z80 only needed a single
               | clock phase -- the 8080 needed two. The 6502 also only
               | needed 5V and a single clock input (the 6800 needed two
               | clock phases). The 6502 and Z80 were simply a lot easier
               | to work with than most of the competition.
        
         | causality0 wrote:
         | This comment makes me want to read an article focusing on how
         | performance-per-transistor has changed over time.
        
         | vidarh wrote:
         | For an idea in that vein (many 6502's on a die), look at (the
         | sadly defunct) Epiphany, in the Parallela [1] (the buy links
         | are still there but I doubt any are still available).
         | 
         | I have two, and they're fun little toys and I wish they'd have
         | gotten further, with the caveat that actually making use of
         | them in a way that'd be better than just resorting to a GPU is
         | hard.
         | 
         | The 16-core Epiphany[2] in the Parallella is too small, but
         | they were hoping for 1k or 4k core version.
         | 
         | I'm saying it's a similar idea because the Epiphany cores had
         | 32KB on-die RAM per core, and a predictable "single cycle per
         | core traversed" (I think, not checked the docs in years) cost
         | of accessing RAM on the other cores, arranged in a grid. Each
         | core is very simple, though still nowhere near the simplicity
         | of a 6502 (they're 32 bit, w/with a much more "complete" modern
         | instruction set)
         | 
         | The challenge is finding a problem that 1) decomposes well
         | enough to benefit from many cores despite low RAM per core (if
         | you need to keep hitting main system memory or neighbouring
         | core RAM, you lose performance fast), 2) _does ot_ decompose
         | well into SIMD style processing where a modern GPU would make
         | mincemeat of it, and 3) is worth the hassle of figuring out how
         | to fit your problem into the memory constraints.
         | 
         | I don't know if this is an idea that will ever work - I suspect
         | the problem set where they could potentially compete is too
         | small, but I really wish we'd see more weird hardware attempt
         | like this anyway.
         | 
         | [1] https://parallella.org/
         | 
         | [2] https://www.adapteva.com/docs/e16g301_datasheet.pdf
        
           | LeonenTheDK wrote:
           | That reminds me of the Green Arrays computer with 144 cores,
           | programmed in Color Forth.
           | 
           | Chuck Moore has a talk on it here, very interesting stuff:
           | https://www.youtube.com/watch?v=0PclgBd6_Zs
           | 
           | I'm not sure how far this has gone since then. The site is
           | still up with buy links as well:
           | https://www.greenarraychips.com
        
             | vidarh wrote:
             | Yeah, I think they're even more minimalist. It's an
             | interesting design space, but hard to see it taking off,
             | especially with few developers comfortable with thinking
             | about the kind of minimalism required to take advantage.of
             | these kinds of chips.
        
               | LeonenTheDK wrote:
               | Completely agree, it's an almost total shift from
               | everything that's in use these days. Very interesting to
               | play with though, I'd love to see some real world use.
        
           | RetroTechie wrote:
           | And 4) how to handle problems that _don 't_ map well onto
           | massively-parallel machine. Some problems / algorithms are
           | inherently serial by nature.
           | 
           | Some applications like games decompose relatively easy into
           | separate jobs like audio, video, user input, physics,
           | networking, prefetch game data etc. Further decomposing those
           | tasks... _not_ so easy. So eg. 4..8 cores are useful. 100 or
           | 1k+ cores otoh... hmmm.
        
             | vidarh wrote:
             | True, but I'd expect for a chip like that you'd do what the
             | parallella did, or what we do with CPU + GPU and pair it
             | with a chip with a smaller number of higher powered cores.
             | E.g. the Parallella had 2x ARM cores along with the 16x
             | Epiphany cores.
        
             | convolvatron wrote:
             | for control parallelism. for data parallelism you get to
             | soak up as many cores as you want if you can structure your
             | problem the right way.
             | 
             | personally I'd love to see more exploration at hybrid
             | approaches. the Tera MTA supported both modes pretty well.
        
       | 3cats-in-a-coat wrote:
       | I'm trying to be wowed by this, but... yes, modern tech be
       | complex. A modern website has more source code than the first few
       | versions of Windows put together.
        
         | serf wrote:
         | anecdote : most 'high volume source code' sites i've
         | encountered are that way because of a need to meet standards on
         | a large amount of devices, compliance regulations, and
         | scaffolding/boiler-plate; 90% of it being dead-code
         | boilerplate.
         | 
         | the logic comes at the very tail end, and often it is
         | exceedingly limited, handing off msot of the work to third
         | party APIs/whatever.
         | 
         | I guess what i'm trying to say is that source-code volume is a
         | lousy metric for ascertaining 'complexity'; something can be
         | huge and cumbersome but still only use simple logic that's easy
         | to follow once you get past the cruft.
        
           | 3cats-in-a-coat wrote:
           | What you describe is basically how DNA works. Most of it is
           | inactive junk. Parts of which, however, gets activated
           | when... who knows what happens.
           | 
           | Basically we got coded by millions of interns.
        
             | gary_0 wrote:
             | And some of the code is never called, but removing it
             | causes weird crashes because the cell.exe compiler is
             | finicky. And some of the code shouldn't be there, but it
             | was left behind by a virus infection that wasn't completely
             | cleared out.
        
           | rvba wrote:
           | Most are done this way since they are outsourcrd to the
           | cheapest bidders who "glue" multiple technologies poorly on
           | top of each other. And then add tons of ads / tracker /
           | analytics code.
           | 
           | Lots of websites dont care that their technology is terrible.
        
       | b5n wrote:
       | There are parts that should be read, but when you only need a
       | reference a good lookup tool is invaluable:
       | 
       | https://github.com/skeeto/x86-lookup
        
       | amelius wrote:
       | It probably also has more bugs than the 6502 has transistors :)
        
         | masklinn wrote:
         | All processors have bugs. Despite their simplicity the 6502
         | were hardly bug-free:
         | https://en.wikipedia.org/wiki/MOS_Technology_6502#Bugs_and_q...
        
       | kens wrote:
       | Wow, I was surprised to see my article from 2013 appear on HN
       | today! (It's hard to believe that the article is 10 years old.)
       | In that time, the x86 documentation has expanded from four
       | volumes to 10 volumes. Curiously, the number of pages has only
       | grown by 21% (4181 pages to 5066). The x86 instruction set has
       | added a bunch of features: AVX, FMA3, TSX, BMI, VNNI, and CET.
       | But I guess the number of new instructions is relatively small
       | compared to the existing ones.
        
       | tos1 wrote:
       | Update to the 2013 numbers: As of today, the Intel SDM
       | (https://cdrdv2-public.intel.com/782158/325462-sdm-vol-1-2abc...)
       | has 5066 pages.
        
       ___________________________________________________________________
       (page generated 2023-08-05 23:01 UTC)