[HN Gopher] Execution units are often pipelined
       ___________________________________________________________________
        
       Execution units are often pipelined
        
       Author : ingve
       Score  : 132 points
       Date   : 2024-12-27 09:02 UTC (3 days ago)
        
 (HTM) web link (blog.xoria.org)
 (TXT) w3m dump (blog.xoria.org)
        
       | Vogtinator wrote:
       | For x86 cores this is visible in Agner Fog's instruction
       | performance tables: https://agner.org/optimize/#manuals
       | 
       | The latency shows after how many cycles the result of an
       | instruction can be consumed by another, while the throughput
       | shows how many such instructions can be pipelined per cycle, i.e.
       | in parallel.
        
         | wtallis wrote:
         | I believe the throughput shown in those tables is the total
         | throughput for the whole CPU core, so it isn't immediately
         | obvious which instructions have high throughput due to
         | pipelining within an execution unit and which have high
         | throughput due just to the core having several execution units
         | capable of handling that instruction.
        
           | BeeOnRope wrote:
           | That's true, but another part of the tables show how many
           | "ports" the operation can be executed on, which is enough
           | information to concluded an operation is pipelined.
           | 
           | For example, for many years Intel chips had a multiplier unit
           | on a single port, with a latency of 3 cycles, but an inverse
           | throughput of 1 cycle, so effectively pipelined across 3
           | stages.
           | 
           | In any case, I think uops.info [1] has replaced Agner for up-
           | to-date and detailed information on instruction execution.
           | 
           | ---
           | 
           | [1] https://uops.info/table.html
        
             | Earw0rm wrote:
             | Shame it doesn't seem to have been updated with Arrow Lake,
             | Zen 5 and so on yet.
        
               | BeeOnRope wrote:
               | Yes. In the past new HW has been made available to the
               | uops.info authors in order to run their benchmark suite
               | and publish new numbers: I'm not sure if that just hasn't
               | happened for the new stuff, or if they are not interested
               | in updating it.
        
         | ajross wrote:
         | FWIW, there are _two_ ideas of parallelism being conflated
         | here. One is the parallel execution of the different sequential
         | steps of an instruction (e.g. fetch, decode, operate, retire).
         | That 's "pipelining", and it's a different idea than decoding
         | multiple instructions in a cycle and sending them to one of
         | many execution units (which is usually just called "dispatch",
         | though "out of order execution" tends to connote the same idea
         | in practice).
         | 
         | The Fog tables try hard show the former, not the latter. You
         | measure dispatch parallelism with benchmarks, not microscopes.
         | 
         | Also IIRC there are still some non-pipelined units in Intel
         | chips, like the division engine, which show latency numbers ~=
         | to their execution time.
        
           | BeeOnRope wrote:
           | I don't think anyone is talking about "fetch, decode,
           | operate, retire" pipelining (though that is certainly called
           | pipelinig): only pipelining within the execution of a
           | instruction that takes multiple cycles just to execute (i.e.,
           | latency from input-ready to output-ready).
           | 
           | Pipelining in stages like fetch and decode are mostly hidden
           | in these small benchmarks, but are visible when there are
           | branch misprediction, other types of flushes, I$ misses and
           | so on.
        
             | ajross wrote:
             | > I don't think anyone is talking about "fetch, decode,
             | operate, retire" pipelining (though that is certainly
             | called pipelinig): only pipelining within the execution of
             | a instruction that takes multiple cycles just to execute
             | (i.e., latency from input-ready to output-ready).
             | 
             | I'm curious what you think the distinction is? Those
             | statements are equivalent. The circuit implementing "an
             | instruction" can't work in a single cycle, so you break it
             | up and overlap sequentially issued instructions. Exactly
             | what they do will be different for different hardware,
             | sure, clearly we've moved beyond the classic four stage
             | Patterson pipeline. But that doesn't make it a different
             | kind of pipelining!
        
               | formerly_proven wrote:
               | Independently scheduled and queued execution phases are
               | qualitatively different from a fixed pipeline.
        
               | gpderetta wrote:
               | An OoO design is qualitatively different from an in-order
               | one because of renaming and dynamic scheduling, but the
               | pipelining is essentially the same and for the same
               | reasons.
        
               | BeeOnRope wrote:
               | We are interested in the software visible performance
               | effects of pipelining. For small benchmarks that don't
               | miss in the predictors or icache, this mostly means
               | execution pipelining. That's the type of pipelining the
               | article is discussing and the type of pipelining
               | considered in instruction performance breakdowns
               | considered by Agner, uops.info, simulated by LLVM-MCA,
               | etc.
               | 
               | I.e., a lot of what you need to model for tight loops
               | only depends on the execution latencies (as little as 1
               | cycle), and not on the full pipeline end-to-end latency
               | (almost always more than 10 cycles on big OoO, maybe more
               | than 20).
        
           | gpderetta wrote:
           | > and it's a different idea than decoding multiple
           | instructions in a cycle and sending them to one of many
           | execution units (which is usually just called "dispatch",
           | though "out of order execution
           | 
           | Being able to execute multiple instructions is more properly
           | superscalar execution, right? In-order designs are also
           | capable of doing it and the separate execution unit do not
           | even need to run in lockstep (consider the original P5 U and
           | V pipes).
        
             | ajross wrote:
             | In-order parallel designs are "VLIW". The jargon indeed
             | gets thick. :)
             | 
             | But as to OO: the whole idea of issuing _sequential_
             | instructions in parallel means that the hardware needs to
             | track dependencies between them so they can 't race ahead
             | of their inputs. And if you're going to do that anyway,
             | allowing them to retire out of order is a big
             | performance/transistor-count win as it allows the pipeline
             | lengths to be different.
        
               | gpderetta wrote:
               | VLIW is again a different thing. It uses a single
               | instruction that encodes multiple independent operations
               | to simplify decoding and tracking, usually with exposed
               | pipelines.
               | 
               | But you can have, for example, a classic in-order RISC
               | design that allows for parallel execution. OoO renaming
               | is not necessary for dependency tracking (in fact even
               | scalar in order CPUs need dependency tracking to solve
               | RAW and other hazards), it is "only" needed for executing
               | around stalled instructions (while an in order design
               | will stall the whole pipeline).
               | 
               | Again P5 (i.e the original Pentium) was a very
               | traditional in order design, yet could execute up to two
               | instructions per cycle.
        
               | ajross wrote:
               | > VLIW is again a different thing.
               | 
               | No it isn't. I'm being very deliberate here with refusing
               | pedantry. _In practice_ , "multiple dispatch" means "OO"
               | in the same way that "VLIW" means "parallel in order
               | dispatch". Yes, you can imagine hypothetical CPUs that
               | mix the distinction, but they'd be so weird that they'd
               | never be built. Discussing the jargon without context
               | only confuses things.
               | 
               | > you can have, for example, a classic in-order RISC
               | design that allows for parallel execution.
               | 
               | Only by inventing VLIW, though, otherwise there's no way
               | to tell the CPU how to order what it does. Which is my
               | point; the ideas are joined at the hip. Note that the
               | Pentium had two defined pipes with specific rules about
               | how the pairing was encoded in the instruction stream. It
               | was, in practice, a VLIW architecture (just one with a
               | variable length encoding and where most of the available
               | instruction bundles only filled one slot)! Pedantry hurts
               | in this world, it doesn't help.
        
               | gpderetta wrote:
               | I'm sorry, but if P5 was VLIW then the word has lost all
               | meanings. They couldn't possibly be more different.
        
             | wtallis wrote:
             | Right; it's easy to forget that superscalar CPU cores don't
             | actually have to be in-order, but most of them _are_ out-
             | of-order because that 's usually necessary to make good use
             | of a wide superscalar core.
             | 
             | (What's the best-performing in-order general purpose CPU
             | core? POWER6 was notably in-order and ran at quite high
             | clock speeds for the time. Intel's first-gen Atom cores
             | were in-order and around the same time as POWER6 but at
             | half the clock speed. SPARC T3 was ran at an even lower
             | clock speed.)
        
               | gpderetta wrote:
               | POWER6 might indeed have been the last In-Order speed
               | demon.
        
           | eigenform wrote:
           | > Also IIRC there are still some non-pipelined units in Intel
           | chips, like the division engine, which show latency numbers
           | ~= to their execution time
           | 
           | I don't think that's accurate. That latency exists _because_
           | the execution unit is pipelined. If it were not pipelined,
           | there would be no latency. The latency corresponds to the
           | fact that  "doing division" is distributed across multiple
           | clock cycles.
        
       | delusional wrote:
       | Is this still the case if I have different ALU operations. Say I
       | have a single ALU on a single x86 core. Would the ALU be able to
       | interleave say ADD and MULs? or would I incur the latency measure
       | for each operation switch?
       | 
       | I know that some ALU's have multiple ADD complexes, and I assume
       | that would influence the answer, hence why I specified x86.
        
         | barbegal wrote:
         | You need to be far more specific than x86, x86 is just the
         | instruction set, the actual architecture can vary massively
         | with the same instruction set.
         | 
         | In general though there is no penalty for interleaved
         | operations.
        
         | bjourne wrote:
         | A modern ALU has multiple pipelined data paths for its
         | operations. So maybe three adders with a one-cycle latency, two
         | multipliers with a three-cycle latency, and one divider with a
         | 16-cycle latency. Sustained throughput depends on the
         | operation. Maybe one per cycle for add and multiply, but only
         | one every eight cycle for divide.
        
         | BeeOnRope wrote:
         | Yes, it applies to different operations. E.g. you could
         | interleave two or three different operations with 3 cycle
         | latency and 1 cycle inv throughput on the same port and get 1
         | cycle inv throughput in aggregate for all of them. There is no
         | restriction that they must be same operation.
         | 
         | In some cases mixing operations with _different_ latencies on
         | the same execution port will leave you with less throughput
         | than you expect due to "writeback conflicts", i.e., two
         | instructions finishing on the same cycle (e.g., a 2 cycle
         | operation starting on cycle 0 and a 1 cycle operation on cycle
         | 1, will both finish on cycle 2 and in some CPUs this will delay
         | the results of one of the operations by 1 cycle due to a
         | conflict).
        
         | Tuna-Fish wrote:
         | You can interleave most operations how you like, without any
         | extra latency, each op starting as soon as all the results are
         | ready, regardless of where they were computed.
         | 
         | There are some exceptions where "domain crossing", or using a
         | very different operation costs an extra clock cycle. Notably,
         | in vector registers using FP or integer operations on the
         | result of the different type of op, as modern CPUs don't
         | actually hold FP values in their IEEE754 transfer format inside
         | registers, but instead registers have hidden extra bits and
         | store all values in normal form, allowing fast operations on
         | denormals. The downside of this is that if you alternate
         | between FP and INT operations, the CPU has to insert extra
         | conversion ops between them. Typical cost is 1-3 cycles per
         | domain crossing.
        
       | barbegal wrote:
       | These days CPUs are so complex and have so many interdependencies
       | that the best way to simulate them is simply to run them!
       | 
       | In most real code the high throughput of these sorts of
       | operations means that something else is the limiting factor. And
       | if multiplier throughput is limiting performance then you should
       | be using SIMD or a GPU.
        
         | sroussey wrote:
         | > These days CPUs are so complex and have so many
         | interdependencies that the best way to simulate them is simply
         | to run them!
         | 
         | True. You can imagine how difficult it is for the hardware
         | engineer designing and testing these things before production!
        
         | alain94040 wrote:
         | Very true. To paraphrase a saying, CPU amateurs argue about
         | micro-benchmarks on HN, the pros simulate real code.
        
           | pizlonator wrote:
           | Or even just A:B test real code.
        
           | devit wrote:
           | The amateurs usually run benchmarks (because they can't
           | reason about it as they lack the relevant knowledge) and
           | believe they got a useful result on some aspect when in the
           | fact the benchmark usually depends on other arbitrary random
           | factors (e.g. maybe they think they are measuring FMA
           | throughput, but are in fact measuring whether the compiler
           | autovectorizes or whether it fuses multiply and adds
           | automatically).
           | 
           | A pro would generally only run benchmarks if it's the only
           | way to find out (or if it's easy), but isn't going to trust
           | it unless there's a good explanation for the effects, or
           | unless they actually just want to compare two very specific
           | configurations rather than coming up with a general finding.
        
             | LegionMammal978 wrote:
             | Then again, 'reasoning about it' can easily go awry if your
             | knowledge doesn't get updated alongside the CPU
             | architectures. I've seen people confidently say all sorts
             | of stuff about optimization that hasn't been relevant since
             | the early 2000s. Or in the other direction, some people
             | treat modern compilers/CPUs like they can perform magic, so
             | that it makes no difference what you shovel into them
             | (e.g., "this OOP language has a compiler that always knows
             | when to store values on the stack"). Benchmarks can help
             | dispel some of the most egregious myths, even if they are
             | easy to misuse.
        
         | Joker_vD wrote:
         | > the best way to simulate them is simply to run them!
         | 
         | And it's quite sad because when you are faced with choosing
         | between two ways to express something in the code, you can't
         | predict how fast one or another option will run. You need to
         | actually run both, preferrably in an environment close to the
         | prod, and under similar load, to get accurate idea which one is
         | more performant.
         | 
         | And the worst thing is, you most likely can't extract any
         | useful general principle out of it, because any small
         | perturbation in the problem will result in a code that is very
         | similar yet has completely different latency/throughput
         | characteristics.
         | 
         | The only saving grace is that modern computers are really
         | incredibly fast, so layers upon layers of suboptimal code
         | result in applications that mostly perform okay, with maybe
         | some places where they perform egregiously slow.
        
           | mpweiher wrote:
           | > you can't predict how fast one or another option will run
           | 
           | "The best way to predict the future is to invent it" -- Alan
           | Kay
           | 
           | > You need to actually run both
           | 
           | Always! If you're not measuring, you're not doing performance
           | optimization. And if you think CPUs are bad: try benchmarking
           | I/O.
           | 
           | Operating System (n) -- Mechanism designed specifically to
           | prevent any meaningful performance measurement (every
           | performance engineer ever)
           | 
           | If it's measurable and repeatable, it's not meaningful. If
           | it's meaningful, it's not measurable or repeatable.
           | 
           | Pretty much.
           | 
           | Or put another way: an actual stop watch is a very meaningful
           | performance measurement tool.
        
       | codedokode wrote:
       | What I am interested to know is who invented pipelining? I tried
       | googling but without much success. Does anybody know?
        
         | paulsutter wrote:
         | Probably the first person to ask, 'how can I speed up this
         | processor, maybe there's a way to do more than one processing
         | step at a time for each instruction'
        
           | zelos wrote:
           | Presumably by analogy to production lines in factories?
        
             | sroussey wrote:
             | Actually, it's just obvious when you design the most basic
             | ALU and the circuits for doing the carry.
        
               | codedokode wrote:
               | Pipelining requires inserting extra (costly) flip-flops
               | and intuitively one might think that this would decrease
               | the performance. It is not that easy.
        
             | NotCamelCase wrote:
             | Is it really that much different than e.g. a cook preparing
             | carrots while onions are being cooked? Nobody needed to
             | invent this as far as my knowledge of history goes; so,
             | maybe there really is nothing new under the sun :)
             | 
             | Joking aside, of course, the ingenuity lies in finding a
             | fitting solution to problem at hand, as well as knowing how
             | to apply it.
        
           | adrian_b wrote:
           | Actually there are 2 ways to do more than one processing step
           | at a time: parallel execution and pipelined (a.k.a.
           | overlapped) execution.
           | 
           | Both ways had been used for many centuries for the
           | manufacturing of complex things, as ways to organize the work
           | of multiple workers.
        
             | taneq wrote:
             | Slightly offtopic but did we go to uni together and team up
             | on a couple of demo compos and VJ some events? If so hey
             | it's been ages and whats up? :D If not, cool name and
             | stuff. :)
        
         | PittleyDunkin wrote:
         | I'm guessing https://en.wikipedia.org/wiki/Donald_B._Gillies is
         | a good bet for this claim.
        
           | adrian_b wrote:
           | Nope.
           | 
           | 1957 in a transistorized computer is far too late for the
           | origin of pipelining.
           | 
           | Pipelining had already been used in computers with vacuum
           | tubes a few years before and it had also been used already a
           | decade earlier in computers with electromechanical relays,
           | i.e. IBM SSEC, which had a 3-stage pipeline for the execution
           | of its instructions (IBM SSEC had a Harvard architecture,
           | with distinct kinds of memories for program and for data).
           | 
           | Before 1959, pipelining was named "overlapped execution".
           | 
           | The first use of the word "pipeline" was as a metaphor in the
           | description of the IBM Stretch computer, in 1959. The word
           | "pipeline" began to be used as a verb, with forms like
           | "pipelining" and "pipelined" around 1965. The first paper
           | where I have seen such a verbal use was from the US Army.
           | 
           | Outside computing, pipelining as a method of accelerating
           | iterative processes had been used for centuries in the
           | progressive assembly lines, e.g. for cars, and before that
           | for ships or engines.
           | 
           | Pipelined execution and parallel execution are dual methods
           | for accelerating an iterative process that transforms a
           | stream of data. In the former the process is divided into
           | subprocesses through which the stream of data passes in
           | series, while in the latter the stream of data is divided
           | into substreams that go in parallel through multiple
           | instances of the process.
        
             | cogman10 wrote:
             | The earliest "pipeline" example I can find is the
             | portsmouth block mills [1] implemented by Marc
             | Brunel[2]/Henry Maudslay[3] around 1802
             | 
             | [1] https://en.wikipedia.org/wiki/Portsmouth_Block_Mills
             | 
             | [2] https://en.wikipedia.org/wiki/Marc_Isambard_Brunel
             | 
             | [3] https://en.wikipedia.org/wiki/Henry_Maudslay
        
               | bjourne wrote:
               | Before that you had bucket brigades.
        
             | formerly_proven wrote:
             | Zuse Z3 had a three stage instruction pipeline.
        
             | Voultapher wrote:
             | > Pipelined execution and parallel execution are dual
             | methods for accelerating an iterative process that
             | transforms a stream of data. In the former the process is
             | divided into subprocesses through which the stream of data
             | passes in series, while in the latter the stream of data is
             | divided into substreams that go in parallel through
             | multiple instances of the process.
             | 
             | With parallel do you mean superscalar execution i.e. having
             | two ALUs or do you mean having multiple cores?
        
         | nxobject wrote:
         | EDIT: Adrian_B's sibling comment has even earlier historical
         | examples that illustrate how architecture ideas are as old as
         | the hills.
         | 
         | There are good historical hardware architecture textbooks:
         | Peter Kogge's "The Architecture of Pipelined Computers" ('81)
         | mentions that the UNIVAC 1 pipelined IO (from which dedicated
         | IO units sprung out), the IBM 7094 similarly interleaved
         | multiple-state memory access cycles between multiple memory
         | units to increase throughput, and then the IBM STRETCH had a
         | two-part fetch-execute without attempt to address hazards.
         | 
         | So pipelining came out the ad-hoc application of the patterns
         | of interleaving and hiving off functionality to dedicated
         | units. In general, a lot of hardware architecture ideas cropped
         | up very early in big iron, with microcoding even co-existing
         | with them, without all of the extra full ideas being worked out
         | (e.g. forwarding, hazard resolution of all types, etc.)
         | 
         | There's a whole fun universe of vintage hardware architecture
         | textbooks that get through the essential concepts through
         | historical big iron, if you're willing to look beyond Hennessy
         | and Patterson and Shen and Lipasti! I am thinking that the
         | latter might have references to other historical textbooks on
         | pipelining, though. "The Anatomy of a High-Performance
         | Microprocessor", a pedagogical deep-dive into the design of the
         | AMD K6 uarch (up to detailed algorithms and HDL!), will have
         | good references to historical textbooks too since these authors
         | were trained before the crop of "modern" (microprocessor-era)
         | textbooks.
        
         | AtlasBarfed wrote:
         | Come on, this isn't like the invention of calculus. Pipelining
         | is just assembly line processing that anyone who managed or
         | designed any factory would have implemented as soon as the
         | transistor real estate became available.
        
       | saagarjha wrote:
       | Unrelated but for benchmarks like these you can just ask the
       | kernel for how many clock cycles you've used using
       | proc_pid_rusage rather than hardcoding it.
        
       | taneq wrote:
       | Is anything important not pipelined?
        
         | monocasa wrote:
         | Operations that are important, but take a single cycle.
        
         | BeeOnRope wrote:
         | Division, gather are muti-cycle instructions which have
         | typically had little or no pipelining on x86.
        
         | syncsynchalt wrote:
         | Memory barriers come to mind.
        
       | ryukoposting wrote:
       | If the two ALUs could feed results into each other, that would
       | make things _really_ interesting, especially when you consider
       | the way it would affect reordering.
       | 
       | Imagine if the OP's benchmark had two of those multiplication
       | chains on independent registers:                 mul x1, x0, x0
       | // a       mul x2, x1, x1 // b       mul x3, x2, x2 // c
       | mul x4, x3, x3 // d            mul x6, x5, x5 // e       mul x7,
       | x6, x6 // f       mul x8, x7, x7 // g       mul x9, x8, x8 // h
       | 
       | If you had two independent ALUs, my intuition is that you'd want
       | to dispatch the instructions as AEBFCGDH, interleaving the two
       | chains.
       | 
       | But, if the two ALUs could feed each other, perhaps you'd
       | actually want to leave it in ABCDEFGH order? Hmm.
        
         | monocasa wrote:
         | They are able to do this via what's called a bypass network.
        
         | wtallis wrote:
         | > If you had two independent ALUs, my intuition is that you'd
         | want to dispatch the instructions as AEBFCGDH, interleaving the
         | two chains. But, if the two ALUs could feed each other, perhaps
         | you'd actually want to leave it in ABCDEFGH order? Hmm.
         | 
         | Once the frontend has determined what the dependencies are
         | between instructions, you no longer have a linear sequence of
         | instructions being dispatched to execution units but a DAG, and
         | with multiple instructions starting on the same clock cycle the
         | execution order would be more like (AE)(BF)(CG)(DH) without any
         | ordering between instructions in the same parenthesized group.
         | 
         | The reorder buffer in the CPU will be large enough to handle
         | the compiler emitting either ABCDEFGH or AEBFCGDH, but the
         | interleaved ordering is less likely to run into limitations of
         | the instruction decoder (if this chunk of code isn't already in
         | a decoded uop cache).
        
       | filiphorvat wrote:
       | Unrelated, do FPUs on modern CPUs use FMAs to both multiply and
       | add or do they use mul/add-only units?
        
         | bonzini wrote:
         | Probably to do multiplies, as the extra add is basically free.
         | Adds are cheaper.
        
           | thesz wrote:
           | Adds are cheaper only for fixed-point computations. Floating
           | point addition needs to denormalize one of its' arguments,
           | perform an (integer) addition and then normalize the result.
           | 
           | Usually FP adds take a cycle or two longer than FP
           | multiplication.
        
         | gpderetta wrote:
         | I don't think there is a generally optimal design. There are
         | cons and pros to using the same homogeneous FMAs units for
         | adds, multiplies and fmas, even at the cost of making adds
         | slower (simpler design, and having all instructions of the same
         | latency greatly simplifies scheduling). IIRC intel cycled
         | through 4 cycles fma, add and mul, then to 4 cycles add and mul
         | and 5 cycles fmas, then with a dedicated 3 cycles add.
         | 
         | The optimal design depends a lot on the rest of the
         | microarchitecture, the loads the core is being optimized for,
         | the target frequency, the memory latency, etc.
        
       | rasz wrote:
       | Pipelining is the direct reason behind 1996 Quake running 30 fps
       | on Intel Pentium 133 but requiring 233MHz from Cyrix and 166MHz
       | from AMD K6 to reach same result. 20 fps needed 75MHz Intel
       | Pentium, 133MHz Cyrix and 100MHz AMD K5.
        
         | gary_0 wrote:
         | Specifically, Carmack exploited the fact that on the Pentium,
         | integer instructions could run in parallel with floating-point
         | division[0]. This goes to show that an optimization that
         | usually gets you ~10% might get you 200% depending on what
         | software you're running. And that no implementation detail is
         | safe from an ambitious software engineer.
         | 
         | [0] https://news.ycombinator.com/item?id=38249029
        
           | gpderetta wrote:
           | So that was not actually pipelining, but superscalar
           | execution. Incidentally division wasn't pipelined at all.
        
       | dhash wrote:
       | My favorite illustrations for the concepts discussed here (in an
       | accessible form, not the processor optimization manuals) has long
       | been [0].
       | 
       | For me, this really makes working with a modern microprocessor a
       | _science_ , as anyone who has written benchmarks knows -- it's
       | difficult to reason about the complex behaviour and performance
       | cliffs without testing.
       | 
       | Another excellent example of the weirdness has to be JVM anatomy
       | quarks [1]
       | 
       | [0] https://www.lighterra.com/papers/modernmicroprocessors/
       | 
       | [1] https://shipilev.net/jvm/anatomy-quarks/
        
         | gpderetta wrote:
         | The first link is very nice, worth of a submission of its own.
        
       ___________________________________________________________________
       (page generated 2024-12-30 23:00 UTC)