[HN Gopher] The J1 Forth CPU
       ___________________________________________________________________
        
       The J1 Forth CPU
        
       Author : Cieplak
       Score  : 198 points
       Date   : 2021-01-13 08:29 UTC (14 hours ago)
        
 (HTM) web link (www.excamera.com)
 (TXT) w3m dump (www.excamera.com)
        
       | bmitc wrote:
       | Forth is fun. I haven't learned much of it or used it much, but
       | it has been enjoyable.
        
       | pkaye wrote:
       | I always wondered if it is feasible to do a pipelined/super
       | scalar stack based CPU.
        
         | FullyFunctional wrote:
         | Contrary to much common misunderstanding, sure, no real
         | problem.
         | 
         | The semantics of each instruction is consume the two top stack
         | elements and replace that with the results. You handle this by
         | having an stack rename stack of physical registers (with
         | additional complication for handling under- and overflows),
         | that is assuming the current stack is                 pr3 pr4
         | pr5
         | 
         | and the first free register is pr56
         | 
         | Then an "+" instruction is interpreted as "add pr56, pr4, pr5"
         | and pr56 is consumed and pr4 and pr5 marked to be freed when
         | this commits.
         | 
         | Because stack machines inherently introduce a lot of tight
         | dependencies you will need to use dynamic scheduling (OoOE) to
         | go super-scalar, but it's not a problem.
         | 
         | Upsides are incredible instruction density. Downside: slightly
         | harder to do good code generation, but not really.
        
       | peter_d_sherman wrote:
       | >"J1 is a
       | 
       |  _small (200 lines of Verilog)_
       | 
       | stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes
       | of RAM fits easily on a small Xilinx FPGA.
       | 
       | PDS: A long time ago, I thought that Forth was "just another
       | computer programming language", but back then, I had no
       | appreciation for it, because I didn't understand where it fits in
       | on the abstraction hierarchy that we call modern day computing...
       | 
       | I didn't have this appreciation, at the time, because I had not
       | worked on writing compilers at the time...
       | 
       | Now, let me explain to you why Forth is so gloriously wonderful!
       | 
       | You see, in terms of abstraction, where transistors are the
       | lowest level, and Lisp, and Lisp-like languages are the highest,
       | and most compilers are somewhat mid-level,
       | 
       |  _Forth is the next step up above machine language -- but a step
       | below most compilers!_
       | 
       | You'd know this, you'd _derive_ this, if you ever tried to write
       | a compiler from the bottom up (going from assembly language to a
       | compiler) rather than from the top down (what is taught about
       | compiler construction in most University CS classes, i.e., the
       | "Dragon Book", etc.)
       | 
       | See, Forth is sort of what happens when you add an at-runtime
       | symbol (AKA, "keyword") lookup table to your assembler code, such
       | that addresses (AKA "labels", AKA, "functions", AKA "procedures",
       | AKA "methods") can be dynamically executed at runtime, and you
       | create an interpreter / stack-machine -- around this concept.
       | 
       | There's a little bit more to that of course, but that's the basic
       | idea.
       | 
       | Then you _derive_ Forth!
       | 
       | That's _why_ Forth is so gorgeous and brilliant!
       | 
       | And so aesthetically pleasing, because of it's small size, yet
       | amazing functionality!
       | 
       | And why you don't need big, complex CPU's to run it!
       | 
       | It exactly follows the "Einsteinian mantra" of
       | 
       |  _" As simple as possible, but not simpler!"_
       | 
       | Anyway, thanks again for your great work with the J1 Forth CPU!
        
         | tkinom wrote:
         | Each "J1" seems to be able to map well(easily) to a tensor.
         | 
         | Thousands (or a lot more) of them can be mapped to a TensorFlow
         | network....
        
         | fmakunbound wrote:
         | Agree with all that!
         | 
         | Except:
         | 
         | > Lisp, and Lisp-like languages are the highest, and most
         | compilers are somewhat mid-level
         | 
         | Most Lisps have compilers (except the toy ones), and have had
         | compilers for several decades.
        
           | Jtsummers wrote:
           | I think GP may have worded that poorly, and "most compilers"
           | in your quoted portion would be better read as "most other
           | languages".
           | 
           | Taken this way, the sentence becomes something like: For
           | computing, transistors are the lowest level of abstraction
           | (they are literally the things doing the computing), Lisp and
           | kin languages are at the extreme end of abstraction, and most
           | other languages fit somewhere in between.
           | 
           | Though not entirely true (Lisps let you get pretty low level
           | if you want), it's a reasonable perspective if you look not
           | at what the languages permit, but what they encourage. Forth
           | encourages building your own abstractions from the bottom up.
           | Lisp encourages starting with an abstract machine model and
           | building up even higher levels of abstractions (same with
           | most functional and declarative languages). C++ (for
           | something in between) encourages understanding the underlying
           | machine, but not necessarily to the degree that Forth does.
           | Forths and Forth programs are often closely tied to a
           | specific CPU or instruction set. C++ can be seen as
           | abstracted across a particular machine model that's typical
           | of many modern CPUs but not with a direct tie to a specific
           | instruction set (doing so constrains many programs more than
           | necessary). And Lisp's abstract model is (or can be) even
           | more disconnected from the particular physical machine.
           | 
           | I'd put Haskell, SQL, or Prolog and languages at their level
           | at the extreme end of abstraction before Lisp. Their abstract
           | machines are even further from the hardware (for most users
           | of the languages) than Lisp's.
        
       | moonchild wrote:
       | > double precision (i.e. 32 bit) math
       | 
       | Is this standard nomenclature anywhere? IME 'double precision'
       | generally refers to 64-bit floating values; and 32-bit is called
       | 'single precision'.
        
         | howerj wrote:
         | It should also be noted that double in this context has nothing
         | to do with floating point numbers, Forth implementations often
         | do not have the functions (called "words") to manipulate FP
         | numbers. Instead I believe it refers to the double-number word
         | set, words like "d+", "d-". See
         | <http://lars.nocrew.org/forth2012/double.html>.
         | 
         | Forth often uses different terminology, or slightly odd
         | terminology, for more common computer science terms because of
         | how the language evolved independently from universities and
         | research labs. For example "meta-compiler" is used when "cross-
         | compiler" would be more appropriate, instead of functions Forth
         | has "words" which are combined into a "dictionary", because the
         | term "word" is already used "cell" is used instead of "word"
         | when referring to a machines word-width.
         | 
         | Edit: Grammar.
        
         | chalst wrote:
         | It used to be the case that double precision meant two words,
         | which for this 16-bit CPU fits. It's fairly rare these days now
         | we care more about portability.
        
           | [deleted]
        
         | sanjarsk wrote:
         | since it's 16 bit CPU, double precision implies 32 bits
        
         | jabl wrote:
         | If you look at e.g. x86 ASM manuals, you have WORD (16 bits),
         | double word (DWORD, 32 bits), and quadword (QWORD, 64 bits). So
         | even if it's nowadays a 64-bit CPU, the nomenclature from the
         | 16-bit days sticks.
         | 
         | Double precision usually refers to 64-bit floating point, like
         | you say.
         | 
         | I would agree that this usage is not standard.
        
           | jimktrains2 wrote:
           | > Double precision usually refers to 64-bit floating point,
           | like you say.
           | 
           | Is it? Doesn't `double` in c refers to a 32bit value?
           | 
           | EDIT:
           | 
           | So, it seems I've not dealt with this in much too long and am
           | misremembering and therefor wrong.                   #include
           | <stdio.h>         int main() {
           | printf("sizeof(float) = %d\n", sizeof(float));
           | printf("sizeof(double) = %d\n", sizeof(double));
           | return 0;         }
           | 
           | yields                   sizeof(float) = 4
           | sizeof(double) = 8
           | 
           | on an Intel(R) Core(TM) i5-6200U (more-or-less a run-of-the-
           | mill-not-that-mill 64-bit x86-family core. I don't have a
           | 32-bit processor handy to test, but I don't believe it'd
           | change the results.
        
             | froydnj wrote:
             | Not usually; `float` is 32 bits and `double` is 64 bits on
             | virtually every common platform (maybe not on some DSP
             | chips or certain embedded chips?). But the C++ standard
             | (and probably the C one) only requires that `double` have
             | at least as much precision as a `float`, so it's
             | conceivable you could have a C++ implementation with 32-bit
             | `float` and `double` or 16-bit `float` and 32-bit `double.
        
             | Something1234 wrote:
             | float == 32 bits, double == 64 bits.
        
             | coliveira wrote:
             | C never required that the size of int or double be the same
             | across compilers. Even in the same machine they can have
             | different sizes.
        
               | zokier wrote:
               | "F.2 Types
               | 
               | The C floating types match the IEC 60559 formats as
               | follows:
               | 
               | -- The float type matches the IEC 60559 single format.
               | 
               | -- The double type matches the IEC 60559 double format.
               | 
               | -- The long double type matches an IEC 60559 extended
               | format, else a non-IEC 60559 extended format, else the
               | IEC 60559 double format.
               | 
               | Any non-IEC 60559 extended format used for the long
               | double type shall have more precision than IEC 60559
               | double and at least the range of IEC 60559 double.
               | 
               | Recommended practice
               | 
               | The long double type should match an IEC 60559 extended
               | format."
               | 
               | ISO/IEC 9899:1999, Annex F
        
       | Cieplak wrote:
       | Also worth checking out James's simplified version designed for
       | the Lattice iCE40 [1][2].
       | 
       | [1] https://www.excamera.com/sphinx/article-j1a-swapforth.html
       | 
       | [2] https://youtube.com/watch?v=rdLgLCIDSk0
        
         | tieze wrote:
         | Oh, thanks so much for this pointer :)
        
         | UncleOxidant wrote:
         | Will the non-simplified J1 fit/run in a larger iCE40 FPGA?
        
           | avhon1 wrote:
           | Probably not.
           | 
           | iCE40 FPGAs come with up to 7,680 logic cells.
           | 
           | The J1 was designed for a board with a Xilinx XC3S1000, which
           | has 17,280 logic cells.
        
             | sitkack wrote:
             | There are RISC-V cores that fit in about 1k luts. One could
             | build a NoC of RISC-V using an XCS1000.
        
             | UncleOxidant wrote:
             | This is probably because the RAM is internal to the FPGA:
             | 
             | > A complete J1 with 16Kbytes of RAM fits easily on a small
             | Xilinx FPGA.
             | 
             | I'd guess the CPU itself would easily fit into an iCE40
             | (given that RISC-Vs are fitting and the J1 should be
             | simpler) with the RAM external. Several of the iCE40 boards
             | have external RAMs
        
       | rbanffy wrote:
       | In college, one of our assignments was to design a CPU. I started
       | with a plain register-based one, but ended up moving to a stack-
       | based design. It was one of the best decisions I made in the
       | project, simplified the microcode enormously and made programming
       | it easy and fun.
        
         | loa_in_ wrote:
         | Did you roll your own language for this, used existing or did
         | you program it in machine code?
        
           | rbanffy wrote:
           | Machine code felt very close to Forth (I worked with GraFORTH
           | on the Apple II prior to that), or an HP calculator
           | (ironically, now I finally own one). The CPU was never
           | implemented and large parts of the microcode were never
           | written, but I have written some simple programs for it.
           | 
           | The printouts and the folder are in my mom's house in Brazil,
           | so I can't really look them up. Not sure the (5.25") floppy
           | is still readable or what could interpret the netlist.
           | 
           | For some ops I cheated a bit. The top of the stack could be
           | referenced as registers, R0 to R7, but I don't think I used
           | those shortcuts in sample code.
        
       | fctorial wrote:
       | What are the advantages of an fpga over an emulator running on a
       | board like rpi? Isn't an fpga just a hardware level emulator?
        
         | rcxdude wrote:
         | It can be more efficient. It depends on how sophisticated an
         | emulator you are writing and how well the emulated CPU
         | semantics match the host CPU semantics. For a simple emulator
         | of a CPU which is vastly different from the host CPU an FPGA
         | will likely be much faster and lower power consumption, since
         | although FPGAS are generally slower than dedicated silicon but
         | they are really good at simulating arbitrary digital logic very
         | fast, while CPUs are quite poor at it.
        
         | progre wrote:
         | While FPGA:s are often used for implementing CPU:s, that's
         | really not the best use of an FPGA in my oppinion. Of course,
         | very usefull when prototyping a CPU, but if you only want a
         | CPU... you can just buy one.
         | 
         | I think a more interesting use is for hardware that you _can
         | 't_ buy. Like the balls in this project:
         | 
         | https://en.wikipedia.org/wiki/IceCube_Neutrino_Observatory
         | 
         | Those balls contains frequency analyzers implemented on FPGA:s,
         | essentially doing Fourier transforms using lookup table maths.
         | This means they can do transforms much, much faster than in
         | software.
        
         | TFortunato wrote:
         | Seconding folks re: some of the advantages on latency,
         | efficiency, etc.
         | 
         | I also wouldn't describe it as "just" a emulator, or to think
         | of it as something interpreting / emulating some digital logic.
         | It's much lower level than this, in that it is actually
         | implementing your design in real hardware (consisting of lookup
         | tables / logic gates / memory cells, etc. all "wired" together.
         | 
         | As such, they can be useful not only for implementing some
         | custom bits of glue logic (e.g. interfaces which need high
         | speed serialization / deserialization of data), or accelerating
         | a particular calculation, but also for anywhere you need to
         | have really deterministic performance in a way that isn't easy
         | or even possible to do in software / an emulator alone.
        
         | opencl wrote:
         | FPGAs can generally achieve much lower latencies than systems
         | like the Pi (but so can microcontrollers). For some use cases
         | they can be more power efficient. If you need some oddball I/O
         | that isn't widely available in existing systems you can
         | implement it in gateware.
         | 
         | In general I would say there aren't a whole lot of cases where
         | it makes sense to use an FPGA over a microcontoller or SBC
         | given how cheap and fast they are these days. Of course like a
         | lot of hobbyist tech stuff people will choose to use a certain
         | technology for fun/learning.
         | 
         | Note that the Pi didn't exist yet when this CPU was designed,
         | the embedded world was very different 10 years ago.
        
       | jandrese wrote:
       | > one's complement addition
       | 
       | That's going to catch some people off guard, especially on a 16
       | bit system where it's so easy to overflow.
        
       | one-more-minute wrote:
       | Tangential question about FPGAs: Is there any work on compiling
       | code to a combination of hardware and software? I'm imagining
       | that the "outer loop" of a program is still fairly standard ARM
       | instructions, or similar, but the compiler turns some subroutines
       | into specialised circuits. Even more ambitiously you could JIT-
       | compile hot loops from machine instructions to hardware.
       | 
       | We already kind of do this manually over the long term (eg things
       | like bfloat16, TF32 and hardware support for them in ML, or
       | specialised video decoders). With mixed compilation you could do
       | things like specify a floating point format on-the-fly, or mix
       | and match formats, in software and still get high performance.
        
         | rwmj wrote:
         | The "execution model" is so vastly different it's hard to even
         | know what "JIT-compile hot loops from machine instructions to
         | hardware" even means. I wouldn't even call HDLs "execution" -
         | they describe how to interconnect electronic circuits, and if
         | they can be said to "execute" at all, it's that everything runs
         | in parallel, processing signals across all circuits to the beat
         | of a clock (usually, not always).
        
         | LargoLasskhyfv wrote:
         | There was. But for Mips. By Microsoft which used NetBSD.
         | 
         | https://www.microsoft.com/en-us/research/project/emips/
         | 
         | https://www.microsoft.com/en-us/research/publication/multico...
         | 
         | http://blog.netbsd.org/tnf/entry/support_for_microsoft_emips...
         | 
         | The thing is, this is not just another step up in complexity as
         | another poster wrote here, but several.
         | 
         | Because it requires _partial dynamic reconfiguration_ , which
         | works with ram based FPGAs only (the ones which load their
         | bitstream(containing their initial configuration) on startup
         | from somewhere), not flash based ones which are "instant on" in
         | their fixed configuration.
         | 
         | Regardless of that, partial dynamic reconfiguration takes time.
         | The larger the reconfigured parts, the more time.
         | 
         | This is all very annoying because of vendor lock in because of
         | proprietary tools, IP-protection, and so much more.
         | 
         | The few fpgas which have open source tool chains are unsuitable
         | because they are all flash based AFAIK, and it doesn't seem to
         | be on the radar of the people involved in developing these,
         | because why, if flash anyways?
        
           | duskwuff wrote:
           | > The few fpgas which have open source tool chains are
           | unsuitable because they are all flash based AFAIK...
           | 
           | Not true at all. The flagship open-source FPGAs are the
           | Lattice iCE40 series, which are SRAM-based. There's also been
           | significant work towards open-source toolchains for Xilinx
           | FPGAs, which are also SRAM-based.
           | 
           | The real limitation is in capabilities. The iCE40 series is
           | composed of relatively small FPGAs which wouldn't be
           | particularly useful for this type of application.
        
             | LargoLasskhyfv wrote:
             | OK? I didn't follow the efforts for Lattice because
             | insufficient resources for my needs. I'm aware of efforts
             | for Xilinx, but they aren't covering the SKUs/models I'm
             | working with. Is there anything for Altera/Intel now?
        
               | duskwuff wrote:
               | I'm not aware of any significant reverse-engineering
               | efforts for Intel FPGAs. QUIP [1] might be an interesting
               | starting point, but there may be significant IP licensing
               | restrictions surrounding that data.
               | 
               | Out of curiosity, which Xilinx models are you hoping to
               | see support for?
               | 
               | [1]: https://www.intel.com/content/www/us/en/programmable
               | /support...
        
             | nereye wrote:
             | Lattice ECP5 is an SRAM-based FPGA which has up to 84K LUTs
             | (vs ~5K for iCE40) and is supported by an open source tool
             | chain. E.g. see
             | https://https://www.crowdsupply.com/radiona/ulx3s.
        
         | ekiwi wrote:
         | You might be interested in this work which integrates a
         | programmable fabric directly with a MIPS core in order to speed
         | up inner loops: http://brass.cs.berkeley.edu/garp.html
        
         | vidarh wrote:
         | The challenge is that reformulating problems to parallel
         | computation steps is something we're in general still really
         | bad at.
         | 
         | We're struggling with taking full advantage of GPUs and many-
         | core CPUs as it is.
         | 
         | FPGAs is one step up in complexity.
         | 
         | I'd expect JIT'ing to FPGA acceleration to show up other than
         | as very limited research prototypes _after_ people have first
         | done a lot more research on JIT 'ed auto-parallelisation to
         | multiple CPU cores or GPUs.
        
       | cbmuser wrote:
       | The name is a bit confusing as there is already a J2 CPU which is
       | the open source variant of the SuperH SH-2.
        
         | jecel wrote:
         | The J1 was published in 2010 and the J2 in 2015, so it was up
         | to them to avoid the confusion.
        
       ___________________________________________________________________
       (page generated 2021-01-13 23:01 UTC)