[HN Gopher] The J1 Forth CPU
___________________________________________________________________
The J1 Forth CPU
Author : Cieplak
Score : 198 points
Date : 2021-01-13 08:29 UTC (14 hours ago)
(HTM) web link (www.excamera.com)
(TXT) w3m dump (www.excamera.com)
| bmitc wrote:
| Forth is fun. I haven't learned much of it or used it much, but
| it has been enjoyable.
| pkaye wrote:
| I always wondered if it is feasible to do a pipelined/super
| scalar stack based CPU.
| FullyFunctional wrote:
| Contrary to much common misunderstanding, sure, no real
| problem.
|
| The semantics of each instruction is consume the two top stack
| elements and replace that with the results. You handle this by
| having an stack rename stack of physical registers (with
| additional complication for handling under- and overflows),
| that is assuming the current stack is pr3 pr4
| pr5
|
| and the first free register is pr56
|
| Then an "+" instruction is interpreted as "add pr56, pr4, pr5"
| and pr56 is consumed and pr4 and pr5 marked to be freed when
| this commits.
|
| Because stack machines inherently introduce a lot of tight
| dependencies you will need to use dynamic scheduling (OoOE) to
| go super-scalar, but it's not a problem.
|
| Upsides are incredible instruction density. Downside: slightly
| harder to do good code generation, but not really.
| peter_d_sherman wrote:
| >"J1 is a
|
| _small (200 lines of Verilog)_
|
| stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes
| of RAM fits easily on a small Xilinx FPGA.
|
| PDS: A long time ago, I thought that Forth was "just another
| computer programming language", but back then, I had no
| appreciation for it, because I didn't understand where it fits in
| on the abstraction hierarchy that we call modern day computing...
|
| I didn't have this appreciation, at the time, because I had not
| worked on writing compilers at the time...
|
| Now, let me explain to you why Forth is so gloriously wonderful!
|
| You see, in terms of abstraction, where transistors are the
| lowest level, and Lisp, and Lisp-like languages are the highest,
| and most compilers are somewhat mid-level,
|
| _Forth is the next step up above machine language -- but a step
| below most compilers!_
|
| You'd know this, you'd _derive_ this, if you ever tried to write
| a compiler from the bottom up (going from assembly language to a
| compiler) rather than from the top down (what is taught about
| compiler construction in most University CS classes, i.e., the
| "Dragon Book", etc.)
|
| See, Forth is sort of what happens when you add an at-runtime
| symbol (AKA, "keyword") lookup table to your assembler code, such
| that addresses (AKA "labels", AKA, "functions", AKA "procedures",
| AKA "methods") can be dynamically executed at runtime, and you
| create an interpreter / stack-machine -- around this concept.
|
| There's a little bit more to that of course, but that's the basic
| idea.
|
| Then you _derive_ Forth!
|
| That's _why_ Forth is so gorgeous and brilliant!
|
| And so aesthetically pleasing, because of it's small size, yet
| amazing functionality!
|
| And why you don't need big, complex CPU's to run it!
|
| It exactly follows the "Einsteinian mantra" of
|
| _" As simple as possible, but not simpler!"_
|
| Anyway, thanks again for your great work with the J1 Forth CPU!
| tkinom wrote:
| Each "J1" seems to be able to map well(easily) to a tensor.
|
| Thousands (or a lot more) of them can be mapped to a TensorFlow
| network....
| fmakunbound wrote:
| Agree with all that!
|
| Except:
|
| > Lisp, and Lisp-like languages are the highest, and most
| compilers are somewhat mid-level
|
| Most Lisps have compilers (except the toy ones), and have had
| compilers for several decades.
| Jtsummers wrote:
| I think GP may have worded that poorly, and "most compilers"
| in your quoted portion would be better read as "most other
| languages".
|
| Taken this way, the sentence becomes something like: For
| computing, transistors are the lowest level of abstraction
| (they are literally the things doing the computing), Lisp and
| kin languages are at the extreme end of abstraction, and most
| other languages fit somewhere in between.
|
| Though not entirely true (Lisps let you get pretty low level
| if you want), it's a reasonable perspective if you look not
| at what the languages permit, but what they encourage. Forth
| encourages building your own abstractions from the bottom up.
| Lisp encourages starting with an abstract machine model and
| building up even higher levels of abstractions (same with
| most functional and declarative languages). C++ (for
| something in between) encourages understanding the underlying
| machine, but not necessarily to the degree that Forth does.
| Forths and Forth programs are often closely tied to a
| specific CPU or instruction set. C++ can be seen as
| abstracted across a particular machine model that's typical
| of many modern CPUs but not with a direct tie to a specific
| instruction set (doing so constrains many programs more than
| necessary). And Lisp's abstract model is (or can be) even
| more disconnected from the particular physical machine.
|
| I'd put Haskell, SQL, or Prolog and languages at their level
| at the extreme end of abstraction before Lisp. Their abstract
| machines are even further from the hardware (for most users
| of the languages) than Lisp's.
| moonchild wrote:
| > double precision (i.e. 32 bit) math
|
| Is this standard nomenclature anywhere? IME 'double precision'
| generally refers to 64-bit floating values; and 32-bit is called
| 'single precision'.
| howerj wrote:
| It should also be noted that double in this context has nothing
| to do with floating point numbers, Forth implementations often
| do not have the functions (called "words") to manipulate FP
| numbers. Instead I believe it refers to the double-number word
| set, words like "d+", "d-". See
| <http://lars.nocrew.org/forth2012/double.html>.
|
| Forth often uses different terminology, or slightly odd
| terminology, for more common computer science terms because of
| how the language evolved independently from universities and
| research labs. For example "meta-compiler" is used when "cross-
| compiler" would be more appropriate, instead of functions Forth
| has "words" which are combined into a "dictionary", because the
| term "word" is already used "cell" is used instead of "word"
| when referring to a machines word-width.
|
| Edit: Grammar.
| chalst wrote:
| It used to be the case that double precision meant two words,
| which for this 16-bit CPU fits. It's fairly rare these days now
| we care more about portability.
| [deleted]
| sanjarsk wrote:
| since it's 16 bit CPU, double precision implies 32 bits
| jabl wrote:
| If you look at e.g. x86 ASM manuals, you have WORD (16 bits),
| double word (DWORD, 32 bits), and quadword (QWORD, 64 bits). So
| even if it's nowadays a 64-bit CPU, the nomenclature from the
| 16-bit days sticks.
|
| Double precision usually refers to 64-bit floating point, like
| you say.
|
| I would agree that this usage is not standard.
| jimktrains2 wrote:
| > Double precision usually refers to 64-bit floating point,
| like you say.
|
| Is it? Doesn't `double` in c refers to a 32bit value?
|
| EDIT:
|
| So, it seems I've not dealt with this in much too long and am
| misremembering and therefor wrong. #include
| <stdio.h> int main() {
| printf("sizeof(float) = %d\n", sizeof(float));
| printf("sizeof(double) = %d\n", sizeof(double));
| return 0; }
|
| yields sizeof(float) = 4
| sizeof(double) = 8
|
| on an Intel(R) Core(TM) i5-6200U (more-or-less a run-of-the-
| mill-not-that-mill 64-bit x86-family core. I don't have a
| 32-bit processor handy to test, but I don't believe it'd
| change the results.
| froydnj wrote:
| Not usually; `float` is 32 bits and `double` is 64 bits on
| virtually every common platform (maybe not on some DSP
| chips or certain embedded chips?). But the C++ standard
| (and probably the C one) only requires that `double` have
| at least as much precision as a `float`, so it's
| conceivable you could have a C++ implementation with 32-bit
| `float` and `double` or 16-bit `float` and 32-bit `double.
| Something1234 wrote:
| float == 32 bits, double == 64 bits.
| coliveira wrote:
| C never required that the size of int or double be the same
| across compilers. Even in the same machine they can have
| different sizes.
| zokier wrote:
| "F.2 Types
|
| The C floating types match the IEC 60559 formats as
| follows:
|
| -- The float type matches the IEC 60559 single format.
|
| -- The double type matches the IEC 60559 double format.
|
| -- The long double type matches an IEC 60559 extended
| format, else a non-IEC 60559 extended format, else the
| IEC 60559 double format.
|
| Any non-IEC 60559 extended format used for the long
| double type shall have more precision than IEC 60559
| double and at least the range of IEC 60559 double.
|
| Recommended practice
|
| The long double type should match an IEC 60559 extended
| format."
|
| ISO/IEC 9899:1999, Annex F
| Cieplak wrote:
| Also worth checking out James's simplified version designed for
| the Lattice iCE40 [1][2].
|
| [1] https://www.excamera.com/sphinx/article-j1a-swapforth.html
|
| [2] https://youtube.com/watch?v=rdLgLCIDSk0
| tieze wrote:
| Oh, thanks so much for this pointer :)
| UncleOxidant wrote:
| Will the non-simplified J1 fit/run in a larger iCE40 FPGA?
| avhon1 wrote:
| Probably not.
|
| iCE40 FPGAs come with up to 7,680 logic cells.
|
| The J1 was designed for a board with a Xilinx XC3S1000, which
| has 17,280 logic cells.
| sitkack wrote:
| There are RISC-V cores that fit in about 1k luts. One could
| build a NoC of RISC-V using an XCS1000.
| UncleOxidant wrote:
| This is probably because the RAM is internal to the FPGA:
|
| > A complete J1 with 16Kbytes of RAM fits easily on a small
| Xilinx FPGA.
|
| I'd guess the CPU itself would easily fit into an iCE40
| (given that RISC-Vs are fitting and the J1 should be
| simpler) with the RAM external. Several of the iCE40 boards
| have external RAMs
| rbanffy wrote:
| In college, one of our assignments was to design a CPU. I started
| with a plain register-based one, but ended up moving to a stack-
| based design. It was one of the best decisions I made in the
| project, simplified the microcode enormously and made programming
| it easy and fun.
| loa_in_ wrote:
| Did you roll your own language for this, used existing or did
| you program it in machine code?
| rbanffy wrote:
| Machine code felt very close to Forth (I worked with GraFORTH
| on the Apple II prior to that), or an HP calculator
| (ironically, now I finally own one). The CPU was never
| implemented and large parts of the microcode were never
| written, but I have written some simple programs for it.
|
| The printouts and the folder are in my mom's house in Brazil,
| so I can't really look them up. Not sure the (5.25") floppy
| is still readable or what could interpret the netlist.
|
| For some ops I cheated a bit. The top of the stack could be
| referenced as registers, R0 to R7, but I don't think I used
| those shortcuts in sample code.
| fctorial wrote:
| What are the advantages of an fpga over an emulator running on a
| board like rpi? Isn't an fpga just a hardware level emulator?
| rcxdude wrote:
| It can be more efficient. It depends on how sophisticated an
| emulator you are writing and how well the emulated CPU
| semantics match the host CPU semantics. For a simple emulator
| of a CPU which is vastly different from the host CPU an FPGA
| will likely be much faster and lower power consumption, since
| although FPGAS are generally slower than dedicated silicon but
| they are really good at simulating arbitrary digital logic very
| fast, while CPUs are quite poor at it.
| progre wrote:
| While FPGA:s are often used for implementing CPU:s, that's
| really not the best use of an FPGA in my oppinion. Of course,
| very usefull when prototyping a CPU, but if you only want a
| CPU... you can just buy one.
|
| I think a more interesting use is for hardware that you _can
| 't_ buy. Like the balls in this project:
|
| https://en.wikipedia.org/wiki/IceCube_Neutrino_Observatory
|
| Those balls contains frequency analyzers implemented on FPGA:s,
| essentially doing Fourier transforms using lookup table maths.
| This means they can do transforms much, much faster than in
| software.
| TFortunato wrote:
| Seconding folks re: some of the advantages on latency,
| efficiency, etc.
|
| I also wouldn't describe it as "just" a emulator, or to think
| of it as something interpreting / emulating some digital logic.
| It's much lower level than this, in that it is actually
| implementing your design in real hardware (consisting of lookup
| tables / logic gates / memory cells, etc. all "wired" together.
|
| As such, they can be useful not only for implementing some
| custom bits of glue logic (e.g. interfaces which need high
| speed serialization / deserialization of data), or accelerating
| a particular calculation, but also for anywhere you need to
| have really deterministic performance in a way that isn't easy
| or even possible to do in software / an emulator alone.
| opencl wrote:
| FPGAs can generally achieve much lower latencies than systems
| like the Pi (but so can microcontrollers). For some use cases
| they can be more power efficient. If you need some oddball I/O
| that isn't widely available in existing systems you can
| implement it in gateware.
|
| In general I would say there aren't a whole lot of cases where
| it makes sense to use an FPGA over a microcontoller or SBC
| given how cheap and fast they are these days. Of course like a
| lot of hobbyist tech stuff people will choose to use a certain
| technology for fun/learning.
|
| Note that the Pi didn't exist yet when this CPU was designed,
| the embedded world was very different 10 years ago.
| jandrese wrote:
| > one's complement addition
|
| That's going to catch some people off guard, especially on a 16
| bit system where it's so easy to overflow.
| one-more-minute wrote:
| Tangential question about FPGAs: Is there any work on compiling
| code to a combination of hardware and software? I'm imagining
| that the "outer loop" of a program is still fairly standard ARM
| instructions, or similar, but the compiler turns some subroutines
| into specialised circuits. Even more ambitiously you could JIT-
| compile hot loops from machine instructions to hardware.
|
| We already kind of do this manually over the long term (eg things
| like bfloat16, TF32 and hardware support for them in ML, or
| specialised video decoders). With mixed compilation you could do
| things like specify a floating point format on-the-fly, or mix
| and match formats, in software and still get high performance.
| rwmj wrote:
| The "execution model" is so vastly different it's hard to even
| know what "JIT-compile hot loops from machine instructions to
| hardware" even means. I wouldn't even call HDLs "execution" -
| they describe how to interconnect electronic circuits, and if
| they can be said to "execute" at all, it's that everything runs
| in parallel, processing signals across all circuits to the beat
| of a clock (usually, not always).
| LargoLasskhyfv wrote:
| There was. But for Mips. By Microsoft which used NetBSD.
|
| https://www.microsoft.com/en-us/research/project/emips/
|
| https://www.microsoft.com/en-us/research/publication/multico...
|
| http://blog.netbsd.org/tnf/entry/support_for_microsoft_emips...
|
| The thing is, this is not just another step up in complexity as
| another poster wrote here, but several.
|
| Because it requires _partial dynamic reconfiguration_ , which
| works with ram based FPGAs only (the ones which load their
| bitstream(containing their initial configuration) on startup
| from somewhere), not flash based ones which are "instant on" in
| their fixed configuration.
|
| Regardless of that, partial dynamic reconfiguration takes time.
| The larger the reconfigured parts, the more time.
|
| This is all very annoying because of vendor lock in because of
| proprietary tools, IP-protection, and so much more.
|
| The few fpgas which have open source tool chains are unsuitable
| because they are all flash based AFAIK, and it doesn't seem to
| be on the radar of the people involved in developing these,
| because why, if flash anyways?
| duskwuff wrote:
| > The few fpgas which have open source tool chains are
| unsuitable because they are all flash based AFAIK...
|
| Not true at all. The flagship open-source FPGAs are the
| Lattice iCE40 series, which are SRAM-based. There's also been
| significant work towards open-source toolchains for Xilinx
| FPGAs, which are also SRAM-based.
|
| The real limitation is in capabilities. The iCE40 series is
| composed of relatively small FPGAs which wouldn't be
| particularly useful for this type of application.
| LargoLasskhyfv wrote:
| OK? I didn't follow the efforts for Lattice because
| insufficient resources for my needs. I'm aware of efforts
| for Xilinx, but they aren't covering the SKUs/models I'm
| working with. Is there anything for Altera/Intel now?
| duskwuff wrote:
| I'm not aware of any significant reverse-engineering
| efforts for Intel FPGAs. QUIP [1] might be an interesting
| starting point, but there may be significant IP licensing
| restrictions surrounding that data.
|
| Out of curiosity, which Xilinx models are you hoping to
| see support for?
|
| [1]: https://www.intel.com/content/www/us/en/programmable
| /support...
| nereye wrote:
| Lattice ECP5 is an SRAM-based FPGA which has up to 84K LUTs
| (vs ~5K for iCE40) and is supported by an open source tool
| chain. E.g. see
| https://https://www.crowdsupply.com/radiona/ulx3s.
| ekiwi wrote:
| You might be interested in this work which integrates a
| programmable fabric directly with a MIPS core in order to speed
| up inner loops: http://brass.cs.berkeley.edu/garp.html
| vidarh wrote:
| The challenge is that reformulating problems to parallel
| computation steps is something we're in general still really
| bad at.
|
| We're struggling with taking full advantage of GPUs and many-
| core CPUs as it is.
|
| FPGAs is one step up in complexity.
|
| I'd expect JIT'ing to FPGA acceleration to show up other than
| as very limited research prototypes _after_ people have first
| done a lot more research on JIT 'ed auto-parallelisation to
| multiple CPU cores or GPUs.
| cbmuser wrote:
| The name is a bit confusing as there is already a J2 CPU which is
| the open source variant of the SuperH SH-2.
| jecel wrote:
| The J1 was published in 2010 and the J2 in 2015, so it was up
| to them to avoid the confusion.
___________________________________________________________________
(page generated 2021-01-13 23:01 UTC)