[HN Gopher] Show HN: A work-in-progress C compiler from scratch
___________________________________________________________________
Show HN: A work-in-progress C compiler from scratch
Author : r1chardnl
Score : 131 points
Date : 2021-08-03 12:50 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jjice wrote:
| This is just an anecdote, but I had to write a SQL DDL and DML
| parser during my last semester. I wrote the DDL parser by hand,
| and it wasn't as bad as I expected, but it was time consuming. I
| managed to convince the professor to give us the option of using
| a parser generator for the next phase (DML) since the point of
| the class wasn't parsing context free grammars and more focused
| on executing the SQL.
|
| I used Flex and Bison since the project was in C. Getting up and
| running and understanding how the tools have to be set up with
| different options was a bit tricky, but after that, my parser was
| up and running in about two hours, compared to probably four
| times that for the hand written DDL. Our DML subset was also much
| larger and more complex than our DDL, so I was very happy with
| the development speed increase.
|
| I had this idea that using a parser generator was slow and
| wasteful since many modern tutorials online write them by hand
| and speak against parser generators (possibly because there isn't
| a catch all for all languages). Turns out dev speed is way more
| important to me up front, because in the case that I notice
| parsing speed actually being an issue I should be happy that my
| MVP has gotten enough use.
|
| It's also nice because a lexer and parser can be pretty easily
| black-boxed and swapped out for your hand written, just keep the
| AST and API the same and you should be good.
|
| All that said, that's personal preferences and writing the parser
| by hand is definitely good experience and more extensible,
| especially for error handling. Nice work!
| brundolf wrote:
| To add an anecdote, I've gotten into a hobby of writing parsers
| and parser-related things, and I've developed a pattern and a
| handful of utility functions (which are small enough to be re-
| written from memory) that together make hand-writing a pretty
| breezy task. Just the other week at work I hand-wrote a GraphQL
| schema parser from scratch in about 2 hours using this
| technique (GraphQL's schemas have a relatively simple syntax,
| but I still think it's a decent data point)
| qorrect wrote:
| Is it hosted somewhere we can see it?
| brundolf wrote:
| It's not, though I've thought about open-sourcing or doing
| a write-up on my pattern, but I've just assumed there are
| tons of libraries out there for this that are probably
| better than my homegrown version. Could still make an
| interesting blog post though, I suppose
| WalterBright wrote:
| > dev speed is way more important to me up front
|
| Having written many parsers, I can attest that the time spent
| writing and debugging the parser is about 0.0000001% of the
| time you'll invest in the compiler.
|
| Heck, I spent more time trying to figure out how to integrate
| the tag name space from ImportC into the host D compiler's
| symbol table mechanism, than I did writing the C parser.
| dboreham wrote:
| Your post came through a wormhole from 1979.
| SavantIdiot wrote:
| Every time I see a compiler, the first thing a zero in on is the
| parser.
|
| I don't mean this in a bad way, but that tokenizer looks a bit
| ... simple. ;-)
|
| I say that because ~26 years ago I had a go at writing a C
| preprocessor with the goal of creating a ToC and Index for my C
| projects. I sat down with a BNF of C and lex/yacc and got ground
| up in the details, then found some software that did exactly what
| I wanted and moved on. I just remember a few parts of the C
| syntax that don't fit nicely into BNF and hadn't studied parsers
| enough to dig myself out of the hole.
|
| Good luck doing it manually, that's gonna be tricky as heck.
| fao_ wrote:
| Parsing it manually is actually the preferred choice for most
| compilers, both GCC and Clang do it manually, for example.
| SavantIdiot wrote:
| Really? That's probably why I failed! Ironically (given my
| post) I've never looked at Clang's parser. I have looked at
| GCC's and, well, sheepishly: I don't get it, it is a beast of
| a project and I wasn't able to crack the architecture. I
| should try again tho.
| rot13xor wrote:
| C is almost context-free with the exception of typedef. A
| custom parser helps with error diagnostics, e.g. missed
| semicolon or a "." swapped with "->".
|
| https://eli.thegreenplace.net/2007/11/24/the-context-
| sensiti...
| gwbas1c wrote:
| What are your goals? Is this a learning project? Are you trying
| to experiment with some kind of compiler architecture?
| r1chardnl wrote:
| Learning, hobby and fun.
|
| Aspiration to create a language that I would love to use some
| day.
| d6ba56c039d9 wrote:
| Here's an assignment for you.
|
| Have your compiler generate high quality assembly language
| for a DSP.
| bruce343434 wrote:
| Good luck with your project and language!
| r1chardnl wrote:
| Thanks, seeing it unfold whilst fixing and adding new features
| and compiling your program into a executable is really cool.
|
| Motivation is key though.
| thamer wrote:
| Just a few small suggestions:
|
| 1. Some code seems to be duplicated (although with slightly
| different formatting) across multiple files, like dd/dw/db in
| elf.c and x86.c for example. I'd suggest consolidating those.
|
| 2. It helps to define types when you have variables that can take
| a limited set of possible values; typedef works but enums are
| better. I'm thinking of variables like the `type` parameter to
| `primitive_data_type_size`, for example. The compiler can help
| you detect a missing case statement if you use an enum, but not
| an int.
|
| 3. It's a matter of preference, but typedef'ing your structs can
| make your code more concise and not have it littered with struct
| keywords.
|
| 4. You seem to have a mix of tabs and spaces in some files (e.g.
| in `elf.c`). I would recommend configuring your editor to use
| just one of the two or it'll start to be an issue as your code
| base grows.
|
| 5. On a related point, I'd suggest picking a formatting style and
| sticking to it. You have functions like `accept` in `ast.c`
| declared without spaces after the opening `(` and others like
| `read_file` in the same file that has spaces.
|
| Good luck! This is a great way to learn.
| MisterTea wrote:
| > _Programming language that compiles into a x86 ELF executable._
|
| > _a compiler somewhat resembling C_
|
| > _Add new or borrow from other language(s) features ontop of C,
| deviate away from just C_ function strlen(const
| char *s)
|
| This does not appear to be a C compiler. The title should reflect
| this.
| r1chardnl wrote:
| Can't edit my original reply anymore, but fair point, I've
| changed the keyword 'function' in the AST to a function return
| type instead. There's no error handling done yet to check
| whether the variable receiving the function return data matches
| the function return type, but it's on my todo list.
|
| https://github.com/riicchhaarrd/ocean/commit/0618e0810c8d437...
|
| Then again in C it would work aswell if you type casted the
| types. const char *f() { return
| (const char*)123; } int main() {
| int i = (int)f(); return 0; }
| tills13 wrote:
| Yeesh. This is probably the most pedantic comment I've read on
| HN.
|
| Contributes absolutely nothing to the conversation, here.
| r1chardnl wrote:
| The main goal at the moment is to create a C compiler, the
| function keyword there is just a placeholder for the time
| being. At the moment anything returned is just mov'd into the
| EAX register and then stored in a local variable if called that
| way.
| MisterTea wrote:
| If you're looking for C-like languages to study look up the
| plan 9 Alef language. It was a concurrent C-like that was
| designed to replace C until they realized its better off as a
| C library which is what plan 9's thread(2) is. It's mostly
| extinct but survives:
|
| http://doc.cat-v.org/plan_9/2nd_edition/papers/alef/
|
| https://seh.dev/go-legacy/
| spijdar wrote:
| Worth pointing out Plan9's C was itself sort of its own
| language. I think it was a strict subset, but I can't
| remember all the details. The preprocessor was much
| simpler, that much I remember.
| mananaysiempre wrote:
| Not exactly a subset, there were at least the anonymous
| struct/union members that partially made it into C11 much
| later.
|
| But then if we look beyond the syntax almost any C
| implementation is "its own language" in that it defines
| behaviour the standard doesn't specify, and frequently
| also changes behaviour the standard does, if in minor
| ways. Like how the Windows ABI forces a noncompliant
| wchar_t.
|
| (Even if we look at the preprocessor, the classic
| algorithm implemented in most places [that I can't be
| bothered to find a link for] in fact expands more than
| the standard strictly guarantees, _e.g._ you can
| sometimes get recursive expansion out of it by creative
| use of token pasting,--I remember reading the ANSI
| committee thought the case was too "perverse" and didn't
| specify anything [no link here as well, sorry... poke me
| again if you actually care about this]. The standalone
| preprocessor mcpp has warnings about this specification
| hole, but other implementations don't as far as I know.
| And you'd think the preprocessor, an almost purely
| syntactic thing, wouldn't have implementation differences
| worth keeping around.)
| MisterTea wrote:
| Yes, Plan 9 c has its own C library which does not follow
| ANSI/ISO though it is very similar to ISO C. Its syntax
| is pretty much c89 and the useful bits of c99. To build
| ANSI/POSIX you use the APE (ANSI/POSIX Environment)
| compilers and libraries which are built in. It's only
| reason for existing is to help with porting Unix baggage.
|
| The compiler is kencc which is why it is different. Plan
| 9 also still uses a.out for binary images.
| kubb wrote:
| Oh, is _this_ where the inspiration for Go came from? It
| would explain a lot.
| kragen wrote:
| Yes. Even before Alef, Newsqueak (01989) was almost
| identical to Golang, and of course Alef, Limbo, etc.,
| descend from Newsqueak.
| r1chardnl wrote:
| Looks really interesting, thanks for sharing this.
|
| I'm not sure whether I'll incorporate everything from C or
| just make it compatible with C and behind the scenes do
| things differently e.g instead of carrying const char
| */string pointers everywhere always carry the length of
| said data structure along with it. Or for instance push an
| additional pointer on the stack for memory zones, where an
| allocator is always present and knows it's context, sort of
| like __thiscall with class methods for this->.
| jeffrallen wrote:
| If you are exploring alternatives to the stdlib, you might
| commit yourself to no zero terminated strings and see where
| it takes you. A fundamental data type with a pointer, a
| length and a capacity does a lot to stop buffer overflows.
| guntars wrote:
| To replace C-style strings you only need the length! You
| also get cheap substrings which is such a common operation
| that it alone IMO makes it worth taking this route.
| justinlloyd wrote:
| Great job! Writing a compiler from scratch is a massively
| valuable learning experience. I wrote one in BASIC for the 6502,
| that became self-hosting on April 24th, 1982. I'm going to watch
| your project with interest and relive old memories.
| kinga254 wrote:
| Any specific resources you've used that you could recommend in
| line with compiler construction?
| optymizer wrote:
| I wrote a C compiler* using flex [1] and bison [2] that
| generated MIPS code. The biggest downside was the rather
| complex glue code.
|
| At some point ANTLR [3] looked promising, but these days I'd
| probably write a lexer and recursive descent parser by hand,
| then generate LLVM IR [4].
|
| [1] https://github.com/westes/flex
|
| [2] https://www.gnu.org/software/bison/
|
| [3] https://www.antlr.org/
|
| [4] https://llvm.org/
|
| * for a subset of the full C89 spec
| blcArmadillo wrote:
| This is specific to interpreters but the beginning part related
| to parsing and AST is common:
| http://craftinginterpreters.com/contents.html
| MH15 wrote:
| I've had good luck with ANTLR4 and ObjWeb ASM for my JVM
| language Imp https://github.com/mh15/imp
| r1chardnl wrote:
| You could use https://en.wikipedia.org/wiki/Yacc
|
| I decided to write my own lexer/parser/compiler it's pretty
| straightforward.
|
| Sources I use(d) to program this, e.g look at resulting
| assembly from other compilers or look at how the AST is
| generated for other languages.
|
| https://en.wikipedia.org/wiki/Recursive_descent_parser
|
| https://en.cppreference.com/w/c/language/operator_precedence
|
| https://godbolt.org/
|
| https://astexplorer.net/
|
| I didn't start writing C yesterday, and most of the things are
| just stuff I've learned over the years and seeing how other
| languages work and then just use what I know to try to program
| in a logical manner.
|
| Also I've written a implementation of a scripting language
| (compiler and VM) before in similar fashion.
|
| https://github.com/riicchhaarrd/gsc
| [deleted]
| [deleted]
___________________________________________________________________
(page generated 2021-08-03 23:00 UTC)