Wak Awk Implementation ====================== * An awk implementation * Writing an awk * Parsing awk is tricky * Parsing awk * Runtime internals * Implementing regex RS * Testing awk An awk implementation ===================== Rob Landley's toybox project provides a variety of Linux command-line tools, similar to busybox. I have written a compact but fairly complete awk implementation intended to integrate with toybox, but it can also build standalone. This implementation is named wak, because all the good awk names are taken. But when used in toybox, it's just awk, or toybox awk. wak is coded in C99. It uses mostly C standard library, aka libc, but also needs some POSIX support, mostly for regular expressions. It is not public domain but does have Landley's very liberal 0BSD license. These pages describe my awk implementation, but are not complete. The implementation is at github. Writing an awk ============== Introduction ------------ I have written a compact but fairly complete awk implementation intended to integrate with Rob Landley's toybox project, but it can also build standalone. These pages document some aspects of this effort. This implementation is named wak, because all the good awk names are taken. But when used in toybox, it's just awk, or toybox awk. It is not public domain but does have Landley's very liberal 0BSD license. Who is this written for? ------------------------ This is written for anyone who wants to understand how wak works internally. It's partly for my own use, to document what I've done so I can find it later. It's also for anyone curious about how awk can be implemented, including some of the problems I encountered and how I dealt with them. To understand this, it helps to know awk pretty well and know a bit about how a lexical analyzer (aka lexer / scanner / tokenizer) and a recursive-descent parser work. There is a POSIX specification for awk, and wak should conform to it with some exceptions (mostly) documented here. The reader should probably be familiar with the POSIX spec to read this, or the wak code. Other implementations --------------------- There are several implementations of awk readily available, which I have been testing my implementation against. The original awk, written by Al Aho, Peter Weinberger, and Brian Kernighan, is still being maintained and even updated recently with new features. Kernighan (bwk) calls it the One True Awk, and it's often referred to as nawk (new awk, as compared with the pre-1985 version, sometimes called old awk, which lacked many essential features). Around September 2023, nawk was updated with some new features, primarily around UTF-8 handling and the ability to read CSV (comma-separated values) files. I have used this later version in my testing, and refer to it here as nnawk (new new awk). Many Linux distros include gawk (GNU awk, maintained for decades by Arnold Robbins). Some distros (like Raspberry Pi's Raspian) have mawk (Mike Brennan's awk, now maintained by Thomas Dickey). Busybox has an awk implementation by Dmitry Zakharov, written in 2002 and currently maintained. More recently, Ben Hoyt wrote goawk, an implementation in Go language. See also . Documents --------- The 1985 version of nawk is documented in The AWK Programming Language (1988), by Aho, Kernighan, and Weinberger. I'll refer to that edition as the AWK book. Since nnawk (aka One True Awk or OTA) has been enhanced in 2023 by Kernighan, the book has been issued in a second edition available at Amazon. I'll refer to that as the AWK book 2nd Ed. In addition to The AWK Programming Language (1st and 2nd editions) and the POSIX spec mentioned above, the gawk manual by Arnold Robbins, titled GAWK: Effective AWK Programming is a great resource. It documents gawk in detail, including "standard" awk and gawk-only extensions, and it also explains how to use awk. The explanations of awk's "dark corners" and many details omitted from the AWK book helped immensely in developing wak. What makes a proper awk? ------------------------ None of the existing implementations follow POSIX to the letter, and no two of them behave exactly the same. As I understand it, the POSIX awk spec is largely based on the description in the AWK book, but there are some differences. The POSIX spec is more precise than the description in the AWK book, but not as detailed as, for example, the language spec for C. The grammar in the POSIX spec reflects the actual nawk implementation more so than the book does, but is not based directly on the yacc/Bison grammar used by nawk. Where POSIX differs from nawk, it seems most implementers tend to follow the actual behavior of nawk rather than POSIX. wak intends to follow the POSIX spec as well, but there are departures. For example, wak does not handle locales except to the extent that toybox does. Some restrictions imposed by toybox ----------------------------------- Toybox is licensed 0BSD, Rob Landley's approach to have copyrighted code that is as free and close to public domain as possible; see the link for more information. Rob does not accept code that has more restrictions than the 0BSD license, so he can't adopt any other current awk implementations. wak uses the 0BSD license. Rob is adamant that it not have any outside dependencies except "standard" libc (plus some POSIX elements such as regular expression functions), so using yacc/Bison or lex/flex is out, and toybox awk will have to use a hand-written lexical analyzer and parser. In order to avoid any "contamination" of wak code by other implementations, I did not look at the code for other implementations. I did look at the yacc/Bison grammar for nawk, only to attempt to understand the actual language accepted by the original awk. Program structure ----------------- The design is a classic "scan, compile to internal format, interpret" interpreter. The compiler is a recursive descent parser with code generation rolled directly into it. It uses a Pratt/precedence-climbing approach for parsing expressions. It is a one-pass design, with a few "fixups" needed to deal with awk allowing functions to be called before being defined, and also to handle a few difficulties with the awk grammar. The internal format is a "virtual machine" where the instructions are each a "word" (int). For no special reason, I call my generated format "zcode". The zcode machine is a straight stack machine. Lexical analysis ---------------- The current wak lexical analyzer, scan.c, is about 300 LOC (counting non-blank, non-comment lines). It's a fairly typical scanner design, with functions to get tokens (numbers, strings, variables, keywords, regular expressions), look up keywords, etc. It also handles reading the program text given on the command line or stepping through the files of awk code specified on the command line. The state of the scanner is maintained in a global of type struct scanner_state, updated as each token is scanned. This includes the state of the file being read (name, line number, etc) and the particular token scanned and information about the token. The tokens are signified by integer values from an enum with names of the form tk... such as tkfor, tkwhile, tklt (<), tknumber, tkstring, etc. One interesting detail is the handling of regular expressions (hereafter just "regex"). The language allows literal regex specified as / ... regex here .../. The / symbol also indicates division, and /= is the divide-and-assign operator, just as in C. To disambiguate, the POSIX spec says: > When an input sequence begins with a character in any syntactic > context where the token '/' or DIV_ASSIGN could appear as the next > token in a valid program, the longer of those two tokens that can > be recognized shall be recognized. In any other syntactic context > where the token ERE could appear as the next token in a valid > program, the token ERE shall be recognized. (DIV_ASSIGN is the `/=' token, and ERE means a POSIX Extended Regular Expression token.) I asked Arnold Robbins and Ozan Yigit (the current OneTrueAwk maintainer) if I am correct that a `/' or `/=' token can appear only after a number, string, variable, getline keyword, right parenthesis `)', right bracket `]', or an increment or decrement operator `++' or `--'. I intended to have the scanner assume a `/' character after any of these means divide, and a `/' in any other context indicates a regex literal. Neither gave me a definitive answer to that, but Ozan did not think it a good idea to have the scanner make the decision. He said "letting the parser inform the scanner what to collect is a reasonable thing to do" and that "both ota and gawk do exactly that for collecting regular expressions." That implied to me that the parser would have to tell the scanner when it is expecting a regex, which seemed to me to be trickier than having the scanner do as I suggested. Perhaps I was wrong, but I did not want to look at the code for OTA (nawk) or gawk to see just how that is done. I have taken the approach I described, and have not run into a problem yet with valid awk code being scanned or parsed wrongly as a result. Internal format and zcode machine --------------------------------- Each awk value is a zvalue struct, that can be a scalar, literal regular expression (regex for short), or an array. A scalar can have a numeric value (C double) or a string value or both. A string (zstring) is a struct that is reference counted and can hold an arbitrary number of bytes. An array is a pointer to a zmap struct that holds a pointer to a hash table. (I often refer to the awk array structure as a map.) A regex is a pointer to a regex_t compiled POSIX regex. Because the string, array, and regex values are mutually exclusive within an awk variable, I use a union to hold them. Here are the zvalue and zstring structures: // zvalue: the main awk value type. // Can be number or string or both, or else map (array) or regex struct zvalue { unsigned flags; double num; union { // anonymous union not in C99 struct zstring *vst; struct zmap *map; regex_t *rx; }; }; // zstring: flexible string type. // Capacity must be > size because we insert a NUL byte. struct zstring { int refcnt; unsigned size; unsigned capacity; char str[]; // C99 flexible array member }; The anonymous union in the struct zvalue is not valid C99, but gcc accepts it, giving a warning with -Wpedantic. I may fix this, but maybe it just clutters up the code with extra .z characters for no really good reason. This is the only non-standard aspect of wak code, as far as I know. With all gcc warnings turned on, it compiles with no other warnings. The stack, several compiler tables, and also the actual value structures of the arrays are held in expanding sequential list structures called zlist: // zlist: expanding sequential list struct zlist { char *base, *limit, *avail; size_t size; }; The size member holds the size of each entry in the list (also referred to as slots). base points to the base of the list; limit points to just past the end of the list, and avail points to the first unused slot. The list is reallocated to be 50% larger as needed as it fills up. The map (awk array) structure is the zmap hash table: // zmap: Mapping data type for arrays; a hash table. Values in // hash are either 0 (unused), -1 (marked deleted), or one plus // the number of the zmap slot containing a key/value pair. The // zlist slot entries are numbered from 0 to count-1, so need to // add one to distinguish from unused. The probe sequence is // borrowed from Python dict, using the "perturb" idea to mix in // upper bits of the original hash value. struct zmap { unsigned mask; // tablesize - 1; tablesize is 2 ** n int *hash; // (mask + 1) elements int limit; // 80% of table size ((mask+1)*8/10) int count; // number of occupied slots in hash int deleted; // number of deleted slots struct zlist slot; // expanding list of zmap_slot elements }; The zlist slot member holds the actual values of the hash table: // Elements of the hash table (key/value pairs) struct zmap_slot { int hash; // store hash key to speed hash table expansion struct zstring *key; struct zvalue val; }; The zmap hash member points to 2n int values that in turn index to zmapslot structs, each of which holds a zvalue, a key as a zstring, and its hash (to speed rehashing, a la Python). The hash table grows by doubling as needed, starting at 8 slots and uses a probe sequence borrowed from Python: (*probe = (*probe * 5 + 1 + (perturb >>= PSHIFT)) & m->mask;). See comments in the Python dict source code for details; wak uses a simplified version of this approach. Parsing awk is tricky ===================== What is the grammar and meaning of awk code? Maybe a yacc expert could know. If you only know what the Awk book (by A, K, & W, 1st or 2nd Ed.) says, you'd be missing some details. For example, the Awk book and many other documents say a for statement is: for (expression [1]; expression [2]; expression [3]) statement (or any of the three expressions can be empty). But try this: awk 'BEGIN{for (; x++ < 3; print x);}' (using gawk, nawk, nnawk, or goawk; mawk and bbawk (busybox awk) fail.) This works because expression [1] and expression [3] are actually simple_statement which can be an expression, delete statement, or print statement. (This is in both the yacc grammar for nawk/nnawk and in the POSIX reference grammar.) Why? Maybe no one knows. Maybe it somehow made the syntax better for yacc. I asked gawk maintainer Arnold Robbins about this, and he suggested > I can imagine something like > > for (printf("Enter data: "); > getline data > 0; > printf("Enter data: ")) { > ## stuff with data > } But even if it could be useful, very few users would ever know about it because it's undocumented outside the formal grammar. A few interesting cases ----------------------- Case 1: Consider this: awk 'BEGIN{if (1 < 2 && x = 3) print x}'. According to the POSIX grammar this should be a syntax error, but it prints 3. The spec says: > This grammar has several ambiguities that shall be resolved as > follows: Operator precedence and associativity shall be as > described in Expressions in Decreasing Precedence in awk. Assignment = has lower precedence than &&, which has lower precedence than <. The statement should parse as: if (((1 < 2) && x) = 3), and the expression to the left of = must be an lvalue, which it clearly is not. (The additional parentheses are notional; no parentheses are allowed around an lvalue.) The rules are similar in C, and a statement like this is a syntax error in C. Case 2: Another odd one, adapted from gawk test getline.awk: running this in bash (as e.g. ./getline_test1.sh gawk): #!/usr/bin/bash awk=$1 echo -e 'A\nB\nC'|$awk 'BEGIN { x = y = "!" a = (getline x y); print a, x a = (getline x + 1); print a, x a = (getline x - 2); print a, x }' with any of gawk, nawk, nnawk, goawk, bbawk gives the same result: 1! A 2 B -1 C But running this (as getline_test2.awk): BEGIN { x = y = "!" cmd = "echo A"; a = (cmd | getline x y); close(cmd); print a, x cmd = "echo B"; a = (cmd | getline x + 1); close(cmd); print a, x cmd = "echo C"; a = (cmd | getline x - 2); close(cmd); print a, x } gives the same result as above in gawk, mawk, and bbawk, but goawk gets a parsing error on the "echo A" line, and nawk/nnawk gives: 1! A 11 B 1-2 C This seems really odd to me. The original awk (nawk/nnawk - Kernighan's OneTrueAwk) apparently parses (cmd | getline x y) as ((cmd | getline x) y), but parses (cmd | getline x + 1) as ((cmd | getline x) (+ 1)) and (cmd | getline x - 2) as ((cmd | getline x) (- 2)). (That is, the y, the + 1, and the - 2 are treated as strings concatenated with the result of (cmd | getline x).) This is despite the fact that the right side of a concatenation excludes an expression beginning with + or -, according to POSIX, as it must because it otherwise makes a + b ambiguous as to whether it's (a + b) or (a) (+b). Case 3: BEGIN {$0 = "2 3 4"; $$0++; print} prints 3, except bbawk says "Unexpected token", and my awk as of this writing prints 2 4 4. Which is "correct"? Apparently nawk, gawk, mawk, and goawk work as follows (bbawk gets a syntax error): in $$0++, the $0 is "2 3 4" but the ++ forces it numeric, as 2, so the post-increment applies to $0 which sets the entire record to 3. The subsequent outer $ selects field 3 of $0, which does not exist, so the value of $$0++ is the empty string. My awk tries to obey the precedence rules in POSIX, which have $ at higher precedence than ++, so evaluates $$0 first as $2, selecting the second field, and then the ++ increments it to 4. But wait, there's more! Try BEGIN {a = 2; b[1] = 2; $0 = "33 44"; print $a; $a++; print; print $b[1]; $b[1]++; print b[1]; print} In gawk, mawk, goawk, bbawk, and my awk, it prints 44 33 45 45 2 33 46 But in nawk/nnawk, it prints 44 33 45 45 3 33 45 What's happening in nawk/nnawk is: $a++ increments the field $a (i.e. $2), but $b[1]++ increments the array value b[1] and does not change the field. All the other awks increment field 2 in both cases, as one might expect. Once again, the original One True Awk's behavior seems inconsistent and not easily discernible from the yacc grammar, nor from the documentation. Case 4: BEGIN {$0="3 4 5 6 7 8 9"; a=3; print $$a++++; print} prints 7 3 4 6 6 8 8 9 except bbawk and my awk currently indicate a syntax error. Similar to case 3 above; $a selects field 3, which is 5, increments it, but then $$a++ is a reference to field 5, which is 7, and that is printed as the value of $$a++++, but then that field is incremented, giving $0 the value "3 4 6 6 8 8 9". My awk (and apparently bbawk?) parse $$a++++ as if it were (($($a))++)++ and reject the outer ++ because (($($a))++) is a value, not an lvalue. The parentheses here are again notional; an lvalue cannot be in parentheses. Case 5: BEGIN {a[y] = 1; x = y in a + 2; print x} prints: * an empty line in bbawk * 3 in wak * 12 in nnawk/nawk (!) * a syntax error message at `+' in gawk, mawk, goawk Apparently nnawk/nawk treats the + 2 as a string to be concatenated with the 1 from evaluating y in a as true. What causes these peculiarities? Aho, Weinberger, and Kernighan have written that > The development of awk was significantly shortened by using UNIX > tools. The grammar is specified with yacc; the lexical analysis is > done by lex. Using these tools made it easy to vary the syntax of > the language during development. In The Unix Programming Environment, Kernighan and Rob Pike wrote (footnote, p. 254): > The yacc message "reduce/reduce conflict" indicates a serious > problem, more often the symptom of an outright error in the grammar > than an intentional ambiguity. Compiling One True Awk (nawk/nnawk) gets 85 reduce/reduce conflicts in addition to 44 shift/reduce conflicts. The original yacc grammar for awk has many ambiguities. I don't know much about yacc yet, but I understand it's basically an LALR(1) parser generator, but with additional features to resolve what would otherwise be ambiguities. I suspect that these features make it hard to understand exactly what the parser does if you are not a yacc expert. The actual awk language is determined by what the parser accepts and what it does with that. My guess is that there may not be any true LALR(1) grammar for awk, maybe no LR(1) grammar, and probably no LL(1) grammar either. This makes writing a recursive-descent parser by hand somewhat difficult. The POSIX grammar is in a yacc-like format and was obviously written with some reference to the original yacc grammar for awk. I can say that with some confidence because it includes the for (simple_statement ...) business discussed above, which one would never know without studying the original yacc grammar. But as shown above, existing implementations do not always follow POSIX grammar, and when they differ they usually follow the original awk behavior. I asked Arnold Robbins (gawk maintainer and contributor to One True Awk) about case 1. He replied "The answer, as unsatisfying as it may be, is that Awk has always worked this way." Also, regarding case 1 above, Ben Hoyt had the same issue in goawk. He resolved it with the commit that included this comment: > The other awks support this by using a yacc grammar which supports > backtracking, and as Vitus13 said on reddit: "If there are two > syntactically valid parsings and one is a semantic error, the error > handling may resolve the ambiguity towards the valid parsing. In > this case, you can only assign to L values, so trying to assign to > (1&&x) doesn't make any sense." I believe this is a misconception; yacc does not support backtracking (though there is a different program, btyacc, that does). I think Ben and Vitus13 are wrong here; yacc does not see multiple syntactically valid parsings, it sees only the parsing that it performs deterministically, and many of the problems understanding what nawk/nnawk does are due to the way yacc resolves conflicts. Parsing Awk =========== I wanted to try parsing awk using recursive descent along with Pratt's method of handling expression syntax. Pure recursive descent without backtracking requires a grammar meeting certain requirements, specifically LL(1). The reference grammar for POSIX awk does not seem to meet this, and I am doubtful a correct LL(1) awk grammar exists. There are some difficulties discussed below, so a bit of finagling in the parser is needed. If you are not familiar with Pratt parsing, there are several tutorials and discussions around on the Web. Vaughan Pratt introduced the idea in a 1973 paper, then Douglas Crockford sort of rediscovered and popularized it in the first paper linked above. The basic idea is that instead of having a separate routine to parse each operator or precedence level as in pure recursive descent, you have a main "expression" routine with a loop to run through the tokens of the expression that calls "nud" and "led" routines for each operator, and those routines can recursively call the main expression routine to process subexpressions with the same or higher operator precedences. The "led" and "nud" (left denotation and null denotation, in Pratt's terms) routines are for operators that take a left operand (e.g. infix and postfix operators) or no left operand (prefix operators), respectively. I was blocked for a time by the concatenation operator being present by its absence. That is, in awk, there is no explicit operator for string concatenation; it happens when two expressions appear adjacent with no operator between them. I figured I had to determine it and insert it by looking for places where a symbol that can end an expression is followed by a symbol that can start an expression. But the ++ and -- operators threw me. They can either end or begin an expression. So I could not see how to have ++ and -- as both prefix and postfix operators, that is, handled by both the "nud" and "led" routines. Then I saw that if they followed an lvalue they were always taken as postfix. I modified my "nud" routine, which was handling prefix operators, to also consume ++ / -- operators if they followed an lvalue. I also had my main expression routine expr() looking for the start of an expression where it could be the right side of a concatenation. Note that the right side cannot start with + or -, or there will be confusion between a + b and a (concatenated with) + b. If you look at the POSIX grammar for awk, you'll see a long list of nearly identical productions for unary_expr and non_unary_expr, that are needed solely to exclude the right side of a concatenation beginning with unary + or -. There is nothing like that in the yacc grammar for nawk/nnawk/OneTrueAwk, and I'm not clear on how yacc handles this. But now my "nud" routine was processing not only prefix operators and plain variables, but also subscripted variables (array references), as well as postfix ++ and --. By the time I got it working, the "nud" routine was also handling the $ field reference operator and function calls, and the parenthesized expression list normally seen with the array-subscript test expression (expr, expr,...) in arrayname. The supposed Pratt parser was looking more similar to a precedence climbing parser, though it still has a Pratt-like main loop routine for expressions and still uses "left binding power" and "right binding power" concepts from Pratt parsing. Consequently, I renamed the nud function to primary() and led function to binary_op(). I hacked code into the expression routine expr() to handle the if (1 < 2 && x = 3) case mentioned in the previous section. (This is also an issue for some similar cases, such as 1 && x = 3 and 2 < x = 3.) It's not elegant, but it seems to work correctly. Statement termination --------------------- Another feature of awk is that statements normally end at the end of a line and do not require a semicolon to end the statement. Some constructs require a semicolon in certain places: in if (condition) statement [1]; else statement [2], where the else appears on the same line as statement [1] and statement [1] is a single statement (i.e. not a compound statement in {braces}, and in do statement; while (condition) where the while appears on the same line as the statement and the statement is not enclosed in braces. Multiple statements on the same line must be separated with semicolons. The awk(1) man page says "Statements are terminated by semicolons, newlines or right braces." Actually, all statements must be terminated with a newline or semicolon, except the last statement before a closing right brace in brace-enclosed statement list. The POSIX grammar has complications due to this, in that it has productions for terminated_statement_list, unterminated_statement_list, terminated_statement, unterminated_statement, and terminatable_statement. The nawk/nnawk grammar avoids all this, via a lexical trick. Since it is never wrong to put a semicolon before a closing brace, and it cannot change the meaning of a correct program, the nawk/nnawk lexical analyzer returns a faked semicolon before each closing brace. Then it can use a much simpler grammar structure. But you can detect this trick, because it can also make an incorrect program correct! Try awk 'BEGIN{while(n-->0);}' with any awk and it will do nothing, silently. But awk 'BEGIN{while(n-->0)}' (note missing semicolon empty statement) works in nawk/nnawk but gets a syntax error in gawk, goawk, bbawk, and wak. (It also works in mawk, indicating that mawk uses the same lexical hack.) I initially intended to do the same, but it seemed too hacky. Instead, when I encounter an else or the while of a do ... while statement, I look at the previous token to see if it's any of newline, semicolon, or closing brace. This has the equivalent effect and also avoids overly complicating the grammar and parsing logic. Some other parsing-related problems ----------------------------------- print and printf: The print and printf statements have some issues. These can be followed by one or more expressions, which may or may not be enclosed in parentheses. So, you may encounter print # by itself, means print $0, the entire record print 3, 4 # prints 3 4 # case 2 print(3, 4) # same effect # case 3 print(1+2), 4 # same effect # case 4 When all you have is print ( the parser cannot know if it's case 3 or 4 above, so it cannot know to look for an expression (case 4) or a list of expressions (case 3). Also, when it sees a list of expressions (case 3), it still cannot decide it's a list to be printed, or the beginning of a subscripted in expression, as in print(3, 4) in a. Any of the above print statements can be followed by redirection, as print 3, 4 > "some_file". This would make print "def" > "abc" ambiguous, so the language spec says that any comparison operator in a print statement must be inside parentheses, as in print("def" > "abc"), or print("def" > "abc"), 123 > "outfile". Actually, only the original nawk/nnawk enforces that against operators other than >. Other awks allow print "abc" < "def" > "outfile", for example. I had a nasty kluge of switches and state variables in place to deal with all this, but now have it simplified somewhat. In the getlbp() function (that returns the Pratt left binding power), I have logic to say if the parser is inside a print/printf statement, and not inside parentheses, and has a > token, return 0, which will end any expression. Then the > will be correctly treated as a redirection operator. The primary() function returns the number of expressions in a parentheses-enclosed statement list, or 0 (or -1 for a potential lvalue, that is used to help with the (1 && a = 2) problem mentioned in the previous previous section). The print_stmt() function calls the expression parser as expr(CALLED_BY_PRINT), where CALLED_BY_PRINT is a "magic" value that flags the expr() function to see if the initial primary() call returns a statement list (>= 2) followed immediately by a token that can end a print statement (`>', `>>', `|', `;', `}', or newline). If so, it returns the expression count from primary() to print_stmt(), otherwise it continues parsing what is expected to be the first (or possibly only) expression for the print statement. I believe this correctly handles the print/printf statement. Data types and scope -------------------- There are two main data types in awk: scalar and array, maybe three if you count regex literals (/regular expression here/) as a separate type. Scalars can be numeric (always floating point), string, or both, which might be considered sub-types. You may consider some special values as numeric string (defined in POSIX) or uninitialized as sub-types as well. Arrays are associative, "subscripted" by string values. I often refer to them as maps; they are implemented with hash tables in wak. The scope rules are pretty simple. Function formal parameters (variables supplied in the function definition function _name_(param1, param2,...)) are local to the function, as far as name scoping is considered. All other variables are global. When a function is called with a scalar variable, the call is by value and changes made to it within the function have no effect outside the function. But when a function is called with an array variable, changes made to the array are visible outside the function. A function can be called with fewer arguments than it has defined as formal parameters. The "extra" parameters will then be uninitialized local variables inside the function, and this is the only mechanism for defining local variables in a function. This presents some problems for a one-pass parser. When the parser first encounters a variable, the type (scalar or array) is noted depending on whether or not it is followed by a subscript, or if it appears after the in operator. With this: BEGIN{f(a); g(a)} a one-pass parser can't know if a is scalar or array. If it's an array, it needs to be set up before calling f(a) so awk can pass the array to g(). In BEGIN{f(a); g(a)}; function f(x){x[1]=3}; function g(x){print x[1]}, goawk / gawk / mawk / nnawk (OneTrueAwk) and bbawk all work. But worse, it is possible for a function to be called with a scalar sometimes and an array (as the same parameter) other times. For example: BEGIN{f(1); f(0)} function f(a, b){if (a) g(b); else {h(b); i(b)}} function g(x){print x} function h(x){x[1] = 37} function i(x){print x[1]} Here, gawk / nnawk / bbawk all print an empty line and then 37; mawk and goawk complain about type errors. Posix says "The same name shall not be used within the same scope both as a scalar variable and as an array." But I don't see how that can be determined during compilation of f(a, b) in a one-pass parser. I'm curious how this works inside gawk, nnawk and bbawk but not going to look. In compiling f(a, b), g(b) will pass a scalar and h(b) and i(b) will pass an array. In my parser, if a variable appears only as a "bare" variable name in a function call, it is left "untyped" (actually gets a ZF_MAYBEMAP flag). When it is pushed on the stack at runtime, an empty map is attached to it. Then if it is used as a scalar, it is converted to scalar and if it is used as an array it is converted to an array. Error recovery -------------- I am using D.A. Turner's error recovery technique to handle syntax errors and continue parsing. Dick Grune's Annotated Bibliography of papers on parsing includes this reference for Turner's method: > Turner, D. A. Error diagnosis and recovery in one pass compilers. > Inform. Process. Lett., 6:113-115, 1977. Proposes an extremely > simple(minded) error recovery method for recursive descent parsers: > when an error occurs, the parser enters a recovering state. While > in this recovering state, error messages are inhibited. Apart from > that, the parser proceeds until it requires a definite symbol. > Then, symbols are skipped until this symbol is found or the end of > the input is reached. Because this method can result in a lot of > skipping, some fine-tuning can be applied. Code generation --------------- Code is generated directly in the parser, using functions gencd() and gen2cd(). These take one or two arguments, an opcode and (for gen2cd()) an operand. These are integers that are simply appended to the zcode zlist, which expands as needed. The opcodes are usually the same as the tokens representing the program element involved, for example, tkmul to multiply the top two values on the stack and replace them with the product. There are also about 25 additional opcodes that don't correspond to an awk token, such as oppush, opdrop (stack manipulation), opjump, opjumptrue (unconditional and conditional jump in the zcode program), etc. The zcode machine and its operations are kept simple and no attempt is made to do code optimization, in order to keep the implementation small. Runtime internals ================= Much more needed here... Stack ----- The runtime stack is a zlist of zvalue structs. All variables and temporary values are stored on the stack. The "special variables" (FS, NF, RS, CONVFMT, etc.) are stored at the bottom of the stack. Program globals are stored above that. As the program runs, values are pushed, popped, operated on, etc. Function call arguments and other related information, including variables local to functions, are put on the stack as well, as discussed below. In earlier versions, the zlist_append() function was called for every stack push. This ensured that the stack would never overflow unless memory is exhausted, but profiling showed that this was expensive. Now, a healthy margin of stack is checked (only) at every function call, where the stack will be expanded as needed to ensure the margin is maintained. Nearly all awk statements will be "stack neutral", so the stack is at the same point before and after a statement except for function calls. One exception is for (index in array)..., which maintains some array iteration information on the stack inside the for loop. The margin MIN_STACK_LEFT (currently 1024) ensures that only a pathological statement will be able to overflow the stack, but it is possible. It is possible to analyze the maximum stack usage for each statement during compilation, and avoid any possibility of overflow, but I believe it is not worth the added complexity. Function definition, call, return, stack frame design ----------------------------------------------------- A function definition function f(a, b, c,...) { ... } generates: tkfunction function_number (code for function body) tknumber uninitvalue tkreturn number_of_params As each parameter is parsed, it is added to a table of local variables for this function. The tkreturn is added to return an uninitialized value if the code falls off the end of the function. When a return keyword is encountered, the expression following it is compiled and its value is left on the stack. If no expression follows the return, then a tknumber 0 is compiled to push a zero on the stack. A function call f(a, b, c,...) generates: opprepcall function_number (code to push args) tkfunc number_of_args At runtime, these work as follows: The runtime keeps an index into the stack called parmbase (a local variable of the main interpreter function interpx(), initially 0) that points into the current call stack frame as follows: return_value parmbase-4 return_addr parmbase-3 prev_parmbase parmbase-2 arg_count parmbase-1 function_number parmbase arg1 parmbase+1 arg2 parmbase+2 ... A function call executes as follows: opprepcall pushes placeholder 0 values for the return value, return address, previous parmbase, and argument count. Then it pushes the function number (from the opprepcall function_number code sequence). The arguments are pushed onto the stack after that. tkfunc retrieves the number of arguments (arg count) from the tkfunc number_of_args code sequence, calculates where the parmbase should be by subtracting the arg count from the current stack top index, then fills in the return address (offset of next zcode word in the zcode table) and the arg count in the stack frame. Finally, the tkfunc op pushes the arg count on the stack and sets the ip (zcode instruction pointer) to go to the tkfunction op of the function definition. tkfunction finds the local variable table for the function and gets the number or parameters. It then pops the arg count (number of actual arguments) from the stack, calculates the new parmbase (stack top index minus arg count), stores the previous parmbase in the stack frame, and sets parmbase to the new parmbase. Next, it loops to drop excess calling arguments if more args have been supplied in the call than there are params defined for the function. (NOTE: This is an error and should be caught at compile time! FIXME) Then, if the number of supplied args is less than the number of defined params, additional "args" are pushed to be used as local variables by the function. This is where the "maybe map" variables may have to be created, as explained in "Parsing awk". When the function returns, either via a return keyword or "falling off the end" of the function (where a tkreturn op has been compiled), the tkreturn op picks up the param count (from the tkreturn number_of_params code sequence) [NOTE this is unused; remove it? FIXME], gets the arg count from the stack frame, and copies the return value from the stack into the return value slot in the stack frame, and drops the return value from the stack. At this point, the stack should have on it just the arguments, including the locals created beyond the args supplied. The tkreturn op loops through the locals not supplied by the caller, releasing any map (array) data and dropping the local "arg" from the stack. Now, only the args actually supplied by the caller remain on the stack, and these are dropped. Finally, the ip instruction pointer is set from the return address in the stack frame, the previous parmbase value is restored from the stack frame, and the execution continues with the next instruction. Implementing regex RS ===================== The POSIX spec (and the original AWK book) only accepts a single character as the record separator, the default being . And POSIX says "If RS contains more than one character, the results are unspecified." I noticed that all the awk implementations I have available have implemented the capability of using a regular expression for RS, the record separator string. This includes nawk (the original Kernighan "One True Awk" in the 2022 version) and nnawk (the latest version), gawk, mawk, goawk, and Busybox awk (bbawk). I believe mawk was the first to implement this feature; it was added to nawk in 2019. Maybe it will be in the next release of the POSIX spec for awk. So I decided to try adding it to my version of awk. This proved a bit trickier than it looked at first. I had noticed that Busybox awk fails the rebuf.awk test in the gawk test set. The problem in the bbawk output looked like the same sort of problem that the rebuf.awk test (a bug reported in 2002) originally revealed in gawk. Looking at the nature of the output, where the regex RS="ti1\n(dwv,)?" was not being matched on some records, causing "dwv," to appear spuriously in the output, I guessed that the problem was that the data at the end of the buffer contained only a partial match for the regex. That is, if the buffer ended with "ti1\n", the RS would match, but the "dwv," would be at the start of the next buffer. But when I tried to implement it, I found my code was getting similar results. I had logic such that if I got a match ending exactly at the end of the buffer, then expand the buffer, read more data, and try again. The problem was that I could get a partial match near the end of buffer, and have the same problem. For example, if the buffer ends with "ti1\ndw", the RS will match the "til\n". So I have to expand the buffer whenever the match is near the end of the buffer. How near, though? After some experimentation, I settled tentatively on a buffer with initial size 8192 byte and a "margin" 1024 bytes from the end. I wrote a program to stress test the awk versions on hand. I generated 8500 lines of data that looks like: 1wxyz 2wxyyz 3wxyyyz ... And a test: BEGIN{ RS = "w(x[^z]*z\n)?" } 1 This did stress them. Correct behavior would be to print 1, 2, ... 8500 lines with just the numbers on them. Version # lines printed # initial lines correct ------- --------------- ----------------------- nnawk 8501 8180 wak 11048 1170 mawk 8639 717 goawk 9078 355 gawk 12007 40 bbawk 16828 25 nnawk did the best, running for 8180 lines before printing a couple of anomalous lines, then ran OK for the rest of the input. Testing awk =========== I've been testing wak against the other awk implementations I've been able to obtain. I have recent versions of nnawk (Kernighan's One True Awk, the original Unix awk updated), gawk, mawk, goawk, and bbawk (busybox awk). As of this writing, the versions are: * nnawk: awk version 20231228 (compiled 2024-01-23) * gawk: GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1 (source 2023-11-02; compiled 2023-11-19) * mawk: mawk 1.3.4 20231102 (compiled 2023-11-16) * goawk: V1.25.0 (compiled 2024-01-23) * bbawk: 2023-12-31 (compiled 2024-01-23) I'm sure there are plenty of bugs. Brian Kernighan has until recently been maintaining the original Unix awk from the start almost 40 years ago and has a "FIXES" file with over 200 entries from 1987 to 2023, and continuing to the present in a separate file for the "second edition" of One True Awk. If Kernighan has been fixing bugs for 35+ years, I doubt I can make a bug-free awk. Some "bugs" are incompatible interpretations of awk compared with what other implementations do with certain features. No two versions of awk (original awk, gawk, mawk, goawk, Busybox awk, etc.) agree completely. Please report bugs to raygard at gmail.com. Testing strategy ---------------- I have used the test files that come with existing awk implementations, plus some I've written. The original One True Awk comes with a folder testdir of about 315 files. Kernighan's README.TESTS file says there are about 60 small tests p.* from the first two chapters of The AWK Programming Language (1st ed.) that are basic stuff; about 160 small tests t.* that are "a random sampling of awk constructions collected over the years. Not organized, but they touch almost everything."; about 20 tt.* files that are timing tests. The testdir folder has also about 30 T.* files that are "more systematic tests of specific language features", but unfortunately these are shell scripts that can test a single awk program to see if it computes correct output as compared with known good data built into the scripts. This makes it difficult to use to compare my implementation against all the others in one pass, but I can run the scripts on wak separately. gawk also comes with a folder of about 1475 files, and most of these are sets of foo.awk, foo.in, and foo.ok files. In each case, the foo.awk file is run with foo.in input and the result can be compared with foo.ok. Some are standalone tests that do not need an input file, so there is sometimes no foo.in file. I have a not-very-neat test driver test_awk.py that I can use to run a batch of tests, such as all t.* in testdir, at one time against several awk implementations, and see how they compare. In the case of testdir's p.* and t.* files, they are intended to use certain input files (test.countries and test.data), and the outputs are compared via MD5 hashes. Each unique output is saved for later examination. For the gawk-style tests, the program can compare the output against the foo.ok file and give a pass/fail result. If there is a non-zero exit code or an exception, that is noted on the test_awk.py output. The output looks like this: ======= ======= ======= ======= ======= ======= ======= =versions=> gawk nnawk mawk goawk bbawk tbawk muwak ==== ==== ==== ==== ==== ==== ==== [...] Test delarpm2.awk dd8e2e5 8841567 NNAWK c992867 gawk GOAWK GOAWK Test dfacheck1.awk 03a19ad 0000000 NNAWK NNAWK gawk !3a19ad TBAWK ERR: tbawk: awk: file tests/gawktests/dfacheck1.awk line 1: warning: '\<' -- unknown regex escape ERR: muwak: muwak: file tests/gawktests/dfacheck1.awk line 1: warning: '\<' -- unknown regex escape Test double1.awk 8c7dbdf 819e6db gawk 351564a d97b6e5 BBAWK BBAWK Test double2.awk 4dbdb44 4941b67 gawk 7a665a2 acfcfe0 0124355 TBAWK Test dtdgport.awk a916caa gawk gawk gawk !!00000 gawk gawk RET: bbawk: 1 ERR: bbawk: awk: tests/gawktests/dtdgport.awk:37: %*x formats are not supported The hex values are the first 7 digits of the MD5 of the output file. If the output is an empty file, the MD5 is replaced with all zeroes to make it easier to spot. If any stderr output occurs, the first digit is replaced with a `!'; if a non-zero exit code occurs, the second digit is replaced with `!'. In either case, the stderr output (ERR:) and exit code (RET:) are printed. These (non-pass-fail, non-timing) reports always display a hash value of the output for the first column (i.e. the first awk version tested). In subsequent columns, if the hash is different from the first column, that hash is listed; but if hash matches a hash from a previous column then the awk version of that column is listed, and if it differs from the first column it is up-cased. So for example, for delarpm2.awk, nnawk has a different output from gawk, mawk matches nnawk, goawk has yet another different output, bbawk matches gawk, and both tbawk (toybox awk - my awk for toybox) and muwak (my awk compiled with musl libc) match goawk. For dfacheck1.awk, gawk gave some output, nnawk produced no output, mawk and goawk also produced no output (matching nnawk), bbawk matched gawk, tbawk produced the same output as gawk but had stderr output, and muwak matched tbawk, including having stderr output. The gawk tests were originally intended to be run via the supplied Makefile, and some of them use special gawk options, environment setup, etc., so that when the foo.awk file is run by test_awk.py it may not produce correct foo.ok output even from gawk. Because of this, I sifted the output from all the gawk tests against all the awk versions into several parts and moved the tests into corresponding folders: gawktests/allfail has tests that fail for all versions, including gawk; gawktests/allpass has tests that pass for all versions; gawktests/gawkonly has tests that pass for gawk and fail for all others (usually because they use gawk-only features); and gawktests has all the remaining tests. I also wrote a shell script and awk script to sift the resulting test output files into several categories. I usually have test results in colums for gawk, nnawk, mawk, goawk, bbawk, my awk within toybox (tbawk), and my awk standalone (may be compiled with ASAN sanitizer, or with musl lib, or some other version). The order is significant because I consider my result golden if it matches both gawk and nnawk, still good if it matches gawk or nawk, less good if it matches (only) mawk, goawk or bbawk. If my awks (last two columns) differ, they go into a set_mismatch file; that's usually a result of the difference in random generators, or due to differences (bugs?) in the musl regex functions. If any "allfail" tests do not all fail, or any "allpass" tests do not all pass, or any "gawkonly" tests do not pass only for gawk, they go into separate files. If a test is pass/fail, then I put tests that my awk fails into a set_fail file; if it passes and both gawk and nawk pass, it goes into an set_good file; if it matches gawk or nawk it goes into a set_gawk or set_nawk file, otherwise it goes into a general set_pass file (or set_passx if it passed but had stderr output or non-zero exit code). If it's not pass/fail, then if my result matches gawk and nawk, it goes into a set_good file, else if it matches gawk it goes into the set_gawk file; else if matches nawk it goes into the set_nawk file, else if it matches mawk, goawk, or bbawk it goes into a set_mawk, set_goawk, or set_bbawk file respectively. If it doesn't fit into any of those buckets, then it doesn't match any other implementation, and goes into a set_odd file. The set_odd file needs the closest scrutiny, as those are possible bugs in my implementation, though some differ from gawk and/or nawk only in that they have stderr output, usually warnings. Some others are due to different implementations iterating over arrays in different orders. (An annoyance for testing is that goawk doesn't usually traverse arrays the same way on different runs due to golng's intentionally random hash behavior that apparently cannot be turned off.) Currently, I have 30 tests in the set_odd category out of 1182 tests run. I believe only a relative few of these are actual bugs. Here is an approximate breakdown of the current test results: category count --------------- ----- set_good 736 set_gawk 87 set_nawk 158 set_mawk 50 set_goawk 8 set_bbawk 10 set_pass 2 set_passx 11 set_fail 47 set_badpass 2 set_badfail 2 set_badgawkonly 1 set_mismatch 38 set_odd 30 To put the 47 set_fail results in perspective, all but two of those are also failed by at least one of gawk, nnawk, mawk, or goawk. Of those two others, both are also failed by bbawk. Still, I would like to make wak/toybox awk work on more of those cases as well. Also, the tests need some cleanup; there is some overlap and duplication. From: