[HN Gopher] The Awk state machine parser pattern (2018)
___________________________________________________________________
The Awk state machine parser pattern (2018)
Author : Tomte
Score : 155 points
Date : 2022-01-31 09:11 UTC (13 hours ago)
(HTM) web link (two-wrongs.com)
(TXT) w3m dump (two-wrongs.com)
| eqvinox wrote:
| Not to disparage the nice awk script, but reading from
| /sys/class/hwmon/* seems more sensible...
|
| (Which is my way of saying, rather than writing a script like
| this, I'd spend some time to get the data machine readable in the
| first place -- or even just dig up where to already find it in a
| machine readable form.)
| nousermane wrote:
| Reading a file from sysfs is great and script-friendly, sure.
| OTOH, finding the right file to read is less straight-forward.
|
| For one, depending on kernel version and compile options,
| temperature/voltage/rpm files could be found under
| /sys/class/hwmon, or /sys/devices/virtual, or
| /sys/devices/platform/soc. And then, say, script found a dozen
| of those "temp", or "temp*_input", or "microvolts" files. How
| to figure out which one is for CPU, motherboard, battery, PSU,
| air intake? Probably with extra logic, reading corresponding
| "temp*_label", if those even exist? Parsing
| /sys/firmware/devicetree ? Taking hints from parts of the path-
| name, where such files are found?
|
| lm_sensors is no silver bullet here either, but at least it
| does passable job discovering/labeling sensors most of the
| time.
| eqvinox wrote:
| Well, in that case: # sensors --help
| Usage: sensors [OPTION]... [CHIP]... ... -j
| Json output
| jjice wrote:
| I'm a bit upset to say that this is one of the few times I've
| seen AWK code outside of a one liner (some of those one liners
| are pretty beastly, but still).
|
| It reads pretty well, and now I'm interested in using it a bit
| more for my scripts. Any good AWK examples/resources anyone can
| recommend?
| b3morales wrote:
| This one is good; clear explanations of the concepts and
| various useful examples to crib from:
| https://www.grymoire.com/Unix/Awk.html
|
| Their document covering sed is also excellent.
| temp0826 wrote:
| The AWK book written by the A W K from the program name has
| always been considered the bible.
|
| Edit: misattributed to k&r
| [deleted]
| yiyus wrote:
| I used a similar technique for my awk markdown parser:
| https://github.com/yiyus/md2html.awk
|
| An awk state machine is a quite straightforward way to deal with
| data like this log file. It is not so clear that this is the best
| way to write a relatively large piece of software, like a
| markdown parser (when I wrote md2html.awk in 2009, the standard
| md parser was the original one by John Gruber, written in Perl,
| so it actually was an improvement in code clarity, performance,
| and portability (we had no perl in Plan 9!), but nowadays it is
| easy to find much better solutions).
| wernsey wrote:
| Wow, I've bookmarked this.
|
| I also wrote an Markdown to HTML converter in Awk once, though
| my purpose was to convert marked down comments in source code
| to documentation. I started with an Awk script to extract the
| comments and then systematically added markdown features. My
| end result isn't very elegant though.
|
| https://github.com/wernsey/d.awk/blob/master/d.awk
| asicsp wrote:
| State machine of the form awk '/start/{f=1} f;
| /end/{f=0}'
|
| is commonly used to work with text bounded by unique markers. You
| can also use `awk '/start/,/end/'` but the state machine format
| can be easily adapted for more variations (like excluding
| either/both of the markers)
|
| Here's a chapter from my GNU awk one-liners book with more such
| examples:
| https://learnbyexample.github.io/learn_gnuawk/processing-mul...
| marklgr wrote:
| Good book, for all levels, I recall stealing several snippets
| into my cheatsheet.
| nerdponx wrote:
| I always forget that omitting the "action" after a pattern
| prints the line by default. Very useful tip! I'll definitely
| pick up a copy of this book.
| [deleted]
| VeninVidiaVicii wrote:
| Glad I'm not the only one crazy enough to write long AWK parsers.
| Here's my tool to convert "feature tables" into "bed" files, both
| files that describe genomes, but for some reason, NCBI uses the
| former, even though it's totally useless.
| https://github.com/ryandward/tbl2bed/blob/main/tbl2bed.awk
| kazinator wrote:
| TXR: @(bind idtab @(relate '("radeon" "k10temp")
| '("GPU" "CPU") "SYS"))
| @(bind thash @(hash)) @(repeat) @id-@bus-@code
| Adapter: @name @ (repeat) temp@num: @{temp}degC@nil
| @ (do (set [thash `@[idtab id]_@num`] temp)) @ (until)
| @ (end) @(end) @(do (dohash (tag temp thash)
| (sh `echo gmetric -t uint16 -u Celsius -n @tag -v @temp`)))
| $ txr temp.txr temp.dat gmetric -t uint16 -u Celsius -n
| SYS_2 -v +43.0 gmetric -t uint16 -u Celsius -n SYS_1 -v
| +31.0 gmetric -t uint16 -u Celsius -n CPU_1 -v +36.8
| gmetric -t uint16 -u Celsius -n GPU_1 -v +50.5 gmetric -t
| uint16 -u Celsius -n SYS_3 -v +38.0
|
| That's based on what I think the Awk is stuffing into the
| associative array. I was not able to run the code as pasted
| verbatim from the site: $ gawk -f temp.awk
| temp.dat gawk: temp.awk:32: temp = substr( gawk:
| temp.awk:32: ^ unexpected newline or end of
| string gawk: temp.awk:35: matches[1, "length"]
| gawk: temp.awk:35: ^ unexpected
| newline or end of string gawk: temp.awk:35: );
| gawk: temp.awk:35: ^ 0 is invalid as number of arguments for
| substr $ mawk -f temp.awk temp.dat mawk:
| temp.awk: line 30: regular expression compile failed (missing
| operand) +([0-9.]+)degC mawk: temp.awk: line 30:
| syntax error at or near , mawk: temp.awk: line 31: missing
| ) near end of line mawk: temp.awk: line 32: syntax error at
| or near , mawk: temp.awk: line 35: extra ')'
|
| I suspect the author may have tweaked the code for presentation
| in the blog without rechecking that it still works. Or else it
| needs some specific implementation and version of awk, with
| specific command line arguments that are not given,
| unfortunately.
| vcdimension wrote:
| My first thought when I saw this item in the HN list was TXR :)
| kazinator wrote:
| The blog's code works (with gawk) if some whitespace errors are
| fixed. $ gawk -f temp.awk temp.dat # echo
| inserted into command gmetric -t uint16 -u Celsius -n
| GPU_1 -v 50.5 gmetric -t uint16 -u Celsius -n SYS_1 -v
| 31.0 gmetric -t uint16 -u Celsius -n SYS_2 -v 43.0
| gmetric -t uint16 -u Celsius -n SYS_3 -v 38.0 gmetric -t
| uint16 -u Celsius -n CPU_1 -v 36.8
|
| I think they are a design flaw in Awk; I'm going to look into
| that and recommend changes to POSIX via the Austin Group
| mailing list if it still exists.
|
| Awk has some newline sensitivities due to the following
| ambiguities: condition # condition
| with no action allowed: default { print } action {
| action } # action with no condition allowed
| condition { action } # both
|
| Therefore, this is not allowed (or well, it is, but codifies a
| separate condition with a default action, and an unconditional
| action). condition { action }
|
| There can be no newline between a condition and the opening {
| of its action. And actions must be brace enclosed.
|
| And thus (IIRC) the awk lexical analyzer (in the original One
| True Awk implementation) returns an explicit newline token to
| the Yacc parser. In any phrase structure that doesn't deal with
| that token, a newline will cause a syntax error:
| function( # no good arg )
| function("string " # no good foo + bar
| " catenation")
|
| When the lexer produces the token which is the opening brace of
| an action, it could shift into a freeform state, in which it
| consumes newlines internally. Then when the action is parsed,
| it can be returned to the newline-sensitive mode.
|
| The newline sensitivities don't seem to serve a purpose in the
| C-like language within the actions.
|
| That language also occurs outside of actions via the function
| construct: function whatever(...) { }
|
| here the lexer would also be shifted into the freeform mode, as
| appropriate.
| kazinator wrote:
| Here is the data into JSON (keeping the values a
| $ txr json.txr temp.dat {"radeon-
| pci-0100":{"temp1":{"crit":120,"hyst":90,"value":50.5}},
| "f71889ed-isa-0480":{"temp2":{"hyst":77,"value":43,"high":85},"
| alarm":{"fan3":true,"fan2":true},
| "sensor":{"crit":100,"name":"thermistor","hyst":92},"max-
| voltage":{"in1":2.04}, "temp1":{"hy
| st":81,"value":31,"high":85},"voltage":{"in5":1.23,"+3.3V":3.23
| ,"in6":1.53,"in2":1.09,"Vbat":3.31,"in1":1.07,
| "in4":0.58,"3VSB":3.25,"in3":0.89},
| "temp3":{"hyst":68,"value":38,"high":70},"rpm":{"fan3":0,"fan1"
| :3978,"fan2":0}}, "k10temp-pci-00c3":{"temp1":{"crit":80
| ,"hyst":78,"value":36.8,"high":70}}}
|
| Using just a straightforward approach of recognizing the cases
| that occur without trying to formally parse things. There is
| significant copy and paste between similar cases. I decided to
| use a post-processing pass on the dictionary to convert the
| numeric values to floating-point. @(bind dict
| @(hash)) @(name file) @(repeat) @idstring
| Adapter: @adapter @ (collect :vars (entry)) @
| (line line) @ (assert error `unhandled stuff occurs at
| @file:@line`) @ (some) @{temp /temp\d+/}:
| @{val}degC (crit = @{crit}degC,
| hyst = @{hyst}degC) @ (bind entry @#J^{~temp : {
| "value" : ~val, "crit"
| : ~crit, "hyst" : ~hyst
| }}) @ (or) @{temp /temp\d+/}: @{val}degC (high
| = @{high}degC) (crit =
| @{crit}degC, hyst = @{hyst}degC)
| @ (bind entry @#J^{~temp : { "value" : ~val,
| "crit" : ~crit, "hyst"
| : ~hyst, "high" : ~high
| }}) @ (or) @{temp /temp\d+/}: @{val}degC (high
| = @{high}degC, hyst =
| @{hyst}degC) @ (bind entry @#J^{~temp : { "value" :
| ~val, "high" : ~high,
| "hyst" : ~hyst }}) @ (or)
| (crit = @{crit}degC, hyst =
| @{hyst}degC) sensor = @sensor
| @ (bind entry @#J^{"sensor" : { "name" : ~sensor,
| "crit" : ~crit,
| "hyst" : ~hyst }}) @ (or) @label: @voltage V
| @ (bind entry @#J^{"voltage" : {~label : ~voltage}})
| @ (or) @label: @voltage V (max = @max V) @
| (bind entry @#J^{"voltage" : {~label : ~voltage},
| "max-voltage" : {~label : ~max}}) @ (or) @label:
| @rpm RPM @ (bind entry @#J^{"rpm" : {~label :
| ~rpm}}) @ (or) @label: @rpm RPM ALARM @
| (bind entry @#J^{"rpm" : {~label : ~rpm},
| "alarm" : {~label : true}}) @ (or) @
| (bind entry @#J{}) @ (end) @ (until)
| @ (end) @ (do (set [dict idstring]
| (reduce-left (op hash-uni @1 @2 hash-uni) entry #J{})))
| @(end) @(do (defun numify (dict)
| (dohash (k v dict dict) (typecase v
| (string (iflet ((f (tofloat v))) (set
| [dict k] f))) (hash (numify v)))))
| (put-jsonl (numify dict)))
| EdwardDiego wrote:
| I have to say, that's the most readable and understandable Awk
| program I've seen.
|
| Does anyone know if there's a repository of similarly literate
| awk scripts?
| OskarS wrote:
| I've found that AWK is frequently surprisingly readable
| actually, as long as you understand the execution model. I
| think people tend to think of it (with some justification) as
| in the same vein as Perl, but it isn't nearly as surface-level
| cryptic. The syntax is just "C-style" with dynamic typing and
| built in regular expressions.
|
| I have a couple of AWK scripts to handle my personal finances
| (basically, consuming various bank/credit card statements and
| turning them to Ledger files) and it's just the perfect
| language for that kind of task. My scripts look fairly similar
| to the examples in the blog post, they also use the same state-
| machine trick. With the possible exception of Perl/Raku, it's
| my favourite language for that kind of thing.
| asicsp wrote:
| Check out https://github.com/e36freak/awk-libs
| baddate wrote:
| nice!
| yiyus wrote:
| There used to be very good examples in awk.info. The domain is
| on sale now, but you can get all the old content from
| archive.org and it still is very valid.
| patrec wrote:
| Have a look at the classic "The AWK Programming Language" by
| the A, W, and K in awk. It's full of great examples.
|
| https://ia803404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...
| rottc0dd wrote:
| I don't know if this is idiomatic way of doing awk. It is just
| a port of a python script to awk.
|
| https://github.com/berry-thawson/diff2html
|
| This is first attempt in writing awk script. Would like to know
| how readable it is.
|
| Edits : added a new line. changed some words.
| dj_mc_merlin wrote:
| This is basically what Perl used to be used for too.
| nimrody wrote:
| If you think of Ruby as a more readable / maintainable Perl --
| it's much better suited to these text processing tasks.
|
| Ruby even supports Perl regular expressions which are more
| powerful and convenient than Awk's.
|
| Some version of Ruby is usually in the base system of every
| Linux system (perl5 is more ubiquitous but much more cryptic)
| pphysch wrote:
| AWK supports conditional branching and switching so you can
| represent nested states as well. Wouldn't recommended beyond
| depth ~1 though... Use a proper language for that.
___________________________________________________________________
(page generated 2022-01-31 23:01 UTC)