[HN Gopher] The Awk state machine parser pattern (2018)
       ___________________________________________________________________
        
       The Awk state machine parser pattern (2018)
        
       Author : Tomte
       Score  : 155 points
       Date   : 2022-01-31 09:11 UTC (13 hours ago)
        
 (HTM) web link (two-wrongs.com)
 (TXT) w3m dump (two-wrongs.com)
        
       | eqvinox wrote:
       | Not to disparage the nice awk script, but reading from
       | /sys/class/hwmon/* seems more sensible...
       | 
       | (Which is my way of saying, rather than writing a script like
       | this, I'd spend some time to get the data machine readable in the
       | first place -- or even just dig up where to already find it in a
       | machine readable form.)
        
         | nousermane wrote:
         | Reading a file from sysfs is great and script-friendly, sure.
         | OTOH, finding the right file to read is less straight-forward.
         | 
         | For one, depending on kernel version and compile options,
         | temperature/voltage/rpm files could be found under
         | /sys/class/hwmon, or /sys/devices/virtual, or
         | /sys/devices/platform/soc. And then, say, script found a dozen
         | of those "temp", or "temp*_input", or "microvolts" files. How
         | to figure out which one is for CPU, motherboard, battery, PSU,
         | air intake? Probably with extra logic, reading corresponding
         | "temp*_label", if those even exist? Parsing
         | /sys/firmware/devicetree ? Taking hints from parts of the path-
         | name, where such files are found?
         | 
         | lm_sensors is no silver bullet here either, but at least it
         | does passable job discovering/labeling sensors most of the
         | time.
        
           | eqvinox wrote:
           | Well, in that case:                 # sensors --help
           | Usage: sensors [OPTION]... [CHIP]...       ...         -j
           | Json output
        
       | jjice wrote:
       | I'm a bit upset to say that this is one of the few times I've
       | seen AWK code outside of a one liner (some of those one liners
       | are pretty beastly, but still).
       | 
       | It reads pretty well, and now I'm interested in using it a bit
       | more for my scripts. Any good AWK examples/resources anyone can
       | recommend?
        
         | b3morales wrote:
         | This one is good; clear explanations of the concepts and
         | various useful examples to crib from:
         | https://www.grymoire.com/Unix/Awk.html
         | 
         | Their document covering sed is also excellent.
        
         | temp0826 wrote:
         | The AWK book written by the A W K from the program name has
         | always been considered the bible.
         | 
         | Edit: misattributed to k&r
        
       | [deleted]
        
       | yiyus wrote:
       | I used a similar technique for my awk markdown parser:
       | https://github.com/yiyus/md2html.awk
       | 
       | An awk state machine is a quite straightforward way to deal with
       | data like this log file. It is not so clear that this is the best
       | way to write a relatively large piece of software, like a
       | markdown parser (when I wrote md2html.awk in 2009, the standard
       | md parser was the original one by John Gruber, written in Perl,
       | so it actually was an improvement in code clarity, performance,
       | and portability (we had no perl in Plan 9!), but nowadays it is
       | easy to find much better solutions).
        
         | wernsey wrote:
         | Wow, I've bookmarked this.
         | 
         | I also wrote an Markdown to HTML converter in Awk once, though
         | my purpose was to convert marked down comments in source code
         | to documentation. I started with an Awk script to extract the
         | comments and then systematically added markdown features. My
         | end result isn't very elegant though.
         | 
         | https://github.com/wernsey/d.awk/blob/master/d.awk
        
       | asicsp wrote:
       | State machine of the form                   awk '/start/{f=1} f;
       | /end/{f=0}'
       | 
       | is commonly used to work with text bounded by unique markers. You
       | can also use `awk '/start/,/end/'` but the state machine format
       | can be easily adapted for more variations (like excluding
       | either/both of the markers)
       | 
       | Here's a chapter from my GNU awk one-liners book with more such
       | examples:
       | https://learnbyexample.github.io/learn_gnuawk/processing-mul...
        
         | marklgr wrote:
         | Good book, for all levels, I recall stealing several snippets
         | into my cheatsheet.
        
         | nerdponx wrote:
         | I always forget that omitting the "action" after a pattern
         | prints the line by default. Very useful tip! I'll definitely
         | pick up a copy of this book.
        
         | [deleted]
        
       | VeninVidiaVicii wrote:
       | Glad I'm not the only one crazy enough to write long AWK parsers.
       | Here's my tool to convert "feature tables" into "bed" files, both
       | files that describe genomes, but for some reason, NCBI uses the
       | former, even though it's totally useless.
       | https://github.com/ryandward/tbl2bed/blob/main/tbl2bed.awk
        
       | kazinator wrote:
       | TXR:                 @(bind idtab @(relate '("radeon" "k10temp")
       | '("GPU"    "CPU")                             "SYS"))
       | @(bind thash @(hash))       @(repeat)       @id-@bus-@code
       | Adapter: @name       @  (repeat)       temp@num: @{temp}degC@nil
       | @    (do (set [thash `@[idtab id]_@num`] temp))       @  (until)
       | @  (end)       @(end)       @(do (dohash (tag temp thash)
       | (sh `echo gmetric -t uint16 -u Celsius -n @tag -v @temp`)))
       | $ txr temp.txr temp.dat        gmetric -t uint16 -u Celsius -n
       | SYS_2 -v +43.0       gmetric -t uint16 -u Celsius -n SYS_1 -v
       | +31.0       gmetric -t uint16 -u Celsius -n CPU_1 -v +36.8
       | gmetric -t uint16 -u Celsius -n GPU_1 -v +50.5       gmetric -t
       | uint16 -u Celsius -n SYS_3 -v +38.0
       | 
       | That's based on what I think the Awk is stuffing into the
       | associative array. I was not able to run the code as pasted
       | verbatim from the site:                 $ gawk -f temp.awk
       | temp.dat        gawk: temp.awk:32:     temp = substr(       gawk:
       | temp.awk:32:                   ^ unexpected newline or end of
       | string       gawk: temp.awk:35:         matches[1, "length"]
       | gawk: temp.awk:35:                             ^ unexpected
       | newline or end of string       gawk: temp.awk:35:     );
       | gawk: temp.awk:35:     ^ 0 is invalid as number of arguments for
       | substr            $ mawk -f temp.awk temp.dat        mawk:
       | temp.awk: line 30: regular expression compile failed (missing
       | operand)       +([0-9.]+)degC       mawk: temp.awk: line 30:
       | syntax error at or near ,       mawk: temp.awk: line 31: missing
       | ) near end of line       mawk: temp.awk: line 32: syntax error at
       | or near ,       mawk: temp.awk: line 35: extra ')'
       | 
       | I suspect the author may have tweaked the code for presentation
       | in the blog without rechecking that it still works. Or else it
       | needs some specific implementation and version of awk, with
       | specific command line arguments that are not given,
       | unfortunately.
        
         | vcdimension wrote:
         | My first thought when I saw this item in the HN list was TXR :)
        
         | kazinator wrote:
         | The blog's code works (with gawk) if some whitespace errors are
         | fixed.                 $ gawk -f temp.awk temp.dat  # echo
         | inserted into command       gmetric -t uint16 -u Celsius -n
         | GPU_1 -v 50.5       gmetric -t uint16 -u Celsius -n SYS_1 -v
         | 31.0       gmetric -t uint16 -u Celsius -n SYS_2 -v 43.0
         | gmetric -t uint16 -u Celsius -n SYS_3 -v 38.0       gmetric -t
         | uint16 -u Celsius -n CPU_1 -v 36.8
         | 
         | I think they are a design flaw in Awk; I'm going to look into
         | that and recommend changes to POSIX via the Austin Group
         | mailing list if it still exists.
         | 
         | Awk has some newline sensitivities due to the following
         | ambiguities:                  condition             # condition
         | with no action allowed: default { print } action        {
         | action }            # action with no condition allowed
         | condition { action }  # both
         | 
         | Therefore, this is not allowed (or well, it is, but codifies a
         | separate condition with a default action, and an unconditional
         | action).                  condition         { action }
         | 
         | There can be no newline between a condition and the opening {
         | of its action. And actions must be brace enclosed.
         | 
         | And thus (IIRC) the awk lexical analyzer (in the original One
         | True Awk implementation) returns an explicit newline token to
         | the Yacc parser. In any phrase structure that doesn't deal with
         | that token, a newline will cause a syntax error:
         | function(     # no good          arg        )
         | function("string "   # no good                 foo + bar
         | " catenation")
         | 
         | When the lexer produces the token which is the opening brace of
         | an action, it could shift into a freeform state, in which it
         | consumes newlines internally. Then when the action is parsed,
         | it can be returned to the newline-sensitive mode.
         | 
         | The newline sensitivities don't seem to serve a purpose in the
         | C-like language within the actions.
         | 
         | That language also occurs outside of actions via the function
         | construct:                 function whatever(...) {       }
         | 
         | here the lexer would also be shifted into the freeform mode, as
         | appropriate.
        
         | kazinator wrote:
         | Here is the data into JSON (keeping the values a
         | $ txr json.txr temp.dat        {"radeon-
         | pci-0100":{"temp1":{"crit":120,"hyst":90,"value":50.5}},
         | "f71889ed-isa-0480":{"temp2":{"hyst":77,"value":43,"high":85},"
         | alarm":{"fan3":true,"fan2":true},
         | "sensor":{"crit":100,"name":"thermistor","hyst":92},"max-
         | voltage":{"in1":2.04},                             "temp1":{"hy
         | st":81,"value":31,"high":85},"voltage":{"in5":1.23,"+3.3V":3.23
         | ,"in6":1.53,"in2":1.09,"Vbat":3.31,"in1":1.07,
         | "in4":0.58,"3VSB":3.25,"in3":0.89},
         | "temp3":{"hyst":68,"value":38,"high":70},"rpm":{"fan3":0,"fan1"
         | :3978,"fan2":0}},        "k10temp-pci-00c3":{"temp1":{"crit":80
         | ,"hyst":78,"value":36.8,"high":70}}}
         | 
         | Using just a straightforward approach of recognizing the cases
         | that occur without trying to formally parse things. There is
         | significant copy and paste between similar cases. I decided to
         | use a post-processing pass on the dictionary to convert the
         | numeric values to floating-point.                 @(bind dict
         | @(hash))       @(name file)       @(repeat)       @idstring
         | Adapter: @adapter       @  (collect :vars (entry))       @
         | (line line)       @    (assert error `unhandled stuff occurs at
         | @file:@line`)       @    (some)       @{temp /temp\d+/}:
         | @{val}degC  (crit = @{crit}degC,
         | hyst = @{hyst}degC)       @      (bind entry @#J^{~temp : {
         | "value" : ~val,                                         "crit"
         | : ~crit,                                         "hyst" : ~hyst
         | }})       @    (or)       @{temp /temp\d+/}: @{val}degC  (high
         | = @{high}degC)                              (crit =
         | @{crit}degC,                               hyst = @{hyst}degC)
         | @      (bind entry @#J^{~temp : { "value" : ~val,
         | "crit" : ~crit,                                         "hyst"
         | : ~hyst,                                         "high" : ~high
         | }})       @    (or)       @{temp /temp\d+/}: @{val}degC  (high
         | = @{high}degC,                               hyst =
         | @{hyst}degC)       @      (bind entry @#J^{~temp : { "value" :
         | ~val,                                         "high" : ~high,
         | "hyst" : ~hyst }})       @    (or)
         | (crit = @{crit}degC,                               hyst =
         | @{hyst}degC)                               sensor = @sensor
         | @      (bind entry @#J^{"sensor" : { "name" : ~sensor,
         | "crit" : ~crit,
         | "hyst" : ~hyst }})       @    (or)       @label: @voltage V
         | @      (bind entry @#J^{"voltage" : {~label : ~voltage}})
         | @    (or)       @label: @voltage V (max = @max V)       @
         | (bind entry @#J^{"voltage" : {~label : ~voltage},
         | "max-voltage" : {~label : ~max}})       @    (or)       @label:
         | @rpm RPM       @      (bind entry @#J^{"rpm" : {~label :
         | ~rpm}})       @    (or)       @label: @rpm RPM ALARM       @
         | (bind entry @#J^{"rpm" : {~label : ~rpm},
         | "alarm" : {~label : true}})       @    (or)              @
         | (bind entry @#J{})       @    (end)       @  (until)
         | @  (end)       @  (do (set [dict idstring]
         | (reduce-left (op hash-uni @1 @2 hash-uni) entry #J{})))
         | @(end)       @(do          (defun numify (dict)
         | (dohash (k v dict dict)              (typecase v
         | (string (iflet ((f (tofloat v)))                          (set
         | [dict k] f)))                (hash (numify v)))))
         | (put-jsonl (numify dict)))
        
       | EdwardDiego wrote:
       | I have to say, that's the most readable and understandable Awk
       | program I've seen.
       | 
       | Does anyone know if there's a repository of similarly literate
       | awk scripts?
        
         | OskarS wrote:
         | I've found that AWK is frequently surprisingly readable
         | actually, as long as you understand the execution model. I
         | think people tend to think of it (with some justification) as
         | in the same vein as Perl, but it isn't nearly as surface-level
         | cryptic. The syntax is just "C-style" with dynamic typing and
         | built in regular expressions.
         | 
         | I have a couple of AWK scripts to handle my personal finances
         | (basically, consuming various bank/credit card statements and
         | turning them to Ledger files) and it's just the perfect
         | language for that kind of task. My scripts look fairly similar
         | to the examples in the blog post, they also use the same state-
         | machine trick. With the possible exception of Perl/Raku, it's
         | my favourite language for that kind of thing.
        
         | asicsp wrote:
         | Check out https://github.com/e36freak/awk-libs
        
           | baddate wrote:
           | nice!
        
         | yiyus wrote:
         | There used to be very good examples in awk.info. The domain is
         | on sale now, but you can get all the old content from
         | archive.org and it still is very valid.
        
         | patrec wrote:
         | Have a look at the classic "The AWK Programming Language" by
         | the A, W, and K in awk. It's full of great examples.
         | 
         | https://ia803404.us.archive.org/0/items/pdfy-MgN0H1joIoDVoIC...
        
         | rottc0dd wrote:
         | I don't know if this is idiomatic way of doing awk. It is just
         | a port of a python script to awk.
         | 
         | https://github.com/berry-thawson/diff2html
         | 
         | This is first attempt in writing awk script. Would like to know
         | how readable it is.
         | 
         | Edits : added a new line. changed some words.
        
       | dj_mc_merlin wrote:
       | This is basically what Perl used to be used for too.
        
         | nimrody wrote:
         | If you think of Ruby as a more readable / maintainable Perl --
         | it's much better suited to these text processing tasks.
         | 
         | Ruby even supports Perl regular expressions which are more
         | powerful and convenient than Awk's.
         | 
         | Some version of Ruby is usually in the base system of every
         | Linux system (perl5 is more ubiquitous but much more cryptic)
        
       | pphysch wrote:
       | AWK supports conditional branching and switching so you can
       | represent nested states as well. Wouldn't recommended beyond
       | depth ~1 though... Use a proper language for that.
        
       ___________________________________________________________________
       (page generated 2022-01-31 23:01 UTC)