--Topography--------------------------------------------------------------------
 AWK Workshop / Discussion - April 11 - 23, 2026
  * what: an informal exploration of plain AWK, aka "new AWK"
  * where: SDF.org - both pcom ("awk" room) and irc.sdf.org ("#awk")
  * when: Saturdays, 10-11am MDT ; Tuesdays & Thursdays, 6-7pm MDT
 --------------------------------------------------------------------------------

[b******s]  Day 2: regular & relational expressions, flow-control statments

[b******s]  in retrospect I think I could have structed this thing better..

[b******s]  AWK supports Extended Regular Expressions (ERE), similar to sed(1) &
            grep(1)                                                                 

[j******e]  It exists at all, and for that I'm thankful.

[b******s]  unlike sed & grep, AWK does *NOT* have back-reference support

[b******s]  probably everyone here knows what that is, things like '(pattern)\1'

[b******s]  note a huge deal but see 'Challenge Exercise'

[j******e]  I know what a regex is, but I'm unclear about what back-reference
            support means, tbh.                                                     

[b******s]  it seems to be part of the Basic REs; in the above the parens are used
            to mark a pattern you can latter reference using the '\m' form, up to 9
            references are supported                                                

[j******h]  I mostly use back-references in the replacement pattern (for a sed
            command); I'm not too comfortable using them in the search pattern as
            well.                                                                   

[b******s]  yeah they are useful, not sure why awk didn't get them

[b******s]  anyway, there are two AWK RegEx forms: /regex/ & "regex" (good for
            dynamic concatenation)                                                  
  
  <j******e is annoyed that he doesn't have this thing he didn't know
            existed until just now. ;)>

[b******s]  heh

[b******s]  the first form you've already seen last time

[b******s]  the latter makes use of AWK's string concatenation which we'll cover
            later                                                                   

[b******s]  here's the basics:

 ------------------------------------------------------------------------
 # AWK ERE meta-character basics:
 #  "." => any char ; "^" => match start of string ; "$" => match end of string
 #  "?" => match 0 or one of preceeding ; "+" => match 1 or more of preceeding
 #  "*" => match 0 or more of preceeding ; "A-Z", "a-z", "0-9" => range set
 #  "[...]" => match any char contained; can be single char or range sets
 #  "[^..]" => match any char NOT contained (complement of above)
 #  "{n,m}" => match n-m number of preceeding ; "{n,}" => matches n or more
 #
 # some common POSIX character classes - see re_format(7) for details:
 #   [:alpha:], [:alnum:], [:digit:], [:upper:], [:cntrl:], [:space:]
 #
 # metachars (usually) lose specialness inside "[]" ; escape w/ "\" otherwise
 #   ex.  "[.+?*]" => matches ".", "+", "?" or ","
 #        "[\\]" => matches "\"  ; "[\]]" => matches "]"
 #
 #  some special cases:
 #     "[[]" => matches "[" *IF* first char in brackets
 #     "[-]" => matches "-" *IF* first char in brackets
 #
 ------------------------------------------------------------------------

[b******s]  probably a lot of that is familiar
  
  <b******s pauses and tries to clean up spilt water..>

[j******e]  b******s: Would {,m} cover zero to m?

[j******e]  ...I suppose you could just use {0,m} for that...

[b******s]  hmm, not sure - try it!
  
  <j******e retracts his question.>

[b******s]  BTW, the complement ( "[^c..]" ) won't break 1st char special
            interpretation                                                          

[b******s]  I probably spend more time trying to debug regex stuff than anything
            else WRT AWK                                                            

[b******s]  hi f******x - type 'R 123' to review

[b******s]  continue?

[j******h]  How greedy is the {n,m} syntax? To how much of the "preceding" tokens
            will it be applied, without explicit parentheses to indicate the scope? 

[b******s]  I'm not sure TBH; that actually came up a while back on the gawk mlist

[b******s]  well, greediness in general

[b******s]  I think it's probably best to use parens if the pattern isn't very
            simple                                                                  

[j******e]  I'd always thought it was one token unless parentheses are used?

[b******s]  right, that would be same as for ?, +, and *

[j******h]  The egrep example at the bottom of today's notes suggests that
            '(.)1{2,}' the {2,} is only being applied to the 1 back-reference, not
            to the (.) as well.                                                     

[b******s]  right but the parens serve a differnt function for backref

[b******s]  A.R. talks a bit about those in the Classic Shell Scripting book

[b******s]  some ERE examples:

 ------------------------------------------------------------------------
 # examples of typical ERE patterns:
 #    "^[sS]erver"  =>  matches str starting w/ "Server" or "server"
 #         "addr$"  =>  matches str ending w/ "addr"
 #   "(http|goph)"  =>  matches str containing "http" or "goph"
 #      "^[a-z]+$"  =>  matches str ONLY containing a-z
 #        "[^sdf]"  =>  matches str NOT containing s, d, or f
 #          "t{2}"  =>  matches str containing "tt"
 #         "[qv]+"  =>  matches str with one or more "q" or "v"
 #
 ------------------------------------------------------------------------


[b******s]  boolean "&&" and "||" along w/ "(|)" allows more complex matching

 ------------------------------------------------------------------------
 #  ex.  ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$
 #  => matches strings w/ any of the following characteristics:
 #       - begins w/ "GOPH" or "SERV"
 #       - contains 2 or 3 "_" chars
 #       - ends w/ "ADDR" or "NAME"
 #
 ------------------------------------------------------------------------


[b******s]  ... on to AWK relational expressions ...

[b******s]  these are typically part of a match pattern of using w/ if(), while(),
            etc.                                                                    

[j******e]  Dumb quesiton, what's the difference between | and ||?

[b******s]  '|' => or within a '()' group

[b******s]  the other is boolean counterpart to &&

[b******s]  the "(http|goph)" => matches str containing "http" or "goph" for example

[b******s]  continue?

[a******r]  what about the one right after that? after the )

[b******s]  it's a group so the parens are paired

[j******e]  I think I've got it, but I'll save and review the chat log to check back
            later.                                                                  

[b******s]  oh, you mean the ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$ example?

[a******r]  yes

[b******s]  it's a nested example

[a******r]  does it mean that the _ characters are optional?

[a******r]  but if there, must be 2 or 3?

[b******s]  no, the string contains 2 or 3 "_" chars

[j******e]  Wait, what does the [_] mean? What makes it different from just _?

[b******s]  the '[]' usually removes specialness from chars; in this case they're
            likely not needed but weird characterss I like putting brackets               

[j******h]  I wonder if you could do it without putting the underscore inside the
            character class, though. Since we agreed that the {n,m} syntax only
            applied to the preceding character ...                                  

[b******s]  yeah it should work

[b******s]  underscores can be hard to read => I often bracket them

[j******e]  Could you not use parens here? Does the underscore have any special
            meaning like ., ?, etc.?                                                

[b******s]  no it's not a metachar

[j******e]  Fair enough.

[b******s]  on to relational expressions ?

[b******s]  these are easier

[j******e]  Sounds good.

[b******s]  AWK relations operators: >, <, >=, <=, !=, ==, !~, ~

[b******s]  these we've kind of already seen in the pattern-action stuff

 ------------------------------------------------------------------------
 #  some examples:
 #   '$3 > $1' => lines w/ field 3 > field 1
 #   'NF >= 5' => lines w/ 5+ fields ; 'NR != 1' => NOT line 1
 #   'length < 72' => length w/o arg is quasi-var == length of $0
 #
 ------------------------------------------------------------------------


[b******s]  i.e. $ awk 'NF >= 5' fubar.txt => lines w/ 5+ fields

[a******r]  you are right, these are easier!

[j******e]  If memory serves, $0 is the concatenation of all the fields?

[b******s]  right.

[b******s]  for AWK's main body these patterns default to matching against $0 so
            there's no need to explicitly use it                                    

[j******h]  length of $0 sounds interesting. I assume $0 is not referring to the
            name of the awk script you ran, though. In bash you can use $0 to see
            how the command was called.                                             

[b******s]  ya 'length' w/o an expresion is like a variable set to the length of $0

[a******r]  kind of like Perl's $_ ?

[b******s]  probably

[b******s]  length == length($0)

[b******s]  some AWK weirdness: AWK will attempt to mathematically compare strings

[b******s]  eg. '42 <= "42abc"' => TRUE! try to avoid this sort of thing..

[j******e]  ...because of course it will.

[b******s]  ya, no types

[b******s]  => context of usage often reults in coercion str<>num

[b******s]  back when it was just plain ASCII those types of string comparision
            wasn't too bad                                                          

[j******e]  So, you won't get weird things like "2" > "12"?

[b******s]  that'll be FALSE => both sides are coerced to digits so 2 !> 12

[b******s]  '2 > "abc12"' is T => the str resolves to 0

[j******e]  I'm sure that can lead to some interesting bugs.

[j******e]  ...but seems mostly reasonable, considering.

[b******s]  ya, best to avoid

[j******e]  Would something like 0x10 be coerced into a number?

[b******s]  hmm, depends on how a particular AWK is handling hex I think

[j******e]  Or just straight decimal?

[j******e]  Probably another one to file under 'best to avoid' then. ;)

[b******s]  well, at least test for the specific AWK you're palnning to use

[b******s]  patterns can be used for ranges, similar to sed(1)

[b******s]  matching can be done as a range: '/pattern_start/ , /pattern_end/'

[b******s]  eg. 'NR==10, NR==20' => lines 10-20

[b******s]  AWK lacks a way to directly reference the last line of data

[b******s]  work-around for last-line: eg. 'NR==42 , EOF' => lines 42-last:

[b******s]  => works if EOF remains unassigned (0) ; or just use '0'

[b******s]  probably using '0' is the best option unless you *only* want the last
            line => use END{print $0}                                               

[j******e]  So you can't use, say, -1 for second to last line?

[j******e]  NR==-1

[j******e]  ...or last line

[b******s]  not sure really

[b******s]  try it

[b******s]  appears to work!

[j******e]  Woo! I'm useful!
  
  <b******s shoots sky>

[a******r]  good job!

[b******s]  not sure we'll have much time for flow-control statments unless folks
            can stay past the hour                                                  

[j******h]  So reaching the END block still leaves $0 intact, from the last line of
            the input file? awk doesn't flush $0 from memory unless there's another
            line to read?                                                           

[b******s]  ya. I think that wasn't always the case though

[a******r]  I can stay

[b******s]  alright, on to flow-control statments..

[j******e]  If I can't stay long enough, I can review the chat afterward.

[j******h]  I can stay too

[a******r]  so its another check your awk thing?

[b******s]  for $0 in END? I think it's pretty standard now

[b******s]  but... setting $0 in BEGIN isn't ; NetBSD AWK seems not to allow it
            while mawk and others do

[a******r]  okay

[b******s]  one can definitly assign it via getline() but that's a bit different

[b******s]  anyways..

[b******s]  AWK statements can separated by newlines or ';', or continued by '\'

[b******s]  the '\' is pretty much like in the shell

 ------------------------------------------------------------------------
 #   ex.  Foo = "fu" ; Bar = "bar"
 #        Aws = 42
 #        BigStr = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa" \
 #                 "bbbbbbbbbbbbbbbbbbbbbbbbbbbb" \
 #                 "cccccccccccccccccccccccccccc"
 #
 ------------------------------------------------------------------------


[b******s]  looping: if();else if();else, while(), do{}while(), various for()

[b******s]  see details in sect. 9.7 of Classic Shell Scripting, Chap. 9 reference

[j******e]  I think pcom stripped your backslash. I was confused for a minute unil I
            read the dump.                                                          

[b******s]  oh it did! gotta escape it

[b******s]  '\\'

[j******e]  unsanitized inputs in com? Say it ain't so...

[b******s]  right?

[b******s]  for basic increment/decrement of vars one can use 'i++' or 'i--'

[b******s]  note '++i' increments *BEFORE* use, 'i++' increments *AFTER*

[b******s]  other compact variable assignments: +=, -=, *=, /=, %=, ^=, =

[b******s]  ex. if you'll not be needing 'i' later, one can do 'while(++i <= NF)' to
            iterate of fields                                                       

[b******s]  does that make sense?

[b******s]  see POSIX AWK standard sheet in reference materials for details

[a******r]  does that mean that once you create the 'i', it lives on even after the
            loop?                                                                   

[b******s]  ya vars will keep their value unless another reassignment occurs; in
            user-defined functions the local vars all reset to 0 or "" upon each new
            call                                                                    

[j******e]  I believe we previously covered that uninitialized variables are "" or
            0, right?                                                               

[j******e]  ...oh, and I guess we can initialize them in the BEGIN{} block.

[j******e]  I catch on quick; you just have to explain long.

[b******s]  I guess I should mention that vanilla AWK has a single namespace so
            except for local vars in user-defined functions all vars are global

[j******e]  b******s: Do they only reset within the scope of the function, or does
            calling the function reset all the variables?                           
  
  <j******e asked that question clumsily.>

[b******s]  we'll cover those later

[b******s]  F-C statments used in BEGIN/END, body {action) blocks, & defined funcs

 ------------------------------------------------------------------------
 #   ex.
 #      # processing cmd line args (ARGV is built-in arg array):
 #      BEGIN {
 #        if (ARGV[1])
 #            for (i=1 ; i < ARGC ; i++)
 #                if (ARGV[i] ~ /^-([hH]|-help)/
 #                    show_usage()
 #                else
 #                    ...
 #     }
 #
 ------------------------------------------------------------------------

[j******e]  Okay, so scoping of variables inside of functions is a thing, and if you
            need to use an external one, it needs to be passed in.                  

[b******s]  basically, except you can't pass in arrays

[j******h]  so functions that call themselves somewhere deeper in the function body,
            cannot assume that the inner function evaluation will see a pristine
            copy of any variables defined earlier in the function body?             

[b******s]  oh a recusive function? I think each call still resets the vars
  
  <j******e assumes you close these with endif, endfor, etc.?>
  
  <j******h asked that question clumsily.>

[b******s]  I don't use recursion much as it's usually slower and harder for me to
            read

[b******s]  unlike some languages, AWK isn't too picky regarding "{}", single blocks
            don't need                                                              

 ------------------------------------------------------------------------
 #   ex.
 #       # all one block => no '{}'s required...
 #       for (i=1 ; i <= NF ; i++)
 #           if ( $i !~ /^[0-9]+$/ )
 #               if ( length ($i) < 72 )
 #                   print $i
 #               else
 #                   fold($i)  # usr-defined func.
 #
 #       # same as above but w/ maximum '{}'s...
 #       for (i=1 ; i <= NF ; i++) {
 #           if ( $i !~ /^[0-9]+$/ ) {
 #               if ( length ($i) < 72 ) {
 #                   print $i
 #               } else {
 #                   fold($i)  # usr-defined func.
 #               }
 #           }
 #       }
 #
 
 ------------------------------------------------------------------------


[b******s]  C-style ternary conditional operator: expr1 ? expr2 : expr3

[j******e]  b******s: So kind of like C-style ifs.

[b******s]  yes, I think a lot of AWK was C inspired

[b******s]  I really like these ternary operators

[j******e]  Wait, what was the ~ operator in your if statement in the dump?

[b******s]  I think it was...

[j******e]  I assume "matches regex"?

[b******s]  yes, and '!~' is the complement

[b******s]  unless a particular field is being targeted in the main body the '~' can
            usually be left off the pattern                                         

[j******e]  Got it.

[b******s]  ternary example:

 ------------------------------------------------------------------------
 #   ex.  print even / odd:
 #     $ seq 1 9 |awk '{print $1, ($1%2 == 0 ? "even":"odd")}'
 #     1 odd
 #     2 even
 #     3 odd
 #     ...
 #
 
 ------------------------------------------------------------------------


[b******s]  ternary operator nestable => quickly become hard to read..

[b******s]  control-flow interuption: next, break, continue, exit

[b******s]  'break' & 'continue' used in for(), while(), do-while() loops

[b******s]  for nexted loops 'break' & 'continue' interupt the innermost loop

[b******s]  ^ I think this is a pretty common logic

[b******s]  'next' ceases matching of current record & moves to next record

[b******s]  'next' only meaningful in main body action { expression }

[b******s]  => data only read within the main body unless getline() is used

[j******h]  In Perl it's possible to give a name to each loop, and refer to that
            name explicitly as the argument of 'next'. No such feature in AWK, I
            assume?                                                                 

[b******s]  don't think so; I think it's due to the single namespace thing

[b******s]  gawk supports multiple namespaces though

[b******s]  'exit' w/ opt. expression, ie. 'exit 1', quits script completely

[b******s]  whew! the end

[b******s]  I'll paste in the challenge exercise

[j******e]  I think I got most of that. :)

 ------------------------------------------------------------------------
 # Challenge Exercise!
 #  Both sed(1) and egrep(1) can use back reference to match
 #  strings with multiple adjacent identical characters, i.e.
 #  finding dictionary words w/ 3+ indentcal chars in a row:
 #
 #     $ nice sed -n '/\(.\)\1\{2,\}/p' /usr/share/dict/words
 #     bossship
 #     demigoddessship
 #     goddessship
 #     headmistressship
 #     patronessship
 #     wallless
 #     whenceeer
 #
 #     $ nice egrep '(.)\1{2,}' /usr/share/dict/words
 #     bossship
 #     demigoddessship
 #     goddessship
 #     headmistressship
 #     patronessship
 #     wallless
 #     whenceeer
 #
 #  => how would you do this in AWK ?
 #
 ------------------------------------------------------------------------

[b******s]  didn't know demigoddessship was a thing

[b******s]  thanks for hanging past the hour

[b******s]  feel free to ask questions

[a******r]  well, if you are going to have a demigodess, might as well have a
            demigodessship                                                          

[j******e]  Thanks for taking the time to go overtime to cover all the material. :)

[j******h]  /(sss|eee|lll)/ { print }

[b******s]  I hope it was at least semi-coherent

[a******r]  yes, thanks heaps!
  
  <j******e has to head out>

[j******h]  Thanks b******s!
  
  <j******e learned stuff today.>

[b******s]  have a great evening & hope to see ya Thursday

[j******e]  ...so that's good.

[a******r]  this group is going to be the awk gurus!

[b******s]  helps that it really is a small language; I don't have to relearn a ton
            after being away from it                                                

[a******r]  is gawk the only one with networking?

[b******s]  seems so

[b******s]  also only one w/ bi-direction pipes

[b******s]  depending on what is being attempted, ncat/nc can often be used for
            simple retrivals