--Topography-------------------------------------------------------------------- AWK Workshop / Discussion - April 11 - 23, 2026 * what: an informal exploration of plain AWK, aka "new AWK" * where: SDF.org - both pcom ("awk" room) and irc.sdf.org ("#awk") * when: Saturdays, 10-11am MDT ; Tuesdays & Thursdays, 6-7pm MDT -------------------------------------------------------------------------------- [b******s] Day 2: regular & relational expressions, flow-control statments [b******s] in retrospect I think I could have structed this thing better.. [b******s] AWK supports Extended Regular Expressions (ERE), similar to sed(1) & grep(1) [j******e] It exists at all, and for that I'm thankful. [b******s] unlike sed & grep, AWK does *NOT* have back-reference support [b******s] probably everyone here knows what that is, things like '(pattern)\1' [b******s] note a huge deal but see 'Challenge Exercise' [j******e] I know what a regex is, but I'm unclear about what back-reference support means, tbh. [b******s] it seems to be part of the Basic REs; in the above the parens are used to mark a pattern you can latter reference using the '\m' form, up to 9 references are supported [j******h] I mostly use back-references in the replacement pattern (for a sed command); I'm not too comfortable using them in the search pattern as well. [b******s] yeah they are useful, not sure why awk didn't get them [b******s] anyway, there are two AWK RegEx forms: /regex/ & "regex" (good for dynamic concatenation) [b******s] heh [b******s] the first form you've already seen last time [b******s] the latter makes use of AWK's string concatenation which we'll cover later [b******s] here's the basics: ------------------------------------------------------------------------ # AWK ERE meta-character basics: # "." => any char ; "^" => match start of string ; "$" => match end of string # "?" => match 0 or one of preceeding ; "+" => match 1 or more of preceeding # "*" => match 0 or more of preceeding ; "A-Z", "a-z", "0-9" => range set # "[...]" => match any char contained; can be single char or range sets # "[^..]" => match any char NOT contained (complement of above) # "{n,m}" => match n-m number of preceeding ; "{n,}" => matches n or more # # some common POSIX character classes - see re_format(7) for details: # [:alpha:], [:alnum:], [:digit:], [:upper:], [:cntrl:], [:space:] # # metachars (usually) lose specialness inside "[]" ; escape w/ "\" otherwise # ex. "[.+?*]" => matches ".", "+", "?" or "," # "[\\]" => matches "\" ; "[\]]" => matches "]" # # some special cases: # "[[]" => matches "[" *IF* first char in brackets # "[-]" => matches "-" *IF* first char in brackets # ------------------------------------------------------------------------ [b******s] probably a lot of that is familiar [j******e] b******s: Would {,m} cover zero to m? [j******e] ...I suppose you could just use {0,m} for that... [b******s] hmm, not sure - try it! [b******s] BTW, the complement ( "[^c..]" ) won't break 1st char special interpretation [b******s] I probably spend more time trying to debug regex stuff than anything else WRT AWK [b******s] hi f******x - type 'R 123' to review [b******s] continue? [j******h] How greedy is the {n,m} syntax? To how much of the "preceding" tokens will it be applied, without explicit parentheses to indicate the scope? [b******s] I'm not sure TBH; that actually came up a while back on the gawk mlist [b******s] well, greediness in general [b******s] I think it's probably best to use parens if the pattern isn't very simple [j******e] I'd always thought it was one token unless parentheses are used? [b******s] right, that would be same as for ?, +, and * [j******h] The egrep example at the bottom of today's notes suggests that '(.)1{2,}' the {2,} is only being applied to the 1 back-reference, not to the (.) as well. [b******s] right but the parens serve a differnt function for backref [b******s] A.R. talks a bit about those in the Classic Shell Scripting book [b******s] some ERE examples: ------------------------------------------------------------------------ # examples of typical ERE patterns: # "^[sS]erver" => matches str starting w/ "Server" or "server" # "addr$" => matches str ending w/ "addr" # "(http|goph)" => matches str containing "http" or "goph" # "^[a-z]+$" => matches str ONLY containing a-z # "[^sdf]" => matches str NOT containing s, d, or f # "t{2}" => matches str containing "tt" # "[qv]+" => matches str with one or more "q" or "v" # ------------------------------------------------------------------------ [b******s] boolean "&&" and "||" along w/ "(|)" allows more complex matching ------------------------------------------------------------------------ # ex. ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$ # => matches strings w/ any of the following characteristics: # - begins w/ "GOPH" or "SERV" # - contains 2 or 3 "_" chars # - ends w/ "ADDR" or "NAME" # ------------------------------------------------------------------------ [b******s] ... on to AWK relational expressions ... [b******s] these are typically part of a match pattern of using w/ if(), while(), etc. [j******e] Dumb quesiton, what's the difference between | and ||? [b******s] '|' => or within a '()' group [b******s] the other is boolean counterpart to && [b******s] the "(http|goph)" => matches str containing "http" or "goph" for example [b******s] continue? [a******r] what about the one right after that? after the ) [b******s] it's a group so the parens are paired [j******e] I think I've got it, but I'll save and review the chat log to check back later. [b******s] oh, you mean the ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$ example? [a******r] yes [b******s] it's a nested example [a******r] does it mean that the _ characters are optional? [a******r] but if there, must be 2 or 3? [b******s] no, the string contains 2 or 3 "_" chars [j******e] Wait, what does the [_] mean? What makes it different from just _? [b******s] the '[]' usually removes specialness from chars; in this case they're likely not needed but weird characterss I like putting brackets [j******h] I wonder if you could do it without putting the underscore inside the character class, though. Since we agreed that the {n,m} syntax only applied to the preceding character ... [b******s] yeah it should work [b******s] underscores can be hard to read => I often bracket them [j******e] Could you not use parens here? Does the underscore have any special meaning like ., ?, etc.? [b******s] no it's not a metachar [j******e] Fair enough. [b******s] on to relational expressions ? [b******s] these are easier [j******e] Sounds good. [b******s] AWK relations operators: >, <, >=, <=, !=, ==, !~, ~ [b******s] these we've kind of already seen in the pattern-action stuff ------------------------------------------------------------------------ # some examples: # '$3 > $1' => lines w/ field 3 > field 1 # 'NF >= 5' => lines w/ 5+ fields ; 'NR != 1' => NOT line 1 # 'length < 72' => length w/o arg is quasi-var == length of $0 # ------------------------------------------------------------------------ [b******s] i.e. $ awk 'NF >= 5' fubar.txt => lines w/ 5+ fields [a******r] you are right, these are easier! [j******e] If memory serves, $0 is the concatenation of all the fields? [b******s] right. [b******s] for AWK's main body these patterns default to matching against $0 so there's no need to explicitly use it [j******h] length of $0 sounds interesting. I assume $0 is not referring to the name of the awk script you ran, though. In bash you can use $0 to see how the command was called. [b******s] ya 'length' w/o an expresion is like a variable set to the length of $0 [a******r] kind of like Perl's $_ ? [b******s] probably [b******s] length == length($0) [b******s] some AWK weirdness: AWK will attempt to mathematically compare strings [b******s] eg. '42 <= "42abc"' => TRUE! try to avoid this sort of thing.. [j******e] ...because of course it will. [b******s] ya, no types [b******s] => context of usage often reults in coercion str<>num [b******s] back when it was just plain ASCII those types of string comparision wasn't too bad [j******e] So, you won't get weird things like "2" > "12"? [b******s] that'll be FALSE => both sides are coerced to digits so 2 !> 12 [b******s] '2 > "abc12"' is T => the str resolves to 0 [j******e] I'm sure that can lead to some interesting bugs. [j******e] ...but seems mostly reasonable, considering. [b******s] ya, best to avoid [j******e] Would something like 0x10 be coerced into a number? [b******s] hmm, depends on how a particular AWK is handling hex I think [j******e] Or just straight decimal? [j******e] Probably another one to file under 'best to avoid' then. ;) [b******s] well, at least test for the specific AWK you're palnning to use [b******s] patterns can be used for ranges, similar to sed(1) [b******s] matching can be done as a range: '/pattern_start/ , /pattern_end/' [b******s] eg. 'NR==10, NR==20' => lines 10-20 [b******s] AWK lacks a way to directly reference the last line of data [b******s] work-around for last-line: eg. 'NR==42 , EOF' => lines 42-last: [b******s] => works if EOF remains unassigned (0) ; or just use '0' [b******s] probably using '0' is the best option unless you *only* want the last line => use END{print $0} [j******e] So you can't use, say, -1 for second to last line? [j******e] NR==-1 [j******e] ...or last line [b******s] not sure really [b******s] try it [b******s] appears to work! [j******e] Woo! I'm useful! [a******r] good job! [b******s] not sure we'll have much time for flow-control statments unless folks can stay past the hour [j******h] So reaching the END block still leaves $0 intact, from the last line of the input file? awk doesn't flush $0 from memory unless there's another line to read? [b******s] ya. I think that wasn't always the case though [a******r] I can stay [b******s] alright, on to flow-control statments.. [j******e] If I can't stay long enough, I can review the chat afterward. [j******h] I can stay too [a******r] so its another check your awk thing? [b******s] for $0 in END? I think it's pretty standard now [b******s] but... setting $0 in BEGIN isn't ; NetBSD AWK seems not to allow it while mawk and others do [a******r] okay [b******s] one can definitly assign it via getline() but that's a bit different [b******s] anyways.. [b******s] AWK statements can separated by newlines or ';', or continued by '\' [b******s] the '\' is pretty much like in the shell ------------------------------------------------------------------------ # ex. Foo = "fu" ; Bar = "bar" # Aws = 42 # BigStr = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa" \ # "bbbbbbbbbbbbbbbbbbbbbbbbbbbb" \ # "cccccccccccccccccccccccccccc" # ------------------------------------------------------------------------ [b******s] looping: if();else if();else, while(), do{}while(), various for() [b******s] see details in sect. 9.7 of Classic Shell Scripting, Chap. 9 reference [j******e] I think pcom stripped your backslash. I was confused for a minute unil I read the dump. [b******s] oh it did! gotta escape it [b******s] '\\' [j******e] unsanitized inputs in com? Say it ain't so... [b******s] right? [b******s] for basic increment/decrement of vars one can use 'i++' or 'i--' [b******s] note '++i' increments *BEFORE* use, 'i++' increments *AFTER* [b******s] other compact variable assignments: +=, -=, *=, /=, %=, ^=, = [b******s] ex. if you'll not be needing 'i' later, one can do 'while(++i <= NF)' to iterate of fields [b******s] does that make sense? [b******s] see POSIX AWK standard sheet in reference materials for details [a******r] does that mean that once you create the 'i', it lives on even after the loop? [b******s] ya vars will keep their value unless another reassignment occurs; in user-defined functions the local vars all reset to 0 or "" upon each new call [j******e] I believe we previously covered that uninitialized variables are "" or 0, right? [j******e] ...oh, and I guess we can initialize them in the BEGIN{} block. [j******e] I catch on quick; you just have to explain long. [b******s] I guess I should mention that vanilla AWK has a single namespace so except for local vars in user-defined functions all vars are global [j******e] b******s: Do they only reset within the scope of the function, or does calling the function reset all the variables? [b******s] we'll cover those later [b******s] F-C statments used in BEGIN/END, body {action) blocks, & defined funcs ------------------------------------------------------------------------ # ex. # # processing cmd line args (ARGV is built-in arg array): # BEGIN { # if (ARGV[1]) # for (i=1 ; i < ARGC ; i++) # if (ARGV[i] ~ /^-([hH]|-help)/ # show_usage() # else # ... # } # ------------------------------------------------------------------------ [j******e] Okay, so scoping of variables inside of functions is a thing, and if you need to use an external one, it needs to be passed in. [b******s] basically, except you can't pass in arrays [j******h] so functions that call themselves somewhere deeper in the function body, cannot assume that the inner function evaluation will see a pristine copy of any variables defined earlier in the function body? [b******s] oh a recusive function? I think each call still resets the vars [b******s] I don't use recursion much as it's usually slower and harder for me to read [b******s] unlike some languages, AWK isn't too picky regarding "{}", single blocks don't need ------------------------------------------------------------------------ # ex. # # all one block => no '{}'s required... # for (i=1 ; i <= NF ; i++) # if ( $i !~ /^[0-9]+$/ ) # if ( length ($i) < 72 ) # print $i # else # fold($i) # usr-defined func. # # # same as above but w/ maximum '{}'s... # for (i=1 ; i <= NF ; i++) { # if ( $i !~ /^[0-9]+$/ ) { # if ( length ($i) < 72 ) { # print $i # } else { # fold($i) # usr-defined func. # } # } # } # ------------------------------------------------------------------------ [b******s] C-style ternary conditional operator: expr1 ? expr2 : expr3 [j******e] b******s: So kind of like C-style ifs. [b******s] yes, I think a lot of AWK was C inspired [b******s] I really like these ternary operators [j******e] Wait, what was the ~ operator in your if statement in the dump? [b******s] I think it was... [j******e] I assume "matches regex"? [b******s] yes, and '!~' is the complement [b******s] unless a particular field is being targeted in the main body the '~' can usually be left off the pattern [j******e] Got it. [b******s] ternary example: ------------------------------------------------------------------------ # ex. print even / odd: # $ seq 1 9 |awk '{print $1, ($1%2 == 0 ? "even":"odd")}' # 1 odd # 2 even # 3 odd # ... # ------------------------------------------------------------------------ [b******s] ternary operator nestable => quickly become hard to read.. [b******s] control-flow interuption: next, break, continue, exit [b******s] 'break' & 'continue' used in for(), while(), do-while() loops [b******s] for nexted loops 'break' & 'continue' interupt the innermost loop [b******s] ^ I think this is a pretty common logic [b******s] 'next' ceases matching of current record & moves to next record [b******s] 'next' only meaningful in main body action { expression } [b******s] => data only read within the main body unless getline() is used [j******h] In Perl it's possible to give a name to each loop, and refer to that name explicitly as the argument of 'next'. No such feature in AWK, I assume? [b******s] don't think so; I think it's due to the single namespace thing [b******s] gawk supports multiple namespaces though [b******s] 'exit' w/ opt. expression, ie. 'exit 1', quits script completely [b******s] whew! the end [b******s] I'll paste in the challenge exercise [j******e] I think I got most of that. :) ------------------------------------------------------------------------ # Challenge Exercise! # Both sed(1) and egrep(1) can use back reference to match # strings with multiple adjacent identical characters, i.e. # finding dictionary words w/ 3+ indentcal chars in a row: # # $ nice sed -n '/\(.\)\1\{2,\}/p' /usr/share/dict/words # bossship # demigoddessship # goddessship # headmistressship # patronessship # wallless # whenceeer # # $ nice egrep '(.)\1{2,}' /usr/share/dict/words # bossship # demigoddessship # goddessship # headmistressship # patronessship # wallless # whenceeer # # => how would you do this in AWK ? # ------------------------------------------------------------------------ [b******s] didn't know demigoddessship was a thing [b******s] thanks for hanging past the hour [b******s] feel free to ask questions [a******r] well, if you are going to have a demigodess, might as well have a demigodessship [j******e] Thanks for taking the time to go overtime to cover all the material. :) [j******h] /(sss|eee|lll)/ { print } [b******s] I hope it was at least semi-coherent [a******r] yes, thanks heaps! [j******h] Thanks b******s! [b******s] have a great evening & hope to see ya Thursday [j******e] ...so that's good. [a******r] this group is going to be the awk gurus! [b******s] helps that it really is a small language; I don't have to relearn a ton after being away from it [a******r] is gawk the only one with networking? [b******s] seems so [b******s] also only one w/ bi-direction pipes [b******s] depending on what is being attempted, ncat/nc can often be used for simple retrivals