--Topography-------------------------------------------------------------------- AWK Workshop / Discussion - April 11 - 23, 2026 * what: an informal exploration of plain AWK, aka "new AWK" * where: SDF.org - both pcom ("awk" room) and irc.sdf.org ("#awk") * when: Saturdays, 10-11am MDT ; Tuesdays & Thursdays, 6-7pm MDT -------------------------------------------------------------------------------- [b******s] Day 3: AWK builtins & defaults, basic IO, printf() / sprintf() [b******s] AWK has several built-in variables; most can be reassigned ------------------------------------------------------------------------ # AWK built-in Variables and (default value): # # FILENAME = name of current data file if any # # NF = number of fields in current record # NR = number of records, total # FNR = number of records, current file # => NF, NR, FNR set with each new record read # => NF can be reassigned, i.e. # $ echo 'a b c' |awk '{NF-- ; print $0}' # a b # # FS = field separator (" " ; can be regex) # OFS = output field separator (" ") # # RS = record separator ("\n") * # ORS = output record separator ("\n") # * RegEx sometimes supported; POSIX limits to single char # # OFMT = floating-point format string ("%.6g") # CONVFMT = floating-point -> str format string ("%.6g") # # RSTART = start of pattern match in match() # RLENGTH = length of pattern match in match() # SUBSEP = "multi-dim" array indices separator ("\034") # ARGC = # of cmd line args (includes calling cmd) # ------------------------------------------------------------------------ [b******s] these are on the POSIX standard reference; I tried to group them by function [b******s] AWK has two built-in arrays: ENVIRON & ARGV [b******s] ARGV I think was shown on Day 1 IIRC [b******s] ENVIRON = array of inherited shell environment variables [b******s] ARGV = array of command line args ; ARGV[0] = "awk" ; ARGC = total [b******s] ENVIRON["var"] reassignments won't affect inherited environment [b******s] questions? [j******e] So, if you change environ["var"] that won't change the value of the "var" environment variable outside the script. [b******s] yes, it will just show the new assignment *in* the script [j******h] The indices of ARGV are integers, but the indices of ENVIRON are the names of the environment variables themselves? How would you look up the indices of ENVIRON from within awk itself? (In Perl you have the function 'keys' for that.) [b******s] for (v in ENVIRON) print ENVIRON[v] [b******s] it's good to test for existence though so you don't inadvertantly create things [b******s] => if(XYZ in Arr)... [b******s] it'll get covered in the arrays section [j******h] doesn't the (v in ENVIRON) already limit the loop to the keys that already exist? [b******s] yes [j******e] If you're pulling keys out with a for loop as described above, is there a risk of calling a key that doesn't exist somehow, or is this just a general good idea when dealing with keys? [b******s] if you reference a non-existing variable it gets created [b******s] usually it's just a var=0 [j******e] Better than crashing I suppose. [b******s] AWK has several built-in commands => see POSIX standard reference [b******s] AWK command sampling.. [b******s] index(s, t) => returns position of "t" in "s" OR 0 if not found [b******s] match(s, e) => returns position of "e" in "s" OR 0 if not found; the "e" can be string OR regex; sets RSTART & RLENGTH variables; often used with substr() [b******s] substr(s, m[, n]) => returns portion of "s" from 'm' to end, or 'n' [b******s] some examples.. ------------------------------------------------------------------------ # ex. index(), match(), substr() sampling.. # $ echo 'a_b_c' |awk '!index($0,"Z"){print " => no Z"}' # no Z # $ echo 'a_b_c' |awk '{print substr($0,index($0,"b"))}' # b_c # $ echo 'a_b_c' |awk '{print substr($0,index($0,"b")-1,3)}' # _b_ # $ echo 'a_b_c' |awk '{match($0,"b");print substr($0,RSTART)}' # b_c # $ echo 'a_b_c' |awk '{match($0,"b");print substr($0,RSTART,1)}' # b # $ echo 'a_b_c' |awk '{match($0,".b.");print substr($0,RSTART,RLENGTH)}' # _b_ # $ echo 'a_b_c' |awk '{match($0,".b.");print substr($0,++RSTART,RLENGTH-2)}' # b # ------------------------------------------------------------------------ [j******h] For these pipes from echo to awk, would there be any value for the builtin variable FILENAME? [b******s] I don't believe so since it's just stdin [b******s] IO is coming up next! [j******e] So the indexes start from 1 rather than 0 to prevent an index in the 0th position from being interpreted as false? [b******s] yes, I think the idea is it allows testing for existence [a******r] shows blank on mine [b******s] which one? [b******s] oh, the underscores? [j******h] which test? probably echo 'a_b_c' |awk '{print FILENAME}' [a******r] using pipe dosn't show the FILENAME variable being set [b******s] right, because it's just stdin [b******s] continue? [j******e] Yes, please. [a******r] please [j******h] Yes, continue [b******s] most AWK cmds are non-destructive [b******s] AWK has two destructive commands: sub() and gsub() [b******s] sub(e, r[, v]) => replace "e" with "r" in $0 or 'v' ; just once [b******s] gsub(e, r[, v]) => replace "e" with "r" in $0 or 'v' ; multiple times [b******s] for both, "e" can be string or regex ; assumes $0 if 'v' omitted [b******s] some illustration.. ------------------------------------------------------------------------ # ex. sub() and gsub() sample.. # $ echo 'a - b - c' |awk '{sub("-", "+")} ; //' # a + b - c # $ echo 'a - b - c' |awk '{sub("-", "+", $2)} ; //' # a + b - c # $ echo 'a - b - c' |awk '{sub("-", "+", $4)} ; //' # a - b + c # $ echo 'a - b - c' |awk '{gsub("-", "+")} ; //' # a + b + c # $ echo 'a - b - c' |awk '{gsub("[a-z]", "&&&")} ; //' # aaa - bbb - ccc # $ echo 'a - b - c' |awk '{gsub("[a-z]", $2)} ; //' # - - - - - # ------------------------------------------------------------------------ [b******s] the "&" works like in sed(1), copies the matched pattern [b******s] onward? [j******e] What's the ;// doing? [j******h] okay [b******s] '//' prints every line [j******e] I know // is a match everything. [b******s] right, the 1st match is just tweaking $0 [j******e] Oh, so the {print $0} is implied? [j******h] '//' is more concise than an explicit 'print' [b******s] ^ [j******e] Okay. Got it. I'm good to move on. [b******s] ... onto AWK IO ... [b******s] AWK special filenames: "/dev/stdin", "/dev/stdout", "/dev/stderr" [b******s] => for POSIX systems; check manpage for proper referencing [b******s] terminal IO: "-" == "/dev/stdin" ; "/dev/tty" == "/dev/stdout" ------------------------------------------------------------------------ # eg. woot all the way down.. # $ echo 'woot' |awk '//' # woot # $ echo 'woot' |awk '//' - # woot # $ echo 'woot' |awk '//' /dev/stdin # woot # $ echo 'woot' |awk '{print}' # woot # $ echo 'woot' |awk '{print > "/dev/tty"}' # woot # $ echo 'woot' |awk '{print > "/dev/stdout"}' # woot # ------------------------------------------------------------------------ [j******e] So these files will work even on Windows machines? [b******s] I'm not usre; it depends on the OS and I don't think Windows is an offically supported OS [b******s] is it considered POSIX? [j******e] Fair enough. It's not exactly POSIX. [b******s] I think they do have a POSIX-like environemnt for it now [j******e] I thought that maybe AWK was doing some compatibility magic behind the scenes. [b******s] oh like python? I don't think so [j******e] Yeah, Cigwin, I believe? [b******s] that's a nice thing about python & similar; you don't have to worry so much about the OS [b******s] AWK can use '<', '>', '>>', and '|' for external files & cmds ------------------------------------------------------------------------ # ex. random data sorted numerically assending: # $ shuffle -n7 |awk '{ print | "sort" } END{ close("sort") }' # 0 # 1 # 2 # 3 # 4 # 5 # 6 # ------------------------------------------------------------------------ [b******s] use 'r' or 'R 123' to display [b******s] note: the first use of 'print >"file"' zeros, subsequent use appends [b******s] => safer to use "shell rules" and stick to '>>' for appending [j******e] Having to 'close' a command you're piping to still feels weird to me. [b******s] for simple one-liners it's probably uneccessary [j******e] A file you're reading from/writing to, sure. [j******e] What if you're doing a {print | "foo" | "bar"}? I assume you'd have to close both? [b******s] you'd assign the entire pipe as a str to, say 'Cmd', then reference that [b******s] that makes it easier to read and also allows some modularity to constructing such command strings [j******h] if the command you're piping to is a lot longer than 4 chars, do you still have to use the entire command as the argument of 'close'? In that case, maybe define a variable for it first. [a******r] like the Moo for cowsay variable in day 1 [b******s] yes [b******s] yes that's mostly how I use it [b******s] note: while AWK supports "/dev/stderr" it mostly lacks error handling [b******s] => ie. no way to determine if a file is writable beforehand [b******s] for finer-grained control use test(1) via system() instead: ------------------------------------------------------------------------ # ex. test if file-ro is writable: # $ awk '!system("test -w" FILENAME){print "not writable"}' test-ro # not writable # # note: system() only returns its exit status # ------------------------------------------------------------------------ [b******s] there is also a fflush() command for I guess buffered data [b******s] only used that with gawk with network stuff [b******s] doesn't look like fflush() is part of the POSIX standard [a******r] that might explain trouble I was having experimenting with the gawk networking - didn't flush! [b******s] perhaps [b******s] more IO stuff w/ getline() [b******s] .. onto printf() / sprintf() ... [b******s] AWK's printf() is modelled on printf(3) from C language [b******s] basic form: printf ("format_str", arg, arg, ...) ; '()' are optional [b******s] AWK sprintf() similar but used for string creation ; '()' are required [b******s] many AWKs have time funtions as well; the strftime() is sort of similar to the above [j******e] Gotta love that consistency with the requireness of the () [b******s] both printf() and sprintf() support dynamic field sizing: ------------------------------------------------------------------------ # ex. print floating point value w/ 7 sig. fig. & precision of 2, 0 padded: # $ awk 'BEGIN{printf "%0*.*f\n", 7, 2, 123.456789}' # 0123.46 # # note: zero padding only works w/ numbers # ------------------------------------------------------------------------ [b******s] printf() quoted format strings can be assigned to variables and used [b******s] => $ awk 'BEGIN{Fmt="%0*.*f\n"; printf Fmt, 7, 2, 123.456789}' [b******s] I often do this for readability [b******s] sprintf() often useful for string concatenation ------------------------------------------------------------------------ # ex. string concatenation 2 different ways.. # $ awk 'BEGIN{S = "a" ":" "b" ; print S}' # a:b # $ awk 'BEGIN{S = sprintf("%s:%s", "a", "b") ; print S}' # a:b # ------------------------------------------------------------------------ [b******s] we haven't really talked about concatenation in AWK [a******r] r [b******s] arrr [b******s] that is actually all of the bullet points I have for today [j******e] TBH, I'm going to have to go over the format strings a little more to really wrap my brain around them. [b******s] there are lots of options and I find for odd stuff I have to look it up [b******s] but at least it's almost the same as what is shown in printf(1) [j******e] The padding options felt strange. [j******e] ...but my ADHD has been in high gear today, making things difficult. Fortunately, I keep chat logs, so I can go over it again later. [b******s] if I don't make notes for myself I have to relearn things over and over [j******h] This talk about format strings reminds me of the fscanf man-page, whose CAVEATS section makes me wary about using it. [b******s] I find it pretty useful for making gopher stuff since one has to use so many tabs [j******e] Yeah, I've been doing that. I just find it easer to refine them after the fact, when I'm not trying to keep up with the chat. [j******e] ...hence the logs. [a******r] all those options sure give one a lot of flexibility to format your output though [b******s] sometimes it's easier to just pipe the output to say pr(1) or rs(1) [j******h] Quoting from fscanf(3): "It is very difficult to use these functions correctly, and it is preferable to read entire lines with fgets(3) or getline(3) and parse them later with sscanf(3) or more specialized functions such as strtol(3)." [b******s] lol [j******e] I love the UNIX philosophy of "just glue together whatever tools you need" [b******s] me too [j******h] Actually fscanf(3) is what I might reach for, if I had to process fixed-width data where some records are missing data. awk might not do so well on that task, because it collapses all whitespace. [b******s] I think that's a common issue with CSV data [b******s] oh, BTW, the nawk(1) installed on SDF appears to be a slightly old version of the One True Awk [b******s] think it lacks csv support [j******h] Heh, at least with CSV there's an explicit delimiter. But with fixed-width fields, a record that's missing one field might be interpreted in awk as having a smaller NF than the previous record. [b******s] oh I see [b******s] there is only so much that a FS regex pattern can capture [j******e] Regarding CSV in AWK: this is a tool I made a while ago to convert CSV to/from a more AWK-friendly format: https://git.fingerprintsoftware.ca/j******e/csv-awk [j******h] b******s: Here's an example of fixed-width fields. https://weather.uwyo.edu/cgi-bin/wyowx.fcgi?TYPE=sflist&DATE=current&HO\ UR=current&UNITS=A&STATION=BIL [b******s] ah ok, I was thinking the fields were literally all the same width => not impossible to parse [b******s] in AWK you can reassign NF for current record, i.e. to create more empty fields [b******s] still likely to be a mess [j******h] I seem to recall that MATLAB has a dedicated function for reading fixed-width fields. You can specify a format string that gives the width of each field, almost the reverse of the printf() feature. [b******s] yeah matlab or octave would be the way to go I think [b******s] well, it's 7:03 here [j******h] Anyway, thanks for Day 3! I'm out for the night. [b******s] have a good evening all [a******r] as always, thanks heaps for all of your work! [j******e] Good night! [b******s] you're quite welcome!