[HN Gopher] Brian Kernighan adds Unicode support to Awk
___________________________________________________________________
Brian Kernighan adds Unicode support to Awk
Author : ducktective
Score : 288 points
Date : 2022-08-20 18:32 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| nanna wrote:
| Apparently Kernighan is also updating his Awk book of 1988 this
| summer too.
|
| https://irreal.org/blog/?p=10746
| 7thaccount wrote:
| That would be cool
| ducktective wrote:
| I became aware of this while watching Professor Brailsford's
| interview with him (Computerphile channel):
| https://www.youtube.com/watch?v=GNyQxXw_oMQ
|
| (Around 7-8 minute mark)
|
| Update:
|
| At 24-25 minute mark, he talks about the technologies he is
| inquiring to write his new book with (he mentions troff and
| groff).
|
| He says he wanted to try "XeTeX" (which supports Unicode) but
| "...I was going to download it as an experiment and they wanted 5
| gigabytes and 5 gigabytes at the particular boonies place I'm
| living would...mmm..not be finished yet!"
|
| So there we go...We had the opportunity to read the mind of the
| developer of awk and unix and co-author of the literal "C
| Programming Language", confronting with the absolute state of the
| tooling of the modern world.
| mlyle wrote:
| > He says he wanted to try "XeTeX" (which supports Unicode) but
| "...I was going to download it as an experiment and they wanted
| 5 gigabytes and 5 gigabytes at the particular boonies place I'm
| living would...mmm..not be finished yet!"
|
| Man, I think once you're Kernighan there should be like a
| 1gigabit/sec symmetric circuit wherever you go just in case you
| use it to do something else useful.
| rustqt6 wrote:
| You put it jokingly? But distinguished people should
| definitely get some state managed perks like politicians do.
| scoot wrote:
| THis isn't about bandwidth, it's about the size of modern
| binaries.
| inglor_cz wrote:
| Pardon me, but why have modern binaries grown so big? When
| I wrote my thesis in TeX, the entire installation would fit
| in some 30 megabytes or so. It was actually right in the
| uncomfortable middle. Far too big to carry around on a set
| of diskettes, but a CD would be waste of space.
| mlyle wrote:
| You can fit a decent TeX distribution in <100MB.
|
| But if you want to have every macro package that everyone
| everywhere likes, you're going to use some space.
| Turing_Machine wrote:
| I've often wondered why there isn't a dependency system
| for TeX that lets you get only the packages you need...
| feed it a document and tell it to automatically download
| and install any missing packages.
|
| There may be some technical reason why this isn't
| practical. Anyone know, offhand?
| mickmcq wrote:
| There is a TeX distribution that does what you say,
| TinyTeX. For some reason it is obscure and only really
| used by R and R Studio users. That may be because it is
| used to render R Markdown documents to pdf.
| kccqzy wrote:
| Totally practical and supported. See MikTeX:
| https://miktex.org/kb/just-enough-tex
|
| But the reason I don't use it is because I don't always
| have Internet when I need to write my document. I'd
| rather download all the packages and their documentation
| beforehand.
| rdlw wrote:
| MikTeX also supports downloading individual packages,
| though I wish there was an option to download, say, the
| most commonly used 1GB or something.
| mananaysiempre wrote:
| TeX distributions are enormous, but the binaries themselves
| are actually not big--the overhead imposed by Knuth's
| obnoxious license, while nonzero (you can't modify the
| original Pascal-in-WEB source, only patch it, so a manual
| source port to a different language is painful enough that
| nobody tried, we're all using an automatic Pascal-to-C
| translation with lipstick on it), is not huge, and aside
| from a smattering of utilities that's it for the binary
| part.
|
| It's just that the distros also include oodles of (plain
| text!) macro packages for everything under the sun. There
| are some legitimately large things such as fonts, but
| generally speaking a full TeXLive or MiKTeX distribution is
| bloat by ten thousand 100-kilobyte files, like a Python
| distribution with the whole of PyPI included.
|
| If you know what you want, you can probably fit a
| comprehensive LaTeX workbench in under 50M, but it takes an
| inordinate amount of time.
| jfk13 wrote:
| Sounds like he was looking at downloading a complete TeX Live
| distribution; XeTeX itself isn't anything like that size (by a
| couple orders of magnitude, at least).
| ducktective wrote:
| I think distros package it like TeX-full, TeX-minimal
| etc...The one having documentation files is a couple of GiB
| on Ubuntu...
|
| I wonder what distro or editor he is using...
| naves wrote:
| Two years ago, he used macOS on a 13" MacBook Air and an
| iMac, as per his conversation with Lex Fridman:
| https://youtu.be/O9upVbGSBFo?t=2523
| zimpenfish wrote:
| MacTeX is 4.7GB which matches the 5GB he's talking about
| and...
|
| "MacTeX installs TeX Live, which contains TeX, LaTeX, AMS-
| TeX, and virtually every TeX-related style file and font.
| [...] MacTeX also installs the GUI programs TeXShop, LaTeXiT,
| TeX Live Utility, and BibDesk. MacTeX installs Ghostscript,
| an open source version of Postscript."
|
| Which is, as you say, considerably more than just "XeTeX".
|
| (Also those are universal binaries containing both Intel and
| ARM versions which probably adds some heft.)
| JadeNB wrote:
| > (Also those are universal binaries containing both Intel
| and ARM versions which probably adds some heft.)
|
| Heh, I remember when "universal binaries" meant "PowerPC
| and Intel". Different universes ....
| jhbadger wrote:
| In the early/mid 1990s there were even "fat binaries"
| that had 68000 (the original Mac platform) and PowerPC
| binaries back when PowerPC was the new thing.
| maxnoe wrote:
| The Problem is that TeXLive still defaults to doing a full
| install.
|
| A full install means installing ~4000 packages, including their
| source files (tens of thousands of tex files) and built
| documentation (thousands of PDF files) and hundreds of free
| fonts (otfs, ttfs, texs own format).
|
| This is _huge_ ( >7GB, not just the 5 GB claimed here).
|
| However, you don't need 99 % of this for any given document.
|
| Not installing the source files and documentation PDFs will
| alone reduce the size by roughly half.
|
| Only installing the packages you really need from a minimal
| installation gives you a few hundred megabytes at most for even
| complex documents.
|
| It's a bit annoying to get the list of packages needed though,
| since there is not really any working dependency management.
|
| I wrote a python wrapper around the tex live installer [1] to
| make this easy for CI jobs, see e.g. [2].
|
| On a side note: I'd recommend luatex over xetex.
|
| - [1] https://github.com/maxnoe/texlive-batch-installation/
|
| - [2] https://github.com/pep-dortmund/toolbox-
| workshop/blob/8b00f0...
| jcelerier wrote:
| On archlinux there's the texlive-core package which does not
| ship the PDF docs (most of the size). It should install 500mb
| (most of which are fonts..) and already provide enough to
| build normal documents, including lualatex for unicode
| support
| JadeNB wrote:
| TeXLive also comes with installation schemes that will give
| you (if I remember the names correctly) bare, medium, and
| full installations, if you prefer not to pick packages
| yourself. Alternately, although I don't use it myself, I'm
| sure you could use MikTeX, which is much better about on-
| demand package installation. (Or even Overleaf, if you don't
| want to put anything on your local device!)
| cfiggers wrote:
| Watching this interview inspired me to start playing around
| with groff. It has a very steep learning curve... And being as
| old/niche as it is, I've found it very hard to find any active
| community to get newbie questions answered. If anybody knows
| where I could find that sort of thing, I'd be very grateful.
| samatman wrote:
| > _the absolute state of the tooling of the modern world_
|
| Hah, TeX Live is... not that.
|
| It's been enormous since I installed it off a CD in the 90s.
| The idea, and it works, is that you can just compile anyone's
| stuff out of the TeX ecosystem.
|
| There is just... a lot... in it. You don't need a package
| manager if you install the whole universe locally. Like I said:
| not what _I_ would call a modern approach to tooling.
|
| On the other hand, I have latex files from the mid-Noughties
| and, I don't even need to check: they'll compile if I want them
| to.
|
| But yeah, if you want just a little piece of TeX here and
| there, you're off the beaten track. That's not how TUG rolls.
| rdlw wrote:
| TeX Live can also be configured to install the bare minimum
| TeX ecosystem (or just TeX+LaTeX), which only takes a few
| minutes to download and install but results in hunting down
| dependencies and manually installing them whenever you want
| to use a new package.
|
| It also seems quite slow to update, and a recent (?) name
| change of `tools' to `latex-tools' seems to have broken
| multicol, which drove me to MikTeX. Internet connection
| required, but far less headache.
| stjohnswarts wrote:
| He could master it in a week if he set his head to it. I don't
| have one doubt of that. He just doesn't really need to.
| svnpenn wrote:
| Looks like its not really done yet:
|
| https://github.com/onetrueawk/awk/compare/master...unicode-s...
| arduinomancer wrote:
| For context this is 37 years after it was released (1985)
| YesThatTom2 wrote:
| Of course he did. Aho has better things to do and Weinberger is
| too rich to write code any more.
| [deleted]
| ducktective wrote:
| [off-topic]
|
| Following the spirit of UNIX, I did a little analysis on the
| upvotes this post got over time (fish-shell):
| while true; curl -sL
| 'https://news.ycombinator.com/item?id=32534173' | pup
| '#score_32534173 text{}' | awk -F'[^0-9]*' '{print $1}' | tee -a
| points; sleep 15s; end
|
| (Initially I used `grep -Po '\d+'` but switched it with an awk
| solution due to...context!)
|
| I started it approx. when I posted it. Now ~2 hours have passed
| since. Using `gnuplot`: f(x) = a*x+b; fit f(x)
| "points" via a,b; set terminal png size 1920,1080 enhanced font
| "Inconsolata,20" ;set output "HN-analysis.png" ;set grid; set
| ylabel "points";set key bottom right ; set xlabel "sample # (15s
| interval)"; plot 'points' w linesp lt 7 lw 3 lc rgb "orange",
| f(x) lc rgb 'blue' lw 2
|
| We generate the plot: https://i.imgur.com/pS6AaI5.png
|
| (The jump #100 sample is due to a network error on my side)
|
| And here are the coeffs. of a linear fit over the data (note that
| every 4 samples is 1 minute, so this post got ~1.52 upvotes per
| minute) a = 0.380809, b = 19.8437
| [deleted]
| etaioinshrdlu wrote:
| I liked this interview with Brian with Lex Fridman:
| https://www.youtube.com/watch?v=O9upVbGSBFo
| neilpanchal wrote:
| +1. Also, I like your username.
| pid_0 wrote:
| timakro wrote:
| I believe no distro actually ships this version of awk by
| default. They ship GNU awk which has Unicode support anyways.
| svnpenn wrote:
| Debian:
|
| https://distrowatch.com/table.php?distribution=debian&pkglis...
| timakro wrote:
| So it turns out the default on Debian is mawk which does NOT
| support Unicode. Thanks for pointing that out. This simple
| test gives different results for gawk and mawk.
| $ echo 'o' | awk '{print length}'
| layer8 wrote:
| ...only if the current locale is set to use UTF-8 (or some
| other variable-width encoding). Which nowadays the default
| locale usually does, but in principle it doesn't need to
| be.
| chasil wrote:
| OpenBSD uses "The One True AWK." $ awk -V
| awk version 20211208
|
| Kernighan's version is likely used in other places where the
| GPL is eschewed.
| fanf2 wrote:
| I think the other BSDs do too, including macOS.
| lelandfe wrote:
| > _Once I figure out how... I will try to submit a pull request.
| I wish I understood git better, but in spite of your help, I
| still don 't have a proper understanding, so this may take a
| while._
|
| Even Kernighan struggles with git.
| brudgers wrote:
| Torvalds is a better programmer than that.
|
| Pull requests are feature of GitHub, not a part of git.
|
| https://docs.github.com/en/pull-requests/collaborating-with-...
| Blikkentrekker wrote:
| The culture around p.r.s is truly a high barrier of entry for
| many people.
|
| Figuring out how all of this works is substantially more
| difficult I find in practice than fixing many longstanding
| trivial bugs in a great deal of software.
| umanwizard wrote:
| What's the alternative? The old way (which is still used by
| many projects) is to send patches to mailing lists, which I
| find more difficult: you need to learn how to generate the
| patch from your source code repo, send the patch as an e-mail
| (needing weird hacks like `git imap-send`), and then
| configure your MUA not to mangle it somehow. Then you also
| don't have a centralized search/tracking interface.
|
| Some good reasons not to use GitHub is because you're
| familiar with standard/traditional tools, or because you
| prefer not to use centralized services. Both of those are
| fine reasons! But "the traditional way is easier" isn't.
| [deleted]
| mordechai9000 wrote:
| This reminds me of the relevant xkcd: https://xkcd.com/1597/
| [deleted]
| stakkur wrote:
| It's somewhat comforting to hear even Brian K. say he doesn't
| understand Git well.
| rustqt6 wrote:
| Git really is a mess. The fact that commits and not diffs have
| hashes should be lampooned despite arguably a few small
| benefits. Geniuses make mistakes too and git is linuses. The
| only reason git is respected is because it came from Linus. If
| it were from Microsoft it would get all the criticism it
| deserves and then 20 times more
| layer8 wrote:
| I came to the comments to say that it's reassuring. :)
| cafard wrote:
| Likewise.
| tialaramex wrote:
| The choice to use UTF-32 (ie Unicode code points as integers,
| which might as well be 32-bit since your CPU definitely doesn't
| have a suitably sized integer type) is unexpected, as I had seen
| so many other systems just choose to work entirely in UTF-8 for
| this problem.
|
| Now, Brian obviously has much better instincts about performance
| than I do and may even have tried some things and benchmarked
| them, but my guess would have been that you should stay in UTF-8
| because it's always faster for the typical cases.
| bombcar wrote:
| Is UTF-32 fixed size per char? Because then it allows simple
| math that you can't do on UTF-8.
| moomin wrote:
| A "character" can be of fairly arbitrary length in Unicode,
| so no.
| fooster wrote:
| Not to be contradictory, but unicode is not a specific
| encoding. ufc-8 is an encoding (with a non specific length)
| and utf-32 is an encoding of a Unicode code point with a
| specific length.
| valleyer wrote:
| It's a fixed size per codepoint. Many clusters that appear
| atomic in a text editor are made up of multiple codepoints.
| The flag emojis are among the many examples.
| simias wrote:
| It's always the tradeoff, some operations are simpler on
| UTF-32 but they have additional memory (and therefore cache)
| footprint and since you typically don't want to use UTF-32
| externally you have to convert back and forth which is not
| free.
|
| I think these days people don't bother with UTF-32 too much
| because it's not even like you have a clean "one 32bit int,
| one character" relation anyway since some characters can be
| built from multiple codepoints. Since generally most code
| manipulating character strings are interested in characters
| and not codepoints, UTF-32 is effectively a variable-length
| encoding too...
| layer8 wrote:
| Another factor is that nowadays machine code execution is
| much faster than memory accesses, so the trade-off of
| requiring more program logic to process a more compact
| format makes a lot of sense.
| tialaramex wrote:
| Right, somebody else might have actual metrics but I'd have
| guessed actual regular expression patterns are split
| something like:
|
| 90% Only care about ASCII, thus individual bytes in UTF-8,
| and so UTF-32 just wastes memory
|
| 1% Care about individual code points, but spread over
| multiple bytes (e.g. the double dagger ++), UTF-32 is
| perfect
|
| 9% Care about multiple code points (to form e.g. a Flag, or
| e written in combining form, or two women kissing) and so
| UTF-32 doesn't really help again
| happytoexplain wrote:
| It is fixed size per code point, which are what developers
| (and programming languages) sometimes casually call a
| character, but in practice a character is a grapheme, which
| can be multiple code points once you're outside the ASCII
| range. But it can still be useful to count code points, which
| would be faster in UTF-32.
|
| Edit: Mixed up code units and code points.
| thayne wrote:
| And even then, in some languages at least, what constitutes
| a grapheme isn't always well defined.
| happytoexplain wrote:
| True - I was thinking of Unicode's definition
| ("[extended] grapheme clusters").
| a1369209993 wrote:
| > in some languages at least, what constitutes a grapheme
| isn't always well defined.
|
| Can you provide some examples? People _say_ this a lot,
| but the cases I 've been able to find tend to be things
| like U+01F1 LATIN CAPITAL LETTER DZ, which is only not
| well defined in the sense that Unicode defines it wrong
| (as one character rather than two) presumably-on-purpose,
| for compatibility with one or more older character
| encodings.
| happytoexplain wrote:
| Is DZ "wrong" because it's not considered a digraph by
| professionals, or because people don't agree that
| digraphs should be considered single characters?
| a1369209993 wrote:
| "DZ" isn't 'wrong', it's a perfectly valid two-character
| string consisting of "D" followed by "Z". Assigning to a
| multi-character string a encoded representation that
| isn't the concatenation of representations of each
| character _in_ the string (especially while insisting
| that that makes it a distinct character in its own right)
| is what 's wrong.
| masklinn wrote:
| > Because then it allows simple math that you can't do on
| UTF-8.
|
| That's not actually useful, because unicode itself is a
| variable length encoding.
|
| So it mostly blows up the size of your data.
|
| Though it might have been selected for implementation
| simplicity and / or backwards compatibility (e.g. same reason
| why Python did it, then had to invent "flexible string
| representation" because strings had become way too big to be
| acceptable).
| moefh wrote:
| UTF-32 is a fixed-length encoding of Unicode[1], so it does
| simplify things a lot for a regex engine.
|
| [1] At least when talking about code points, which is what
| matters for regular expressions (unless you want stuff like
| \X with is not universally supported).
| dotancohen wrote:
| Unicode is not an encoding, despite MS Notepad calling some
| encoding "Unicode".
| tialaramex wrote:
| Unicode isn't a _storage_ encoding and so yeah, Notepad
| shouldn 't do that. However Unicode does encode
| essentially all extant human writing systems into
| integers called "code points" between zero and 0x10FFFF.
| The Latin "capital A" is 65 for example.
|
| However you'd probably like to store something more
| compact than, say, JSON arrays of integers. So there are
| also a bunch of encodings which turn the integers into
| bytes. These encodings would work for any integers, but
| they make most sense to encode Unicode's code points.
| UTF-8 turns each code point into 1-4 bytes, a pair of
| UTF-16 encodings turns them into one or two "code units"
| each of 2 bytes either little or big endian. And UTF-32
| just encodes them as native 32-bit integers but again
| either little or big endian.
| formerly_proven wrote:
| > Q: What is Unicode?
|
| > A: Unicode is the universal character encoding,
| maintained by the Unicode Consortium. This encoding
| standard provides the basis for processing, storage and
| interchange of text data in any language in all modern
| software and information technology protocols.
|
| https://home.unicode.org/basic-info/faq/
| brewmarche wrote:
| In your quote encoding refers to assigning numbers (code
| points in Unicode parlance) to characters (I am
| simplifying here, I know the definition of character in
| Unicode is not that easy).
|
| It's like a catalogue of scripts. We have to extend it
| when we encounter new scripts that are not catalogued yet
| (or when we create new emojis)
|
| Converting a byte sequence to a Unicode code point
| sequence and vice-versa is called transformation format
| (or more generally an encoding form, but then might not
| be deterministic) by Unicode (see
| <https://www.unicode.org/faq/utf_bom.html#gen2>). Unicode
| specifies UTF-8, -16 and -32. We do not have to change
| these formats unless the catalogue hit the limits of 32
| bits (not a big problem for UTF-8 but for the other two
| formats). These formats are already able to encode code
| points that are not assigned yet.
|
| And the confusion now is that a lot of people call what
| Unicode calls transformation format (i.e. the byte to
| code point mapping) encoding as well. The term charset is
| also used sometimes.
|
| PS: Note that a goal of Unicode is to be able to
| accommodate legacy encoding/charsets by having a broad
| enough catalogue. This is so that these legacy encoding
| which may come with their own catalogue can be mapped to
| the Unicode catalogue. So we have control codes (even
| though not part of any "proper" human script),
| precomposed letters (there is a code point for a although
| it could be represented by a + combining `), things like
| the Greek terminal form of sigma separately encoded,
| although that could be done in font-rendering (like
| generally done for Arabic), and a lot more to aid with
| mapping and roundtrips.
| jll29 wrote:
| Regardless of official terminology, there are two levels:
|
| 1. Map a character to a unique number in a character set
| (in Unicode: called codepoint)
|
| 2. Map a number that represents a character in a
| character set to a bit pattern for storage (transiently
| or persistently, internally or externally). Unicode code
| points can be bit-encoded in various ways: UTF8, UCS2 and
| UCS4/UTF32.
|
| The original code points permit the same character to be
| represented in various ways, which makes equality checks
| non-trivial: for instance a character like "a" can be
| represented as a single character or alternatively as a
| composition of "a" + umlaut accent (2 characters).
|
| So far, this is all about plain text, so we are not
| talking about font families or character properties
| (bold, italics, underlined) or orientation (super-script,
| sup-script).
|
| Ken Lunde's opus magnum is the standard book on
| representing text in various languages other than
| English, with a focus on Asian languages:
| https://www.oreilly.com/library/view/cjkv-information-
| proces...
| layer8 wrote:
| Unicode uses the term "character encoding form" or
| "character encoding scheme" for what is normally referred
| to or abbreviated as "character encoding" or "charset"
| (see e.g. RFC 8187), and uses "character encoding" or
| "coded character set" for the abstract assignment of
| natural numbers to the abstract characters in a character
| repertoire, which is more usually referred to as just
| "[coded] character set" (cf. also UCS = Unicode Character
| Set). This different use of terminology can cause
| confusion. The GP is correct that Unicode as a whole is
| not what is colloquially meant by "encoding".
| rustqt6 wrote:
| At the rate emojis are being added, in a few decades it won't
| be. Unless Biden mistakes his nuclear briefcase with
| children's toys in his perpetual confusion (although
| thankfully the American msm is keeping people calm by never
| showing his regular gaffes)
| alganet wrote:
| You are right.
|
| Unicode in UTF-8 will have variable char length. Plain ASCII
| will be one byte for each char, but others might have up to 4
| bytes. Anything dealing with it will have to be aware of
| leading bytes.
|
| UTF-32 in other hand will encode all chars, even plain ASCII
| ones, using 4 bytes.
|
| Take the "length of a string" function, for example. Porting
| that from ASCII to UTF-32 is just dividing the length in
| bytes by 4. For UTF-8, you'd have to iterate over each
| character and figure out if there is a combination of bytes
| that collapse into a single character.
| simias wrote:
| He mentions that "The amount of actual change isn't too great,
| so I think this might be ok" so I wonder if part of the
| equation has more to do with avoiding messing with legacy code
| rather than raw performance. If the current code expects all
| codepoints to have a constant-width representation, it may be
| complicated to add UTF-8 into the mix.
|
| A complete guess on my part though, I never looked into AWK's
| source code.
| xonix wrote:
| This sounds reasonable. When the GoAWK creator tried to add
| Unicode support through UTF-8 he discovered that this had
| drastic performance implications (rendering some algorithms
| to be O(N^2) instead of O(N)), if done naive
| https://github.com/benhoyt/goawk/issues/35. Therefore the
| change was reverted till the more efficient implementation
| can be found.
| fpoling wrote:
| The code only uses UTF-32 in regular expressions where I
| suppose it was much simpler to adopt the older code. The rest
| uses UTF-8.
| cyocum wrote:
| Here is Brian Kernighan mentioning the Unicode work in an
| interview: https://www.youtube.com/watch?v=GNyQxXw_oMQ
___________________________________________________________________
(page generated 2022-08-20 23:00 UTC)