[HN Gopher] Coreutils Gotchas (2015)
       ___________________________________________________________________
        
       Coreutils Gotchas (2015)
        
       Author : tosh
       Score  : 87 points
       Date   : 2021-06-18 13:05 UTC (9 hours ago)
        
 (HTM) web link (www.pixelbeat.org)
 (TXT) w3m dump (www.pixelbeat.org)
        
       | derefr wrote:
       | Is there a way to automatically set dd's ibs= and obs= depending
       | on the block and/or stripe sizes of the disks involved? It's
       | weird that optimizing a large transfer involves manually looking
       | up and then setting parameters that the computer itself already
       | knows.
       | 
       | (Though I suppose it only "knows" them through proprietary APIs
       | that a program like dd might not want to build into itself.
       | Still, that's what wrapper scripts are for, no?)
       | 
       | Speaking of, is this something sendfile(2) thinks about? Trying
       | to optimize between time spent reading blocks from the disk and
       | time spent writing packets to the NIC?
        
         | [deleted]
        
         | kevincox wrote:
         | A lot of use cases of `dd` are better served by `head -c
         | $bytes`. `dd` does provide a lot more control if you need it,
         | but when you don't just use head.
        
         | throwaway09223 wrote:
         | The better answer to your question is that you don't /want/ dd
         | to use the filesystem block size because this isn't optimal.
         | 
         | Generally speaking, the larger bs= you use, the faster things
         | will be as it minimizes the syscall overhead. Don't worry about
         | the particulars of the underlying filesystem or block device.
         | Just set the buffers as big as you can reasonably afford given
         | available memory.
         | 
         | And yes, sendfile or io_uring are designed to avoid this.
        
       | arendtio wrote:
       | This reminds me of the time I tried to replace a region in an
       | image I created with dd and forgot to add the `notrunc` option.
       | Which resulted in the image file being truncated and instead of
       | the few hundred gigabytes only a few megabytes were left.
        
       | rascul wrote:
       | Worth considering that shells might have builtins for things like
       | kill, echo, test, time, and others. As an example, bash has its
       | own builtin kill command and it isn't documented by the kill(1)
       | man page.
        
       | caymanjim wrote:
       | Over 30 years of using GNU chmod, and only today did I learn
       | about "chmod -R a-x,a+X" to avoid constantly doing two "find |
       | chmod" commands.
        
       | fbcpck wrote:
       | Somewhat relevant: the default sed that ships on all macos is
       | freebsd sed.
       | 
       | The basic featureset and syntax is about the same, but the more
       | advanced parameters and features have different syntax (or not
       | implemented) when compared to gnu sed.
       | 
       | This caused me quite a headache in the past figuring out why
       | certain scripts works (or breaks!) on local machine but not in
       | others; I started taking closer look on all tool versions since
       | then.
       | 
       | (The solution is to just use gnu sed with
       | https://formulae.brew.sh/formula/gnu-sed)
        
         | asicsp wrote:
         | See https://stackoverflow.com/q/24275070 for a list of
         | differences between various sed versions. Just seeing the
         | length of the answer is intimidating.
         | 
         | Perl is another great option for portability.
        
           | NegativeLatency wrote:
           | Another plus of Perl is that it's a little easier to specify
           | tabs new lines and other escaped characters than in sed in my
           | experience. In Perl I'll usually get it right the first time,
           | but in sed I might have to do something dumb like sticking a
           | literal newline in a variable or something.
        
         | 1vuio0pswjnm7 wrote:
         | "The solution is to just use gnu sed with
         | https://formulae.brew.sh/formula/gnu-sed)"
         | 
         | Problem for me is I use more than just MacOS and Linux. I would
         | have install GNU sed on every OS I use. I make small OS images
         | for booting from USB. I use a non-busybox multi-call binary as
         | the initial userland to conserve space, part of filesystem
         | embedded in the kernel (sort of like initrd on Linux I guess).
         | Early in the boot and setup process I use sed in scripts. Thus
         | I would have to make sure there was a copy of GNU sed
         | installed, preferably added to the multi-call binary. Not sure
         | the work is really worth it just for a few convenience
         | features. I can do anything I need to do with BSD sed just as
         | well as with GNU sed.
         | 
         | Instead, the solution I chose is to write sed scripts for
         | NetBSD's sed. These work on Linux, MacOS, FreeBSD, OpenBSD,
         | Plan9, etc., old or new. There is one feature it has that GNU
         | does not: "-a"; it can be used to do true (but limited) "in-
         | place editing" without creating a temp file. Not sure why but
         | NetBSD has since changed their sed to be more FreeBSD and GNU-
         | compatible, e.g., adding "-i" and "-g/-G", to enable automatic
         | temp file creation then removal^1 and GNU regex, respectively.
         | However if we start using these "features" everywhere, then our
         | scripts may not work with older userlands.
         | 
         | 1. This is often called "in-place" editing, but if one examines
         | the source code one will see there is still a temp file
         | (tmpfname) created. We are being spared the step of manually
         | creating then removing a temp file, but we still need the
         | requisite filesystem space available for one.
        
         | tyingq wrote:
         | Similar for other tools. "Awk" on MacOS is mawk, which is
         | significantly different from the gawk that Linux uses as "awk".
         | 
         | Edit: As mentioned below, awk is mawk for some Linux distros,
         | gawk for others, and something else on still others :)
        
           | version_five wrote:
           | Which linux are you referring to? I'm almost certain than
           | Ubuntu 16.04 and 18.04 both ship with mawk as the default.
           | 
           | Edit: just checked and my vps than runs 20.10 uses gawk
           | 
           | Edit 2: 18.04 also comes with gawk, at least the install I
           | have. I know I have worked with mawk on a linux machine-
           | could it have been Ubuntu 16.04?
        
             | spinax wrote:
             | IIRC (no Ubuntu VM handy right now) mawk is the default
             | with a minimal LTS install, using alternatives to point awk
             | at it. When you apt install gawk it has a higher priority
             | in alternatives and flips the awk symlinks over to gawk
             | automagically.
        
               | version_five wrote:
               | > gawk it has a higher priority in alternatives and flips
               | the awk symlinks over to gawk automagically
               | 
               | That must be what happened. I just checked my other
               | machine that had a pretty new install of 20.04 on it, and
               | it does have mawk. I'm happy to see I'm not crazy.
        
             | shakna wrote:
             | Debian & Ubuntu have moved back and forth between using
             | mawk or gawk several times.
             | 
             | Debian Jessie (circa 2016) used gawk.
             | 
             | Debian Stretch (circa 2017) used mawk.
             | 
             | Today, Debian uses gawk.
        
               | JNRowe wrote:
               | Have you flipped the final one? Both my Buster and
               | Bullseye installation only have mawk.
               | 
               | Trying to dig in deeper makes my head explode though.
               | mawk is marked required yet with update-alternatives
               | priority of 5, and gawk is marked optional but with a
               | priority 10. I _guess_ the intention is that mawk is good
               | enough for base installs, but you probably want gawk.
        
               | tyingq wrote:
               | It's definitely hard to track down. Mawk is often way
               | faster than gawk, but gawk has more features (sockets
               | even...crazy) and maybe fewer arbitrary limits.
        
             | tyingq wrote:
             | Ah...yeah, I was probably over generalizing when I said
             | that, though it's probably fair to say that gawk is the
             | default awk for many popular Linux distros.
        
           | pletnes wrote:
           | True, although sometimes mawk is installed, as others
           | mention.
           | 
           | If you need gawk, put gawk in your shebang, and if nothing
           | else the end user has a chance at understanding why they're
           | seeing a crash. Also it will crash on line 1, nice and loud.
        
         | remram wrote:
         | The biggest problem with this is using the -i flag ("in place")
         | which expects an argument on BSD/MacOS but not on GNU/Linux.
        
         | CraigJPerry wrote:
         | There's a bunch of tools that trip me up between the BSD and
         | GNU versions.
         | 
         | Find is the big one - so much so that i "cargo install fd-find"
         | on my boxes these days and use the rust version across all
         | hosts.
         | 
         | I briefly flirted with using zsh's enhanced glob abilities,
         | e.g.                 find . -type f -size +1m -ls
         | 
         | vs.                 ls **/*(.Lm+1)
        
       | euske wrote:
       | I really want to add this bit:                   cut -d delimiter
       | -f fields         sort -t delimiter -k fields         awk -F
       | delimiter
        
         | rwmj wrote:
         | Padraig's a nice guy, why not email him!
        
       | ChrisArchitect wrote:
       | some previous discussion:
       | 
       |  _6 years ago_ https://news.ycombinator.com/item?id=10648255
        
       | asicsp wrote:
       | > _cut doesn 't work with fields separated by arbitrary
       | whitespace. It's often better to use awk_
       | 
       | I was working on a script to provide cut like syntax using awk,
       | mainly for awk's default field separation and regexp delimiters.
       | It isn't ready, but I just created a repo with what I have so
       | far: https://github.com/learnbyexample/regexp-cut
        
         | unhammer wrote:
         | >cut doesn't work with fields separated by arbitrary
         | whitespace. It's often better to use awk
         | 
         | I use `cut` precisely when I _want_ this behaviour. If
         | something actually is tab-separated, I don 't want `cut -f3` to
         | give me the 4th column just because the value in column 1 was
         | the empty string.
        
         | theiasson wrote:
         | Nice tool. I just found that if you have fields separated by
         | arbitrary whitespaces, piping that through xargs trims multiple
         | whitespaces into a single whitespace.
         | 
         | ```$ echo "words are far apart" | xargs```
         | 
         | Although, it does have its caveats.
         | https://stackoverflow.com/a/12973694
        
       | uplime wrote:
       | > echo is non portable and its behaviour diverges between systems
       | and shell builtins
       | 
       | `echo` is portable:
       | https://pubs.opengroup.org/onlinepubs/009695399/utilities/ec...
       | Any options passed to `echo` are not. Although I agree `printf`
       | should be used instead.
        
         | chrisseaton wrote:
         | > Any options passed to `echo` are not
         | 
         | Using common sense, this is probably what they mean by 'non
         | portable' isn't it?
        
           | uplime wrote:
           | It probably is what they meant. However, what was written is
           | that `echo` itself is non-portable, which isn't correct.
        
             | tssva wrote:
             | The context of the article is the Gnu coreutils. The
             | coreutils version of echo accepts options which is not
             | allowed according to the link you referenced. Therefore
             | within the context of the article stating echo is non-
             | portable is correct.
        
               | uplime wrote:
               | GNU coreutils isn't a standard. It conforms to and
               | extends the POSIX standard. If it was just talking about
               | GNU echo, then that `echo` would work the same on every
               | machine. It even explicitly mentions other `echo`
               | implementations, like the shell builtins. `echo` itself
               | is portable, none of its flags are.
        
               | burntsushi wrote:
               | You seem to have a very narrow definition of "portable."
               | Can you link to the definition of the jargon term you
               | appear to be using here?
               | 
               | Also, have you considered that some words have colloquial
               | meanings too, and that it doesn't make sense to label
               | them as "not correct"?
        
               | uplime wrote:
               | > You seem to have a very narrow definition of
               | "portable."
               | 
               | Can you explain how wanting to include more programs into
               | the definition of portable is narrow?
               | 
               | > Can you link to the definition of the jargon term you
               | appear to be using here?
               | 
               | The same definition I've always seen in every community
               | discussing shell-scripting: if it's specified by the
               | POSIX (Portable Operating System Interface) standard.
               | POSIX does have a definition of `echo`. GNU echo does not
               | conform to it by default, but can be configured to.
               | 
               | > Also, have you considered that some words have
               | colloquial meanings too, and that it doesn't make sense
               | to label them as "not correct"?
               | 
               | But it makes sense to label `echo` itself as not
               | portable? The argument can be made that `GNU echo` is not
               | portable (even though it can be with an environment
               | variable), however that should be specified, instead of
               | mentioning there are several implementations where it is
               | non-standard.
        
               | burntsushi wrote:
               | You sound like a zealot to be honest. You misconstrued my
               | post and gave a disingenuous response IMO. And you didn't
               | answer my questions.
        
         | kevincox wrote:
         | Doesn't that make echo non-portable in practice.
         | 
         | `echo -e` should print "-e\n" according to POSIX but the most
         | common implementation will print "". As far as I can tell there
         | is no way to work around this.
        
           | uplime wrote:
           | It makes relying on meaningful options passed to `echo` non-
           | portable. You can always rely on `echo` being on the system
           | (thanks to POSIX), which is what makes it portable. Relying
           | on the behavior of -e/-n/etc is not.
        
             | kevincox wrote:
             | But that is my point. Print the string "-e\n" in a portable
             | way. I can't think of a way. For GNU coreutils you would
             | need to do `echo - -e` but on a complaint echo you would
             | need to do `echo -e`. So maybe it is portable for some
             | strings. But it can't be used for all cases. Even something
             | as simple as outputting a user entered string can't be done
             | because you don't know if it is handled specially.
        
               | uplime wrote:
               | You can with the POSIXLY_CORRECT environment variable:
               | 
               | $ /bin/echo --version echo (GNU coreutils) 8.30 Copyright
               | (C) 2018 Free Software Foundation, Inc. License GPLv3+:
               | GNU GPL version 3 or later
               | <https://gnu.org/licenses/gpl.html>. This is free
               | software: you are free to change and redistribute it.
               | There is NO WARRANTY, to the extent permitted by law.
               | 
               | Written by Brian Fox and Chet Ramey. $ POSIXLY_CORRECT=1
               | /bin/echo -e -e
               | 
               | Which is why I say relying on `echo` is fine, since its
               | standard. Relying on `echo` to use options is not. That's
               | also why the author is correct that you should use
               | `printf`, because POSIXLY_CORRECT is not set, then you
               | can run into non-standard behavior, like you showed.
        
       ___________________________________________________________________
       (page generated 2021-06-18 23:01 UTC)