[HN Gopher] Coreutils Gotchas (2015)
___________________________________________________________________
Coreutils Gotchas (2015)
Author : tosh
Score : 87 points
Date : 2021-06-18 13:05 UTC (9 hours ago)
(HTM) web link (www.pixelbeat.org)
(TXT) w3m dump (www.pixelbeat.org)
| derefr wrote:
| Is there a way to automatically set dd's ibs= and obs= depending
| on the block and/or stripe sizes of the disks involved? It's
| weird that optimizing a large transfer involves manually looking
| up and then setting parameters that the computer itself already
| knows.
|
| (Though I suppose it only "knows" them through proprietary APIs
| that a program like dd might not want to build into itself.
| Still, that's what wrapper scripts are for, no?)
|
| Speaking of, is this something sendfile(2) thinks about? Trying
| to optimize between time spent reading blocks from the disk and
| time spent writing packets to the NIC?
| [deleted]
| kevincox wrote:
| A lot of use cases of `dd` are better served by `head -c
| $bytes`. `dd` does provide a lot more control if you need it,
| but when you don't just use head.
| throwaway09223 wrote:
| The better answer to your question is that you don't /want/ dd
| to use the filesystem block size because this isn't optimal.
|
| Generally speaking, the larger bs= you use, the faster things
| will be as it minimizes the syscall overhead. Don't worry about
| the particulars of the underlying filesystem or block device.
| Just set the buffers as big as you can reasonably afford given
| available memory.
|
| And yes, sendfile or io_uring are designed to avoid this.
| arendtio wrote:
| This reminds me of the time I tried to replace a region in an
| image I created with dd and forgot to add the `notrunc` option.
| Which resulted in the image file being truncated and instead of
| the few hundred gigabytes only a few megabytes were left.
| rascul wrote:
| Worth considering that shells might have builtins for things like
| kill, echo, test, time, and others. As an example, bash has its
| own builtin kill command and it isn't documented by the kill(1)
| man page.
| caymanjim wrote:
| Over 30 years of using GNU chmod, and only today did I learn
| about "chmod -R a-x,a+X" to avoid constantly doing two "find |
| chmod" commands.
| fbcpck wrote:
| Somewhat relevant: the default sed that ships on all macos is
| freebsd sed.
|
| The basic featureset and syntax is about the same, but the more
| advanced parameters and features have different syntax (or not
| implemented) when compared to gnu sed.
|
| This caused me quite a headache in the past figuring out why
| certain scripts works (or breaks!) on local machine but not in
| others; I started taking closer look on all tool versions since
| then.
|
| (The solution is to just use gnu sed with
| https://formulae.brew.sh/formula/gnu-sed)
| asicsp wrote:
| See https://stackoverflow.com/q/24275070 for a list of
| differences between various sed versions. Just seeing the
| length of the answer is intimidating.
|
| Perl is another great option for portability.
| NegativeLatency wrote:
| Another plus of Perl is that it's a little easier to specify
| tabs new lines and other escaped characters than in sed in my
| experience. In Perl I'll usually get it right the first time,
| but in sed I might have to do something dumb like sticking a
| literal newline in a variable or something.
| 1vuio0pswjnm7 wrote:
| "The solution is to just use gnu sed with
| https://formulae.brew.sh/formula/gnu-sed)"
|
| Problem for me is I use more than just MacOS and Linux. I would
| have install GNU sed on every OS I use. I make small OS images
| for booting from USB. I use a non-busybox multi-call binary as
| the initial userland to conserve space, part of filesystem
| embedded in the kernel (sort of like initrd on Linux I guess).
| Early in the boot and setup process I use sed in scripts. Thus
| I would have to make sure there was a copy of GNU sed
| installed, preferably added to the multi-call binary. Not sure
| the work is really worth it just for a few convenience
| features. I can do anything I need to do with BSD sed just as
| well as with GNU sed.
|
| Instead, the solution I chose is to write sed scripts for
| NetBSD's sed. These work on Linux, MacOS, FreeBSD, OpenBSD,
| Plan9, etc., old or new. There is one feature it has that GNU
| does not: "-a"; it can be used to do true (but limited) "in-
| place editing" without creating a temp file. Not sure why but
| NetBSD has since changed their sed to be more FreeBSD and GNU-
| compatible, e.g., adding "-i" and "-g/-G", to enable automatic
| temp file creation then removal^1 and GNU regex, respectively.
| However if we start using these "features" everywhere, then our
| scripts may not work with older userlands.
|
| 1. This is often called "in-place" editing, but if one examines
| the source code one will see there is still a temp file
| (tmpfname) created. We are being spared the step of manually
| creating then removing a temp file, but we still need the
| requisite filesystem space available for one.
| tyingq wrote:
| Similar for other tools. "Awk" on MacOS is mawk, which is
| significantly different from the gawk that Linux uses as "awk".
|
| Edit: As mentioned below, awk is mawk for some Linux distros,
| gawk for others, and something else on still others :)
| version_five wrote:
| Which linux are you referring to? I'm almost certain than
| Ubuntu 16.04 and 18.04 both ship with mawk as the default.
|
| Edit: just checked and my vps than runs 20.10 uses gawk
|
| Edit 2: 18.04 also comes with gawk, at least the install I
| have. I know I have worked with mawk on a linux machine-
| could it have been Ubuntu 16.04?
| spinax wrote:
| IIRC (no Ubuntu VM handy right now) mawk is the default
| with a minimal LTS install, using alternatives to point awk
| at it. When you apt install gawk it has a higher priority
| in alternatives and flips the awk symlinks over to gawk
| automagically.
| version_five wrote:
| > gawk it has a higher priority in alternatives and flips
| the awk symlinks over to gawk automagically
|
| That must be what happened. I just checked my other
| machine that had a pretty new install of 20.04 on it, and
| it does have mawk. I'm happy to see I'm not crazy.
| shakna wrote:
| Debian & Ubuntu have moved back and forth between using
| mawk or gawk several times.
|
| Debian Jessie (circa 2016) used gawk.
|
| Debian Stretch (circa 2017) used mawk.
|
| Today, Debian uses gawk.
| JNRowe wrote:
| Have you flipped the final one? Both my Buster and
| Bullseye installation only have mawk.
|
| Trying to dig in deeper makes my head explode though.
| mawk is marked required yet with update-alternatives
| priority of 5, and gawk is marked optional but with a
| priority 10. I _guess_ the intention is that mawk is good
| enough for base installs, but you probably want gawk.
| tyingq wrote:
| It's definitely hard to track down. Mawk is often way
| faster than gawk, but gawk has more features (sockets
| even...crazy) and maybe fewer arbitrary limits.
| tyingq wrote:
| Ah...yeah, I was probably over generalizing when I said
| that, though it's probably fair to say that gawk is the
| default awk for many popular Linux distros.
| pletnes wrote:
| True, although sometimes mawk is installed, as others
| mention.
|
| If you need gawk, put gawk in your shebang, and if nothing
| else the end user has a chance at understanding why they're
| seeing a crash. Also it will crash on line 1, nice and loud.
| remram wrote:
| The biggest problem with this is using the -i flag ("in place")
| which expects an argument on BSD/MacOS but not on GNU/Linux.
| CraigJPerry wrote:
| There's a bunch of tools that trip me up between the BSD and
| GNU versions.
|
| Find is the big one - so much so that i "cargo install fd-find"
| on my boxes these days and use the rust version across all
| hosts.
|
| I briefly flirted with using zsh's enhanced glob abilities,
| e.g. find . -type f -size +1m -ls
|
| vs. ls **/*(.Lm+1)
| euske wrote:
| I really want to add this bit: cut -d delimiter
| -f fields sort -t delimiter -k fields awk -F
| delimiter
| rwmj wrote:
| Padraig's a nice guy, why not email him!
| ChrisArchitect wrote:
| some previous discussion:
|
| _6 years ago_ https://news.ycombinator.com/item?id=10648255
| asicsp wrote:
| > _cut doesn 't work with fields separated by arbitrary
| whitespace. It's often better to use awk_
|
| I was working on a script to provide cut like syntax using awk,
| mainly for awk's default field separation and regexp delimiters.
| It isn't ready, but I just created a repo with what I have so
| far: https://github.com/learnbyexample/regexp-cut
| unhammer wrote:
| >cut doesn't work with fields separated by arbitrary
| whitespace. It's often better to use awk
|
| I use `cut` precisely when I _want_ this behaviour. If
| something actually is tab-separated, I don 't want `cut -f3` to
| give me the 4th column just because the value in column 1 was
| the empty string.
| theiasson wrote:
| Nice tool. I just found that if you have fields separated by
| arbitrary whitespaces, piping that through xargs trims multiple
| whitespaces into a single whitespace.
|
| ```$ echo "words are far apart" | xargs```
|
| Although, it does have its caveats.
| https://stackoverflow.com/a/12973694
| uplime wrote:
| > echo is non portable and its behaviour diverges between systems
| and shell builtins
|
| `echo` is portable:
| https://pubs.opengroup.org/onlinepubs/009695399/utilities/ec...
| Any options passed to `echo` are not. Although I agree `printf`
| should be used instead.
| chrisseaton wrote:
| > Any options passed to `echo` are not
|
| Using common sense, this is probably what they mean by 'non
| portable' isn't it?
| uplime wrote:
| It probably is what they meant. However, what was written is
| that `echo` itself is non-portable, which isn't correct.
| tssva wrote:
| The context of the article is the Gnu coreutils. The
| coreutils version of echo accepts options which is not
| allowed according to the link you referenced. Therefore
| within the context of the article stating echo is non-
| portable is correct.
| uplime wrote:
| GNU coreutils isn't a standard. It conforms to and
| extends the POSIX standard. If it was just talking about
| GNU echo, then that `echo` would work the same on every
| machine. It even explicitly mentions other `echo`
| implementations, like the shell builtins. `echo` itself
| is portable, none of its flags are.
| burntsushi wrote:
| You seem to have a very narrow definition of "portable."
| Can you link to the definition of the jargon term you
| appear to be using here?
|
| Also, have you considered that some words have colloquial
| meanings too, and that it doesn't make sense to label
| them as "not correct"?
| uplime wrote:
| > You seem to have a very narrow definition of
| "portable."
|
| Can you explain how wanting to include more programs into
| the definition of portable is narrow?
|
| > Can you link to the definition of the jargon term you
| appear to be using here?
|
| The same definition I've always seen in every community
| discussing shell-scripting: if it's specified by the
| POSIX (Portable Operating System Interface) standard.
| POSIX does have a definition of `echo`. GNU echo does not
| conform to it by default, but can be configured to.
|
| > Also, have you considered that some words have
| colloquial meanings too, and that it doesn't make sense
| to label them as "not correct"?
|
| But it makes sense to label `echo` itself as not
| portable? The argument can be made that `GNU echo` is not
| portable (even though it can be with an environment
| variable), however that should be specified, instead of
| mentioning there are several implementations where it is
| non-standard.
| burntsushi wrote:
| You sound like a zealot to be honest. You misconstrued my
| post and gave a disingenuous response IMO. And you didn't
| answer my questions.
| kevincox wrote:
| Doesn't that make echo non-portable in practice.
|
| `echo -e` should print "-e\n" according to POSIX but the most
| common implementation will print "". As far as I can tell there
| is no way to work around this.
| uplime wrote:
| It makes relying on meaningful options passed to `echo` non-
| portable. You can always rely on `echo` being on the system
| (thanks to POSIX), which is what makes it portable. Relying
| on the behavior of -e/-n/etc is not.
| kevincox wrote:
| But that is my point. Print the string "-e\n" in a portable
| way. I can't think of a way. For GNU coreutils you would
| need to do `echo - -e` but on a complaint echo you would
| need to do `echo -e`. So maybe it is portable for some
| strings. But it can't be used for all cases. Even something
| as simple as outputting a user entered string can't be done
| because you don't know if it is handled specially.
| uplime wrote:
| You can with the POSIXLY_CORRECT environment variable:
|
| $ /bin/echo --version echo (GNU coreutils) 8.30 Copyright
| (C) 2018 Free Software Foundation, Inc. License GPLv3+:
| GNU GPL version 3 or later
| <https://gnu.org/licenses/gpl.html>. This is free
| software: you are free to change and redistribute it.
| There is NO WARRANTY, to the extent permitted by law.
|
| Written by Brian Fox and Chet Ramey. $ POSIXLY_CORRECT=1
| /bin/echo -e -e
|
| Which is why I say relying on `echo` is fine, since its
| standard. Relying on `echo` to use options is not. That's
| also why the author is correct that you should use
| `printf`, because POSIXLY_CORRECT is not set, then you
| can run into non-standard behavior, like you showed.
___________________________________________________________________
(page generated 2021-06-18 23:01 UTC)