hngopher.com

       [HN Gopher] Why not parse `ls` and what to do instead
       ___________________________________________________________________
        
       Why not parse `ls` and what to do instead
        
       Author : nomilk
       Score  : 121 points
       Date   : 2024-06-23 05:24 UTC (2 days ago)
        
 (HTM) web link (unix.stackexchange.com)
 (TXT) w3m dump (unix.stackexchange.com)
        
       | fellerts wrote:
       | The title omits the final '?' which is important, because the
       | rant and its replies didn't settle the matter.
       | 
       | Shellcheck's page on parsing ls links to the article the author
       | is nitpicking on, but it also links to the answer to "what to do
       | instead": use find(1), unless you really can't.
       | https://mywiki.wooledge.org/BashFAQ/020
        
       | noobermin wrote:
       | Posts like these are like the main character threads on twitter
       | where someone says, "men don't do x" or "women aren't like y." It
       | just feels like people outside of you who have no understanding
       | of your context seem intent on making up rules for how you should
       | code things.
       | 
       | Perhaps it would help to translate this into something more like,
       | "what pitfalls do you run into if you parse `ls`" but it's hard
       | to get past the initial language.
        
         | qsort wrote:
         | When we say "don't do X" we mean "the obvious way is wrong". If
         | you have enough knowledge to ignore the advice, you likely are
         | already aware of the problems with the obvious solution.
         | 
         | I'm pretty sure you can come up with scenarios where parsing
         | the output of "ls" is indeed the simplest solution, but that
         | kind of article is supposed to discourage people who _don 't_
         | know better from going "oh, I know, I'll just parse the output
         | of ls". As a general advice, people should indeed be pointed
         | towards "man find" or "man opendir 3".
        
       | g15jv2dp wrote:
       | What to do instead: use pwsh to completely obviate all these
       | issues.
        
         | Qwertious wrote:
         | Or, once it's API-stable, use nushell.
        
           | DonHopkins wrote:
           | Python has had an API-stable module for listing directories
           | for decades, you know.
        
         | Aerbil313 wrote:
         | Do you recommend it? I feel like I'd get RSI from pressing
         | shift when using it. https://learn.microsoft.com/en-
         | us/powershell/module/microsof...
        
           | gfv wrote:
           | Powershell is mostly case-insensitive, and most of the core
           | cmdlets have short aliases. Try `Get-Alias` (or `gal`) to
           | learn more.
        
           | shiandow wrote:
           | It's case insensitive for what it's worth. My main problem is
           | trying to figure out which utilities they've bundled into
           | which command.
        
           | g15jv2dp wrote:
           | Yes, I absolutely recommend it. I use it every day.
           | 
           | Commands and flags are case-insensitive.
        
         | DonHopkins wrote:
         | Isn't it ironic that Powershell from Microsoft is so much
         | vastly superior than bash, not because it's great or even
         | better than Python, but because bash is such a terribly low bar
         | to beat, that it totally undermines the "Unix Philosophy"?
         | 
         | Who would have thought that little old Microsoft, purveyors of
         | MSDOS CMD.EXE, would have leapfrogged Unix and come out with
         | something so important and fundamental as a shell that was
         | superior to all of Unix's "standard" sh/csh/bash/whatever
         | shells in so many ways, all of which historically used to be
         | and ridiculously still are touted by Unix Supremacists as one
         | of its greatest strengths?
         | 
         | You see, Microsoft is willing to look at the flaws in their own
         | software, and the virtues of their competitors' software, then
         | admit that they made mistakes, and their competitors did
         | something right, and finally fix their own shit, unlike so many
         | fanatical monolinguistic Unix evangelists.
         | 
         | They did the exact same thing to Java and JavaScript, leaving
         | Visual Basic and CMD.EXE behind in the dustbin of history --
         | just like Unix should leave bash behind -- resulting in great
         | cross platform languages like C# and TypeScript.
         | 
         | Edit: that reinforces my point that taking so long to get there
         | is a hell of a lot better than taking MUCH LONGER to NOT get
         | there.
         | 
         | Maybe bash's legacy inertia is a problem, not a virtue. Is
         | certainly isn't getting a JSON parser in the foreseeable
         | future. The ironic point is that even Microsoft's power shell
         | has much less legacy inertia, and therefore is so much better,
         | in such a shorter amount of time.
        
           | crispyambulance wrote:
           | > Isn't it ironic that Powershell from Microsoft is so much
           | vastly superior than bash
           | 
           | I agree that powershell is _now_ better than bash. But it
           | took SO LONG to get there. Moreover, bash has had a 12 year
           | head-start (ok, 30 if you count earlier unix shells). Bash
           | has legacy inertia. Even though you can now supposedly run
           | powershell in linux, I don 't know anyone who does. Does
           | anybody?
           | 
           | That said, I think powershell is great for utility-knife uses
           | on windows machines.
        
             | g15jv2dp wrote:
             | > Even though you can now supposedly run powershell in
             | linux, I don't know anyone who does. Does anybody?
             | 
             | I do. I replaced all of the automation scripts on my rpi
             | with pwsh scripts, and I'm not regretting it. Not having to
             | deal with decades of cruft in argument parsing and string
             | handling, learning little DSLs for every command, etc. is
             | so worth it.
        
           | kbolino wrote:
           | At this point, all PowerShell has accomplished is creating a
           | separate ecosystem. The designers set out to make a "better"
           | shell and yet refused to ever learn the things they were
           | allegedly "improving".
           | 
           | Basic features are still lacking from PowerShell that have
           | been in UNIX shells since the very beginning:
           | https://github.com/PowerShell/PowerShell/issues/3316
           | 
           | But hey, that's a fixable problem, right? No, because
           | PowerShell is so suffused with arrogance about its
           | superiority that anything, no matter how simple it was to do
           | in a UNIX shell, has to be cross-examined, re-imagined, and
           | bent over the wheel of PowerShell's superiority, before
           | ultimately getting ignored or rejected anyway.
           | 
           | PowerShell is a language unto itself. It is not a replacement
           | for bash/zsh/etc because nobody who knows the latter well can
           | easily migrate to the former, and that's _by design_.
        
             | g15jv2dp wrote:
             | Some very strong sentiments about a shell...
        
               | kbolino wrote:
               | I want there to be something better than the UNIX shells,
               | at least when it comes to error handling and data
               | parsing. PowerShell was _supposed_ to be that tool, but
               | it seems to have lost sight of that goal somewhere along
               | the way.
        
         | alganet wrote:
         | Alternative shells or higher languages don't solve _all_ the
         | issues.
         | 
         | I won't install a new shell to generate a file list on my CI
         | server. I won't install a new shell on remote machines. Ever.
         | 
         | These structured shells also require commands to be aware of
         | them, either via some plugin that structures their raw I/O
         | output or some convention. They solve _some_ command output
         | structuring but not _all_ the general problem.
         | 
         | So, the answer is good. It promotes the idea that one should be
         | careful when machine parsing output meant for humans.
        
           | g15jv2dp wrote:
           | > I won't install a new shell to generate a file list on my
           | CI server. I won't install a new shell on remote machines.
           | Ever.
           | 
           | Uh... that's on you? Why do you intentionally hinder
           | yourself?
           | 
           | > These structured shells also require commands to be aware
           | of them, either via some plugin that structures their raw I/O
           | output or some convention. They solve _some_ command output
           | structuring but not _all_ the general problem.
           | 
           | Okay. It doesn't solve literally every single problem, that
           | is true. It's still miles ahead. And when interfacing with
           | non-pwsh commands, you just fall back to text parsing/output.
        
             | alganet wrote:
             | > Uh... that's on you? Why do you intentionally hinder
             | yourself?
             | 
             | Hinder myself? An ephemeral cloud machine would not keep my
             | custom shell anyway. By having to install it _every single
             | time I connect_ I just loose precious time.
             | 
             | I want to be familiar with tools that are _already_
             | installed everywhere.
             | 
             | The shell is supposed to be a bottom feeder, lowest common
             | denominator, barely usable tool. That way, it can build
             | soon and get stable real fast. That (unintentional)
             | strategy placed it as a core infrastructural piece...
             | everywhere.
             | 
             | Of course, there's scripting and using it on the terminal.
             | But we're talking about scripting, right? Parsing ls and
             | stuff. I want the fast, lean, simple `dash` to parse my
             | fast, lean simple scripts. pwsh is fine for the terminal
             | leather seats.
        
               | Kwpolska wrote:
               | Ephemeral cloud machines are created from images. Build
               | your own image with the tools you need.
        
         | eikenberry wrote:
         | If you're going to skip using the standard shell that is
         | installed everywhere by default, then you should go ahead and
         | use a full language with easily distributed binaries.
        
       | hawski wrote:
       | I think that when someone uses ls instead of a glob it means they
       | most probably don't understand shell. I don't see any advantage
       | of parsing ls output when glob is available. Shell is finicky
       | enough to not invite more trouble. Same with word splitting, one
       | of the reasons to use shell functions, because then you have "$@"
       | which makes sense and any other way to do it is something I can't
       | comprehend.
       | 
       | Maybe I also don't understand shell, but as it was said before:
       | when in doubt switch to a better defined language. Thank heavens
       | for awk.
        
         | DonHopkins wrote:
         | When you don't want to waste your time and sanity and happiness
         | being in doubt and then throwing away all you've done and
         | switching to a new language in mid stream, just don't even
         | start using a terribly crippled shell scripting language in the
         | first place, also and especially including awk.
         | 
         | The tired old "stick to bash because it's already installed
         | everywhere" argument is just as weak and misleading and
         | pernicious as the "stick to Internet Explorer because it's
         | already installed everywhere" argument.
         | 
         | It's not like it isn't trivial to install Python on any system
         | you'll encounter, unless you're programming an Analytical
         | Engine or Jacquard Loom with punched cards.
        
           | cassianoleal wrote:
           | In most places where I run shell scripts, there is no Python.
           | There could be if I really wanted it but it's generally
           | unnecessary waste.
           | 
           | On top of it, shell is better than Python for many things,
           | not to mention faster.
           | 
           | It's also, as you mentioned, ubiquitous.
           | 
           | In the end, choose the tool that makes more sense. For me, a
           | lot of the time, that's a shell script. Other times it may be
           | Python, or Go, or Ruby, or any of the other tools in the box.
        
             | DonHopkins wrote:
             | A waste of what, disk space? I'd much rather waste a few
             | megabytes of disk space than hours or days of my time,
             | which is much more precious. And what are you doing on
             | those servers, anyway? Installing huge amounts of software,
             | I bet. So install a little more!
             | 
             | For decades, on most Windows computers I run web browsers,
             | there's always Internet Explorer. So do you still always
             | use IE because installing Chrome is "wasteful"? It's a hell
             | of a lot bigger and more wasteful than Python. As I already
             | said, that is a weak and misleading and pernicious
             | argument.
             | 
             | So what exactly is bash better than Python at, besides just
             | starting up, which only matters if you write millions of
             | little bash and awk and sed and find and tr and jq and curl
             | scripts that all call each other, because none of them are
             | powerful or integrated enough to solve the problem on their
             | own.
             | 
             | Bash forces you to represent everything as strings, parsing
             | and serializing and re-parsing them again and again. Even
             | something as simple as manipulating json requires forking
             | off a ridiculous number of processes, and parsing and
             | serializing the JSON again and again, instead of simply
             | keeping and manipulating it as efficient native data
             | structures.
             | 
             | It makes absolutely no sense to choose a tool that you know
             | is going to hit the wall soon, so you have to throw out
             | everything you've done and rewrite it in another language.
             | And you don't seem to realize that when you're duct-taping
             | together all these other half-assed languages with their
             | quirky non-standard incompatible byzantine flourishes of
             | command line parameters and weak antique domain specific
             | languages, like find, awk, sed, jq, curl, etc, you're ping-
             | ponging between many different inadequate half-assed
             | languages, and paying the price for starting up and
             | shutting down each of their interpreters many times over,
             | and serializing and deserializing and escaping and
             | unescaping their command line parameters, stdin, and
             | stdout, which totally blows away bash's quick start-up
             | advantage.
             | 
             | You're arguing for learning and cobbling together a dozen
             | or so different half-assed languages and flimsy tools, none
             | of which you can also use to do general purpose
             | programming, user interfaces, machine learning, web servers
             | and clients, etc.
             | 
             | Why learn the quirks and limitations of all those shitty
             | complex tools, and pay the cognitive price and resource
             | overhead of stringing them all together, when you can
             | simply learn one tool that can do all of that much more
             | efficiently in one process, without any quirks and
             | limitations and duct tape, and is much easier to debug and
             | maintain?
        
               | nolist_policy wrote:
               | Plus jq and curl might not even be installed. And I never
               | got warm with jq, so if I need to parse json from shell I
               | reach for... python. Really.
        
               | radiator wrote:
               | Alternatively, maybe you can get warmer with JMESPath,
               | which has jp as its command line interface
               | https://github.com/jmespath/jp .
               | 
               | The good thing about the JMESPath syntax is that it is
               | the standard one when processing JSON in software like
               | Ansible, Grafana, perhaps some more.
        
               | mywittyname wrote:
               | I'm an avid jq user. There are certainly situations where
               | it's better to use python because it's just more sane and
               | easier to read/write, but jq does a few things extremely
               | well, namely, compressing json, and converting json files
               | consisting of big-ass arrays into line delimited json
               | files.
        
               | cassianoleal wrote:
               | > For decades, on most Windows computers I run web
               | browsers, there's always Internet Explorer. As I already
               | said, that is a weak and misleading and pernicious
               | argument.
               | 
               | On its own, I agree. But you glossed over everything else
               | I said, so I'm not going to entertain your weak argument.
               | 
               | You seem to ignore that different users, different use
               | cases, different environments, etc. all need to be taken
               | into account when choosing a tool.
               | 
               | Like I said, for most of my use cases where I use shell
               | scripting, it's the best tool for the job. If you don't
               | believe me, or think you know better about my
               | circumstances than I do, all the power to you.
        
               | sqeaky wrote:
               | > You seem to ignore that different users, different use
               | cases, different environments, etc. all need to be taken
               | into account when choosing a tool.
               | 
               | I have worked on projects that are extremely sensitive to
               | extra dependencies and projects that aren't.
               | 
               | Sometimes I am in an underground bunker and each
               | dependency goes through an 18 month Department of Defense
               | vetting process, and "Just install python" is equivalent
               | to "just don't do the project". Other times I have worked
               | on projects where tech debt was an afterthought because
               | we didn't know if the code would still be around in a
               | week and re-writing was a real option, so bringing in a
               | dependency for a single command was worthwhile if we
               | could solve the problem _now_.
               | 
               | There is appetite for risk, desire for control, need for
               | flexibility, and many other factors just as you stated
               | that DonHopkins is ignoring or unaware of.
        
         | Pesthuf wrote:
         | People new to *nix make the mistake of thinking this stuff is
         | well designed, makes sense and that things work well together.
         | 
         | They learn. We all do.
        
           | larodi wrote:
           | People new to the internet think alike. Still, not a day
           | passes and we are once again reminded how fragile yet amazing
           | this all information theory stuff is.
        
           | lukan wrote:
           | Coincidently I discovered the unix haters handbook today:
           | 
           | https://web.mit.edu/~simsong/www/ugh.pdf
        
             | pikzel wrote:
             | "The Macintosh on which I type this has 64MB: Unix was not
             | designed for the Mac. What kind of challenge is there when
             | you have that much RAM?"
             | 
             | Love it.
        
               | Narishma wrote:
               | I don't understand what they mean in that quote. Neither
               | Unix nor the Mac were designed for that much RAM.
        
               | II2II wrote:
               | Judging from the context, the user interface was fine in
               | the days of limited resources (a 16 kiloword PDP-11 was
               | cited) but then modern computers have the resources for
               | better user interfaces.
               | 
               | They clearly didn't realize that even more modern Unix
               | kernels would require hundreds of megabytes just to boot.
        
             | nsguy wrote:
             | OT ... I worked with Simson briefly ages ago. Smart dude.
             | This book happened later and I've never seen it before.
             | Small world I guess.
        
           | boricj wrote:
           | People new to *nix don't realize that it's a 55 year old
           | design that keeps accumulating cruft.
        
             | hifromwork wrote:
             | Of course, but the same (with a bit lower number of years)
             | can be said about Windows, or HTTP, or the web with its
             | HTML+JS+CSS unholy trinity, or email, or anything old and
             | important really. It's scary how much of our modern
             | infrastructure hinges on hacks made tens of years ago.
        
           | com2kid wrote:
           | One of the original demos showing off PowerShell was well
           | structured output from its version of ls.
           | 
           | That was 17 years ago!
        
         | CJefferson wrote:
         | Sometimes I want all filenames from a subdirectory, without the
         | subdirectory name.
         | 
         | I can do (ignoring parsing issues):                   for name
         | in $(cd subdir; ls); do echo "$name"; done
         | 
         | This isn't easy to do with globbing (as far as I know)
        
           | rascul wrote:
           | One alternative:                 for name in subdir/*; do
           | basename "$name"; done
        
             | Izkata wrote:
             | Also since subdir is hardcoded, you can reliably type it a
             | second time to chop off however much of the start you want:
             | for name in subdir/subsubdir/*; do         echo
             | "${name#subdir/}"  # subsubdir/foo       done
        
               | xolox wrote:
               | Note this string replacement is not anchored (right?)
               | which can end up biting you badly (depending on
               | circumstances of course).
        
               | chuckadams wrote:
               | It's anchored on the left. ${name#subdir/} will turn
               | 'subdir/abc' into 'abc', but will not touch
               | foo/subdir/bar. I don't think bash even has syntax to
               | replace in the middle of an expansion, I always pull out
               | sed for that.
        
               | xolox wrote:
               | Thanks for clarifying, I learned something new today!
               | 
               | Edit: It turns out that Bash does substitutions in the
               | middle of strings using the
               | ${string/substring/replacement} and
               | ${string//substring/replacement} syntax, for more details
               | see https://tldp.org/LDP/abs/html/string-
               | manipulation.html
        
           | silvestrov wrote:
           | I'd really like if the "find" command supported this much
           | easier, so if I write                   find some/dir/here
           | -name '*.gz'
           | 
           | then I could get the filenames without the "some/dir/here"
           | prefix.
           | 
           | It would also be nice if "find" (and "stat") could output the
           | full info for a file in JSON format so I could use "jq" to
           | filter and extract the needed info safely instead of having
           | to split whitespace seperated columns.
        
             | ykonstant wrote:
             | Why would you do this work when stat (and GNU find) can
             | `printf` the exact needed information without any parsing?
        
               | silvestrov wrote:
               | If I need filesize and filename then I still need to
               | parse a filename that might contain all kinds of weird
               | ascii control characters or weird unicode.
               | 
               | JSON makes that a lot less fragile.
        
               | ykonstant wrote:
               | I don't get it; I need a concrete example.
        
             | williamcotton wrote:
             | What about:                 find . -name '*.hs' -exec
             | basename {} \;
        
               | rascul wrote:
               | You could get mixed up here because find is recursive by
               | default and basename won't show that files might be in
               | different subdirectories.
        
           | hawski wrote:
           | If you are gonna do a subshell (cd subdir; ls) you can wrap
           | the whole loop:                 (cd subdir       for name in
           | *; do         echo "$name"       done)
           | 
           | But I prefer:                 for name in subdir/*; do
           | name="${name#*/}"         echo "$name"       done
        
           | chasil wrote:
           | This is really easy to do with a shell pattern.
           | $ x=/some/really/long/path/to/my/file.txt       $ echo
           | "${x##*/}"       file.txt
        
         | englishspot wrote:
         | I just use find. it's a little longer but gives me the full
         | paths and is more consistent. also works well if you need to
         | recurse. something like `find . -type f | while read -r
         | filepath; do whatever "${filepath}"; done`
        
           | gnuvince wrote:
           | Is there a reason to prefer `while read; ...;done` over
           | find's -exec or piping into xargs?
        
             | xolox wrote:
             | Both `find -exec` and xargs expect an executable command
             | whereas `while read; ...; done` executes inline shell code.
             | 
             | Of course you can pass `sh -c '...'` (or Bash or $SHELL) to
             | `find -exec` or xargs but then you easily get into quoting
             | hell for anything non-trivial, especially if you need to
             | share state from the parent process to the (grand) child
             | process.
             | 
             | You can actually get `find -exec` and xargs to execute a
             | function defined in the parent shell script (the one that's
             | running the `find -exec` or xargs child process) using
             | `export -f` but to me this feels like a somewhat obscure
             | use case versus just using an inline while loop.
        
             | retrogeek wrote:
             | I will sometimes use the "| while read" syntax with find.
             | One reason for doing so is that the "-exec" option to find
             | uses {} to represent the found path, and it can only be
             | used ONCE. Sometimes I need to use the found path more than
             | once in what I'm executing, and capturing it via a read
             | into a reusable variable is the easiest option for that.
             | I'd say I use "-exec" and "| while read" about equally,
             | actually. And I admittedly almost NEVER use xargs.
        
           | ykonstant wrote:
           | This will fail for files with newlines.
        
             | michaeldh wrote:
             | How common are they?
        
               | anamexis wrote:
               | This whole post is about uncommon things that can break
               | naive file parsing.
        
           | chasil wrote:
           | Wow, you people are really young.
           | 
           | http://www.etalabs.net/sh_tricks.html
        
           | stouset wrote:
           | I love this example, because it highlights how absolutely
           | cursed shell is if you ever want to do anything correctly or
           | robustly.
           | 
           | In your example, newlines and spaces in your filenames will
           | ruin things. Better is                   find ... -print0 |
           | while read -r -d $'\0'; do ...; done
           | 
           | This works in most cases, but it can still run into problems.
           | Let's say you want to modify a variable inside the loop (this
           | is a toy example, please don't nit that there are easier ways
           | of doing _this specific_ task).                   declare -a
           | list=()              find ... -print0 | while read -r -d
           | $'\0' filename; do             list+=("${filename}")
           | done
           | 
           | The variable `list` isn't updated at the end of the loop,
           | because the loop is done in a subshell and the subshell
           | doesn't propagate its environment changes back into the outer
           | shell. So we have to avoid the subshell by reading in from
           | process substitution instead.                   declare -a
           | list=()              while read -r -d $'\0' filename; do
           | list+=("${filename}")         done < <(find ... -print0)
           | 
           | Even this isn't perfect. If the command inside the process
           | substitution exits with an error, that error will be
           | swallowed and your script won't exit even with `set -o
           | errexit` or `shopt -s inherit_errexit` (both of which you
           | should always use). The script will continue on as if the
           | command inside the subshell suceeded, just with no output.
           | What you have to do is read it into a variable first, and
           | then use that variable as standard input.
           | files="$(find ... -print0)"         declare -a list=()
           | while read -r -d $'\0' filename; do
           | list+=("${filename}")         done <<< "${files}"
           | 
           | I think there's an alternative to this that lets you keep the
           | original pipe version when `shopt -s lastpipe` is set, but I
           | couldn't get it to work with a little experimentation.
           | 
           | Also be aware that in all of these, standard input inside the
           | loop is redirected. So if you want to prompt a user for
           | input, you need to _explicitly_ read from ` /dev/tty`.
           | 
           | My point with all this isn't that you should use the above
           | example every single time, but that all of the (mis)features
           | of shell compose _extremely_ badly. Even piping to a loop
           | causes weird changes in the environment that you now have to
           | work around with other approaches. I wouldn 't be surprised
           | if there's something still terribly broken about that last
           | example.
        
         | dotancohen wrote:
         | > I think that when someone uses ls instead of a glob it means
         | they most probably don't understand shell.
         | 
         | In 25 years of using Bash, I've picked up the knowledge that I
         | shouldn't parse the output of ls. I suppose that it has
         | something to do with spaces, newlines, and non-printing
         | characters in file names. I really don't know.
         | 
         | But I do know that when I'm scripting, I'm generally wrapping
         | what I do by hand, in a file. I'm codifying my decisions with
         | ifs and such, but I'm using the same tools that I use by hand.
         | And ls is the only tool that I use to list files by hand - so I
         | find it natural that people would (naively) pick ls as the tool
         | to do that in scripts.
        
           | notnmeyer wrote:
           | exactly--well said
        
         | PaulHoule wrote:
         | I went through a phase when I really enjoyed writing shell
         | scripts like                 ls *.jpg | awk '{print "resize
         | 200x200 $1 thumbnails/$1"}' | bash
         | 
         | because I never got to the point where I could remember the
         | strange punctuation that the shell requires for loops without
         | looking up the info pages for bash whereas I've thoroughly
         | internalized awk syntax.
         | 
         | Word is you should never write something like that because
         | you'll never get the escaping right and somebody could craft
         | inputs that would cause arbitrary code execution. I mean, they
         | try to scare you into using xargs, but I find xargs so foreign
         | I have to read the whole man page every time I want to do
         | something with it.
        
           | hifromwork wrote:
           | I encourage you to give it a try again. Almost every use of
           | xargs that I ever did looked like this:
           | 
           | ls *.jpg | xargs -i,, resize 200x200 ,, thumbnails/,,
           | 
           | I just always define the placeholder to ,, (you can pick
           | something else but ,, is nice and unique) and write commands
           | like you do.
        
             | kstrauser wrote:
             | I'm more likely to write that like:                 for i
             | in *.jpg; resize 200x200 "$i" "thumbnails/$i"; end
        
               | TylerE wrote:
               | Does that not fail when you hit the maximum command line
               | length? Doesn't the entirety of the directory get
               | splatted? Isn't this the whole reason xargs exists?
        
               | PaulHoule wrote:
               | Exactly, there are so many limits in the shell that I
               | don't want to be bothered to think about. When I get
               | serious I just write Python.
        
               | genrilz wrote:
               | The for loop only runs resize once per file. So no, the
               | entire directory does not get splatted. It is unlikely
               | you'd hit maximum command length.
               | 
               | At least on mac, the max command length is 1048576 bytes,
               | while the maximum path length in the home directory is
               | 1024 bytes. There might be some unix variant where the
               | max path length is close enough to the max command length
               | to cause an overflow, but I doubt that is the case for
               | common ones.
               | 
               | xargs exists in an attempt to be able to parse command
               | output. You could for instance have awk output xargs
               | formatted file names to build up a single command
               | invocation from arbitrary records read by awk. Note that
               | xargs still has to obey the command line length limit
               | though, because the command line needs to get passed to
               | the program. Thus, in a situation where this for loop
               | overflows the command line, it would cause xargs to also
               | fail. Thus I would always use globbing if I have the
               | choice.
               | 
               | EDIT: If you mean that the directory is splatted in the
               | for loop, then in a theoretical sense it is. However,
               | since "for" is a shell builtin, it does not have to care
               | about command line length limits to my knowledge.
        
               | TylerE wrote:
               | Yes, this is an issue, absolutely.
               | 
               | I've seen some image directories with more than a million
               | _files_ in them.
        
               | genrilz wrote:
               | This shouldn't overrun the command line length for
               | resize, since resize only gets fed one filename at a
               | time. I do think that the for loop would need to hold all
               | the filenames in a naive shell implementation. (I would
               | assume most shells are naive in this respect) The for
               | loop's length limit is probably the amount of ram
               | available though. I find it improbable that one could
               | overflow ram with purely pathnames on a PC, since a
               | million files times 100 chars per file is still less than
               | a gig of ram. If that was an issue though, one would
               | indeed have to use "find" with "-exec" instead to make
               | sure that one was never holding all file names in memory
               | at the same time.
        
               | lelandbatey wrote:
               | No, it does not fail. Maximum command line length exists
               | in the operating system, not the shell; you can't launch
               | a program with too many argc and you can't launch a
               | program with an argv that's a string that's too long.
               | 
               | But when you execute a for loop in bash/sh, the 'for'
               | command is not a program that is launched; it's a keyword
               | that's interpreted, and the glob is also interpreted.
               | 
               | Thus, no, that does not fail when you hit the maximum
               | command line length (which is 4096 on most _nix). It'll
               | fail at other limits, but those limits exist in bash and
               | are much larger. If you want to move to a stream-
               | processing approach to avoid any limits, then that is
               | possible, while probably also being a sign you should not
               | use the shell.
        
           | projektfu wrote:
           | Better is something like                 find . -maxdepth 1
           | -name "*.jpg" -exec resize 200x200 "{}" "thumbnails/{}" \;
           | 
           | which works for spaces and probably quotes in filenames I am
           | not sure about other special characters.
        
             | mjevans wrote:
             | It's tough to be portable and have a one liner See
             | https://stackoverflow.com/questions/45181115/portable-way-
             | to...
             | 
             | I switched the command to a graphics magick based resize
             | since that's the tool these days, default quality is 75%
             | (for JPEG), but is included as a commonly desired
             | customization. ,, is from a different comment in this
             | thread; it seems better self-documenting than the single ,
             | I'd traditionally use.                 find . -maxdepth 1
             | -name "*.jpg" -print0 |\       xargs -0P $(nproc --all)
             | -I,, gm convert resize '200x200^>' -quality 75 ,,
             | "thumbnails/,,"
        
         | chrsig wrote:
         | Commands can have a maximum number of arguments. Try globbing
         | on a directory with millions of files.
        
           | anthk wrote:
           | Sane people will just use find and/or xargs.
        
         | stephenr wrote:
         | One advantage: `ls -i` gives you the file's inode in a POSIX
         | portable way. If you glob and then look it up individually for
         | each file, you'll need to be aware of which tool (and whether
         | it's GNU or BSD in origin) you use on which platform.
         | 
         | In _general_ yes globbing is better for iterating through
         | files. But parsing `ls` doesn 't necessarily mean the author
         | doesn't know shell. It might mean they know it well enough to
         | use the tools that are made available to them.
        
       | amelius wrote:
       | I feel like Unix utilities should provide a standardized way to
       | generate machine-readable output, perhaps using JSON.
        
         | db48x wrote:
         | The same information is already available in a machine-readable
         | format. Just call readdir. You don't need to run ls, have ls
         | call readdir and convert the output into JSON, and then finally
         | parse the JSON back into a data structure. You can just call
         | readdir!
        
           | amelius wrote:
           | I know, but it would be so great if __every__ Unix utility
           | just had the same type of output. By the way, ls does more
           | than just readdir.
        
           | masklinn wrote:
           | `find` is also an option, or shell globs.
        
             | db48x wrote:
             | Right, globs are syntactic sugar on top of readdir.
             | Definitely use them when you are in a shell. But in general
             | the solution is to call readdir, or some language facility
             | built directly on top of it. Calling ls and asking it for
             | JSON is the stupid way to do things.
        
               | amelius wrote:
               | Just curious, how would you approach getting output from
               | utilities like "df", "mount" and "parted"?
        
               | solardev wrote:
               | Generally speaking, can't you limit/define the output of
               | those commands and parse them that way? like df
               | --portability or --total or --output
               | 
               | And/or use their return codes to verify that something
               | worked or didn't
               | 
               | Or hope your higher level programming language contains
               | built-ins for file system manipulations
        
               | amelius wrote:
               | How is that any easier than just giving a standardized
               | --json flag?
        
               | solardev wrote:
               | It doesn't require trying to organize a small revolution
               | across dozens of GNU tools, many authors, and numerous
               | distros...?
               | 
               | I'd love to see standard JSON output across these tools.
               | I just don't see a realistic way to get that to happen in
               | my lifetime.
               | 
               | Maybe a unified parsing layer is more realistic, like an
               | open source command output to JSON framework that would
               | automatically identify the command variant you're running
               | based on its version and your shell settings, parse the
               | output for you, and format it in a standard JSON schema?
               | Even that would be a huge undertaking though.
               | 
               | There are a lot, LOT of command variants out there. It's
               | one thing to tweak the output to make it parseable for
               | your one-off script on your specific machine. Not so easy
               | to make it reusable across the entire *nix world.
        
               | xolox wrote:
               | With regards to parted, if you only want to query for
               | information, there is "partx" whose output was
               | purposefully designed to be parsed. I have good
               | experiences with it.
        
           | emmelaich wrote:
           | Can you call readdir() from a shell easily?
           | 
           | WRT format, I'd prefer csv.
        
             | db48x wrote:
             | Certainly. Just do `for f in *`. See how easy that is?
        
             | zokier wrote:
             | here is trivial program to dump dents to stdout, suitable
             | for shell pipelines. example usage `./getdents64 . | xargs
             | -0 printf "%q\n"`                   #define _GNU_SOURCE
             | #include <dirent.h>         #include <fcntl.h>
             | #include <malloc.h>         #include <stdio.h>
             | #include <stdlib.h>         #include <string.h>
             | #include <unistd.h>              #define BUF_SIZE 32768
             | struct linux_dirent64 {           ino64_t d_ino;
             | /* 64-bit inode number */           off64_t d_off;
             | /* Not an offset; see getdents() */           unsigned
             | short d_reclen; /* Size of this dirent */
             | unsigned char d_type;    /* File type */           char
             | d_name[];           /* Filename (null-terminated) */
             | };              int writeall(char *buf, size_t len) {
             | ssize_t wres = 0;           wres = write(1, buf, len);
             | if (wres == -1) {             perror("write");
             | return -1;           }           if (((size_t)wres) < len)
             | {             return writeall(buf + wres, len - wres);
             | }           return 0;         }              int main(int
             | argc, char **argv) {           if (argc != 2) {
             | return EXIT_FAILURE;           }           int fd =
             | open(argv[1], O_DIRECTORY | O_RDONLY);           if (fd ==
             | -1) {             perror("open");             return
             | EXIT_FAILURE;           }           void *buf =
             | malloc(BUF_SIZE);           ssize_t res = 0;           do {
             | res = getdents64(fd, buf, BUF_SIZE);             if (res ==
             | -1) {               perror("getdents64");
             | return EXIT_FAILURE;             }             void *it =
             | buf;             while (it < (buf + res)) {
             | struct linux_dirent64 *elem = it;               it +=
             | elem->d_reclen;               size_t len =
             | strlen(elem->d_name);               if
             | (writeall(elem->d_name, len + 1) == -1) {
             | return EXIT_FAILURE;               }             }
             | } while (res > 0);           return EXIT_SUCCESS;         }
        
               | db48x wrote:
               | You're still doing unnecessary work. You're turning a
               | list of files into a string, then parsing the string back
               | into words.
               | 
               | Your shell already provides a nice abstraction over
               | calling readdir directly. A glob gives you a list, with
               | no intermediate stage as a string that needs to be
               | parsed. You can iterate directly over that list.
               | 
               | Every language provides either direct access to the C
               | library, so that you can call readdir, or it provides
               | some abstraction over it to make the process less
               | annoying. In Common Lisp the function `directory` takes a
               | pathname and returns a list of pathnames for the files in
               | the named directory. In Rust there is the
               | `std::fs::read_dir` that gives you an iterator that
               | yields `io::Result<std::fs::DirEntry>`, allowing easy
               | handling of io errors and also neatly avoiding an extra
               | allocation. Raku has a function `dir` that returns a
               | similar iterator, but with the added feature that it can
               | match the names against a regex for you and only yield
               | the matches. You can fill in more examples from your
               | favorite languages if you want.
        
             | emmelaich wrote:
             | Wow, these replies. I was being a little sarcastic as there
             | is no 'readdir' shell command. That is all.
        
         | DonHopkins wrote:
         | That doesn't solve the problem that bash is completely useless
         | for manipulating JSON.
         | 
         | It certainly would make writing Python scripts that need to
         | interact with other programs easier. But Python doesn't
         | desperately NEED to interact with so many other programs for
         | such simple tasks like enumerating files or making http
         | requests or parsing json, the way bash does.
        
           | Kwpolska wrote:
           | Bash is useless at JSON _now_. There 's nothing stopping Bash
           | from introducing native JSON parsing.
        
         | bpshaver wrote:
         | https://kellyjonbrazil.github.io/jc/docs/parsers/ls
        
       | Aerbil313 wrote:
       | What to do instead: Use Nushell.
       | 
       | I finally started really using my shell after switching to it. I
       | casually write multiple scripts and small functions per day to
       | automate my stuff. I'm writing scripts I'd otherwise write in
       | python in nu. All because the data needs no parsing. I'm not even
       | annotating my data with types even though Nushell supports it
       | because it turns out structured data with inferred types is more
       | than you need day-to-day. I'm not even talking about all the
       | other nice features other shells simply don't have. See this
       | custom command definiton:                 # A greeting command
       | that can greet the caller       def greet [         name: string
       | # The name of the person to greet         --age (-a): int   # The
       | age of the person       ] {         [$name $age]       }
       | 
       | Here's the auto-generated output when you run `help greet`:
       | A greeting command that can greet the caller            Usage:
       | > greet <name> {flags}            Parameters:         <name> The
       | name of the person to greet            Flags:         -h, --help:
       | Display this help message         -a, --age <integer>: The age of
       | the person
       | 
       | It's one of the software that only empowers you, immediately,
       | without a single downside. Except the time spent learning it, but
       | that was about a week for me. Bash or fish is there if I ever
       | need it to paste some shell commands.
        
         | db48x wrote:
         | Parsing, or the lack thereof, is not the point. The point is
         | that standard shells already provide all the tools you need for
         | dealing with lists of files. Want to do something for every
         | file? Write this:                   shopt -s nullglob
         | for f in *; do            ...         done
         | 
         | But never this:                   for f in $(ls); do
         | ...         done
         | 
         | They look similar, but the latter runs ls to turn the list of
         | files into a string, then has the shell parse the string back
         | into a list. Even if the parsing was done correctly (and it
         | isn't), this is still extra work. Looping over the glob avoids
         | the extra work.
        
           | Aerbil313 wrote:
           | I have to say this is very unintuitive. In Nushell, you'd do:
           | ls | each { ... }
           | 
           | Another examples I don't need to explain, which would be far
           | harder in stringly typed shells:                 ls | where
           | type == file and size <= 5MiB | sort-by size | reverse |
           | first 10            ps | where cpu > 10 and mem > 1GB | kill
           | $in.pid
           | 
           | It's immediately obvious what you need to do when you can
           | easily visualize your data:                 > ls       +----+
           | -----------------------+------+-----------+-------------+
           | | #  |         name          | type |   size    |  modified
           | |       +----+-----------------------+------+-----------+----
           | ---------+       |  0 | 404.html              | file |
           | 429 B | 3 days ago  |       |  1 | CONTRIBUTING.md       |
           | file |     955 B | 8 mins ago  |       |  2 | Gemfile
           | | file |   1.1 KiB | 3 days ago  |       |  3 | Gemfile.lock
           | | file |   6.9 KiB | 3 days ago  |       |  4 | LICENSE
           | | file |   1.1 KiB | 3 days ago  |       |  5 | README.md
           | | file |     213 B | 3 days ago  |       ...
        
             | db48x wrote:
             | I didn't say that nushell is bad, I said that it's not
             | relevant to the discussion. nushell provides typed data in
             | pipelines, which is cool. But standard shells already have
             | typed data for this particular use case, thus parsing
             | untyped data is unnecessary. Of course it would be nice if
             | that typed data could be used in a pipeline, but everything
             | had to start somewhere.
        
               | acureau wrote:
               | Who are you to decide what's relevant to the discussion?
               | It's very clearly on topic. I had never heard of nushell
               | and I'm glad it was mentioned
        
           | CJefferson wrote:
           | How do I replace:                   for f in $(cd subdir;
           | ls); do            ...         done          ?
        
             | ykonstant wrote:
             | Either                 for f in subdir/*; do         ...
             | done
             | 
             | or                 (        cd subdir || exit 1       for f
             | in *; do         ...       done       )
             | 
             | work fine. However, I must insist against using `for` loops
             | in favor of `find`.
        
       | probably_wrong wrote:
       | I think there's a middle point where you want to do something
       | that's complex enough that a glob won't cut it but simple enough
       | that switching languages is not worth it.
       | 
       | I think the example of "exclude these two types of files" is a
       | good case. I often have to write stuff like `ls P* | grep -Ev
       | "wav|draft"` which doesn't solve a problem I don't have (such as
       | filenames with newlines in them) but does solve the one I do
       | (keeping a subset of files that would be tricky to glob
       | properly).
       | 
       | In my experience 95% of those scripts are going to be discarded
       | in a week, and bringing Python into it means I need to deal with
       | `os.path` and `subprocess.run`. My rule of thumb: if it's not
       | going to be version controlled then Bash is fine.
        
         | OrderlyTiamat wrote:
         | You might enjoy a variety of `find` based commands, e.g. `find
         | -maxdepth 1 -iregex ".*\\.(wav|draft)" | xargs echo "found
         | file:"`
         | 
         | This uses regex to match files ending in .wav or .draft (which
         | is what I interpreted you to want). Xargs then processes the
         | file. You could use flags to have xargs pass the file names in
         | a specific place in the command, which can even be a one liner
         | shell call or some script.
         | 
         | So the "find <regex> - xarg <command>" pattern is almost fully
         | generally applicable to any problem where you want to execute a
         | oneliner on a number of files with regular names. (I think gnu
         | find has no extended regex, which is just as well- thats not a
         | "regular expression" at that point)
        
           | bheadmaster wrote:
           | > You might enjoy a variety of `find` based commands, e.g.
           | `find -maxdepth 1 -iregex ".*\\.(wav|draft)" | xargs echo
           | "found file:"`
           | 
           | Find can even execute commands itself without using `xargs`:
           | find -maxdepth 1 -iregex '.\.\(wav\|draft\)' -exec echo
           | "found file:" {} \;
        
             | Izkata wrote:
             | Definitely do it this way if you want to stick to the pre-
             | filtered version (I recommend the cousin comment, filter
             | inside the loop). GP's version is buggy in the same way as
             | the post misunderstands, particularly with files that
             | somehow got newlines in the filename (xargs is newline-
             | delimited by default).
             | 
             | If for some reason you do need the "find | xargs" combo
             | (maybe for concurrency), you can get it to work with "find
             | -print0" and "xargs -0". Nulls can't be in filenames so a
             | null-delimited list should work.
        
               | ykonstant wrote:
               | As an addendum, note that `-print0` and `-0` for find and
               | xargs respectively are now in the latest POSIX standard,
               | so their use is compliant.
        
               | genrilz wrote:
               | The latest standard I know of is SuS 2018, which I have
               | the docs for, and does not include either switch. I
               | searched around a bit and it doesn't seem like there is a
               | new one. Are you referring to some draft? I sure wish
               | this was true.
               | 
               | That being said, I would interpret "-exec printf '%s\0'
               | {} +" as being a posix compliant way for find to output
               | null delimited files. I say this since the docs for the
               | octal escape for printf allows zero digits. However, most
               | posix tools operate on "text" input "files", which are
               | defined as not having null characters. Thus I don't think
               | outputting nulls could be easily used in a posix
               | complaint way. In practice, I would expect many posix
               | implementations to also not handle nulls well because C
               | uses null to mean end of string, so lots of C library
               | calls for dealing with strings will not correctly deal
               | with null characters.
        
               | PhilipRoman wrote:
               | >xargs is newline-delimited by default
               | 
               | Even worse, it is whitespace delimited (with its own
               | rules for escaping with quotes and backslashes)
        
               | hifromwork wrote:
               | It is not, but (for reasons unknown to me) it doesn't
               | quote parameters in the default mode. Consider:
               | 
               | touch "a b" ls | xargs rm # this won't work, rm gets two
               | parameters ls | xargs -i,, rm ,, # this will work
        
               | PhilipRoman wrote:
               | https://pubs.opengroup.org/onlinepubs/9699919799/utilitie
               | s/x...
               | 
               | >[..] arguments in the standard input are separated by
               | unquoted <blank> characters [..]
               | 
               | As for -i, it is documented to be the same as -I, which,
               | among other things, makes it so that "unquoted blanks do
               | not terminate input items; instead the separator is the
               | newline character."
        
               | hifromwork wrote:
               | >GP's version is buggy in the same way as the post
               | misunderstands, particularly with files that somehow got
               | newlines in the filename
               | 
               | I understand this caveat, but I never had a file with
               | newline that I cared about. Everyone keeps repeating this
               | gotcha but I literally don't care. When I do "ls | grep
               | [.]png\$ | xargs -i,, rm ,," (yes, stupid example) there
               | is 0% chance that a png file with a newline in the name
               | found itself in my Downloads folder. Or my project's
               | source code. Or my photo library. It just won't happen,
               | and the bash oneliner only needs to run once. In my 20
               | years of using xargs I didn't have to use -0 even once.
        
         | bheadmaster wrote:
         | It's not _necessary_ to bring Python into it, Bash can handle
         | filenames with weird characters properly if you know how to use
         | it.
         | 
         | E.g. instead of `ls | grep -Ev 'wav|draft'`, you'd have to do
         | something like                   for filename in *; do
         | if grep -E 'wav|draft' >/dev/null <<< "$filename"
         | then : # ...             fi         done
         | 
         | Of course, it's more convoluted, but when you're writing
         | scripts that might be used for a long time and by many people,
         | it helps to know that it is _possible_ to write robust things.
         | Tools like shellcheck certainly help.
        
           | PeterWhittaker wrote:
           | grep -q and you won't need the redirect of stdout.
        
           | ykonstant wrote:
           | The above is perfectly fine for small directories, but in
           | general the preferred way to loop over files is with find:
           | find . ! -name . -prune                   \
           | -exec grep -qE 'wav|draft' {} \; \                -exec
           | "${action}"             \; ;
           | 
           | Edit: I missed the herestring in the original code, so the
           | above is wrong as mentioned in the comments; if your find has
           | regex, you can use it to save one grep:                 find
           | . ! -name . -prune              \                -regex
           | '.*wav.*\|.*draft.*' \                -exec "${action}"
           | \; ;
           | 
           | Otherwise you can call sh to printf the filename into a grep.
           | 
           | However, the point of my post is that find can perform _seek_
           | , _filter_ and _execute_ , and should be used for all three
           | unless it is really impossible (which is unlikely).
        
             | tux1968 wrote:
             | Your example is grepping the file contents, where GP is
             | using grep to select the filenames.
        
               | ykonstant wrote:
               | D'oh!
        
           | some_random wrote:
           | At that point I think you need to ask yourself why you're
           | using Bash to begin with. If it's just meant to be a quick
           | script that's run occasionally then this is good but probably
           | overkill. If it's going into prod to be run regularly as part
           | of business critical, then it should be in a language that
           | has a less convoluted way to _ls a directory_. There's an
           | inflection point somewhere in there, where it is depends on
           | you.
        
             | mywittyname wrote:
             | Am I monitoring the execution?
             | 
             | Yes: bash is probably fine.
             | 
             | No: real programming language time.
        
         | some_random wrote:
         | Before you write anything, you need to think about the cost of
         | it breaking and the chance of it breaking, and Bash scripts in
         | VC tend to maximize both. I like that heuristic a lot.
        
       | mcc1ane wrote:
       | https://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_....
        
       | teddyh wrote:
       | Many people turn to globbing to save them, which is usually
       | better, but has some problems in case of no matches. But, for
       | Bash, you can do this to fix it:                 shopt -s
       | failglob
        
       | billpg wrote:
       | Why do you want to put LF bytes into filenames?
       | 
       | Using magic, I've renamed any files you have to remove control
       | characters in the name and made it impossible to make any new
       | ones. (You can thank me later.)
       | 
       | What can't you do now?
        
         | tredre3 wrote:
         | Preventing certain characters in filenames would solve a _lot_
         | of issues, from security issues to wasted time all around.
         | 
         | But for whatever reason, when it is suggested, you get many
         | people chiming in that "filenames should be dumb bytes,
         | anything allowed except / !"
        
       | badgersnake wrote:
       | Some front end clown is about to suggest all tools should output
       | json by default aren't they. They'll probably add that it's fine
       | to start up 17 v8 engines just to parse all this json because
       | modern laptops have loads of ram and anyway it's modern and cross
       | platform.
        
         | wslh wrote:
         | An extension of this could be to also input everything in json:
         | {"command":"ls", parameters: ["-l"]}, etc.
        
           | solardev wrote:
           | Didn't Microsoft try to define something like that with
           | Powershell, with parameters being objects (though not JSON)?
        
         | solardev wrote:
         | But hey, at least it's not YAML!
        
           | emmelaich wrote:
           | The funny thing is that so so many bits of info come in very
           | much like but not quite yaml.
           | 
           | e.g. /proc/cpuinfo
        
           | badgersnake wrote:
           | Of course, with some k8s yaml we can run all our cli tools in
           | separate containers each with their own userland.
        
         | lukan wrote:
         | I do not think parsing json requires a full blown javascript
         | engine.
        
         | williamcotton wrote:
         | Ever heard of jq?
        
         | amelius wrote:
         | What do you suggest? Have 100 different ways to parse output?
         | Think about the resulting code bloat.
         | 
         | And no, you don't need V8 to parse JSON.
        
         | afiori wrote:
         | sadly JSON does not handle non-utf8 strings
        
           | scintill76 wrote:
           | Is that a problem for this application, though? Don't most
           | people encode their file names in utf8, or is that an ASCII-
           | centric falsehood?
        
         | MaxMatti wrote:
         | Json is maybe a bit heavy, but using a machine readable format
         | such as tsv or csv (including configuring your terminal
         | emulator to properly display it) would be a big step up from
         | the status quo.
        
         | CJefferson wrote:
         | Do you really think outputting a stream of JSON, as opposed to
         | plain text, would add any measurable overhead to all command
         | line tools?
         | 
         | Honestly, I'd love this. Output one JSON object per file, bash
         | already has hash tables and lists, so it has all the types we
         | need for JSON already.
        
       | badsectoracula wrote:
       | I guess this is for shell scripts that need to work with "unsafe"
       | filenames?
       | 
       | I've been using Linux since 1999 and i never came across a
       | filename with newlines. On the other hand, pretty much all "ls
       | parsing" i've done was on the command-line to pipe it to other
       | stuff in files i was 100.1% sure would be fine.
        
         | kemitchell wrote:
         | When teaching beginners shell, it's natural to teach `ls` for
         | listing directory contents. It's also natural to extend from
         | `ls` to `ls | ...` for processing lists of files
         | 
         | The important point to get across is that pipes let us build
         | bigger commands _from the commands we already know_. If needed,
         | you can back up later to teach patterns like `find [...]
         | -exec`, `find [...] -print0 | xargs -0 [...]`, `find [...] |
         | while read -r file; do [...] done` and so on.
         | 
         | There are all kinds of prerequisites to creating files with
         | unusual names. Those barriers tend to mean beginners won't run
         | into file name processing edge cases for a while. The exception
         | will be files they download from the Internet. But the
         | complexity there will usually be quote and non-ASCII Unicode
         | characters, not newlines or other control codes.
         | 
         | In teaching, the one filename complexity I would try to get
         | ahead of, preventively, is spaces. There was a time, way back
         | when, when newbies seemed to expect to stick with short, simple
         | filenames. These days they the people I've helped tended to be
         | used to using spaces in file names in Finder and Explorer for
         | office or school work.
        
       | zokier wrote:
       | I wonder if anyone has implemented kernel module or smth to limit
       | filenames to sane set. Just ensuring that they are valid utf8 and
       | do not contain any non-printables would be huge improvement. Sure
       | some niche applications might break so its not something that can
       | be made default, but still I think it would help on systems I
       | control.
        
       | tmtvl wrote:
       | Today I learned how neat find is:                 find ~/Music
       | -iname 'p*' -not -iname '*age*' -not -iname '*etto*'       find
       | ~/Music -iname 'p*' -not -iregex '.*\(age\|etto\).*'       find
       | ~/Music -regextype posix-extended -iname 'p*' -not -iregex
       | '.*(age|etto).*'
       | 
       | Not that I'm likely to ever use any of that in anger, but it's
       | good to know if ever I do wind up needing it.
        
       | InsideOutSanta wrote:
       | Files and directories, once a reference to them is obtained,
       | should not be identified by their path. This causes all kinds of
       | problems, like the reference breaking when the user moves or
       | renames things, and issues like the ones described in the
       | article, where some "edge case" (and I'm using that term very
       | loosely, because it includes common situations like a space in a
       | file name) causes problems down the line.
       | 
       | You might say that people don't move or rename things while files
       | are open, but they absolutely do, and it absolutely breaks
       | things. Even something as simple as starting to copy a directory
       | in Explorer to a different drive, and then moving it while the
       | copy is ongoing, doesn't work. That's pathetic! There is no
       | technical reason this should not be possible.
       | 
       | And who can forget the case where an Apple installer deleted
       | people's hard disk contents when they had two drives, one with a
       | space character, and another one whose name was the string before
       | the first drive's space character?
       | 
       | Files and directories need to have a unique ID, and references to
       | files need to be that ID, not their path, in almost all cases.
       | MFS got that right in 1984, it's insane that we have failed to
       | properly replicate this simple concept ever since, and actually
       | gone backwards in systems like Mac OS X, which used to work
       | correctly, and now no longer consistently do.
        
         | Kwpolska wrote:
         | IDs don't really solve many problems. The issues with scripts
         | removing all your files were either caused by the absurd bash
         | spaces and quotes rules, or by bash silently ignoring
         | nonexistent variables. Those scripts would still need paths,
         | since the ID of ~/.steam will be different for everyone.
         | Scripts that need to work on more than one system, and human-
         | authored config files, would still have paths. There are cases
         | where you want to depend on the path, not the identity of the
         | folder, and potentially swap the folder with something else
         | without editing configuration.
         | 
         | Explorer needs to support local drives, with a lot of
         | filesystems, including possibly third-party ones, but also
         | network drives, FTP, WebDAV, and a bunch of other niche things.
         | Not all of them have IDs and might not be possible to be
         | extended. The cost is massive, solving it everywhere is
         | impossible, and the benefit seems negligible to me (even though
         | I fairly recently managed to eject a disk image (vhdx) in the
         | middle of copying files onto it...)
        
           | InsideOutSanta wrote:
           | Earlier versions of Mac OS had APIs to retrieve the IDs of
           | directories and files relevant for things like installing
           | applications (such as the the System directory). It
           | effectively never used paths to identify any files; if users
           | opened a file, they'd use the system file picker, which would
           | provide the application a file ID, not a path.
           | 
           | Similarly, things like config files would be identified by
           | their name, not their path, because the directory containing
           | configs was a directory the system knew about. As a result,
           | no application needed to know the path to its own config
           | files.
           | 
           | This meant there was no action that the system prevented you
           | from doing to an open file, other than actually deleting that
           | file. There was also no way for an installer to accidentally
           | break your system because its code didn't take your drive,
           | file, or directory names into account.
           | 
           | And, of course, there _are_ file systems that don 't use
           | paths at all, like HashFS, a bunch of modern document
           | management systems, or the Newton's Soup.
           | 
           | I get your point about interoperability with existing file
           | systems, but I think it's perfectly acceptable to offer
           | better solutions where possible, and fall back to paths for
           | situations where that is not possible.
        
       | 7bit wrote:
       | Or use PowerShell where LS returns a bunch of objects, and say
       | goodbye to string parsing forever.
        
         | chickenimprint wrote:
         | nushell is the superior structured data shell and it's cross-
         | platform. https://www.nushell.sh/
        
           | redserk wrote:
           | I've only used Powershell a little bit on Linux and Mac but
           | it seems reasonably cross-platform.
           | 
           | On the surface, it looks like I'd be giving up the decently
           | sized ecosystem of Powershell libraries for a new ecosystem
           | without much support?
           | 
           | I'm interested in knowing what Nushell does differently since
           | I'm wanting to find a better shell.
        
             | ericfrederich wrote:
             | Wait until you realize that "giving up the decently sized
             | ecosystem of Powershell libraries" is a net positive ;-)
        
             | chickenimprint wrote:
             | I'm probably not the best person to ask, since the last
             | time I touched Powershell, it was Windows only, but I'd say
             | nushell is likely a lot more platform-agnostic, has sane
             | syntax and follows a functional paradigm. Plugins are
             | written in Rust. It's probably not worth it if all you do
             | is Windows sysadmin work, as you'd have to serialize and
             | deserialize data when interacting with Powershell from nu.
        
       | TacticalCoder wrote:
       | Now of course having scripts and pre-commit hooks enforcing
       | simple rules so that files _must_ only use a subset of Unicode
       | are a thing and do help.
       | 
       | Do you really think that, say, all music streaming services are
       | storing their songs with names allowing Unicode HANGUL fillers
       | and control characters allowing to modify the direction of
       | characters?
       | 
       | Or... Maybe just maybe that Unicode characters belong to metadata
       | and that a strict rule of "only visible ASCII chars are allowed
       | and nothing else or you're fired" does make sense.
       | 
       | I'm not saying you always have control on every single filename
       | you'll ever encounter. But when you've got power over that and
       | can enforce saner rules, sometimes it's a good idea to use it.
       | 
       | You'll thank me later.
        
       | jcalvinowens wrote:
       | Not sure how portable it is, but gnu ls has a flag to solve this
       | problem trivially:                 --zero    end each output line
       | with NUL, not newline
        
       | renewiltord wrote:
       | I just solve this by not having files like that on my computer.
       | No spaces. No null chars.
        
       | waffletower wrote:
       | Borkdude has a wonderful Clojure/Babashka solution in this space:
       | https://github.com/babashka/fs
        
       | geophile wrote:
       | I wrote a pipe-objects-instead-of-strings shell:
       | https://marceltheshell.org.
       | 
       | Not piping strings avoids this issue completely. Marcel's ls
       | produces a stream of File objects, which can be processed without
       | worrying about whitespace, EOL, etc.
       | 
       | In general, this approach avoids parsing the output of any
       | command. You always get a stream of Python values.
        
         | fooker wrote:
         | > In general, this approach avoids parsing the output of any
         | command.
         | 
         | Somewhere, there has to be validation phases. Just because you
         | have objects, doesn't mean they are well formed.
         | 
         | https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
         | 
         | It turns out proper validation is way harder than parsing.
         | There is a reason text based interfaces and formats are so
         | pervasive.
        
           | geophile wrote:
           | As long as Python does the right thing with globs, there is
           | really no room for marcel to get it wrong. Not sure what
           | additional validation you are thinking of.
        
       | lostmsu wrote:
       | This is a problem I faced recently on Linux. You can use ip addr
       | to see the list of your IPv6 addresses and their types (temporary
       | or not, etc). But doing it programmatically from a non-C codebase
       | is way more involved.
        
       | Nimitz14 wrote:
       | These sorts of pedantic exchanges are so pointless to me. We are
       | programmers. We can control what characters are used in
       | filenames. Then you can use the simplest tool for the job and
       | move on with your life to focus on the stuff that actually
       | matters. Fix the root cause instead of creating workarounds for
       | the symptom.
        
       | tremon wrote:
       | Most of the time I avoid parsing ls, but I haven't found a
       | reliable way to do this one:                 latest="$(ls -1
       | $pattern | sort --reverse --version-sort | head -1)"
       | 
       | Anyone got a better solution?
        
         | chenxiaolong wrote:
         | This should work with any arbitrary filename:
         | latest=$(printf '%s\0' <glob> | sort -zrV | head -zn1)
         | 
         | or with long args:                   latest=$(printf '%s\0'
         | <glob> | sort --zero-terminated --reverse --version-sort | head
         | --zero-terminated --lines 1
        
           | genrilz wrote:
           | What unix is this on? Neither the mac nor gnu manpages have a
           | -z or --zero-terminated option for head.
        
             | tremon wrote:
             | Debian's head (from GNU coreutils) does: https://manpages.d
             | ebian.org/bookworm/coreutils/head.1.en.htm...
        
               | genrilz wrote:
               | Yay! Glad to see zero termination flags in more places.
               | 
               | EDIT: The linux manpages I read were from die.net, which
               | it looks like were from 2010, guess I'll have to avoid
               | them in the future. I checked FreeBSD, OpenBSD, and Mac
               | man page to make sure, and unfortunately none of them
               | support the -z flag yet.
        
         | genrilz wrote:
         | This ones a hard one. Since "--version-sort" isn't standard
         | anyways, lets assume we can use flags which are common to BSD
         | and GNU. Furthermore, lets assume bash or zsh so we can use
         | "read -d ''".
         | 
         | In that case, how about:                 IFS='' read -d ''
         | latest < <(find $pattern -prune -print0 | sort -z --reverse
         | --version-sort)
        
       | midjji wrote:
       | The bash code which creates the c file which gets the list of
       | null terminated files in a directory and compiles it, and runs
       | it, is easier to write and understand. Bash is a lousy language
       | to do anything in, python is almost always available, and if not,
       | then CC is.
        
       | cess11 wrote:
       | I don't know, this seems like a lot of words to avoid coming to
       | the conclusion that there are many ways to skin a directory.
       | 
       | Most of the time it's fine to just suck in ls and split it on \n
       | and iterate away, which I do a lot because it's just a nice and
       | simple way forward when names are well-formed. Sometimes it's
       | nicer to figure out a 'find at-place thing -exec do-the-stuff {}
       | \;'. And sometimes one needs some other tool that scours the file
       | system directly and doesn't choke on absolutely bizarre file
       | names and gives a representation that doesn't explode in the
       | subsequent context, whatever that may be, which is quite rare.
       | 
       | A more common issue than file names consisting of line breaks is
       | unclean encodings, non-UTF-8 text that seeps in from lesser
       | operating systems. Renaming makes the problem go away, so one
       | should absolutely do that and then crude techniques are likely
       | very viable again.
        
       ___________________________________________________________________
       (page generated 2024-06-25 23:01 UTC)