[HN Gopher] You shouldn't parse the output of ls(1)
       ___________________________________________________________________
        
       You shouldn't parse the output of ls(1)
        
       Author : tosh
       Score  : 113 points
       Date   : 2021-12-31 11:32 UTC (11 hours ago)
        
 (HTM) web link (mywiki.wooledge.org)
 (TXT) w3m dump (mywiki.wooledge.org)
        
       | Tepix wrote:
       | Very valuable points that are too easily forgotten. Thanks
        
       | gorgoiler wrote:
       | Greg aka graycat was a real IRC legend 20 years ago. I learned so
       | much from him.
       | 
       | Many a happy hour did I watch him flaming lazy newcomers looking
       | for a quick fix in #debian, right about the time when Linux as a
       | commercially viable server platform was taking off.
       | 
       | Almost every admonishment was accompanied by sound technical
       | advice which was useful to lurkers as well as the unfortunate
       | noob who dared ask.
       | 
       | Thanks :)
        
         | ByThyGrace wrote:
         | Greg Wooledge's bash wiki is my goto resource for bash
         | scripting. Everything I always need to find out is in there
         | (Bash Guide + FAQ). I didn't know about his IRC persona which
         | only improves my appreciation of him, so thanks for sharing.
        
         | dsr_ wrote:
         | On occasion I have posted something in debian-user, adding "but
         | Greg will have a better approach".
         | 
         | Then he shows up and offers a better approach.
         | 
         | Thanks to Greg, my bashrc contains:                 >   stat=(
         | )       >   statcolor=("$Green" "$Red")       >   ...       >
         | PS1=...       >   ${statcolor[!!$?]}\]${stat[!!$?]}$
         | 
         | Which, if it's not entirely clear, puts a green checkmark or a
         | red x in my prompt depending on the error value of the last run
         | commandline.
        
           | marcosdumay wrote:
           | Oh, a long time ago (but not so long as that) I got this line
           | from a HN thread on bash tricks:                   export
           | PS1="\h:\w \$(if [ \$? = 0 ]; then echo :\\\); else echo
           | :\\\(; fi) \$ "
           | 
           | It's a non-colored version of it, with a happy or sad smiley.
        
         | Fnoord wrote:
         | Ah, greycat. Yeah I remember him from #debian on Freenode some
         | 20 years ago. Smart, helpful fellow.
        
       | thibran wrote:
       | If common CLI programs would have a --json option this would be
       | no problem.
        
       | jmnicolas wrote:
       | I have a bash alias that creates a random playlist of videos or
       | music with ls. I noticed that sometimes there were duplicates in
       | the list.
       | 
       | If I can't use ls, it's not going to be a one liner anymore, so I
       | have to create a file, store it somewhere, assign execute
       | privileges and link my alias to it. Much more complicated.
        
         | theamk wrote:
         | Um, no? As examples show, "find -maxdepth" can do the same
         | things, but safely. And in most cases, it'd still be one-liner,
         | even if a longer one.
        
         | rascul wrote:
         | > so I have to create a file, store it somewhere, assign
         | execute privileges and link my alias to it
         | 
         | Or put it in a function in the same file your alias is in.
        
       | dvh wrote:
       | In a similar way, ftp's "dir" command is only for humans. Every
       | ftp library that is for accessing ftp API for programs is only
       | guessing what in the "dir" output is filename.
        
       | renewiltord wrote:
       | My file system is _my_ file system. I solve this problem by just
       | not having weird file names on it.
        
       | ape4 wrote:
       | Its pretty trivial to make a C program that lists a directory in
       | the format you want.
        
       | yagop wrote:
       | Parsing UNIX command outputs is generally a pain and constantly a
       | source of errors. PowerShell mostly solve that, I wish we can use
       | that.
        
       | enriquto wrote:
       | I disagree.
       | 
       | Parsing the textual output of ls is such a natural idiom that I'm
       | happy to renounce any other thing that causes trouble. Give me a
       | "-o sanenames" option for mount, instead.
        
         | spicybright wrote:
         | I think the point is ls has so many options, it's not safe to
         | parse as is.
         | 
         | I always use `find .` if I need a list of files from a
         | directory for this reason
        
           | johnisgood wrote:
           | Yeah. I often do "find . | grep 'foo'". Perhaps "find" can do
           | it without the "| grep" bit, but I have not RTFM. :P
        
           | rightbyte wrote:
           | Ye that is what the maintainer said for the -z option for ls
           | too.
        
           | beermonster wrote:
           | I use `echo *`
        
             | vidarh wrote:
             | `echo *` has many of the downsides of ls (doesn't escape
             | e.g. space in filenames) and _additionally_ breaks on
             | directories where the expansion fills the command line
             | buffer.
             | 
             | EDIT: Also note that "find ..." is also not safe from all
             | the quoting issues without "-print0" or equivalent options
             | to make it separate the names with ASCII NUL rather than
             | linefeed or otherwise taking steps to handle filenames with
             | actual linefeeds in them.
        
               | xorcist wrote:
               | Indeed. Those backticks only works in a shell, but in a
               | shell why not just write                 *
               | 
               | which is how filename expansion is supposed to work.
        
               | beermonster wrote:
               | The back ticks were supposed to be quotes, I meant to say
               | 
               | echo *
               | 
               | However as per replies, this suffers from the same issues
               | as ls.
        
               | vidarh wrote:
               | The problem is not the backticks. They were just used as
               | quote characters. The problem is that shell expansion
               | doesn't escape the characters. E.g. this is a cut down
               | output from my system now after I did a echo >'/tmp/
               | space ':                   $ echo /tmp/*
               | /tmp/bspwm_0_0-socket /tmp/config-err-Q667kI /tmp/foolog
               | /tmp/ space  /tmp/...
               | 
               | Parse that output and you get a broken list of filenames.
        
           | traceroute66 wrote:
           | find or indeed, the Rust-based fd[1] which is infinitesimally
           | faster.
           | 
           | [1]https://github.com/sharkdp/fd
        
             | vidarh wrote:
             | If you're listing just the filenames, the things that makes
             | fd fast (the parallelised directory traversal when you do
             | something that requires stat calls or similar) are
             | irrelevant, as the getdents() calls are going to be more
             | affected by your buffer size.
             | 
             | So for the limited subset of tasks where you're ok with
             | using a tool that might not be installed _and_ need options
             | that requires stat calls _and_ the directories may be large
             | enough, it might make a difference.
        
             | mellavora wrote:
             | infinitesimally
             | 
             | https://en.wikipedia.org/wiki/Infinitesimal
             | 
             | "In mathematics, an infinitesimal or infinitesimal number
             | is a quantity that is closer to zero than any standard real
             | number, but that is not zero."
             | 
             | I'll blame this one on auto-correct.
        
         | pkrumins wrote:
         | It's not an idiom. It's a fireable offense.
        
         | rightbyte wrote:
         | Ye I mean when doing sysadmin stuff you know to avoid asking
         | for it with, like, filenames with spaces. Why even bother
         | handling newlines or what not.
        
       | mattowen_uk wrote:
       | Dead link (HN hug of death?)
       | 
       | Cached copy:
       | 
       | https://webcache.googleusercontent.com/search?q=cache:eZI_am...
        
       | tyingq wrote:
       | This is one reason Perl was very popular even before CGI was a
       | thing. You could get to things like stat() with an interpreted
       | language that was very portable. It also has the "-0" flag to
       | accept the null terminated output of "find -print0".
        
       | artemonster wrote:
       | I wonder how world would look like if all standard unix tools
       | gave two outputs: human readable and structured, json-like.
        
         | int_19h wrote:
         | FreeBSD is trying for something like that.
         | 
         | https://libxo.readthedocs.io/en/latest/
        
         | xorcist wrote:
         | Perhaps take git as an example instead, with its plumbing and
         | porcelain commands.
         | 
         | They have historically been easier to keep backwards
         | compatibility with than the dict-like structures of json.
        
         | [deleted]
        
         | rnestler wrote:
         | I'd recommend to take a look at https://www.nushell.sh/ which
         | has structured output, but displays it neatly when printing to
         | the terminal.
        
         | Kim_Bruning wrote:
         | You don't need to wonder, because jc is a filter that does just
         | that!
         | 
         | https://kellyjonbrazil.github.io/jc/
        
           | robert_tweed wrote:
           | It even claims to parse ls output correctly (see caveat):
           | https://kellyjonbrazil.github.io/jc/docs/parsers/ls
        
           | kortex wrote:
           | > >>> import jc.parsers.dig
           | 
           | I think "jc" stands for "jesus christ!" because I just
           | exclaimed that out loud thinking about the amount of time
           | I've wasted trying to parse dig outputs, or something
           | similar. Spent a nontrivial amount of time looking for
           | lightweight tools to convert the typical "fwf" of coreutils
           | style programs.
           | 
           | Definitely running "pipx install jc" immediately (pipx is
           | great for managing python-based executable programs, avoid
           | the mess of venvs).
        
         | wayoutthere wrote:
         | Then it would be PowerShell.
        
           | disgruntledphd2 wrote:
           | But with shorter command names, and gnomic short options.
        
           | laumars wrote:
           | No. Powershell is a whole new CLI user land as well as a
           | shell. If you want something that's compatible with POSIX but
           | still has smart pipelines and native support for JSON then
           | you're better off with Elvish or Murex as shells.
        
       | nailer wrote:
       | Just add JSON output to ls like other tools have.
        
         | Tepix wrote:
         | Nul-terminated strings would be the more desirable option like
         | some unix tools such as find and xargs already offer since
         | decades.
        
       | yholio wrote:
       | These all seem to be ls bugs. It's a common pattern when
       | outputting data to format it such that the receiver can
       | unambiguously separate the data from the formatting. If you use
       | CR/LF in your output formatting, then those characters need to be
       | escaped in the data. If your attacker can deceive you into
       | printing fake output by crafting their filename as :
       | 
       | "\n -rw-r--r-- 1 user group 12 Dec 15:55 mostly_harmless_planet"
       | 
       | ...then you have already lost.
       | 
       | Violating this pattern always leads to problems like format
       | string vulnerabilities, SQL or executable injections etc. As the
       | long history of fighting against these problems shows, "banning
       | weird characters" without fixing the bugs will always lead to
       | problems, some apparently harmless characters find devious uses
       | etc. You can't unscramble eggs.
       | 
       | The only real solutions are properly escaping the payload so that
       | it can be unambiguously interpreted. And you can't claim that the
       | authors of 'ls' don't expect their output to be consumed by other
       | programs.
        
         | josephcsible wrote:
         | Interestingly, ls does escape characters like \n in its output
         | when it's printing to a terminal, but not when it's being piped
         | into other programs. Try this by making a file with a newline
         | in its name, and then comparing "ls" with "ls | cat".
        
       | rvieira wrote:
       | Isn't parsing `ls` the whole backbone of Emacs' `dired`?
        
         | tyingq wrote:
         | If you <ctrl-f> and search for "newline", you can see some of
         | the hackery they do to get around newlines in file names:
         | 
         | https://github.com/emacs-mirror/emacs/blob/master/lisp/dired...
        
       | asicsp wrote:
       | See also: Why _not_ parse `ls` (and what to do instead)?
       | 
       | https://unix.stackexchange.com/questions/128985/why-not-pars...
        
       | NumberWangMan wrote:
       | "Quoted string notation"
       | (https://www.oilshell.org/release/latest/doc/qsn.html) seems like
       | a good way to solve this problem.
        
         | chubot wrote:
         | (author here) Yes thanks, that is exactly the point!
         | 
         | As I point out at the end of the doc, coreutils ls actually
         | started quoting the names in 2016. However the format is
         | confusing for people who can't read 2 or 3 types of shell
         | strings, and not that readable.
         | 
         | In contrast, QSN is simply Rust string literal syntax, which
         | are a cleaned up version of C string literal syntax.
         | $ touch $'foo\nbar' 'dq"dq' "sq'sq"    # create 3 files with
         | newline, double quote, single quote              # coreutils is
         | correct, though I'm not sure people will understand $'\n'
         | $ ls         'dq"dq'   eggs  'foo'$'\n''bar'  "sq'sq"
         | 
         | Pipe through cat mangles the name:                   $ ls|cat
         | dq"dq         foo         bar         sq'sq
         | 
         | In Oil, write --qsn will ALWAYS give you 5 lines if you have 5
         | names, no matter what they are                   $ oil -c
         | 'write --qsn -- *'         'dq"dq'         'foo\nbar'  # more
         | familiar encoding         'sq\'sq'
         | 
         | Without --qsn it's like ls|cat:                   $ oil -c
         | 'write -- *'         dq"dq         foo         bar
         | sq'sq
         | 
         | I think it's important for something like QSN to be built into
         | the shell, because quoting issues arise in many places, not
         | just filenames and ls.
         | 
         | Although this makes me think that we should have the inverse of
         | `printf %q` to parse the output of coreutils ls. Oil does
         | implement printf %q, but most people don't know about it.
         | $ printf '%q\n' -- *         dq\"dq         $'foo\nbar'
         | sq\'sq
         | 
         | Again it is actually correct, but sort of a grab bag of formats
         | derived from shell strings. QSN strings will be familiar to
         | anyone using Python, Rust, etc. consistent with Oil's slogan:
         | _It 's for Python and JavaScript users who avoid shell!_
         | 
         | -----
         | 
         | edit: Also reminds me that I wrote this page before designing
         | QSN: https://github.com/oilshell/oil/wiki/Shell-Almost-Has-a-
         | JSON...
         | 
         | So printf %q and %b are inverses in bash, but this doesn't work
         | in other shells. QSN can represent NUL bytes, which are illegal
         | in filenames, but are useful elsewhere.
        
       | unilynx wrote:
       | More importantly, we need to get rid of the ability to put line
       | feeds, tabs in file names and also disallow odd starting
       | characters such as tab, dash and $
       | 
       | I wish someone would add a mount option for that and have eg
       | fedora be a trailblazer to fix the few apps that break
        
         | Tepix wrote:
         | It would be almost trivial to create a fuse filesystem which
         | completely hides these files if they exist (and doesn't allow
         | the creation of new ones).
        
           | patrec wrote:
           | And how would that be even remotely useful? Unless something
           | changed recently, FUSE has so much overhead it's only useful
           | for niche applications and prototyping.
        
           | laumars wrote:
           | It would be trivial to code but any such wrapper would add
           | overhead to file system operations. FUSE is a fantastic set
           | of APIs (I've used it personally) and performs remarkably
           | well considering it is constantly swapping memory between
           | kernel and user space but for wide spread adoption any
           | feature like this would need to be part of the native file
           | system options.
        
         | ezoe wrote:
         | I hope you won't be in the position of handling the non-ascii
         | file names. Whitespaces, symbols and other complicated glyphs
         | are widely used in file name since Windows 95.
        
         | ravenstine wrote:
         | Or... ls could just escape filenames.
         | 
         | I'm not sure why it seemingly doesn't in the year 2021.
        
           | dspillett wrote:
           | From "man ls":                   -b, --escape
           | print C-style escapes for nongraphic characters
           | 
           | I'd also add:                   -1  list one file per line.
           | Avoid '\n' with -q or -b
           | 
           | to make sure you can easily split the list by just the EOL,
           | in case for some reason it thinks it is talking to a terminal
           | and tries for format things for a human.
        
           | harry8 wrote:
           | ls -Q
           | 
           | --quoting-style=xxx
           | 
           | ?
        
         | patrec wrote:
         | A thousand times this. There is absolutely no reason to allow
         | newlines in filenames, and it is pathetic that there isn't even
         | yet a mount option to disallow totally idiotic filenames (at
         | the minimum I don't want programs to create filenames with
         | newlines or invalid utf-8).
        
           | throwawaylinux wrote:
           | It's great that file names in the user/kernel ABI are treated
           | as unstructured NUL terminated byte streams so I can do what
           | I want with my file names even if you don't like it. And you
           | can do what you want with yours, including not creating ones
           | you think are idiotic, or using filesystems with code or
           | options that restrict what names can be used.
        
             | CorrectHorseBat wrote:
             | But is there any real use case for that? For me I've only
             | encountered this when something else went wrong, I'd rather
             | have an error at that time than later trying to find out
             | what this garbage is and how to remove it.
             | 
             | So why not give a mount option for this behavior?
        
               | throwawaylinux wrote:
               | > But is there any real use case for that?
               | 
               | Compatibility, at least. Which is actually a big one and
               | is basically never broken in Linux.
               | 
               | > So why not give a mount option for this behavior?
               | 
               | Not sure, maybe just nobody yet cared enough to code it
               | up and submit it for inclusion. It's never caused me
               | problems.
        
             | patrec wrote:
             | Can you give a plausible use case? Filenames can't be
             | arbitrary bytes anyway, since they cannot contain '\0' and
             | '/'. What's a realistic example where it's really useful to
             | be able to stuff arbitrary bytes into a filename, just not
             | '/' or '\0' and where C or URL escaping would somehow be
             | onerous enough to justify all the other problems these
             | pathological filenames create?
             | 
             | Do you really think disallowing pathological filenames (at
             | least as a mount option) would be more expensive than the
             | countless security exploits allowing them has already
             | caused or massive tax nearly all software that tries to
             | deal with filenames robustly needs to pay for it?
             | 
             | Forget shell scripting, almost no software can afford to
             | just pretend filenames are arbitrary bytes.
             | 
             | They typically still somehow need to be displayed to and be
             | editable by end users somewhere along the line, and this
             | means (in unix-based systems) some conversion to-and-from
             | utf-8. Which is going to cause problems[1].
             | 
             | And even if you don't directly need to handle this yourself
             | (but you do, even for a simple shell script or command-line
             | utility or a library that wants to provide an error message
             | with a filename), there is now a whole lot of extra bloat
             | and complexity and edge cases no one handles in practice.
             | With weirdo types like special filename strings, which are
             | neither bytes or proper unicode like python's unicode
             | surrogate encoding (which effectively leaks into all text
             | handling). And of course different languages and eco-
             | systems solve it differently (e.g. whereas python bends its
             | general unicode string for this, Rust has a OsString).
             | 
             | [1] Even the utf-8 compatible subset causes problems of
             | course. E.g. if you have a terminal program that needs to
             | display untrusted filenames to an end user, you now have to
             | deal with problems like terminal escape injection via
             | filenames.
        
               | throwawaylinux wrote:
               | Compatibility. And a mount option seems fine if you don't
               | need compatibility.
               | 
               | That does not relieve applications of the requirement to
               | robustly handle paths and file names though.
               | 
               | > Forget shell scripting, almost no software can afford
               | to just pretend filenames are arbitrary bytes.
               | 
               | Much non-script software can actually treat file names as
               | arbitrary bytes and just pass them through its typical
               | input and output mechanisms. Shells and terminals are
               | very special classes of application, and they need a lot
               | of I/O sanitization whether or not the filesystem
               | restricts file names.
        
         | laumars wrote:
         | Thats not a bad suggest per se but it would only work for Linux
         | and thus you still have other POSIX systems that wouldn't
         | follow suit.
         | 
         | So the advice here of not parsing ls is still prudent.
        
         | cranekam wrote:
         | Why? The name of a file is none of the filesystem's business.
         | If users choose names that make using software difficult it's
         | on them. It's not like there aren't ways to handle any kind of
         | "weird" character in a file name, as the linked article states.
         | 
         | Furthermore, if the kernel/filesystem starts prohibiting
         | certain characters this is more code to maintain and test. User
         | space programs that previously worked fine will stop working.
         | All of this just to prevent someone shootings themselves in the
         | foot by misunderstanding how filenames should be manipulated.
        
           | unilynx wrote:
           | I already can't use NUL and slashes in a file name. And win32
           | limits me even more. It's always been a compromise
           | 
           | And the amount of feet that have been shot by weird file
           | names is staggering.
           | 
           | Programs will stop working, but that's why we need a bleeding
           | edge distribution to find them. In the short term things will
           | break, in the long term quality will increase. Just like
           | memory protection broke some DOS apps in the short term
        
           | laumars wrote:
           | If you're looking at file system in the purist possible
           | sense, you'd be right. But equally a lot of other meta data
           | is stored beyond the inode that file systems "understand".
           | And there is already precedence of file systems having code
           | to intelligently handle file names (eg options for case
           | sensitivity). So pragmatically it's not an unreasonable
           | suggestion to house any code defining legal file names in the
           | file system driver too.
           | 
           | The biggest argument against that in my view isn't down to
           | testing but DRY methodologies: if the code sits in the kernel
           | then it should work against all file systems and not just
           | supported ones.
        
           | tomrod wrote:
           | In practical reality, the name of the file is the
           | filesystem's business. It would be nice if it operated
           | similar to cloud filesystems, where you could version based
           | on GUID that is disconnected from the filename, but the
           | practical reality is that users and developers have long
           | accepted the local operational mode.
        
           | charcircuit wrote:
           | Try and name a file "(//^-^//)"
           | 
           | You can't because certain characters are prohibited.
        
         | jbverschoor wrote:
         | Nah.. we need to use object graphs as streams instead of
         | whitespace "(un)parsable" text. The output to the console (ui)
         | or gui (ui) can be different, but the data should be structured
        
           | notreallyserio wrote:
           | Sounds like Powershell to me. I'm down, as long as the syntax
           | is as simple and terse as on UNIX-based systems and not what
           | Microsoft did (were they paid by the character for flag
           | names?)
        
             | gerdesj wrote:
             | Absolutely. For example: why can't "Get-
             | TrustAuthorityKeyProviderClientCertificateCSR" simply be
             | "takpccc" as $DEITY intended?
             | 
             | If your keyboard's tab key is still legibly labelled then
             | you aren't trying hard enough or have an eidetic memory and
             | fast typing skills!
        
               | notreallyserio wrote:
               | Amusingly HN cuts off the end of the command you typed, I
               | assume using css overflow attributes (don't have an easy
               | way to tell on my UA). I assume it stops at "cate"[0]. I
               | see this sort of chopping a lot, which naturally makes
               | sharing PS commands frustrating -- although there may be
               | workarounds like using `backticks`.
               | 
               | 0: Nope, had to paste it to see it ends with "cateCSR".
        
               | marcosdumay wrote:
               | They could at least change the names order and start with
               | the specific part
               | (TrustAuthorityKeyProviderClientCertificateCSR-Get), so
               | the (braindead) MS version of tab completion would be
               | useful.
        
               | emj wrote:
               | That is basically integrations, there is never going to
               | be nice integrations to my Cobol mainframe linked to a
               | Springboot fuzzbuzz. As is stated in other comments the
               | big issue is usually about being cross platform, and that
               | is a subset of the ls problem: Most of the time you have
               | control over your inputs, until you haven't. This is true
               | for every language even Python which is obnoxious about
               | that. What I mean is that you will always hit edgecases
               | in integrations and you never have time to write new
               | ones.
               | 
               | I always felt that powershell was tab unfriendly, the
               | Get- prefix is hard to get used to. I may be wrong that
               | they have a good way to deal with one-off integrations in
               | a sane manner.
        
           | theamk wrote:
           | If you do this, this will break every program which takes
           | text based filenames on command line.. which is most of ghem.
           | It is an interesting idea, but I don't think it would be Unix
           | anymore.
        
             | jbverschoor wrote:
             | 'Unix' is too low level anyway.. Unix is about
             | reading/writing byte streams..
             | 
             | I don't think the "interactive shell" was meant for
             | scripting anyway. It's like writing your scripts in
             | selenium similar tools. Someone only needs to change the
             | structure or order of the webpage, and you have a problem,
             | depending on how you do your scraping / interacting with
             | the output.
             | 
             | No experience with powershell, but sounds great.
        
               | conradludgate wrote:
               | I dont think unix was every truly about byte streams.
               | Make each program do one thing well. To do a new job,
               | build afresh rather than complicate old programs by
               | adding new "features".
               | 
               | Is the core point. Later editions went on to specify text
               | as the preferred language for these programs to
               | communicate in but I don't think that's key to upholding
               | the unix philosophy. It was just the easiest to work with
               | at the time
               | 
               | We just need to agree upon a common framework for these
               | programs to communicate with. There will definitely be a
               | lot of churn though
        
               | jbverschoor wrote:
               | > Make each program do one thing well. To do a new job,
               | build afresh rather than complicate old programs by
               | adding new "features".
               | 
               | That's a quote from '78, by the Doug McIlroy, the
               | inventor of Unix pipes. Pipes are exactly that... reading
               | and writing bytestreams.
               | 
               | Also him:
               | 
               | - (ii) Expect the output of every program to become the
               | input to another, as yet unknown, program. Don't clutter
               | output with extraneous information. Avoid stringently
               | columnar or binary input formats. Don't insist on
               | interactive input.
               | 
               | Later:
               | 
               | Write programs that do one thing and do it well. Write
               | programs to work together. Write programs to handle text
               | streams, because that is a universal interface.
               | 
               | (https://homepage.cs.uri.edu/~thenry/resources/unix_art/c
               | h01s...)
               | 
               | I mean come one. "ls" grew from a simple tool with 11
               | parameters to 58 parameters. source:
               | https://news.ycombinator.com/item?id=29568042 /
               | https://danluu.com/cli-complexity/
               | 
               | This only happens because people are scripting in their
               | ui. They shouldn't. "Unix admins" laugh at people who do
               | the exact same things within office or other GUI
               | solutions.
               | 
               | Using "text streams".. yes, for performance, a stream is
               | better. Same with SAX vs DOM. nobody likes SAX
        
           | geophile wrote:
           | Yes, exactly. A number of newer shells take this approach.
           | The one I wrote pipes File objects out of its ls command:
           | https://marceltheshell.org
        
             | jbverschoor wrote:
             | looks great
        
           | lordgroff wrote:
           | This can work with something like nushell, but obviously
           | breaks the entire current universe of coreutils.
           | 
           | In the normal world we can solve this problem without
           | breaking everything by adding --jsonout or similar to all the
           | coreutils and then we can have sanity by piping to jq.
        
             | [deleted]
        
             | AnIdiotOnTheNet wrote:
             | > This can work with something like nushell, but obviously
             | breaks the entire current universe of coreutils
             | 
             | Good, because these utilities suck. Half of them only
             | exists because the data is unstructured in the first place,
             | the other half are mostly made of parameters that only
             | exist for the same reason, and most of their names have no
             | apparent relation to what they do. It is time to move out
             | of the 1970s.
        
               | ilyash wrote:
               | Hi. Author of Next Generation Shell here. Totally agree.
               | Also UI of the shell is stuck and ignores pretty much
               | everything that happened in last decades.
               | 
               | Here is my plan for the UI: https://github.com/ngs-
               | lang/ngs/wiki/UI-Design
               | 
               | Edit: but I do try to keep interoperability with existing
               | bullshit.
        
             | epse wrote:
             | Not necessarily though, as filenames aren't required to be
             | valid strings, so that would break json syntax. And json
             | doesn't have a syntax for "just a blob of bytes", besides
             | the fact that wrapping bytes in text just to be decoded
             | back to bytes seems silly to me, but that's an opinion
        
       | amptorn wrote:
       | Why in the world does Unix allow newlines in a filename in the
       | first place? That's just such an obviously brain-damaged idea.
       | There's not a single rational use case for it, yet it breaks
       | nearly every text-based tool you could possibly imagine...
        
         | marcosdumay wrote:
         | Why would Unix go and add random restrictions to filenames?
         | 
         | And what text protocol requires you to just insert user data
         | without escaping or re-encoding? That looks badly broken. The
         | kind of broken that will give your entire system to a hacker
         | for encrypting and demanding ransom.
        
         | Latty wrote:
         | Unix filenames are just sequences of bytes, not defined as
         | strings. Most programs parse them as utf-8, but there is
         | nothing mandating that. Obviously that leads to problems.
        
           | amptorn wrote:
           | > Unix filenames are just sequences of bytes, not defined as
           | strings
           | 
           | "Write programs to handle text streams, because that is a
           | universal interface except for filenames which are opaque
           | binary"
        
           | ninkendo wrote:
           | One pedantic qualification: any byte except 0x2f (`/`) or
           | 0x00.
           | 
           | This actually rules out nearly any non-UTF8 character set
           | (besides ASCII.)
           | 
           | Quote from Linus, which reminds me of Henry Ford's "you can
           | have any color you want, so long as it's black":
           | 
           | > And that one true format is UTF-8. End of story. If you try
           | to talk to the kernel in UCS-2 or anything else, you _will_
           | fail.
           | 
           | https://lore.kernel.org/all/Pine.LNX.4.58.0402141827200.1402.
           | ..
        
             | jcranmer wrote:
             | > This actually rules out nearly any non-UTF8 character set
             | (besides ASCII.)
             | 
             | It doesn't--pretty much any character set that has seen
             | widespread use in the past few decades would be compatible.
             | Any single-byte charsets that are ASCII compatible (such as
             | most Windows CP* sets or the entire ISO-8859-* suite) would
             | work. Most Asiatic charsets (e.g., EUC-JP, Shift-JIS, Big5,
             | GBK) that use variable-width encodings follow the rule that
             | characters in the 0x00-0x7f range are ASCII and subsequent
             | characters in the 0x40-0xff range, and so are themselves
             | compatible as well.
             | 
             | So actually the list of notable _incompatible_ charsets is
             | easier to write out: UTF-16, UTF-32, EBCDIC, and ISO-2022-*
             | charsets (which are mode-switching).
        
               | ninkendo wrote:
               | Eh, fair enough. While you're correct, character sets
               | that are "ascii, but something custom when the high bit
               | is 1" are all just "ascii" to me, in that they are all
               | mutually incompatible for anything other than the first
               | 127 characters, and 8-bit encoding in general has been
               | ubiquitous for nearly as long as ascii has been defined.
               | (Meaning that when most people say "ascii", they're
               | actually referring to one of those encodings in
               | practice.)
               | 
               | Asiatic character sets are an interesting point though. I
               | wonder how common they were at the time of what Linus
               | wrote...
        
               | jcranmer wrote:
               | > While you're correct, character sets that are "ascii,
               | but something custom when the high bit is 1" are all just
               | "ascii" to me
               | 
               | Don't call them just "ASCII"--that only serves to confuse
               | people. Call them 8-bit ASCII-compatible charsets if you
               | need a collective noun, but note that they are very
               | different.
               | 
               | > (Meaning that when most people say "ascii", they're
               | actually referring to one of those encodings in
               | practice.)
               | 
               | Having actually worked on charset handling, when most
               | people say "ASCII", they _mean_ "ASCII" and not anything
               | else. If a document is _labeled_ as ASCII, then generally
               | it should be handled as Windows-1252. If a conversion
               | function claims to convert ASCII to something else, and
               | doesn 't provide any error mechanism (which it really
               | should), then it usually means ISO-8859-1 aka Latin-1 aka
               | map each byte to the first 256 Unicode characters.
               | 
               | But I'd never see, e.g., a KOI8-R document referred to as
               | ASCII, nor anything that claimed to be ASCII assumed to
               | be a KOI8-R document.
               | 
               | > Asiatic character sets are an interesting point though.
               | I wonder how common they were at the time of what Linus
               | wrote...
               | 
               | https://4.bp.blogspot.com/-O4jXmTm7WWI/Tyw1As8jt7I/AAAAAA
               | AAI...
               | 
               | At the time he wrote that, the main Asiatic charsets for
               | Chinese and Japanese would have been more common than
               | UTF-8. Maybe Korean as well, although Linus's message is
               | around the time that UTF-8 overtook EUC-KR. In any case,
               | anyone who knew anything about character sets at the time
               | would have been well aware of Asiatic variable-width
               | character sets.
        
               | ninkendo wrote:
               | I appreciate your insight, but I just want to expand on
               | one point:
               | 
               | > Having actually worked on charset handling, when most
               | people say "ASCII", they mean "ASCII" and not anything
               | else.
               | 
               | Approximately zero people are referring to a true,
               | packed, 7-bit encoding when they say "ASCII". They're
               | nearly always talking about an 8-bit character set, and
               | in such cases, _something_ must happen when the high bit
               | is 1. (I 've never seen one that plain ignores or uses
               | error glyphs for characters >127, although you likely
               | have more experience with this than I do.) This is why I
               | said people are referring to one of these encodings _in
               | practice_... because ascii is 7-bit, and approximately
               | everyone is talking about _some_ 8-bit encoding of one
               | form or another.
               | 
               | I would definitely agree that most wouldn't call KO18-R
               | "ascii", but they may use the term "ascii" to describe
               | the first 128 characters of KO18-R. (Notwithstanding if
               | it uses weird replacement characters like Shift_JIS does
               | with the backslash and the yen sign.) This is the reason
               | for my comment about how the weird "ascii + custom" all
               | just feels like ascii to me... if you stay below 128 it
               | literally is.
               | 
               | I'll modify my original statement thusly:
               | 
               | > This actually rules out nearly any character set that
               | isn't compatible with ASCII.
               | 
               | And add an addendum that if you don't use UTF-8, you
               | can't use unicode and will be stuck in code page/locale
               | hell.
        
               | int_19h wrote:
               | > I've never seen one that plain ignores or uses error
               | glyphs for characters >127
               | 
               | Reporting an error is the default behavior if you try to
               | decode such a string with the ASCII codec in Python and
               | .NET, at the very least.
               | 
               | The first 128 characters of KOI8-R are, of course, ASCII
               | (the "weird replacement characters" are, in fact,
               | explicitly allowed!). But a file encoded in KOI8-R is
               | only ASCII if it contains those first 128 chars.
               | 
               | > if you don't use UTF-8, you can't use unicode and will
               | be stuck in code page/locale hell.
               | 
               | UTF-7 was a thing. It just turned out that nobody really
               | needed it.
        
             | dylan604 wrote:
             | I see your pedantic and raise you: UTF-8 isn't a font
             | though. It's a text encoding.
        
               | marklgr wrote:
               | String bets not allowed, whatever their encoding ;)
        
         | jl6 wrote:
         | I can't think of why you'd ever want a newline in a filename,
         | but it does make for easier reasoning about what characters (or
         | perhaps I should say bytes) could be found in filenames, as
         | opposed to having to remember a long list of exceptions.
        
         | tyingq wrote:
         | It is odd. Though tools like find have "-print0" for this
         | purpose. And corresponding input flags for xargs, perl, sort,
         | uniq, cut, head, etc, that accept NUL terminated vs newline
         | terminated lists.
        
         | bayindirh wrote:
         | I'm against limiting the character set allowed for file names.
         | macOS is also in the same boat with Linux, going one step
         | forward and allowing \null terminator even in the filenames.
         | 
         | If we're going to limit filenames' character sets, I can offer
         | a simpler solution:
         | 
         | Why allow file names? OS should provide a UUID for all files.
         | No names, nothing. We can just write which file is what to
         | another file, noting its UUIDs to sticky notes.
        
           | crispyambulance wrote:
           | > Why allow file names? OS should provide a UUID for all
           | files. No names, nothing.
           | 
           | On an application level that's sort-of starting happen. It's
           | annoying though. Sometimes you just need to know where the
           | actual F Apple put your photo's (it's not obvious). If
           | different applications need to work with the same files, then
           | there's an annoying coordination problem if one application
           | tries to pretend that "files" don't exist and another needs a
           | file path.
           | 
           | Autodesk Fusion 360 tucks your projects into a cloud. I know
           | there's some local cache, but there's no need to think about
           | it because only Fusion-360 handles those "files" and I just
           | worry about my project assets as presented to me by the UI.
           | In that case, it's OK, but it also suggests a "walled-garden"
           | of files for each application.
        
           | pklausler wrote:
           | We could use SHA-256 for the UUIDs, map names to hashes in
           | special directory files, and build a source code control
           | system out of it too while we're at it.
        
             | jdblair wrote:
             | git outta here!
        
           | dragonwriter wrote:
           | > Why allow file names? OS should provide a UUID for all
           | files. No names, nothing. We can just write which file is
           | what to another file, noting its UUIDs to sticky notes.
           | 
           | But... isn't that what filesystems, in effect, already do?
           | Files have IDs, which are mapped to names in a separate
           | record. Having it in one common shared place for the whole
           | filesystem, and a common OS API that provides access to it
           | for all mounted filesystems, just makes things like useful,
           | user-friendly shells (graphical and text), and common
           | controls possible without everything user-facing needed
           | separate UI constructed from scratch for each apps files.
        
           | feldrim wrote:
           | This is an old solution to a problem that does not exist.
           | Yes, in that case the file system can be a key-value store.
           | It would eliminate the need for a tree structure. But the
           | tree structure has a meaning: it adds context. The
           | directories are containers of files that adds a semantic
           | abstraction to the files within.
           | 
           | https://devblogs.microsoft.com/oldnewthing/20110228-00/?p=11.
           | ..
        
             | gglitch wrote:
             | I'm with you on the directory tree, but like the idea of
             | files having both names and unique, autogenerated IDs.
             | 
             | Edit: _optionally_ having IDs.
        
               | feldrim wrote:
               | Windows allows you to have optional IDs.
        
             | wlib wrote:
             | Why do we impose hierarchy so much in file systems? We
             | already allow hard and soft links, so it's not even a tree
             | anyways. Why not just allow any reference types you want;
             | no name with extensions, but a set of tags. Why not
             | identify files the same way a graph database query
             | identifies nodes?
        
               | feldrim wrote:
               | So you propose a graph database for data structures,
               | without the persistence layer provided by the file
               | system, right?
        
               | sitharus wrote:
               | Because hierarchical structures and names are easy to
               | explain to most people. macOS has supported tagging for
               | ages, but I've never seen it used extensively or as a
               | complete alternative to tree structure.
        
         | mistrial9 wrote:
         | my imagined reason is -- because when that terrible day
         | happens, and an important file with some new name, does in fact
         | get a newline in it, the rest of the system now has predictable
         | code paths. Q. Is this related to perl, who knows
        
         | jagrsw wrote:
         | > yet it breaks nearly every text-based tool you could possibly
         | imagine
         | 
         | It breaks badly designed text protocols - some can argue that
         | it's a good idea - "crash early, crash loud" etc.
         | 
         | Also if your protocol breaks with newlines, it probably breaks
         | with other non-literals - brackets, quotes, NUL-bytes, control
         | characters, carriage return char, multibyte chars etc etc.
        
         | jlarocco wrote:
         | > That's just such an obviously brain-damaged idea.
         | 
         | Is it, though? "Every character except '/' because it's the
         | directory delimiter" seems pretty straight forward to me...
         | 
         | > There's not a single rational use case for it, yet it breaks
         | nearly every text-based tool you could possibly imagine...
         | 
         | You don't have a use case, but that doesn't mean nobody else
         | has one.
         | 
         | And as far as "text-based tools" go, their developers should
         | RTFM. I'm fairly sure UNIX existed before almost all of them,
         | and it's accepted new lines all along.
        
       | hericium wrote:
       | First example suggests that `ls` should not be used but `ls -l` -
       | the same program author advises against in the title, but with a
       | parameter - works as expected and in this case would not result
       | in "you can't tell".
       | 
       | > The problem is that from the output of ls, neither you or the
       | computer can tell what parts of it constitute a filename.
       | 
       | Computer does not use console output of ls(1) to determine the
       | list of files. It's for the user. The computer can tell what is a
       | file here.
       | 
       | The title could also be stricter with s/ls/"GNU coreutils ls"/g,
       | too. I could not reproduce all the issues with FreeBSD's ls(1)
       | under zsh.
        
         | Tepix wrote:
         | "ls -l" has other issues, now it will show you user and group
         | names which can contain unexpected characters, too.
        
           | [deleted]
        
         | vidarh wrote:
         | > First example suggests that `ls` should not be used but `ls
         | -l` - the same program author advises against in the title, but
         | with a parameter - works as expected and in this case would not
         | result in "you can't tell".
         | 
         | The first example is used to demonstrate the issue _and_ to
         | demonstrate that  "-l" introduces other issues (inconsistent
         | escaping).
         | 
         | > Computer does not use console output of ls(1) to determine
         | the list of files. It's for the user. The computer can tell
         | what is a file here.
         | 
         | But if you try to use the output of ls in a script to find
         | filenames, the computer _will_ be using ls to determine the
         | list of files. Hence the advice not to do so.
         | 
         | > The title could also be stricter with s/ls/"GNU coreutils
         | ls"/g, too. I could not reproduce all the issues with FreeBSD's
         | ls(1) under zsh.
         | 
         | I think that just emphasises why you shouldn't, as it
         | demonstrates you can't trust the output of ls to be consistent
         | between systems either. If you are sure you'll never need to
         | run your scripts on another system, you might not care, but
         | when it's so easy to prevent this by e.g. using find with
         | "-print0" or equivalent, it seems silly to not just unlearn the
         | bad habit of using ls for this.
        
       | [deleted]
        
       | salmo wrote:
       | Some of these are why I bail for a "real language" in many
       | seemingly simple scenarios.
       | 
       | As soon as I care about datetimes, it's just easier to use stat()
       | and a proper datetime API.
       | 
       | I can treat filenames as byte arrays and translate to Unicode or
       | let the language do it for me.
       | 
       | In dire circumstances, find ... -print0 | xargs -0 second_script
       | is usually my fallback, but that has pitfalls as well.
       | 
       | Go has been a blessing there for me, not having to rely on a
       | runtime across diverse hosts. But that's a preference and doesn't
       | help on old kernels w/o epoll().
       | 
       | So many battle scars from inconsistency in Bash and GNU utilities
       | over the years, especially on Unixes' bundled versions (Solaris,
       | etc) or supporting GNU, BSD, SysV, and HP-UX in the same script.
       | Used to deploy a ksh88(ish) on all for SOME consistency.
       | 
       | Luckily now I'm not supporting anything but Linux anymore. When I
       | can't Go, then I just hijack some tool's bundled Ruby (eg
       | Puppet), Python, etc when I have to handle that and stick to the
       | standard library.
       | 
       | I am too lazy to C these days like I used to. I'm usually dealing
       | with an emergency (looking at you log4j) and don't have the
       | cycles to cover the gotchas there.
        
         | chasil wrote:
         | Any POSIX system has an easy way to remove a file with a
         | malformed/hostile name.
         | 
         | Determine the inode number with "ls -li", then remove it with
         | "find . -inum # -delete" to remove it.
         | 
         | The GNU stat utility makes finding the inode slightly easier.
         | This method is preferable when there is any doubt of wildcard
         | expansion.
        
           | scbrg wrote:
           | Unless I'm mistaken, POSIX _find_ does not have _-delete_
        
             | 10000truths wrote:
             | It might not be in the standard, but pretty much every
             | implementation that I know of supports it - GNU, BSD,
             | Solaris, even busybox and toybox.
        
             | chasil wrote:
             | Alas, you are right, my mistake. It doesn't have inum
             | either. Those GNUisms do creep in.
             | 
             | https://pubs.opengroup.org/onlinepubs/9699919799/utilities/
             | f...
        
               | cogburnd02 wrote:
               | GNU ls also has options to automatically add quoting to
               | its output.
        
       ___________________________________________________________________
       (page generated 2021-12-31 23:01 UTC)