[HN Gopher] You shouldn't parse the output of ls(1)
___________________________________________________________________
You shouldn't parse the output of ls(1)
Author : tosh
Score : 113 points
Date : 2021-12-31 11:32 UTC (11 hours ago)
(HTM) web link (mywiki.wooledge.org)
(TXT) w3m dump (mywiki.wooledge.org)
| Tepix wrote:
| Very valuable points that are too easily forgotten. Thanks
| gorgoiler wrote:
| Greg aka graycat was a real IRC legend 20 years ago. I learned so
| much from him.
|
| Many a happy hour did I watch him flaming lazy newcomers looking
| for a quick fix in #debian, right about the time when Linux as a
| commercially viable server platform was taking off.
|
| Almost every admonishment was accompanied by sound technical
| advice which was useful to lurkers as well as the unfortunate
| noob who dared ask.
|
| Thanks :)
| ByThyGrace wrote:
| Greg Wooledge's bash wiki is my goto resource for bash
| scripting. Everything I always need to find out is in there
| (Bash Guide + FAQ). I didn't know about his IRC persona which
| only improves my appreciation of him, so thanks for sharing.
| dsr_ wrote:
| On occasion I have posted something in debian-user, adding "but
| Greg will have a better approach".
|
| Then he shows up and offers a better approach.
|
| Thanks to Greg, my bashrc contains: > stat=(
| ) > statcolor=("$Green" "$Red") > ... >
| PS1=... > ${statcolor[!!$?]}\]${stat[!!$?]}$
|
| Which, if it's not entirely clear, puts a green checkmark or a
| red x in my prompt depending on the error value of the last run
| commandline.
| marcosdumay wrote:
| Oh, a long time ago (but not so long as that) I got this line
| from a HN thread on bash tricks: export
| PS1="\h:\w \$(if [ \$? = 0 ]; then echo :\\\); else echo
| :\\\(; fi) \$ "
|
| It's a non-colored version of it, with a happy or sad smiley.
| Fnoord wrote:
| Ah, greycat. Yeah I remember him from #debian on Freenode some
| 20 years ago. Smart, helpful fellow.
| thibran wrote:
| If common CLI programs would have a --json option this would be
| no problem.
| jmnicolas wrote:
| I have a bash alias that creates a random playlist of videos or
| music with ls. I noticed that sometimes there were duplicates in
| the list.
|
| If I can't use ls, it's not going to be a one liner anymore, so I
| have to create a file, store it somewhere, assign execute
| privileges and link my alias to it. Much more complicated.
| theamk wrote:
| Um, no? As examples show, "find -maxdepth" can do the same
| things, but safely. And in most cases, it'd still be one-liner,
| even if a longer one.
| rascul wrote:
| > so I have to create a file, store it somewhere, assign
| execute privileges and link my alias to it
|
| Or put it in a function in the same file your alias is in.
| dvh wrote:
| In a similar way, ftp's "dir" command is only for humans. Every
| ftp library that is for accessing ftp API for programs is only
| guessing what in the "dir" output is filename.
| renewiltord wrote:
| My file system is _my_ file system. I solve this problem by just
| not having weird file names on it.
| ape4 wrote:
| Its pretty trivial to make a C program that lists a directory in
| the format you want.
| yagop wrote:
| Parsing UNIX command outputs is generally a pain and constantly a
| source of errors. PowerShell mostly solve that, I wish we can use
| that.
| enriquto wrote:
| I disagree.
|
| Parsing the textual output of ls is such a natural idiom that I'm
| happy to renounce any other thing that causes trouble. Give me a
| "-o sanenames" option for mount, instead.
| spicybright wrote:
| I think the point is ls has so many options, it's not safe to
| parse as is.
|
| I always use `find .` if I need a list of files from a
| directory for this reason
| johnisgood wrote:
| Yeah. I often do "find . | grep 'foo'". Perhaps "find" can do
| it without the "| grep" bit, but I have not RTFM. :P
| rightbyte wrote:
| Ye that is what the maintainer said for the -z option for ls
| too.
| beermonster wrote:
| I use `echo *`
| vidarh wrote:
| `echo *` has many of the downsides of ls (doesn't escape
| e.g. space in filenames) and _additionally_ breaks on
| directories where the expansion fills the command line
| buffer.
|
| EDIT: Also note that "find ..." is also not safe from all
| the quoting issues without "-print0" or equivalent options
| to make it separate the names with ASCII NUL rather than
| linefeed or otherwise taking steps to handle filenames with
| actual linefeeds in them.
| xorcist wrote:
| Indeed. Those backticks only works in a shell, but in a
| shell why not just write *
|
| which is how filename expansion is supposed to work.
| beermonster wrote:
| The back ticks were supposed to be quotes, I meant to say
|
| echo *
|
| However as per replies, this suffers from the same issues
| as ls.
| vidarh wrote:
| The problem is not the backticks. They were just used as
| quote characters. The problem is that shell expansion
| doesn't escape the characters. E.g. this is a cut down
| output from my system now after I did a echo >'/tmp/
| space ': $ echo /tmp/*
| /tmp/bspwm_0_0-socket /tmp/config-err-Q667kI /tmp/foolog
| /tmp/ space /tmp/...
|
| Parse that output and you get a broken list of filenames.
| traceroute66 wrote:
| find or indeed, the Rust-based fd[1] which is infinitesimally
| faster.
|
| [1]https://github.com/sharkdp/fd
| vidarh wrote:
| If you're listing just the filenames, the things that makes
| fd fast (the parallelised directory traversal when you do
| something that requires stat calls or similar) are
| irrelevant, as the getdents() calls are going to be more
| affected by your buffer size.
|
| So for the limited subset of tasks where you're ok with
| using a tool that might not be installed _and_ need options
| that requires stat calls _and_ the directories may be large
| enough, it might make a difference.
| mellavora wrote:
| infinitesimally
|
| https://en.wikipedia.org/wiki/Infinitesimal
|
| "In mathematics, an infinitesimal or infinitesimal number
| is a quantity that is closer to zero than any standard real
| number, but that is not zero."
|
| I'll blame this one on auto-correct.
| pkrumins wrote:
| It's not an idiom. It's a fireable offense.
| rightbyte wrote:
| Ye I mean when doing sysadmin stuff you know to avoid asking
| for it with, like, filenames with spaces. Why even bother
| handling newlines or what not.
| mattowen_uk wrote:
| Dead link (HN hug of death?)
|
| Cached copy:
|
| https://webcache.googleusercontent.com/search?q=cache:eZI_am...
| tyingq wrote:
| This is one reason Perl was very popular even before CGI was a
| thing. You could get to things like stat() with an interpreted
| language that was very portable. It also has the "-0" flag to
| accept the null terminated output of "find -print0".
| artemonster wrote:
| I wonder how world would look like if all standard unix tools
| gave two outputs: human readable and structured, json-like.
| int_19h wrote:
| FreeBSD is trying for something like that.
|
| https://libxo.readthedocs.io/en/latest/
| xorcist wrote:
| Perhaps take git as an example instead, with its plumbing and
| porcelain commands.
|
| They have historically been easier to keep backwards
| compatibility with than the dict-like structures of json.
| [deleted]
| rnestler wrote:
| I'd recommend to take a look at https://www.nushell.sh/ which
| has structured output, but displays it neatly when printing to
| the terminal.
| Kim_Bruning wrote:
| You don't need to wonder, because jc is a filter that does just
| that!
|
| https://kellyjonbrazil.github.io/jc/
| robert_tweed wrote:
| It even claims to parse ls output correctly (see caveat):
| https://kellyjonbrazil.github.io/jc/docs/parsers/ls
| kortex wrote:
| > >>> import jc.parsers.dig
|
| I think "jc" stands for "jesus christ!" because I just
| exclaimed that out loud thinking about the amount of time
| I've wasted trying to parse dig outputs, or something
| similar. Spent a nontrivial amount of time looking for
| lightweight tools to convert the typical "fwf" of coreutils
| style programs.
|
| Definitely running "pipx install jc" immediately (pipx is
| great for managing python-based executable programs, avoid
| the mess of venvs).
| wayoutthere wrote:
| Then it would be PowerShell.
| disgruntledphd2 wrote:
| But with shorter command names, and gnomic short options.
| laumars wrote:
| No. Powershell is a whole new CLI user land as well as a
| shell. If you want something that's compatible with POSIX but
| still has smart pipelines and native support for JSON then
| you're better off with Elvish or Murex as shells.
| nailer wrote:
| Just add JSON output to ls like other tools have.
| Tepix wrote:
| Nul-terminated strings would be the more desirable option like
| some unix tools such as find and xargs already offer since
| decades.
| yholio wrote:
| These all seem to be ls bugs. It's a common pattern when
| outputting data to format it such that the receiver can
| unambiguously separate the data from the formatting. If you use
| CR/LF in your output formatting, then those characters need to be
| escaped in the data. If your attacker can deceive you into
| printing fake output by crafting their filename as :
|
| "\n -rw-r--r-- 1 user group 12 Dec 15:55 mostly_harmless_planet"
|
| ...then you have already lost.
|
| Violating this pattern always leads to problems like format
| string vulnerabilities, SQL or executable injections etc. As the
| long history of fighting against these problems shows, "banning
| weird characters" without fixing the bugs will always lead to
| problems, some apparently harmless characters find devious uses
| etc. You can't unscramble eggs.
|
| The only real solutions are properly escaping the payload so that
| it can be unambiguously interpreted. And you can't claim that the
| authors of 'ls' don't expect their output to be consumed by other
| programs.
| josephcsible wrote:
| Interestingly, ls does escape characters like \n in its output
| when it's printing to a terminal, but not when it's being piped
| into other programs. Try this by making a file with a newline
| in its name, and then comparing "ls" with "ls | cat".
| rvieira wrote:
| Isn't parsing `ls` the whole backbone of Emacs' `dired`?
| tyingq wrote:
| If you <ctrl-f> and search for "newline", you can see some of
| the hackery they do to get around newlines in file names:
|
| https://github.com/emacs-mirror/emacs/blob/master/lisp/dired...
| asicsp wrote:
| See also: Why _not_ parse `ls` (and what to do instead)?
|
| https://unix.stackexchange.com/questions/128985/why-not-pars...
| NumberWangMan wrote:
| "Quoted string notation"
| (https://www.oilshell.org/release/latest/doc/qsn.html) seems like
| a good way to solve this problem.
| chubot wrote:
| (author here) Yes thanks, that is exactly the point!
|
| As I point out at the end of the doc, coreutils ls actually
| started quoting the names in 2016. However the format is
| confusing for people who can't read 2 or 3 types of shell
| strings, and not that readable.
|
| In contrast, QSN is simply Rust string literal syntax, which
| are a cleaned up version of C string literal syntax.
| $ touch $'foo\nbar' 'dq"dq' "sq'sq" # create 3 files with
| newline, double quote, single quote # coreutils is
| correct, though I'm not sure people will understand $'\n'
| $ ls 'dq"dq' eggs 'foo'$'\n''bar' "sq'sq"
|
| Pipe through cat mangles the name: $ ls|cat
| dq"dq foo bar sq'sq
|
| In Oil, write --qsn will ALWAYS give you 5 lines if you have 5
| names, no matter what they are $ oil -c
| 'write --qsn -- *' 'dq"dq' 'foo\nbar' # more
| familiar encoding 'sq\'sq'
|
| Without --qsn it's like ls|cat: $ oil -c
| 'write -- *' dq"dq foo bar
| sq'sq
|
| I think it's important for something like QSN to be built into
| the shell, because quoting issues arise in many places, not
| just filenames and ls.
|
| Although this makes me think that we should have the inverse of
| `printf %q` to parse the output of coreutils ls. Oil does
| implement printf %q, but most people don't know about it.
| $ printf '%q\n' -- * dq\"dq $'foo\nbar'
| sq\'sq
|
| Again it is actually correct, but sort of a grab bag of formats
| derived from shell strings. QSN strings will be familiar to
| anyone using Python, Rust, etc. consistent with Oil's slogan:
| _It 's for Python and JavaScript users who avoid shell!_
|
| -----
|
| edit: Also reminds me that I wrote this page before designing
| QSN: https://github.com/oilshell/oil/wiki/Shell-Almost-Has-a-
| JSON...
|
| So printf %q and %b are inverses in bash, but this doesn't work
| in other shells. QSN can represent NUL bytes, which are illegal
| in filenames, but are useful elsewhere.
| unilynx wrote:
| More importantly, we need to get rid of the ability to put line
| feeds, tabs in file names and also disallow odd starting
| characters such as tab, dash and $
|
| I wish someone would add a mount option for that and have eg
| fedora be a trailblazer to fix the few apps that break
| Tepix wrote:
| It would be almost trivial to create a fuse filesystem which
| completely hides these files if they exist (and doesn't allow
| the creation of new ones).
| patrec wrote:
| And how would that be even remotely useful? Unless something
| changed recently, FUSE has so much overhead it's only useful
| for niche applications and prototyping.
| laumars wrote:
| It would be trivial to code but any such wrapper would add
| overhead to file system operations. FUSE is a fantastic set
| of APIs (I've used it personally) and performs remarkably
| well considering it is constantly swapping memory between
| kernel and user space but for wide spread adoption any
| feature like this would need to be part of the native file
| system options.
| ezoe wrote:
| I hope you won't be in the position of handling the non-ascii
| file names. Whitespaces, symbols and other complicated glyphs
| are widely used in file name since Windows 95.
| ravenstine wrote:
| Or... ls could just escape filenames.
|
| I'm not sure why it seemingly doesn't in the year 2021.
| dspillett wrote:
| From "man ls": -b, --escape
| print C-style escapes for nongraphic characters
|
| I'd also add: -1 list one file per line.
| Avoid '\n' with -q or -b
|
| to make sure you can easily split the list by just the EOL,
| in case for some reason it thinks it is talking to a terminal
| and tries for format things for a human.
| harry8 wrote:
| ls -Q
|
| --quoting-style=xxx
|
| ?
| patrec wrote:
| A thousand times this. There is absolutely no reason to allow
| newlines in filenames, and it is pathetic that there isn't even
| yet a mount option to disallow totally idiotic filenames (at
| the minimum I don't want programs to create filenames with
| newlines or invalid utf-8).
| throwawaylinux wrote:
| It's great that file names in the user/kernel ABI are treated
| as unstructured NUL terminated byte streams so I can do what
| I want with my file names even if you don't like it. And you
| can do what you want with yours, including not creating ones
| you think are idiotic, or using filesystems with code or
| options that restrict what names can be used.
| CorrectHorseBat wrote:
| But is there any real use case for that? For me I've only
| encountered this when something else went wrong, I'd rather
| have an error at that time than later trying to find out
| what this garbage is and how to remove it.
|
| So why not give a mount option for this behavior?
| throwawaylinux wrote:
| > But is there any real use case for that?
|
| Compatibility, at least. Which is actually a big one and
| is basically never broken in Linux.
|
| > So why not give a mount option for this behavior?
|
| Not sure, maybe just nobody yet cared enough to code it
| up and submit it for inclusion. It's never caused me
| problems.
| patrec wrote:
| Can you give a plausible use case? Filenames can't be
| arbitrary bytes anyway, since they cannot contain '\0' and
| '/'. What's a realistic example where it's really useful to
| be able to stuff arbitrary bytes into a filename, just not
| '/' or '\0' and where C or URL escaping would somehow be
| onerous enough to justify all the other problems these
| pathological filenames create?
|
| Do you really think disallowing pathological filenames (at
| least as a mount option) would be more expensive than the
| countless security exploits allowing them has already
| caused or massive tax nearly all software that tries to
| deal with filenames robustly needs to pay for it?
|
| Forget shell scripting, almost no software can afford to
| just pretend filenames are arbitrary bytes.
|
| They typically still somehow need to be displayed to and be
| editable by end users somewhere along the line, and this
| means (in unix-based systems) some conversion to-and-from
| utf-8. Which is going to cause problems[1].
|
| And even if you don't directly need to handle this yourself
| (but you do, even for a simple shell script or command-line
| utility or a library that wants to provide an error message
| with a filename), there is now a whole lot of extra bloat
| and complexity and edge cases no one handles in practice.
| With weirdo types like special filename strings, which are
| neither bytes or proper unicode like python's unicode
| surrogate encoding (which effectively leaks into all text
| handling). And of course different languages and eco-
| systems solve it differently (e.g. whereas python bends its
| general unicode string for this, Rust has a OsString).
|
| [1] Even the utf-8 compatible subset causes problems of
| course. E.g. if you have a terminal program that needs to
| display untrusted filenames to an end user, you now have to
| deal with problems like terminal escape injection via
| filenames.
| throwawaylinux wrote:
| Compatibility. And a mount option seems fine if you don't
| need compatibility.
|
| That does not relieve applications of the requirement to
| robustly handle paths and file names though.
|
| > Forget shell scripting, almost no software can afford
| to just pretend filenames are arbitrary bytes.
|
| Much non-script software can actually treat file names as
| arbitrary bytes and just pass them through its typical
| input and output mechanisms. Shells and terminals are
| very special classes of application, and they need a lot
| of I/O sanitization whether or not the filesystem
| restricts file names.
| laumars wrote:
| Thats not a bad suggest per se but it would only work for Linux
| and thus you still have other POSIX systems that wouldn't
| follow suit.
|
| So the advice here of not parsing ls is still prudent.
| cranekam wrote:
| Why? The name of a file is none of the filesystem's business.
| If users choose names that make using software difficult it's
| on them. It's not like there aren't ways to handle any kind of
| "weird" character in a file name, as the linked article states.
|
| Furthermore, if the kernel/filesystem starts prohibiting
| certain characters this is more code to maintain and test. User
| space programs that previously worked fine will stop working.
| All of this just to prevent someone shootings themselves in the
| foot by misunderstanding how filenames should be manipulated.
| unilynx wrote:
| I already can't use NUL and slashes in a file name. And win32
| limits me even more. It's always been a compromise
|
| And the amount of feet that have been shot by weird file
| names is staggering.
|
| Programs will stop working, but that's why we need a bleeding
| edge distribution to find them. In the short term things will
| break, in the long term quality will increase. Just like
| memory protection broke some DOS apps in the short term
| laumars wrote:
| If you're looking at file system in the purist possible
| sense, you'd be right. But equally a lot of other meta data
| is stored beyond the inode that file systems "understand".
| And there is already precedence of file systems having code
| to intelligently handle file names (eg options for case
| sensitivity). So pragmatically it's not an unreasonable
| suggestion to house any code defining legal file names in the
| file system driver too.
|
| The biggest argument against that in my view isn't down to
| testing but DRY methodologies: if the code sits in the kernel
| then it should work against all file systems and not just
| supported ones.
| tomrod wrote:
| In practical reality, the name of the file is the
| filesystem's business. It would be nice if it operated
| similar to cloud filesystems, where you could version based
| on GUID that is disconnected from the filename, but the
| practical reality is that users and developers have long
| accepted the local operational mode.
| charcircuit wrote:
| Try and name a file "(//^-^//)"
|
| You can't because certain characters are prohibited.
| jbverschoor wrote:
| Nah.. we need to use object graphs as streams instead of
| whitespace "(un)parsable" text. The output to the console (ui)
| or gui (ui) can be different, but the data should be structured
| notreallyserio wrote:
| Sounds like Powershell to me. I'm down, as long as the syntax
| is as simple and terse as on UNIX-based systems and not what
| Microsoft did (were they paid by the character for flag
| names?)
| gerdesj wrote:
| Absolutely. For example: why can't "Get-
| TrustAuthorityKeyProviderClientCertificateCSR" simply be
| "takpccc" as $DEITY intended?
|
| If your keyboard's tab key is still legibly labelled then
| you aren't trying hard enough or have an eidetic memory and
| fast typing skills!
| notreallyserio wrote:
| Amusingly HN cuts off the end of the command you typed, I
| assume using css overflow attributes (don't have an easy
| way to tell on my UA). I assume it stops at "cate"[0]. I
| see this sort of chopping a lot, which naturally makes
| sharing PS commands frustrating -- although there may be
| workarounds like using `backticks`.
|
| 0: Nope, had to paste it to see it ends with "cateCSR".
| marcosdumay wrote:
| They could at least change the names order and start with
| the specific part
| (TrustAuthorityKeyProviderClientCertificateCSR-Get), so
| the (braindead) MS version of tab completion would be
| useful.
| emj wrote:
| That is basically integrations, there is never going to
| be nice integrations to my Cobol mainframe linked to a
| Springboot fuzzbuzz. As is stated in other comments the
| big issue is usually about being cross platform, and that
| is a subset of the ls problem: Most of the time you have
| control over your inputs, until you haven't. This is true
| for every language even Python which is obnoxious about
| that. What I mean is that you will always hit edgecases
| in integrations and you never have time to write new
| ones.
|
| I always felt that powershell was tab unfriendly, the
| Get- prefix is hard to get used to. I may be wrong that
| they have a good way to deal with one-off integrations in
| a sane manner.
| theamk wrote:
| If you do this, this will break every program which takes
| text based filenames on command line.. which is most of ghem.
| It is an interesting idea, but I don't think it would be Unix
| anymore.
| jbverschoor wrote:
| 'Unix' is too low level anyway.. Unix is about
| reading/writing byte streams..
|
| I don't think the "interactive shell" was meant for
| scripting anyway. It's like writing your scripts in
| selenium similar tools. Someone only needs to change the
| structure or order of the webpage, and you have a problem,
| depending on how you do your scraping / interacting with
| the output.
|
| No experience with powershell, but sounds great.
| conradludgate wrote:
| I dont think unix was every truly about byte streams.
| Make each program do one thing well. To do a new job,
| build afresh rather than complicate old programs by
| adding new "features".
|
| Is the core point. Later editions went on to specify text
| as the preferred language for these programs to
| communicate in but I don't think that's key to upholding
| the unix philosophy. It was just the easiest to work with
| at the time
|
| We just need to agree upon a common framework for these
| programs to communicate with. There will definitely be a
| lot of churn though
| jbverschoor wrote:
| > Make each program do one thing well. To do a new job,
| build afresh rather than complicate old programs by
| adding new "features".
|
| That's a quote from '78, by the Doug McIlroy, the
| inventor of Unix pipes. Pipes are exactly that... reading
| and writing bytestreams.
|
| Also him:
|
| - (ii) Expect the output of every program to become the
| input to another, as yet unknown, program. Don't clutter
| output with extraneous information. Avoid stringently
| columnar or binary input formats. Don't insist on
| interactive input.
|
| Later:
|
| Write programs that do one thing and do it well. Write
| programs to work together. Write programs to handle text
| streams, because that is a universal interface.
|
| (https://homepage.cs.uri.edu/~thenry/resources/unix_art/c
| h01s...)
|
| I mean come one. "ls" grew from a simple tool with 11
| parameters to 58 parameters. source:
| https://news.ycombinator.com/item?id=29568042 /
| https://danluu.com/cli-complexity/
|
| This only happens because people are scripting in their
| ui. They shouldn't. "Unix admins" laugh at people who do
| the exact same things within office or other GUI
| solutions.
|
| Using "text streams".. yes, for performance, a stream is
| better. Same with SAX vs DOM. nobody likes SAX
| geophile wrote:
| Yes, exactly. A number of newer shells take this approach.
| The one I wrote pipes File objects out of its ls command:
| https://marceltheshell.org
| jbverschoor wrote:
| looks great
| lordgroff wrote:
| This can work with something like nushell, but obviously
| breaks the entire current universe of coreutils.
|
| In the normal world we can solve this problem without
| breaking everything by adding --jsonout or similar to all the
| coreutils and then we can have sanity by piping to jq.
| [deleted]
| AnIdiotOnTheNet wrote:
| > This can work with something like nushell, but obviously
| breaks the entire current universe of coreutils
|
| Good, because these utilities suck. Half of them only
| exists because the data is unstructured in the first place,
| the other half are mostly made of parameters that only
| exist for the same reason, and most of their names have no
| apparent relation to what they do. It is time to move out
| of the 1970s.
| ilyash wrote:
| Hi. Author of Next Generation Shell here. Totally agree.
| Also UI of the shell is stuck and ignores pretty much
| everything that happened in last decades.
|
| Here is my plan for the UI: https://github.com/ngs-
| lang/ngs/wiki/UI-Design
|
| Edit: but I do try to keep interoperability with existing
| bullshit.
| epse wrote:
| Not necessarily though, as filenames aren't required to be
| valid strings, so that would break json syntax. And json
| doesn't have a syntax for "just a blob of bytes", besides
| the fact that wrapping bytes in text just to be decoded
| back to bytes seems silly to me, but that's an opinion
| amptorn wrote:
| Why in the world does Unix allow newlines in a filename in the
| first place? That's just such an obviously brain-damaged idea.
| There's not a single rational use case for it, yet it breaks
| nearly every text-based tool you could possibly imagine...
| marcosdumay wrote:
| Why would Unix go and add random restrictions to filenames?
|
| And what text protocol requires you to just insert user data
| without escaping or re-encoding? That looks badly broken. The
| kind of broken that will give your entire system to a hacker
| for encrypting and demanding ransom.
| Latty wrote:
| Unix filenames are just sequences of bytes, not defined as
| strings. Most programs parse them as utf-8, but there is
| nothing mandating that. Obviously that leads to problems.
| amptorn wrote:
| > Unix filenames are just sequences of bytes, not defined as
| strings
|
| "Write programs to handle text streams, because that is a
| universal interface except for filenames which are opaque
| binary"
| ninkendo wrote:
| One pedantic qualification: any byte except 0x2f (`/`) or
| 0x00.
|
| This actually rules out nearly any non-UTF8 character set
| (besides ASCII.)
|
| Quote from Linus, which reminds me of Henry Ford's "you can
| have any color you want, so long as it's black":
|
| > And that one true format is UTF-8. End of story. If you try
| to talk to the kernel in UCS-2 or anything else, you _will_
| fail.
|
| https://lore.kernel.org/all/Pine.LNX.4.58.0402141827200.1402.
| ..
| jcranmer wrote:
| > This actually rules out nearly any non-UTF8 character set
| (besides ASCII.)
|
| It doesn't--pretty much any character set that has seen
| widespread use in the past few decades would be compatible.
| Any single-byte charsets that are ASCII compatible (such as
| most Windows CP* sets or the entire ISO-8859-* suite) would
| work. Most Asiatic charsets (e.g., EUC-JP, Shift-JIS, Big5,
| GBK) that use variable-width encodings follow the rule that
| characters in the 0x00-0x7f range are ASCII and subsequent
| characters in the 0x40-0xff range, and so are themselves
| compatible as well.
|
| So actually the list of notable _incompatible_ charsets is
| easier to write out: UTF-16, UTF-32, EBCDIC, and ISO-2022-*
| charsets (which are mode-switching).
| ninkendo wrote:
| Eh, fair enough. While you're correct, character sets
| that are "ascii, but something custom when the high bit
| is 1" are all just "ascii" to me, in that they are all
| mutually incompatible for anything other than the first
| 127 characters, and 8-bit encoding in general has been
| ubiquitous for nearly as long as ascii has been defined.
| (Meaning that when most people say "ascii", they're
| actually referring to one of those encodings in
| practice.)
|
| Asiatic character sets are an interesting point though. I
| wonder how common they were at the time of what Linus
| wrote...
| jcranmer wrote:
| > While you're correct, character sets that are "ascii,
| but something custom when the high bit is 1" are all just
| "ascii" to me
|
| Don't call them just "ASCII"--that only serves to confuse
| people. Call them 8-bit ASCII-compatible charsets if you
| need a collective noun, but note that they are very
| different.
|
| > (Meaning that when most people say "ascii", they're
| actually referring to one of those encodings in
| practice.)
|
| Having actually worked on charset handling, when most
| people say "ASCII", they _mean_ "ASCII" and not anything
| else. If a document is _labeled_ as ASCII, then generally
| it should be handled as Windows-1252. If a conversion
| function claims to convert ASCII to something else, and
| doesn 't provide any error mechanism (which it really
| should), then it usually means ISO-8859-1 aka Latin-1 aka
| map each byte to the first 256 Unicode characters.
|
| But I'd never see, e.g., a KOI8-R document referred to as
| ASCII, nor anything that claimed to be ASCII assumed to
| be a KOI8-R document.
|
| > Asiatic character sets are an interesting point though.
| I wonder how common they were at the time of what Linus
| wrote...
|
| https://4.bp.blogspot.com/-O4jXmTm7WWI/Tyw1As8jt7I/AAAAAA
| AAI...
|
| At the time he wrote that, the main Asiatic charsets for
| Chinese and Japanese would have been more common than
| UTF-8. Maybe Korean as well, although Linus's message is
| around the time that UTF-8 overtook EUC-KR. In any case,
| anyone who knew anything about character sets at the time
| would have been well aware of Asiatic variable-width
| character sets.
| ninkendo wrote:
| I appreciate your insight, but I just want to expand on
| one point:
|
| > Having actually worked on charset handling, when most
| people say "ASCII", they mean "ASCII" and not anything
| else.
|
| Approximately zero people are referring to a true,
| packed, 7-bit encoding when they say "ASCII". They're
| nearly always talking about an 8-bit character set, and
| in such cases, _something_ must happen when the high bit
| is 1. (I 've never seen one that plain ignores or uses
| error glyphs for characters >127, although you likely
| have more experience with this than I do.) This is why I
| said people are referring to one of these encodings _in
| practice_... because ascii is 7-bit, and approximately
| everyone is talking about _some_ 8-bit encoding of one
| form or another.
|
| I would definitely agree that most wouldn't call KO18-R
| "ascii", but they may use the term "ascii" to describe
| the first 128 characters of KO18-R. (Notwithstanding if
| it uses weird replacement characters like Shift_JIS does
| with the backslash and the yen sign.) This is the reason
| for my comment about how the weird "ascii + custom" all
| just feels like ascii to me... if you stay below 128 it
| literally is.
|
| I'll modify my original statement thusly:
|
| > This actually rules out nearly any character set that
| isn't compatible with ASCII.
|
| And add an addendum that if you don't use UTF-8, you
| can't use unicode and will be stuck in code page/locale
| hell.
| int_19h wrote:
| > I've never seen one that plain ignores or uses error
| glyphs for characters >127
|
| Reporting an error is the default behavior if you try to
| decode such a string with the ASCII codec in Python and
| .NET, at the very least.
|
| The first 128 characters of KOI8-R are, of course, ASCII
| (the "weird replacement characters" are, in fact,
| explicitly allowed!). But a file encoded in KOI8-R is
| only ASCII if it contains those first 128 chars.
|
| > if you don't use UTF-8, you can't use unicode and will
| be stuck in code page/locale hell.
|
| UTF-7 was a thing. It just turned out that nobody really
| needed it.
| dylan604 wrote:
| I see your pedantic and raise you: UTF-8 isn't a font
| though. It's a text encoding.
| marklgr wrote:
| String bets not allowed, whatever their encoding ;)
| jl6 wrote:
| I can't think of why you'd ever want a newline in a filename,
| but it does make for easier reasoning about what characters (or
| perhaps I should say bytes) could be found in filenames, as
| opposed to having to remember a long list of exceptions.
| tyingq wrote:
| It is odd. Though tools like find have "-print0" for this
| purpose. And corresponding input flags for xargs, perl, sort,
| uniq, cut, head, etc, that accept NUL terminated vs newline
| terminated lists.
| bayindirh wrote:
| I'm against limiting the character set allowed for file names.
| macOS is also in the same boat with Linux, going one step
| forward and allowing \null terminator even in the filenames.
|
| If we're going to limit filenames' character sets, I can offer
| a simpler solution:
|
| Why allow file names? OS should provide a UUID for all files.
| No names, nothing. We can just write which file is what to
| another file, noting its UUIDs to sticky notes.
| crispyambulance wrote:
| > Why allow file names? OS should provide a UUID for all
| files. No names, nothing.
|
| On an application level that's sort-of starting happen. It's
| annoying though. Sometimes you just need to know where the
| actual F Apple put your photo's (it's not obvious). If
| different applications need to work with the same files, then
| there's an annoying coordination problem if one application
| tries to pretend that "files" don't exist and another needs a
| file path.
|
| Autodesk Fusion 360 tucks your projects into a cloud. I know
| there's some local cache, but there's no need to think about
| it because only Fusion-360 handles those "files" and I just
| worry about my project assets as presented to me by the UI.
| In that case, it's OK, but it also suggests a "walled-garden"
| of files for each application.
| pklausler wrote:
| We could use SHA-256 for the UUIDs, map names to hashes in
| special directory files, and build a source code control
| system out of it too while we're at it.
| jdblair wrote:
| git outta here!
| dragonwriter wrote:
| > Why allow file names? OS should provide a UUID for all
| files. No names, nothing. We can just write which file is
| what to another file, noting its UUIDs to sticky notes.
|
| But... isn't that what filesystems, in effect, already do?
| Files have IDs, which are mapped to names in a separate
| record. Having it in one common shared place for the whole
| filesystem, and a common OS API that provides access to it
| for all mounted filesystems, just makes things like useful,
| user-friendly shells (graphical and text), and common
| controls possible without everything user-facing needed
| separate UI constructed from scratch for each apps files.
| feldrim wrote:
| This is an old solution to a problem that does not exist.
| Yes, in that case the file system can be a key-value store.
| It would eliminate the need for a tree structure. But the
| tree structure has a meaning: it adds context. The
| directories are containers of files that adds a semantic
| abstraction to the files within.
|
| https://devblogs.microsoft.com/oldnewthing/20110228-00/?p=11.
| ..
| gglitch wrote:
| I'm with you on the directory tree, but like the idea of
| files having both names and unique, autogenerated IDs.
|
| Edit: _optionally_ having IDs.
| feldrim wrote:
| Windows allows you to have optional IDs.
| wlib wrote:
| Why do we impose hierarchy so much in file systems? We
| already allow hard and soft links, so it's not even a tree
| anyways. Why not just allow any reference types you want;
| no name with extensions, but a set of tags. Why not
| identify files the same way a graph database query
| identifies nodes?
| feldrim wrote:
| So you propose a graph database for data structures,
| without the persistence layer provided by the file
| system, right?
| sitharus wrote:
| Because hierarchical structures and names are easy to
| explain to most people. macOS has supported tagging for
| ages, but I've never seen it used extensively or as a
| complete alternative to tree structure.
| mistrial9 wrote:
| my imagined reason is -- because when that terrible day
| happens, and an important file with some new name, does in fact
| get a newline in it, the rest of the system now has predictable
| code paths. Q. Is this related to perl, who knows
| jagrsw wrote:
| > yet it breaks nearly every text-based tool you could possibly
| imagine
|
| It breaks badly designed text protocols - some can argue that
| it's a good idea - "crash early, crash loud" etc.
|
| Also if your protocol breaks with newlines, it probably breaks
| with other non-literals - brackets, quotes, NUL-bytes, control
| characters, carriage return char, multibyte chars etc etc.
| jlarocco wrote:
| > That's just such an obviously brain-damaged idea.
|
| Is it, though? "Every character except '/' because it's the
| directory delimiter" seems pretty straight forward to me...
|
| > There's not a single rational use case for it, yet it breaks
| nearly every text-based tool you could possibly imagine...
|
| You don't have a use case, but that doesn't mean nobody else
| has one.
|
| And as far as "text-based tools" go, their developers should
| RTFM. I'm fairly sure UNIX existed before almost all of them,
| and it's accepted new lines all along.
| hericium wrote:
| First example suggests that `ls` should not be used but `ls -l` -
| the same program author advises against in the title, but with a
| parameter - works as expected and in this case would not result
| in "you can't tell".
|
| > The problem is that from the output of ls, neither you or the
| computer can tell what parts of it constitute a filename.
|
| Computer does not use console output of ls(1) to determine the
| list of files. It's for the user. The computer can tell what is a
| file here.
|
| The title could also be stricter with s/ls/"GNU coreutils ls"/g,
| too. I could not reproduce all the issues with FreeBSD's ls(1)
| under zsh.
| Tepix wrote:
| "ls -l" has other issues, now it will show you user and group
| names which can contain unexpected characters, too.
| [deleted]
| vidarh wrote:
| > First example suggests that `ls` should not be used but `ls
| -l` - the same program author advises against in the title, but
| with a parameter - works as expected and in this case would not
| result in "you can't tell".
|
| The first example is used to demonstrate the issue _and_ to
| demonstrate that "-l" introduces other issues (inconsistent
| escaping).
|
| > Computer does not use console output of ls(1) to determine
| the list of files. It's for the user. The computer can tell
| what is a file here.
|
| But if you try to use the output of ls in a script to find
| filenames, the computer _will_ be using ls to determine the
| list of files. Hence the advice not to do so.
|
| > The title could also be stricter with s/ls/"GNU coreutils
| ls"/g, too. I could not reproduce all the issues with FreeBSD's
| ls(1) under zsh.
|
| I think that just emphasises why you shouldn't, as it
| demonstrates you can't trust the output of ls to be consistent
| between systems either. If you are sure you'll never need to
| run your scripts on another system, you might not care, but
| when it's so easy to prevent this by e.g. using find with
| "-print0" or equivalent, it seems silly to not just unlearn the
| bad habit of using ls for this.
| [deleted]
| salmo wrote:
| Some of these are why I bail for a "real language" in many
| seemingly simple scenarios.
|
| As soon as I care about datetimes, it's just easier to use stat()
| and a proper datetime API.
|
| I can treat filenames as byte arrays and translate to Unicode or
| let the language do it for me.
|
| In dire circumstances, find ... -print0 | xargs -0 second_script
| is usually my fallback, but that has pitfalls as well.
|
| Go has been a blessing there for me, not having to rely on a
| runtime across diverse hosts. But that's a preference and doesn't
| help on old kernels w/o epoll().
|
| So many battle scars from inconsistency in Bash and GNU utilities
| over the years, especially on Unixes' bundled versions (Solaris,
| etc) or supporting GNU, BSD, SysV, and HP-UX in the same script.
| Used to deploy a ksh88(ish) on all for SOME consistency.
|
| Luckily now I'm not supporting anything but Linux anymore. When I
| can't Go, then I just hijack some tool's bundled Ruby (eg
| Puppet), Python, etc when I have to handle that and stick to the
| standard library.
|
| I am too lazy to C these days like I used to. I'm usually dealing
| with an emergency (looking at you log4j) and don't have the
| cycles to cover the gotchas there.
| chasil wrote:
| Any POSIX system has an easy way to remove a file with a
| malformed/hostile name.
|
| Determine the inode number with "ls -li", then remove it with
| "find . -inum # -delete" to remove it.
|
| The GNU stat utility makes finding the inode slightly easier.
| This method is preferable when there is any doubt of wildcard
| expansion.
| scbrg wrote:
| Unless I'm mistaken, POSIX _find_ does not have _-delete_
| 10000truths wrote:
| It might not be in the standard, but pretty much every
| implementation that I know of supports it - GNU, BSD,
| Solaris, even busybox and toybox.
| chasil wrote:
| Alas, you are right, my mistake. It doesn't have inum
| either. Those GNUisms do creep in.
|
| https://pubs.opengroup.org/onlinepubs/9699919799/utilities/
| f...
| cogburnd02 wrote:
| GNU ls also has options to automatically add quoting to
| its output.
___________________________________________________________________
(page generated 2021-12-31 23:01 UTC)