[HN Gopher] You can list a directory containing 8M files, but no...
___________________________________________________________________
You can list a directory containing 8M files, but not with ls
Author : _wldu
Score : 50 points
Date : 2021-08-15 19:21 UTC (3 hours ago)
(HTM) web link (be-n.com)
(TXT) w3m dump (be-n.com)
| tyingq wrote:
| perl -E 'opendir(my $d,".");say while readdir $d'
| marcodiego wrote:
| Makes me think that findfirst and findnext were not that bad
| after all.
| Y_Y wrote:
| If you haven't prepared for this eventuality then odds are you're
| going to run out of inodes first. And it's probably not useful to
| just dump all those filenames to your terminal. And don't even
| say you were piping the output of `ls` to something else!
|
| Anyway the coreutils shouldn't have arbitrary limits like this,
| at least if they do then the limits should be so high that you
| have to be really trying hard in order to reach them.
| cmeacham98 wrote:
| There isn't actually an arbitrary limit, it's just that glibc's
| readdir() implementation is really really slow with millions of
| files according to the article. Presumably if you waited awhile
| `ls` would eventually get the whole list.
| matheusmoreira wrote:
| The glibc functions are just bad wrappers for the real system
| calls which no doubt work much more efficiently. I fully
| expected to find the system call solution in the article and
| was not disappointed.
| mercurialuser wrote:
| did you try ls -1? in the far past I had the same problem listing
| millions of files in a dir edit: if I remember correctly ls
| bufferizes the results for sorting. with -1 it just dumps the
| values
| acdha wrote:
| It's not "ls -1" but "--sort=none" or "-U".
| loeg wrote:
| "ls -f" in POSIX ls (which GNU ls also implements). Also,
| avoid "-F", which will stat each file.
| the_arun wrote:
| `ls | more` works too, right?
| acdha wrote:
| No - it still does all of the work to sort the entries,
| which is the slow part since it prevents the first entry
| from being displayed until the last has been retrieved.
| loeg wrote:
| At least with GNU ls, 'ls | more' does not disable sorting.
| It disables automatic color (which is important -- coloring
| requires 'stat(2)'ing every file in the directory).
| innagadadavida wrote:
| I think it also tries to nearly format into columns and this
| requires it to know name lengths for all files. If you do -l it
| basically outputs one file per line and can be done more
| efficiently.
| sigmonsays wrote:
| seems the author didn't read the man page. ls -1f as others have
| pointed out is a much better than the solution.
|
| Additionally, having 8 million anything in a single directory
| screams bad planning. It's common for some hashing if directory
| structure to be planned.
| unwind wrote:
| Meta, if the author is around: there seems to be some kind of
| encoding problem, on my system (Linux, Firefox) I see a lot of
| strange characters where there should probably be punctuation.
|
| The first section header reads "Why doesnaEUR(tm)t ls work?", for
| instance.
| FpUser wrote:
| Same here
| cmeacham98 wrote:
| This is because the page has no doctype, thus putting the
| browser in "quirks mode", defaulting to a charset of ISO-8859-1
| (as the page does not specify one). The author can fix this
| either by specifying the charset, or adding the HTML5 doctype
| (HTML5 defaults to UTF-8).
| dheera wrote:
| Maybe browsers should default to UTF-8 already. It's 2021.
| lxgr wrote:
| Why? Defaulting to UTF-8 for modern HTML, and to ISO-8859-1
| for legacy pages, makes a lot of sense.
|
| Pages that haven't been adapted to HTML 5 in the last 10
| years or so are exceedingly unlikely to do so in year 11.
| dheera wrote:
| ISO-8859-1 is a subset of UTF-8 isn't it? No harm done by
| defaulting to the superset.
| [deleted]
| CodesInChaos wrote:
| No. ASCII is a subset of UTF-8, ISO-8859-1 is not. The
| first 256 codepoints of unicode match ISO-8859-1, which
| is probably the source of your confusion. However
| Codepoints 128-255 are encoded differently in UTF-8. They
| are represented by a single byte when encoded as
| ISO-8859-1, while they turn into two bytes encoded in
| UTF-8.
|
| Plus "ISO-8859-1" is treated as Windows-1252 by browsers,
| while unicode uses ISO-8859-1 extended with the ISO 6429
| control characters for its initial 256 codepoints.
| dheera wrote:
| Ah I see, thanks.
| anyfoo wrote:
| If it were, the characters in question would already
| display correctly for this website, since they are within
| ISO-8859-1. ASCII is a subset of UTF-8.
| magicalhippo wrote:
| We need to handle a lot of crappy data-in-text-files at
| work, and for most of them using the UTF-8 duck test seems
| to be the most reliable.
|
| If it decodes successfully as UTF-8 it's probably UTF-8.
| wolfgang42 wrote:
| That requires scanning the whole file before guessing the
| encoding, which browsers don't do for performance reasons
| (and also because an HTML document _may never end_ , it's
| perfectly valid for the server to keep appending to the
| document indefinitely). The HTML5 spec does recommend
| doing this on the first 1024 bytes, though.
| magicalhippo wrote:
| Browsers are quite happy on re-rendering the whole
| document multiple times though, so it could just switch
| and re-decode when UTF-8 fails. Sure it wouldn't be the
| fast path, but sure beats looking at hieroglyphs.
|
| And yeah, add some sensible limits to this logic of
| course. Most web pages aren't never-ending nor multi-GB
| of text.
| wolfgang42 wrote:
| _> HTML5 defaults to UTF-8_
|
| I'm not sure this is correct, though the WHATWG docs[1] are
| kind of confusing. From what I can tell, it seems like HTML5
| documents are required to be UTF-8, but also this is required
| to be explicitly declared either in the Content-Type header,
| a leading BOM, or a <meta> tag in the first 1024 bytes of the
| file. Reading this blog post[2] it sounds like there is a
| danger that if you don't do this then heuristics will kick in
| and try to guess the charset instead; the documented
| algorithm for this doesn't seem to consider the doctype at
| all.
|
| [1]: https://html.spec.whatwg.org/dev/semantics.html#charset
|
| [2]: https://blog.whatwg.org/the-road-to-html-5-character-
| encodin...
| Aardwolf wrote:
| The single quot for doesn't is an ASCII character though, why
| does that one become aEUR(tm)?
| iudqnolq wrote:
| Here's the heuridtic-based hypothesis of the python package
| ftfy >>> ftfy.fix_and_explain("aEUR(tm)")
| ExplainedText( text="'",
| explanation=[ ('encode', 'sloppy-
| windows-1252'), ('decode', 'utf-8'),
| ('apply', 'uncurl_quotes') ] )
| wolfgang42 wrote:
| Note that uncurl_quotes is a FTFY fix unrelated to
| character encoding, it's basically just s/'/'/. (FTFY
| turns all of its fixes on by default, which sometimes
| results in it doing more than you might want it to.)
|
| You can play around with FTFY here (open the "Decoding
| steps" to see the explanation of what it did and why): ht
| tps://www.linestarve.com/tools/mojibake/?mojibake=aEUR(tm
| )
| bdowling wrote:
| It's not. It's a Unicode 'RIGHT SINGLE QUOTATION MARK'
| (U+2019), which in UTF-8 is encoded as 0xe2 0x80 0x99.
|
| 0xe2 is a in iso8859-1. 0x80 is not in iso8859-1, but is
| EUR in windows-1252. 0x99 is not in iso8859-1, but is (tm)
| in windows-1252.
|
| So, the browser here appears to be defaulting to
| windows-1252.
| kccqzy wrote:
| Use your browser to override the encoding. For example in
| Firefox choose "View > Repair Text Encoding" from the menu or
| in Safari choose "View > Text Encoding > Unicode (UTF-8)" from
| the menu. Many browsers still default to Latin 1, but this page
| is using UTF-8.
|
| (This used to happen a lot ~15 years ago. Did the dominance of
| UTF-8 make people forget about these encoding issues?)
| fintler wrote:
| https://github.com/hpc/mpifileutils handles this pretty well --
| with SYS_getdents64. It has a few other tricks in there in
| addition to this one.
| scottlamb wrote:
| tl;dr: try "ls -1 -f". It's fast.
|
| This doesn't pass my smell test:
|
| > Putting two and two together I could see that the reason it was
| taking forever to list the directory was because ls was reading
| the directory entries file 32K at a time, and the file was 513M.
| So it would take around 16416 system calls of getdents() to list
| the directory. That is a lot of calls, especially on a slow
| virtualized disk.
|
| 16,416 system calls is a little inefficient but not that
| noticeable on human terms. And the author is talking as if each
| one waits 10 ms for a disk head to move to the correct position.
| That's not true. The OS and drive both do readahead, and they're
| both quite effective. I recently tried to improve performance of
| a long-running sequential read on an otherwise-idle old-fashioned
| spinning disk by tuning the former ("sudo blockdev --setra 6144
| /path/to/device"). I found it made no real difference: "iostat"
| showed OS-level readahead reduces the number of block operations
| (as expected) but also that total latency doesn't decrease. It
| turns out in this scenario the disk's cache is full of the
| upcoming bytes so those extra operations are super fast anyway.
|
| The real reason "ls" takes a while to print stuff is that by
| default it will buffer everything before printing anything so
| that it can sort it and (when stdout is a terminal) place it into
| appropriately-sized columns. It also (depending on the options
| you are using) will stat every file, which obviously will dwarf
| the number of getdents calls and access the inodes (which are
| more scattered across the filesystem).
|
| "ls -1 -f" disables both those behaviors. It's reasonably fast
| without changing the buffer size. moonfire-
| nvr@nuc:/media/14tb/sample$ time ls -1f | wc -l 1042303
| real 0m0.934s user 0m0.403s sys
| 0m0.563s
|
| That's on Linux with ext4.
| loeg wrote:
| Agree re smell test. Those directory blocks are cached, even in
| front of a slow virtualized disk, and most of those syscalls
| are hitting in cache. Author is likely running into (1) stat
| calls and (2) buffer and sort behavior, exactly as you
| describe.
| iso1210 wrote:
| Interesting, tried myself on a test VM
|
| ~/test$ time for I in `seq -w 1 1000000`; do touch $I; done
| real 27m8.663s user 14m15.410s sys 12m24.411s
|
| OK
|
| ~/test$ time ls -1f | wc -l 1000002
|
| real 0m0.604s user 0m0.180s sys 0m0.422s
|
| ~/test$ time ls -f | wc -l 1000002
|
| real 0m0.574s
|
| ~/test$ time perl -E 'opendir(my $d,".");say while readdir
| $d' |wc -l 1000002
|
| real 0m0.597s
|
| All seems reasonable. Directory size alone is 23M, somewhat
| larger than the typical 4096 bytes.
| osswid wrote:
| ls -f
| wolfgang42 wrote:
| -f do not sort, enable -aU, disable -ls --color
| -a do not ignore entries starting with . -U
| do not sort; list entries in directory order -l
| use a long listing format -s print the
| allocated size of each file, in blocks --color
| colorize the output
|
| I assume you mean to imply that by turning off
| sorting/filtering/formatting ls will run in a more optimized
| mode where it can avoid buffering and just dump the dentries as
| described in the article?
| jjgreen wrote:
| Seems that way: https://github.com/wertarbyte/coreutils/blob/
| master/src/ls.c...
| loeg wrote:
| Yeah, exactly. OP is changing 3 variables and concluding that
| getdirent buffer size was the significant one, but actually
| the problem was likely (1) stat calls, for --color, and (2)
| buffer and sort, which adds O(N log N) sorting time to the
| total run+print time. (Both of which are avoided by using
| getdirent directly.)
| majkinetor wrote:
| Yeah ...
|
| However, lets just accept that regular people don't know those
| tricks and we should keep files in subfolders? I have that logic
| in any app that has potential to spam a directory. You can still
| show them as a single folder (somewhere called branch view) if
| you like but every other tools that uses ls will work like a
| charm (such as your backup shell script)
| yjftsjthsd-h wrote:
| Then anything working on it needs to recurse.
| bifrost wrote:
| Interesting point, this does appear to be Linux and situation
| specific though.
|
| Its interesting enough that I'm going to run my own test now.
| bifrost wrote:
| Its going to take me a bit to generate several million files
| but so far I've got a single directory with 550k files in it,
| it takes 30s to ls it on a very busy system running FreeBSD.
|
| 1.1M files -> 120 seconds
|
| 1.8M files -> 270 seconds (this could be related to system load
| being over 90 heh)
| ipaddr wrote:
| At 3,000 my windows 7 os freezes. Not bad for a million.
| ygra wrote:
| You may want to disable short name generation on windows
| when putting many files on one directory.
| loeg wrote:
| Try "ls -f" (don't sort)?
|
| Which filesystem you use will also make a big difference
| here. You could imagine some filesystem that uses the
| getdirentries(2) binary format for dirents, and that could
| literally memcpy cached directory pages for a syscall. In
| FreeBSD, UFS gets somewhat close, but 'struct direct' differs
| from the ABI 'struct dirent'. And the FS attempts to validate
| the disk format, too.
|
| FWIW, FreeBSD uses 4kB (x86 system page size) where glibc
| uses 32kB in this article[1]. To the extent libc is actually
| the problem (I'm not confident of that yet based on the
| article), this will be worse than glibc's larger buffer.
|
| [1]: https://github.com/freebsd/freebsd-
| src/blob/main/lib/libc/ge...
| bifrost wrote:
| with "ls -f" on 1.9M files its 45 seconds, much better than
| regular ls (and system load of 94)
|
| 2.25M and its 60 seconds
|
| I'm also spamming about 16-18 thousand new files per second
| to disk using a very inefficient set of csh scripts...
| scottlamb wrote:
| A more efficient one-liner: seq 1
| 8000000 | xargs touch
| avaika wrote:
| If you are going to have a directory with millions of files,
| probably there's one more interesting thing to consider.
|
| As you might know ext* and some other FSs store filenames right
| in the directory file. Means the more files you have in the
| directory, the bigger directory size gets. In majority of cases
| nothing unusual happens, cause people have maybe a few dozens of
| dirs / files.
|
| However if you'll put millions of files, then directory size
| grows up to a few megabytes in size. If you decide to clean up
| later, you'd probably expect the directory size to shrink. But it
| never happens. Unless you run fsck or re-create a directory.
|
| That's because nobody believes the implementation effort really
| worth it. Here's a link to lkml discussion:
| https://lkml.org/lkml/2009/5/15/146
|
| PS. Here's a previous discussion of the very same article posted
| in this submission. It's been 10 years already :)
| https://news.ycombinator.com/item?id=2888820
|
| upd. Here's a code example:
|
| $ mkdir niceDir && cd niceDir
|
| # this might take a few moments
|
| $ for ((i=1;i<133700;i++)); do touch
| long_long_looong_man_sakeru_$i ; done
|
| $ ls -lhd .
|
| drwxr-xr-x 2 user user 8.1M Aug 2 13:37 .
|
| $ find . -type f -delete
|
| $ ls -l
|
| total 0
|
| $ ls -lhd .
|
| drwxr-xr-x 2 user user 8.1M Aug 2 13:37 .
| kalmi10 wrote:
| I once had a directory on OpenZFS with more than a billion files,
| and after cleaning it up with only handful of folders remaining,
| running ls in it still took a few seconds. I guess some large but
| almost empty tree structure remained.
|
| https://0kalmi.blogspot.com/2020/02/quick-moving-of-billion-...
___________________________________________________________________
(page generated 2021-08-15 23:00 UTC)