[HN Gopher] Git ls-files is Faster Than Fd and Find
___________________________________________________________________
Git ls-files is Faster Than Fd and Find
Author : todsacerdoti
Score : 91 points
Date : 2021-11-21 17:33 UTC (5 hours ago)
(HTM) web link (cj.rs)
(TXT) w3m dump (cj.rs)
| rmetzler wrote:
| Small typo occurring two times: git ls-file instead of git ls-
| file _s_.
|
| Thanks for bringing it up, I didn't know about the command. I
| also might have a use case for it.
| c_joly wrote:
| Thanks for pointing this out and for your kind words. A fix for
| the typos should be live soon.
| tyingq wrote:
| A warm buffer cache makes a big difference too, so if you're
| benchmarking things like find vs other tools, be sure to empty
| the cache between runs.
|
| For Linux: echo 3 > /proc/sys/vm/drop_caches
|
| In this case, the author is using hyperfine with --warmup 10, so
| the numbers are all using a warm buffer cache. A cold cache
| probably would have been more realistic for comparison, since the
| benchmark is traversing lots of directories.
| [deleted]
| cb321 wrote:
| I got run times from the simplest single-threaded directory walk
| that are only 1.8x slower than git ls-files. (Min time of 10 runs
| with the git repo housed by /dev/shm on Linux 5.15.)
|
| The "simple" code is in
| https://github.com/c-blake/cligen/blob/master/cligen/dents.n...
| (just `dents find` does not require the special kernel batch
| system call module to be fast. That kernel module is more about
| statx batching but IO uring can also do that these days. For
| those unfamiliar with Nim, it's a high productivity, high
| performance systems language.)
|
| I believe that GNU find is slow because it is specifically
| written to allow arbitrary filesystem depth as opposed to "open
| file descriptor limit-limited depth". (I personally consider this
| over-/mis-engineered from days of bygone systems with default
| ulimits of, say, only 64 open fd's. There should at least be a
| "fast mode" since let's be real - file hierarchies deeper than
| 256..1024 levels, while possible, are rare and one should
| optimize for the common case or at least allow manual instruction
| that this case prevails. AFAIK there is no such "fast mode".
| Maybe on some BSD?)
|
| Meanwhile, I think the Rust fd is slow because of (probably
| counterproductive) multi-threading (at least it does 11,000 calls
| to futex).
| aaaaaaaaaaab wrote:
| >Meanwhile, I think the Rust fd is slow because of (probably
| counterproductive) multi-threading (at least it does 11,000
| calls to futex).
|
| Afaik rustaceans call this "fearless concurrency".
| R0b0t1 wrote:
| From experience benchmarking this in the past, Window's poor
| NTFS implementation sees speedup from multithreading (which I
| can't make sense of) whereas Linux usually had a penalty for
| multithreading directory traversals.
|
| But, as mentioned below, this is faster because git-ls is using
| a pregenerated index.
| cb321 wrote:
| No question/disagreement that it's faster from having an
| index. It just need only be 1.8x faster -- not the much
| higher ratio reported in the article (on Linux, anyway).
|
| I was mostly just adding some performance color that the
| points of comparison of the article are, in some sense, not
| very fast.
| R0b0t1 wrote:
| Hm, yes, I reread the part about the performance of find.
| GNU find is one of the fastest directory traversals out of
| the box compared to many other implementations, e.g. the
| ones that come in stdlibs. I was under the impression the
| slowness was applying the tests when they were applied.
| cb321 wrote:
| Running the tests like `-type l` or `-perm` or whatnot
| surely can be slow, but that is not in play in the
| article's use case.
|
| In my experience, fast vs. slow "reputations" are sadly
| very unreliable.
|
| Here is some system call color. $ find
| | wc -l 79348 $ strace -c find >/dev/null
| % time seconds usecs/call calls errors
| syscall ------ ----------- ----------- ---------
| --------- ---------------- 29.57 0.039172
| 1 24453 fcntl 24.51 0.032477
| 1 19615 close 19.87 0.026329
| 1 14724 newfstatat 16.78
| 0.022229 2 9786 getdents64
| 7.48 0.009909 2 4948 6 openat
| 0.86 0.001137 1 755 write
| ... ------ ----------- ----------- ---------
| --------- ---------------- 100.00 0.132487
| 1 74336 9 total $ dents find|wc
| -l 79348 $ strace -c dents f >/dev/null
| % time seconds usecs/call calls errors
| syscall ------ ----------- ----------- ---------
| --------- ---------------- 42.43 0.018491
| 3 5084 getdents64 28.69
| 0.012503 2 4896 openat
| 25.40 0.011071 2 4896 close
| 3.47 0.001514 2 755 write
| ... ------ ----------- ----------- ---------
| --------- ---------------- 100.00 0.043579
| 2 15697 2 total
|
| So, you know, almost 5x the system call count. (EDIT: and
| yes, yes, not all syscalls are created equally. In this
| hot cache case there are so many that syscall overhead
| alone could be a large fraction of costs, though.)
|
| But you don't need to trust me. It's probably about 50
| lines of C to code up a file tree walk recursion and test
| it yourself.
| bscphil wrote:
| > I believe that GNU find is slow because it is specifically
| written to allow arbitrary filesystem depth as opposed to "open
| file descriptor limit-limited depth".
|
| I haven't benchmarked find specifically, but I believe the most
| common Rust library for the purpose, walkdir[1], also allows
| arbitrary file system recursion depth, and is extremely fast.
| It was fairly close to some "naive" limited depth code I wrote
| in C for the same purpose. A lot of go-to C approaches seem to
| needlessly call stat on every file, so they're even slower.
|
| I'd be curious to see benchmarks of whether this actually makes
| a difference.
|
| [1] https://github.com/BurntSushi/walkdir
| masklinn wrote:
| > Meanwhile, I think the Rust fd is slow because of (probably
| counterproductive) multi-threading (at least it does 11,000
| calls to futex).
|
| There's probably a switch to run single-threaded so that should
| be testable.
| cb321 wrote:
| Agreed. Adding -j1 still leaves a very large (and wildly
| varying actually) number of futex calls shown by strace and
| slows down execution by another 1.5x. So, a little boost from
| threads but not much. Someone more interested (than I am) in
| "fixing" fd should study it more. Often people do not
| "special case" things like "-j1" to actually be single
| threaded with (no MT runtime overheads) but instead "launch
| only 1 worker thread" and such. Might taking hacking on fd to
| really test what is going on.
| psykotic wrote:
| I've only profiled fd on Windows but one thing that stood
| out was that it performed 2 stat syscalls per file via
| NtQueryInformationFile (that number should be 0 per file on
| Windows since stat metadata comes for free with
| NtQueryDirectoryFile from the directory enumeration). When
| I mentioned this finding on Twitter, someone confirmed that
| it's also doubling up the stat syscalls on Linux. But if
| the OP is actually trying to benchmark raw directory
| enumeration speed vs git ls-files, they should make sure
| they're benchmarking against something that's not making
| per-file stat calls at all.
| filmor wrote:
| Well, first doing `find > .my-index` and then measuring `cat .my-
| index` would give you even better results... I don't find it
| noteworthy that reading from an index is faster than actually
| recursively walking the filesystem.
| yakubin wrote:
| Yes. I once wrote a tool which spends a lot of time traversing
| the filesystem and to my surprise in one scenario most of its
| time is spent on-CPU (!) in kernel-mode in the _readdir_r(2)_
| syscall implementation. I still haven 't dived into what it's
| doing on the CPU, but it sure is interesting.
| rakoo wrote:
| No, it's not surprising, so why do we still not use indexes for
| this ?
|
| NTFS maintains a journal of all files modification
| (https://en.wikipedia.org/wiki/USN_Journal). This is used by
| Everything (https://www.voidtools.com/support/everything/) to
| quickly and efficiently index _all_ files and folders. Thanks
| to that, searching for a file is instantaneous because it's
| "just an index read".
|
| The feature is common: listing files. We know that indexes help
| solve the issue. But we still use `find` because there's no
| efficient files indexing strategy. Yes, there's
| updatedb/mlocate, but updating the db requires a full traversal
| of the filesytem because there's no efficient listing of
| changes in linux, only filesystems-specific stuff.
|
| So we will still have articles like this until the situation
| changes, and it will still be relevant to the discussion
| because it's not "solved"
| roywiggins wrote:
| WizTree also uses NTFS metadata to perform _shockingly_ fast
| reports on space usage. Just much, much faster than anything
| else I 've seen, and I'm not sure there's any equivalents for
| other filesystems.
| rakoo wrote:
| WizTree is so fast it makes me yearn for the day we have
| the same funcitonality on Linux.
|
| Especially not because I'm particularly interested in
| visualizing free disk space on the regular, but because I
| want to backup changed files as soon as possible and the
| only reliable way to do it is to have some kind of journal.
| ryantriangles wrote:
| Everything's approach has important tradeoffs: the indexing
| service must be run as administrator/root, and by reading the
| USN journal it gives users a listing of _all_ files and
| mtimes on the filesystem regardless of directory permissions.
| That means that any user who can run a file search can also
| see every other users' files, which includes their partial
| web history (since most browsers, including Firefox and
| Chrome, cache data in files named for the website they're
| from, and the mtime will often be the last visit time). If
| you want to avoid this, you can only switch to Everything's
| traditional directory-traversal-based index, which has the
| same performance as updatedb/mlocate, Baloo, or fsearch.
|
| I think these tradeoffs are why there hasn't been as much
| interest in replicating it, combined with the fact that
| mlocate/fd/find/fsearch are already a lot faster in most
| circumstances than the default Windows search and good enough
| for the most common use cases (although there are certainly
| usecases where they're not).
| sillysaurusx wrote:
| It feels like the index update should be a part of the
| filesystem layer itself. Not a separate process like you're
| saying.
| rakoo wrote:
| That's an extremely important tradeoff but it's not an
| issue in the process: it is possible to split indexing and
| searching in a root process and querying in a user process
| that asks the first one, and the first one filters with the
| different rights. Nothing we've never heard of.
| ww520 wrote:
| Updating an index adds additional time, adding a random seek
| to the index file location. Also requires some transaction to
| group the update to the file metadata and the index. It just
| adds more complexity with little benefit.
| ziml77 wrote:
| As someone who uses Everything many times per day, I can
| say that there is a significant amount of benefit. I don't
| think I'd be able to function at work without being able to
| instantly search all my files. The lack of a similar
| solution on Linux is one of the big barriers to me using
| it. The best options I've seen there all refresh the index
| on a schedule.
| c_joly wrote:
| > I don't find it noteworthy that reading from an index is
| faster than actually recursively walking the filesystem
|
| I agree, what I found noteworthy though is that git ls-files
| uses this index you already have for.
|
| (author here, "proof": https://cj.rs/contact/ &
| https://keybase.io/leowzukw)
| cseleborg wrote:
| Yeah, that's not really the interesting part. It's clever
| because if you're a developer editing source files, the Git
| index is relatively up-to-date. So, as the author says, why not
| take advantage of it?
| nmz wrote:
| Walk is faster than find. https://github.com/google/walk
| gorgoiler wrote:
| One thing I've never understood about Linux filesystems: given
| how small and bounded the sets of directories and directory
| entries are, why is filesystem traversal not instantaneous?
| spicybright wrote:
| Lack of caches I'd assume. Not because it's not easy, but
| because no one has taken the time to implement it.
| toxik wrote:
| They say there are two hard problems in computer science:
| cache invalidation, naming, and off-by-one errors.
| spicybright wrote:
| Hehe. You're right, I forgot one of the golden rules.
| There's probably a better reason why no one has implemented
| it.
| amelius wrote:
| Also, why can tab-completion in Bash hang my shell?
| chrsig wrote:
| This is actually an incredibly frustrating thing -- when
| doing disk i/o, the process goes into an uninterruptible
| sleep state[0].
|
| So if your shell is doing some tab completion that goes into
| an uninterruptible sleep, there's no escape sequence you can
| send it to cancel that operation.
|
| Where I've seen this be the most vexing is in AWS if an EBS
| volume dies (surprisingly more frequent of an occurrence than
| you'd think!). It effectively prevents the machine from
| cleanly shutting down, and you wind up having to wait for the
| instance to be forcefully terminated.
|
| [0] https://eklitzke.org/uninterruptible-sleep
| m000 wrote:
| What you describe is an over-specialized optimization that very
| few users would benefit from, but would still introduce
| significant complexity.
|
| Linux already transparently caches filesystem metadata. You
| already get a good speedup if you attempt the same directory
| walk twice, and not much have changed in the filesystem.
| gorgoiler wrote:
| My desktop has ~2MB of basenames on it. Memory isn't free,
| and there's more to an inode than a filename, but it seems
| odd that this data that's the size of a cat photo doesn't get
| special treatment over other vm cache data.
| thanatos519 wrote:
| That's tunable, and apparently the default is reasonable:
| https://sysctl-explorer.net/vm/vfs_cache_pressure/
|
| My workstation has 64GB of RAM and only ~7M directory
| entries so I have 'vm.vfs_cache_pressure = 1' in
| /etc/sysctl.conf and cache everything with a full directory
| traversal via find. The first time it takes 52s; subsequent
| times take 5s. It has never given me memory problems.
| spicybright wrote:
| I'm assuming the cache isn't kept on disk tho, right?
|
| As in 52s the first time since boot, 5+ for subsequent
| runs (incorporating time for changed files/directories)
| topspin wrote:
| > why is filesystem traversal not instantaneous
|
| Because a mounted file system isn't a simple indexed list of
| paths. File systems are shared, mutable state.
|
| The mechanism you're asking about is called the dentry cache[1]
| and a decent narrative of its operation is found here[2]. It
| has the fabulously complex job of atomically brokering the
| state of an arbitrary number of local and remote file systems
| between an arbitrary number of processes using arbitrary access
| patterns without consuming all available RAM. Expecting the
| dentry cache to yield 'instantaneous' results is unreasonable.
| Comparing its performance to that of an VCS index is naive.
|
| [1]
| https://www.kernel.org/doc/Documentation/filesystems/vfs.txt
| [2] https://www.halolinux.us/kernel-reference/the-dentry-
| cache.h... (no endorsement of these people, but the dentry
| cache description is both concise and sufficient.)
| wvh wrote:
| Probably having to check if the actual disk entries changed is
| what slows it down. I wonder if it would be possible with
| nowadays' memory sizes to keep all metadata in memory as a
| write-through cache. Not sure if it'd be worth it though, my
| system has close to half a million files, but I'm only
| interested in about a hundred or so. I don't think file systems
| are slow in practice for typical human-scale operations though,
| with the exception of non-indexed "search all my files" type of
| operations.
| bee_rider wrote:
| Of course scanning an index is faster than traversing the
| filesystem.
|
| Is locate/mlocate some obscure command? It works pretty well for
| this sort of thing (and has the advantage that you wouldn't need
| to put git repos everywhere, or something silly like that).
|
| I often forget what I've named a pdf that I've downloaded, but
| usually I'll put something related to the topic of a paper in the
| file name, so a command like: locate -i
| "*svd*.pdf"
|
| will have a decent chance of finding any papers about svd's that
| I have, and is more or less instantaneous.
|
| Although -- I think I did come across the locate command because
| I was searching for alternatives to 'find.' I can not for the
| life of me remember the correct order for the path to start
| searching, and the filename.
| cb321 wrote:
| plocate [1] is even faster (but of course with either you need
| to re-scan the whole FSes and rebuild indexes to have an up-to-
| date view).
|
| [1] https://www.linuxuprising.com/2021/09/plocate-is-much-
| faster...
| quicklime wrote:
| The problem with locate is that it requires the index db to be
| updated periodically (I think this happens daily by default?).
| For some use cases, especially those where I'm searching for
| files in a tree that I'm actively working in, this forces me to
| fall back to find (or maybe git ls-files now).
|
| I feel like it should be the job of the filesystem to maintain
| an index and incrementally update it whenever files are created
| or removed, but afaict no modern filesystem offers it.
| spicybright wrote:
| It really is strange we don't have that yet.
|
| We already pretend files are on the physical disk and sync it
| in the background. Adding smarter indexing to that shouldn't
| be a huge deal.
| bee_rider wrote:
| I suppose this is sort of a problem depending on how you use
| it. I find that I don't really need to locate things that
| I've recently used, because I already know where they are as
| a result of having recently used them. But, other people have
| different workflows of course (I can imagine, for example, if
| somebody's workflow involved creating lots of files, only
| some of which are immediately useful, they might want the
| ability to scan through them for particular outputs).
| ziml77 wrote:
| As mentioned elsewhere in these comments, NTFS actually does
| do that. There are a couple tools out there that take
| advantage of it like Everything Search and WizTree. I don't
| know why no open source filesystem has done the same thing.
| Karellen wrote:
| Sarcasm?
| cryptonector wrote:
| This is why I have a git alias `ls` for `ls-files`.
| njharman wrote:
| > source code management system (SCM), it's main business3 is not
| to help you list your files!
|
| This seems utterly bizarre statement. What does author imagine
| SCM dies? Stuff, but listing the files that are the source it is
| managing is up there.
|
| Like saying FPS is a game its not pushing triangles through GPU.
|
| The larger purpose is facilitated by and requires the lessor
| purpose.
| mikewhy wrote:
| FYI to anyone looking to try this out: the `--others` flag, used
| to include untracked files, will also print out files included in
| your .gitignore. So if you have eg. a node_modules/ or venv/
| folder, all its contents will be listed.
|
| This is often unwanted noise. I haven't been able to find if
| there's some combination of flags that would get the desired
| behaviour, but it's been a while since I've messed around with
| this.
| xyzzy_plugh wrote:
| What's the desired behaviour you want? --cached --others
| --exclude-standard will show tracked and untracked but not
| ignored.
| [deleted]
___________________________________________________________________
(page generated 2021-11-21 23:00 UTC)