hngopher.com

       [HN Gopher] Show HN: Dut - a fast Linux disk usage calculator
       ___________________________________________________________________
        
       Show HN: Dut - a fast Linux disk usage calculator
        
       "dut" is a disk usage calculator that I wrote a couple months ago
       in C. It is multi-threaded, making it one of the fastest such
       programs. It beats normal "du" in all cases, and beats all other
       similar programs when Linux's caches are warm (so, not on the first
       run). I wrote "dut" as a challenge to beat similar programs that I
       used a lot, namely pdu[1] and dust[2].  "dut" displays a tree of
       the biggest things under your current directory, and it also shows
       the size of hard-links under each directory as well. The hard-link
       tallying was inspired by ncdu[3], but I don't like how unintuitive
       the readout is. Anyone have ideas for a better format?  There's
       installation instructions in the README. dut is a single source
       file, so you only need to download it and copy-paste the compiler
       command, and then copy somewhere on your path like /usr/local/bin.
       I went through a few different approaches writing it, and you can
       see most of them in the git history. At the core of the program is
       a datastructure that holds the directories that still need to be
       traversed, and binary heaps to hold statted files and directories.
       I had started off using C++ std::queues with mutexes, but the
       performance was awful, so I took it as a learning opportunity and
       wrote all the datastructures from scratch. That was the hardest
       part of the program to get right.  These are the other techniques I
       used to improve performance:  * Using fstatat(2) with the parent
       directory's fd instead of lstat(2) with an absolute path. (10-15%
       performance increase)  * Using statx(2) instead of fstatat. (perf
       showed fstatat running statx code in the kernel). (10% performance
       increase)  * Using getdents(2) to get directory contents instead of
       opendir/readdir/closedir. (also around 10%)  * Limiting inter-
       thread communication. I originally had fs-traversal results
       accumulated in a shared binary heap, but giving each thread a
       binary-heap and then merging them all at the end was faster.  I
       couldn't find any information online about fstatat and statx being
       significantly faster than plain old stat, so maybe this info will
       help someone in the future.  [1]:
       https://github.com/KSXGitHub/parallel-disk-usage  [2]:
       https://github.com/bootandy/dust  [3]:
       https://dev.yorhel.nl/doc/ncdu2, see "Shared Links"
        
       Author : 201984
       Score  : 346 points
       Date   : 2024-07-10 23:29 UTC (23 hours ago)
        
 (HTM) web link (codeberg.org)
 (TXT) w3m dump (codeberg.org)
        
       | kccqzy wrote:
       | I'm away from my Linux machine now but I'm curious whether/how
       | you handle reflinks. On a supported file system such as Btrfs
       | which I use, how does `cp --reflink` gets counted? Similar to
       | hard links? I'm curious because I use this feature extensively.
        
         | 201984 wrote:
         | I've actually never heard of --reflink, so I had to look it up.
         | `cp` from coreutils uses the FICLONE ioctl to clone the file on
         | btrfs instead of a regular syscall.
         | 
         | I don't handle them specifically in dut, so it will total up
         | whatever statx(2) reports for any reflink files.
        
           | vlovich123 wrote:
           | You'll probably end up with dupes (and removing these files
           | won't have the effect you intend) but I don't know that
           | there's a good way to handle and report such soft links
           | anyway.
        
             | namibj wrote:
             | Btdu will be you friend.
        
       | hsbauauvhabzb wrote:
       | This looks handy. Do you have any tips for stuff like queued 'mv'
       | or similar? If I'm moving data around on 3-4 drives, it's common
       | where I'll stack commands where the 3rd command may free up space
       | for the 4th to run successfully - I use && a to ensure a halt on
       | failure, but I need to mentally calculate the space free when I'm
       | writing the commands as the free space after the third mv will be
       | different to the output of 'df' before any of the commands have
       | run.
        
         | 201984 wrote:
         | I haven't run into a situation like that, but if I did, I'd be
         | doing mental math like you. `dut` would only be useful as a
         | (quicker) replacement for `du` for telling you how large the
         | source of a `cp -r` is.
        
       | timrichard wrote:
       | Looks nice, although a feature I like in ncdu is the 'd' key to
       | delete the currently highlighted file or directory.
        
         | 201984 wrote:
         | This isn't an interactive program, so ncdu would be better for
         | interactively going around and freeing up space. If you just
         | want an overview, though, then dut runs much quicker than ncdu
         | and will show large files deep down in subdirectories without
         | having to go down manually.
        
       | notarealllama wrote:
       | If this accurately shows hidden stuff, such as docker build cache
       | and old kernels, then it will become my go-to!
        
         | 201984 wrote:
         | As long as it has permissions, it totals up everything under
         | the directory you give it including names that start with a
         | '.'. It won't follow symlinks though.
        
         | nottorp wrote:
         | Uh, even basic du "shows hidden stuff" accurately doesn't it?
         | 
         | dot files are just a convention on unix.
        
       | jepler wrote:
       | did you consider using io_uring? if not, was there a reason other
       | than portability?
        
         | 201984 wrote:
         | io_uring doesn't support the getdents syscall, so there's no
         | way to traverse the filesystem with it. I considered using it
         | for statx(2) to get the disk usage of each file, but decided
         | not to because (a) it would be complicated to mix normal
         | syscalls and io_uring and (b) perf showed the kernel spending
         | most of its time doing actual work and not syscall boilerplate.
        
           | jepler wrote:
           | I appreciate the explanation!
        
           | nh2 wrote:
           | Are you sure the perf may not be misleading?
           | 
           | E.g. memory accesses might show up as slower die to CPU
           | caches being flushed when switching between user and kernel
           | space.
           | 
           | I would be extremely interested in a quick (standalone?)
           | benchmark of e.g. 1M stats with vs without uring.
           | 
           | Also https://github.com/tdanecker/iouring-getdents reports
           | big uring speedups for getdents, which makes it surprising to
           | get no speedups for stat.
           | 
           | If uring turns out fast, you might ignore (a), just doing the
           | getdents first and then all stats afterwards. Since getdents
           | is a "batch" syscalls covering many files anyway, but stat
           | isn't.
        
       | montroser wrote:
       | Nice work. Some times I wonder if there's any way to trade away
       | accuracy for speed? Like, often I don't care _exactly_ how many
       | bytes is the biggest user of space, but I just want to see some
       | orders of magnitude.
       | 
       | Maybe there could be an iterative breadth-first approach, where
       | first you quickly identify and discard the small unimportant
       | items, passing over anything that can't be counted quickly. Then
       | with what's left you identify the smallest of those and discard,
       | and then with what's left the smallest of those, and repeat and
       | repeat. Each pass through, you get a higher resolution picture of
       | which directories and files are using the most space, and you
       | just wait until you have the level of detail you need, but you
       | get to see the tally as it happens across the board. Does this
       | exist?
        
         | 201984 wrote:
         | Thanks!
         | 
         | What you described is a neat idea, but it's not possible with
         | any degree of accuracy AFAIK. To give you a picture of the
         | problem, calculating the disk usage of a directory requires
         | calling statx(2) on every file in that directory, summing up
         | the reported sizes, and then recursing into every subdirectory
         | and starting over. The problem with doing a partial search is
         | that all the data is at the leaves of the tree, so you'll miss
         | some potentially very large files.
         | 
         | Picture if your program only traversed the first, say, three
         | levels of subdirectories to get a rough estimate. If there was
         | a 1TB file down another level, your program would miss it
         | completely and get a _very_ innaccurate estimate of the disk
         | usage, so it wouldn 't be useful at all for finding the biggest
         | culprits. You have the same problem if you decide to stop
         | counting after seeing N files, since file N+1 could be gigantic
         | and you'd never know.
        
           | montroser wrote:
           | Yeah, maybe approximation is not really possible. But it
           | still seems like if you could do say, up to 1000 stats per
           | directory per pass, then running totals could be accumulated
           | incrementally and reported along the way.
           | 
           | So after just a second or two, you might be able to know with
           | certainty that a bunch of small directories are small, and
           | then that a handful of others are at least however big has
           | been counted so far. And that could be all you need, or else
           | you could wait longer to see how the bigger directories play
           | out.
        
             | geertj wrote:
             | You would still have to getdents() everything but this way
             | you may indeed save on stat() operations, which access
             | information that is stored separately on disk and
             | eliminating these would likely help uncached runs.
             | 
             | You could sample files in a directory or across directories
             | to get an average file size and use the total number of
             | files from getdents to estimate a total size. This does
             | require you to know if a directory entry is a file or
             | directory, which the d_type field gives you depending on
             | the OS, file system and other factors. An average file size
             | could also be obtained from statvfs().
             | 
             | Another trick is based on the fact that the link count of a
             | directory is 2 + the number of subdirectories. Once you
             | have seen the corresponding number of subdirectories, you
             | know that there are no more subdirectories you need to
             | descend into. This could allow you to abort a getdents for
             | a very large directory, using eg the directory size to
             | estimate the total entries.
        
               | olddustytrail wrote:
               | > Another trick is based on the fact that the link count
               | of a directory is 2 + the number of subdirectories.
               | 
               | For anyone who doesn't know why this is, it's because
               | when you create a directory it has 2 hard links to it
               | which are                   dirname         dirname/.
               | 
               | When you add a new subdirectory it adds one more link
               | which is                   dirname/subdir/..
               | 
               | So each subdirectory adds one more to the original 2.
        
         | BeeOnRope wrote:
         | This seems difficult since I'm not aware of any way to get
         | approximate file sizes, at least with the usual FS-agnostic
         | system calls: to get any size info you are pretty much calling
         | something in the `stat` family and at that point you have the
         | exact size.
        
           | fsckboy wrote:
           | i thought files can be sparse and have holes in the middle
           | where nothing is allocated, so the file size is not what is
           | used to calculate usage, it's the sum of the extents or some
           | such.
        
             | BeeOnRope wrote:
             | Yes, files can be sparse but the actual disk usage
             | information is also returned by these stat-family calls, so
             | there is no special cost to handling sparse files.
        
         | mos_basik wrote:
         | Something like that exists for btrfs; it's called bdtu. It has
         | the accuracy/time trade-off you're interested in, but the
         | implementation is quite different. It samples random points on
         | the disk and finds out what file path they belong to. The
         | longer it runs the more accurate it gets. The readme is good at
         | explaining why this approach makes sense for btrfs and what its
         | limitations are.
         | 
         | https://github.com/CyberShadow/btdu
        
           | jszymborski wrote:
           | That's so cool.
        
           | renewiltord wrote:
           | Damn, `ext4` is organized differently entirely. You can't get
           | anything useful from:                   sudo debugfs -R
           | "icheck $RANDOM" /dev/nvme1         sudo debugfs -R "ncheck
           | $res" /dev/nvme1
           | 
           | and recursing. That's a clever technique given btrfs structs.
        
           | oehpr wrote:
           | That is so cool!!! I have always wanted something like this!
           | Arg I wish other filesystems supported a strategy like this!
        
         | lenkite wrote:
         | Wish modern filesystems maintained usage per dir as a directory
         | file attribute instead of mandating tools to do this basic job.
        
           | hsbauauvhabzb wrote:
           | This is an excellent point and I wholeheartedly agree!
        
             | masklinn wrote:
             | Is it? That would require any update to any file to cascade
             | into a bunch of directory updates amplifying the write and
             | for what? Do you "du" in your shell prompt?
             | 
             | Not to mention it would likely be unable to handle the
             | hardlink problem so it would consistently be wrong.
        
               | IsTom wrote:
               | > That would require any update to any file to cascade
               | into a bunch of directory updates amplifying the write
               | and for what?
               | 
               | You can be a little lazy about updating parents this and
               | have O(1) update and O(1) amortized read with O(n) worst
               | case (same as now anyway).
        
               | hsbauauvhabzb wrote:
               | Disks have improved in I/O and write speed metrics
               | substantially to the point where windows will literally
               | index your file system so you can search faster, and
               | antivirus will scan files in the background before you
               | open them. I don't think maintaining size state on
               | directories would be all that much of a challenge.
        
               | Someone wrote:
               | I expect performance would suffer quite a lot. In a
               | system with high I/O, there would be a lot of contention
               | on updating the size of such directories as _/ home_ or
               | _/ tmp_, let alone _/_.
               | 
               | Also, are you going to update a file's size for every
               | write (could easily be a thousand times if you're copying
               | over a 10MB file) or are you going to coalesce updates to
               | file sizes? If the latter, how do you recover after a
               | crash?
               | 
               | Virtual directories such as _/ dev_ and _/ proc_ would
               | require special-casing.
               | 
               | Mounting and unmounting disks probably would require
               | special-casing.
        
           | nh2 wrote:
           | CephFS does that.
           | 
           | You can use getfattr to ask it for the recursive number of
           | entries or bytes in a given directory.
           | 
           | Querying it is constant time, updates update it with a few
           | seconds delay.
           | 
           | Extremely useful when you have billions of files on spinning
           | disks, where running du/ncdu would take a month just for the
           | stat()s.
        
       | chomp wrote:
       | GPLv3, you love to see it. Great work.
        
       | jeffbee wrote:
       | It's also interesting that the perf report for running dut on my
       | homedir shows that it spends virtually all of the time looking
       | for, not finding, and inserting entries in dentry cache slabs,
       | where the entries are never found again, only inserted :-/ Great
       | cache management by the kernel there.
       | 
       | ETA: Apparently the value in /proc/sys/vm/vfs_cache_pressure
       | makes a _huge_ difference. With the default of 100, my dentry and
       | inode caches never grow large enough to contain the ~15M entries
       | in my homedir. Dentry slabs get reclaimed to stay  < 1% of system
       | RAM, while the xfs_inode slab cache grows to the correct size.
       | The threads in dut are pointless in this case because the access
       | to the xfs inodes serializes.
       | 
       | If I set this kernel param to 15, then the caches grow to
       | accommodate the tens of millions of inodes in my homedir.
       | Ultimately the slab caches occupy 20GB of RAM! When the caches
       | are working the threading in dut is moderately effective, job
       | finishes in 5s with 200% CPU time.
        
         | 201984 wrote:
         | Are you referring to the kmem_cache_alloc calls in the profile?
         | If so, that's all in kernel space and there's nothing I can do
         | about it.
         | 
         | https://share.firefox.dev/3XT9L7P
        
           | jeffbee wrote:
           | No, see how your profiles have `lookup_fast` at the leaves?
           | Mine has `__lookup_slow` and it is slow indeed.
        
             | 201984 wrote:
             | I just saw your edit. You have WAY more stuff under your
             | home directory than I do. I only have ~2.5M inodes on both
             | my laptop drives combined. The difference in the buff/cache
             | output of `free` before and after running `dut` is only 1
             | GB for me.
             | 
             | Also, TIL about that kernel parameter, thanks!
        
               | jeffbee wrote:
               | Yeah I have a TB of bazel outputs in my cache directory.
               | Unfortunately automatically deleting old bazel outputs is
               | beyond the frontier of computer science and has been
               | pushed out to future releases for 6 years and still
               | going: https://github.com/bazelbuild/bazel/issues/5139
        
       | tambourine_man wrote:
       | > https://dev.yorhel.nl/doc/ncdu2
       | 
       | I wasn't aware that there was a rewrite of ncdu in Zig. That link
       | is a nice read.
        
       | jftuga wrote:
       | You should include the "How to build" instructions near the
       | beginning of the main.c file.
        
         | 201984 wrote:
         | Done
        
       | teamspirit wrote:
       | Nice job. I've been using dua[0] and have found it to be quite
       | fast on my MacBook Pro. I'm interested to see how this compares.
       | 
       | [0] https://github.com/Byron/dua-cli
        
         | 201984 wrote:
         | I benchmarked against dua while developing, and the results are
         | in the README. Note that dut uses Linux-specific syscalls, so
         | it won't run on MacOS.
         | 
         | TL;DR: dut is 3x faster with warm caches, slightly faster on
         | SSD, slightly slower on HDD.
        
       | IAmLiterallyAB wrote:
       | I'm surprised statx was that much faster than fstatat. fstatat
       | looks like a very thin wrapper around statx, it just calls
       | vfs_statx and copies out the result to user space.
        
         | 201984 wrote:
         | Out of curiosity, I switched it back to fstatat and compared,
         | and found no significant difference. Must've been some other
         | change I made at the time, although I could've sworn this was
         | true. Could be a system update changed something in the three
         | months since I did that. I can't edit my post now though, so
         | that wrong info is stuck there.
        
       | jmakov wrote:
       | Would be great to have a TUI interface for browsing like ncdu.
        
       | jonhohle wrote:
       | Did you consider the fts[0] family of functions for traversal? I
       | use that along with a work queue for filtered entries to get
       | pretty good performance with dedup[1]. For my use case I could
       | avoid any separate stat call altogether, the FTSENT already
       | provided everything I needed.
       | 
       | 0 - https://linux.die.net/man/3/fts_read
       | 
       | 1 - https://github.com/ttkb-
       | oss/dedup/blob/6a906db5a940df71deb4f...
        
         | nh2 wrote:
         | fts is just wrapper functions.
         | 
         | You cannot around getdents and stat family syscalls on Linux if
         | you need file sizes.
        
         | 201984 wrote:
         | Those are single threaded, so they would have kneecapped
         | performance pretty badly. 'du' from coreutils uses them, and
         | you can see the drastic speed difference between that and my
         | program in the README.
        
       | dima55 wrote:
       | Ideas for a better format: do what xdiskusage does.
        
         | metadat wrote:
         | What specifically do you feel xdiskusage does well?
        
           | dima55 wrote:
           | It graphically displays the relative sizes of things, and
           | allows you to interactively zoom into any particular
           | subdirectory to see the relative sizes of the things inside
           | it
        
       | shellfishgene wrote:
       | What I need is a du that caches the results somewhere and then
       | does not rescan the 90% of dirs that have not changed when I run
       | it again a month later...
        
         | kstrauser wrote:
         | And it would know they did not change without scanning them
         | because how?
        
           | svpg wrote:
           | It could hash the contents of a dir. Along the lines of git
        
             | Galanwe wrote:
             | Except hashing requires... reading.
             | 
             | There is not much to be done here. Directories entries are
             | just names, no guarantees that the files were not modified
             | or replaced.
             | 
             | The best you could do is something similar to the
             | strategies of rsync, rely on metadata (modified date, etc)
             | and cross fingers nobody did `cp -a`.
        
               | shellfishgene wrote:
               | I would be fine with the latter, the program could
               | display a warning like "Results may be inaccurate, full
               | scan required" or something.
               | 
               | I guess I'm just annoyed that for Windows/NTFS really
               | fast programs are available but not for Linux
               | filesystems.
        
             | legends2k wrote:
             | And to hash something needs reading all of its data. I
             | think deducing the file size would actually be faster in
             | some file systems and never slower with any.
        
               | mort96 wrote:
               | Faster in all file systems I'd guess, stat is fast,
               | opening the file and reading its contents and updating a
               | checksum is slow, and gets slower the larger the file is.
        
           | shellfishgene wrote:
           | Maybe it could run in the background and use inotify to just
           | update the database all the time, or at least keep track of
           | what needs rescanning?
        
             | shellfishgene wrote:
             | Thinking about this some more, does this system not already
             | exist for the disk quota calculation in the kernel? How
             | does that work? Would it be possible for a tool to scan the
             | disk once, and then get information about file
             | modifications from the system that's used to update quota
             | info?
        
       | anon-3988 wrote:
       | I have been using diskonaut, its fast enough given that it also
       | produces a nice visual output.
        
       | rafaelgoncalves wrote:
       | neat tool, congrats on the release and thank you for this and the
       | analysis/comparison.
        
       | mg wrote:
       | I have this in my bashrc:                   alias duwim='du
       | --apparent-size -c -s -B1048576 * | sort -g'
       | 
       | It produces a similar output, showing a list of directories and
       | their sizes under the current dir.
       | 
       | The name "duwim" stands for "du what I mean". It came naturally
       | after I dabbled for quite a while to figure out how to make du do
       | what I mean.
        
       | mixmastamyk wrote:
       | Not as featureful, but what I've been using. If you can't install
       | this tool for some reason, it's still useful. I call it usage:
       | #!/bin/bash              du -hs * .??* 2> /dev/null | sort -h |
       | tail -22
        
       | tamimio wrote:
       | Someone, please create a Gdut, a fork that will produce graphs
       | for a quick and easy way to read, it's almost impossible to read
       | in small vertical screens.
        
       | sandreas wrote:
       | Nice work! There is also gdu[1], where the UI is heavily inspired
       | by ncdu and somehow feels way faster...
       | 
       | 1: https://github.com/dundee/gdu
        
       | nh2 wrote:
       | > I don't know why one ordering is better than the other, but the
       | difference is pretty drastic.
       | 
       | I have the suspicion that some file systems store stat info next
       | to the getdents entries.
       | 
       | Thus cache locality would kick in if you stat a file after
       | receiving it via getdents (and counterintuitively, smaller
       | getdents buffers make it faster then). Also in such cases it
       | would be important to not sort combined getdents outputs before
       | starting (which would destroy the locality again).
       | 
       | I found such a situation with CephFS but don't know what the
       | layout is for common local file systems.
        
       | miew wrote:
       | Why C and not Rust or even Zig?
        
         | Galanwe wrote:
         | Why Rust or Zig and not C?
        
           | nottorp wrote:
           | Better for the author's resume, if they want to make it hype
           | driven.
           | 
           | Also some nebulous "being more secure". Never mind that this
           | tool does not have elevated privileges. You gotta watch out
           | for those remote root exploits even for a local only app,
           | man.
        
             | _flux wrote:
             | I mean you could extract an archive you've downloaded to
             | your filesystem and said archive could have funky file
             | names and then you use this tool..
             | 
             | But I suppose it's not a very likely bug to have in this
             | kind of tool.
        
               | nottorp wrote:
               | Or "they" could enter your house at 2 am, drug you and
               | hit you with a $5 wrench until they get access to your
               | files :)
        
               | gsck wrote:
               | Never understood this line of thought. Not everything
               | needs to be super secure. Not everything is going to be
               | an attack vector. No one is going to deploy this onto a
               | production server where this program specifically is
               | going to be the attack vector.
               | 
               | Memory safety is cool and all, but a program that
               | effectively sums a bunch of numbers together isn't going
               | to cause issues. Worst case the program segfaults
        
         | craftoman wrote:
         | Because C is like a cult Toyota Supra with twin turbo and Rust
         | or Zig is like another cool boring Corvette roadster.
        
         | 201984 wrote:
         | Because (a) I felt like it, (b) it would be more difficult to
         | make raw syscalls in Rust, and (c) I use[1] flexible array
         | members to minimize allocations, whereas Rust support for those
         | quite bad[2]. It doesn't even look possible to allocate one
         | with size known at runtime.
         | 
         | [1]:
         | https://codeberg.org/201984/dut/src/branch/master/main.c#L16...
         | 
         | [2]: https://doc.rust-lang.org/nomicon/exotic-
         | sizes.html#dynamica...
        
           | estebank wrote:
           | > it would be more difficult to make raw syscalls in Rust
           | 
           | Would you like to expand on this? Is it because of type
           | conversions that you'd have to do?
           | 
           | > I use[1] flexible array members to minimize allocations
           | 
           | I was under the impression that FAM was a non-standard
           | extension, but alas it is part of C99.
           | 
           | From what I'm seeing you have intrusive list where each
           | `entry` points to the previous and next element, where the
           | path itself is a bag of bytes as part of the entry itself.
           | I'm assuming that what you'd want from Rust is something akin
           | to the following when it comes to the path?
           | struct Entry<const NAME_LEN: usize> {             ..
           | mode: Mode,      name: InlineName<NAME_LEN>,         }
           | struct InlineName<const NAME_LEN: usize> {      value: [u8;
           | NAME_LEN],         }
        
             | 201984 wrote:
             | For syscalls, I would have needed to either pull in
             | dependencies or write FFI bindings, and neither of those
             | options are appealing when I could simply write the program
             | in Linux's native language.
             | 
             | For the FAM, your example looks like it requires a compile-
             | time constant size. That's the same as hardcoding an array
             | size in the struct, defeating the whole point. Short names
             | will waste space, and long ones will get truncated.
        
               | estebank wrote:
               | > For the FAM, your example looks like it requires a
               | compile-time constant size. That's the same as hardcoding
               | an array size in the struct, defeating the whole point.
               | Short names will waste space, and long ones will get
               | truncated.
               | 
               | You made me realize that the const generics stabilization
               | work hasn't advanced enough to do what I was proposing
               | (at least not in as straightforward way):
               | https://play.rust-
               | lang.org/?version=nightly&mode=debug&editi...
               | 
               | Those are const arguments, not const _values_ , which
               | means that you can operate on values _as if_ they didn 't
               | have a size, while the compiler does keep track of the
               | size throughout.
               | 
               | I'd have to take a more detailed look to see if this
               | level of dynamism is _enough_ to work with runtime
               | provided strings, which they might not be, unless you
               | started encoding things like  "take every CStr, create a
               | StackString<1> for each char, and then add them
               | together".
        
         | estebank wrote:
         | This is not an appropriate question to ask, that I see
         | sometimes in these threads. "Because the author wanted" is good
         | enough a reason for them to write a program in C. It being a
         | new project written in C can also be good enough reason for you
         | not to use it: dust already exists and is written in Rust,
         | which you can use instead.
        
       | trustno2 wrote:
       | Does it depend on linux functionality or can I use it on macos?
       | 
       | Well I can just try :)
        
         | Ringz wrote:
         | From the author:
         | 
         | ,,Note that dut uses Linux-specific syscalls, so it won't run
         | on MacOS."
        
       | INTPenis wrote:
       | Reminds me of someone's script I have been using for over a
       | decade.                   #/bin/sh         du -k --max-depth=1
       | "$@" | sort -nr | awk '              BEGIN {
       | split("KB,MB,GB,TB", Units, ",");              }              {
       | u = 1;                 while ($1 >= 1024) {                    $1
       | = $1 / 1024;                    u += 1                 }
       | $1 = sprintf("%.1f %s", $1, Units[u]);                 print $0;
       | }             '
        
         | mshook wrote:
         | I don't understand the point of the script, it's nothing more
         | than:                 du -h --max-depth=1 "$@" | sort -hr
        
           | out-of-ideas wrote:
           | `-h` is not available in all `sort` implementations
        
             | BlackLotus89 wrote:
             | Even the busybox port has it. The only sort implementation
             | I know of that doesn't have -h is toybox (I guess older
             | busybox implementations are missing it as well), but I'm
             | using -h for well over a decade and seldom had it missing
        
           | dotancohen wrote:
           | The point is that it is faster.
        
             | zamadatix wrote:
             | A bash script for postprocessing the sorting is certainly
             | slower than just having sort do it correctly in the first
             | place.
        
           | INTPenis wrote:
           | I found this online a long time ago, and it's been with me
           | across BSD, Macintosh and Linux. So I can't say why it is
           | that way, and I didn't know about sort -h before today.
        
         | matzf wrote:
         | Any particular reason for doing the human readable units
         | "manually"? `du -h | sort -h` works just fine.
        
         | kelahcim wrote:
         | I will definitely try this one and compare with my daily stuff
         | 
         | `du -s -k * | sort -r -n -k1,1 -t" "`
        
       | laixintao wrote:
       | > Anyone have ideas for a better format?
       | 
       | Hi, how about flamegraph? I always want to display the file
       | hierarchy in flamegraph like format.
       | 
       | - previous discussion:
       | https://x.com/laixintao/status/1744012609983295816
       | 
       | - my work display flamegraph in terminal:
       | https://github.com/laixintao/flameshow
        
       | robocat wrote:
       | > but I don't like how unintuitive the readout is
       | 
       | The best disk usage UI I ever saw was this one:
       | https://www.trishtech.com/2013/10/scanner-display-hard-disk-...
       | The inner circle is the top level directories, and each ring
       | outwards is one level deeper in the directory heirarchy. You
       | would mouse over large subdirectories to see what they were, or
       | double click to drilldown into a subdirectory. Download it and
       | try it - it is quite spectacularly useful on Windows (although
       | I'm not sure how well it handles Terabyte size drives - I haven't
       | used Windows for a long time).
       | 
       | Hard to do a circular graph in a terminal...
       | 
       | It is very similar to a flame graph? Perhaps look at how flame
       | graphs are drawn by other terminal performance tools.
        
         | pricechild wrote:
         | "Disk Usage Analyser" / "Baobab" on Linux is awesome with the
         | same UI: https://apps.gnome.org/en-GB/Baobab/
        
           | bmicraft wrote:
           | Also Filelight (KDE)
        
           | haskman wrote:
           | And Filelight from KDE - https://apps.kde.org/filelight/
        
         | rlue wrote:
         | That's called a ring chart or a sunburst chart.
        
         | krackers wrote:
         | DaisyDisk on mac does that. Also it's blazing fast, it seems to
         | even beat "du" so I don't know what tricks they're pulling.
        
           | supernes wrote:
           | I think they're reading some of the info from Spotlight
           | metadata already collected by the OS for indexing, but I
           | could be wrong.
        
             | radicality wrote:
             | That's probably it. It's likely powered by whatever thing
             | gives you quick directory sizes in Finder after you do View
             | Options (cmd+j), and select "Show All Sizes". I have that
             | setting always on for all directories and pretty sure it's
             | cached as it's fast.
        
         | OskarS wrote:
         | I've used graphical tools very similar to this, and I always
         | come back to this:                  du -h | sort -rh | less
         | 
         | (might have to sudo that du depending on current folder. on
         | macos, use gsort)
         | 
         | You just immediately see exactly what you need to delete, and
         | you can so quickly scan the list. I'm not a terminal die-hard
         | "use it for everything" kinda guy, I like GUIs for lots of
         | stuff. But when it comes to "what's taking up all my space?"
         | this is honestly the best solution I've found.
        
           | pnutjam wrote:
           | I like to use:                   du -shx \*
           | 
           | I used to pipe that to:                   | grep G
           | 
           | to find anything gig sized, but I like your:
           | | sort -rh
           | 
           | Thanks!
        
         | ajnin wrote:
         | I don't like radial charts because the outer rings have a
         | larger area, which makes it look like files deep in the
         | hierarchy take more space than in reality. And also it leaves
         | the majority of screen space unused.
         | 
         | I prefer the more classic treemap view, my personal favorite
         | being the classic version of SpaceMonger but it's Windows only
         | and very slow.
        
           | roelschroeven wrote:
           | WizTree also uses a treemap view and is very fast. It's also
           | Windows-only though.
        
             | entropicdrifter wrote:
             | Linux and MacOS have QDirStat:
             | https://github.com/shundhammer/qdirstat
        
         | kreyenborgi wrote:
         | duc http://duc.zevv.nl/ does this
        
       | Neil44 wrote:
       | On Windows I always used to use Windirstat but it was slow, then
       | I found Wiztree which is many orders of magnitude faster. I
       | understand it works by directly reading the NTFS tables rather
       | than spidering through the directories laboriously. I wonder if
       | this approach would work for ext4 or whatever.
        
         | Filligree wrote:
         | NTFS is pointlessly slow, so bypassing the VFS provides a
         | decent speedup in exchange for the ridiculous fragility.
         | 
         | Linux doesn't have the same issue, and I'd be quite concerned
         | if an application like this needed root access to function.
        
           | luma wrote:
           | I think you underestimate how much of a speedup we're talking
           | about: it can pull in the entire filesystem in a couple
           | seconds on a multi TB volume with Bs of files. I have yet to
           | see anything in the linux world (including the OP) that comes
           | anywhere near this performance level via tree walking.
        
           | ablob wrote:
           | I want to take this opportunity to recommend the talk "NTFS
           | isn't that bad"
           | (https://www.youtube.com/watch?v=qbKGw8MQ0i8). NTFS prefers a
           | different access pattern than most usual file systems. I
           | remember that a part of the talk was about speed-ups on Linux
           | as well. So even if it doesn't sway your opinion it should
           | enhance your perspective on how file systems work.
        
         | Gormo wrote:
         | If you do like WinDirStat, there's a good Linux equivalent
         | called QDirStat: https://github.com/shundhammer/qdirstat
        
         | fsfod wrote:
         | There is a fork of Windirstat that also reads the NTFS MFT as
         | well https://github.com/ariccio/altWinDirStat
        
         | utensil4778 wrote:
         | > it works by directly reading the NTFS tables rather than
         | spidering through the directories
         | 
         | Maybe I'm just ignorant of linux filesystems, but this seems
         | like the obvious thing to do. Do ext and friends not have a
         | file table like this?
        
       | tiku wrote:
       | Ncdu is easy to remember and use, clicking through etc. would be
       | cool to find a faster replacement, same usage instead of a new
       | tool with parameters to remember..
        
       | bArray wrote:
       | I think that 'ls' should also be evaluating the size of the files
       | contained within. The size and number of contained files/folders
       | really does reveal a lot about the contents of a directory
       | without peeking inside. The speed of this is what would be most
       | concerning though.
        
       | frumiousirc wrote:
       | dut looks very nice.
       | 
       | One small surprise I found came when I have a symlink to a
       | directory and refer to that with a trailing "/". dut doesn't
       | follow the link in order to scan the real directory. Ie I have
       | this symlink:                   ln -s /big/disk/dev ~/dev
       | 
       | then                   ./dut ~/dev/
       | 
       | returns zero size while                   du -sh ~/dev/
       | 
       | returns the full size.
       | 
       | I'm not sure how widespread this convention is to resolve
       | symlinks to their target directories if named with a trailing "/"
       | but it's one my fingers have memorized.
       | 
       | In any case, this is another tool for my toolbox. Thank you for
       | sharing it.
        
       | classified wrote:
       | I get boatloads of "undefined reference" errors. Where's the list
       | of dependencies?
        
         | 201984 wrote:
         | The only dependency is a recent Linux C standard library. What
         | are the missing symbols? On older versions of glibc you do have
         | to add -pthread.
        
           | israrkhan wrote:
           | The author should have included a Makefile. You need to add
           | -lpthread to command provided in README.md
        
       | imiric wrote:
       | I've been using my own function with `du` for ages now, similar
       | to others here, but I appreciate new tools in this space.
       | 
       | I gave `dut` a try, but I'm confused by its output. For example:
       | 3.2G    0B |- .pyenv       3.4G    0B | /- toolchains       3.4G
       | 0B |- .rustup       4.0G    0B | |- <censored>       4.4G    0B |
       | /- <censored>       9.2G    0B |- Work       3.7G    0B |   /-
       | flash       3.8G    0B | /- <censored>        16G  4.0K |-
       | Downloads       5.1G    0B | |- <censored>       5.2G    0B | /-
       | <censored>        16G    0B |- Projects       3.2G   42M | /-
       | <censored>        17G  183M |- src        17G    0B | /-
       | <censored>        17G    0B |- Videos       3.7G    0B | /-
       | Videos        28G    0B |- Music       6.9G    0B | |- tmp
       | 3.4G    0B | | /- tmp       8.8G    0B | |- go       3.6G    0B |
       | |   /- .versions       3.9G    0B | | |- go       8.5G    0B | |
       | |     /- dir       8.5G    0B | | |   /- vfs       8.5G    0B | |
       | | /- storage       8.5G    0B | | /- containers        15G  140M
       | | /- share        34G  183M /- .local       161G    0B .
       | 
       | - I expected the output to be sorted by the first column, yet
       | some items are clearly out of order. I don't use hard links much,
       | so I wouldn't expect this to be because of shared data.
       | 
       | - The tree rendering is very confusing. Some directories are
       | several levels deep, but in this output they're all jumbled, so
       | it's not clear where they exist on disk. Showing the full path
       | with the `-p` option, and removing indentation with `-i 0`
       | somewhat helps, but I would almost remove tree rendering
       | entirely.
        
         | 201984 wrote:
         | It is being sorted by the first column, but it also keeps
         | subdirectories with each other. Look at the order of your top-
         | level directories.                 3.2G    0B |- .pyenv
         | 3.4G    0B |- .rustup       9.2G    0B |- Work        16G  4.0K
         | |- Downloads        17G  183M |- src        28G    0B |- Music
         | 34G  183M /- .local
         | 
         | If you don't want the tree output and only want the top
         | directories, you can use `-d 1` to limit to depth=1.
        
           | imiric wrote:
           | Ah, I see, that makes sense.
           | 
           | But still, the second `Videos` directory of 3.7G is a
           | subdirectory of `Music`, so it should appear below it, no?
           | Same for the two `tmp` directories, they're subdirectories of
           | `.local`, so I would expect them to be listed under it. Right
           | now there doesn't seem to be a clear order in either case.
        
       | bitwize wrote:
       | Neat, a new C program! I get a little frisson of good vibes
       | whenever someone announces a new project in C, as opposed to Rust
       | or Python or Go. Even though C is pretty much a lost cause at
       | this point. It looks like it has some real sophisticated
       | performance optimizations going on too.
        
       | inbetween wrote:
       | I often want to know who there is a sudden growth disk usage over
       | the last month/week/etc, what suddenly take space. In those cases
       | I find myself wishing that du and friends would cache their last
       | few runs and would offer a diff against them, this easily listing
       | the new disk eating files or directories. Could dut evolve to do
       | something like that?
        
         | beepbooptheory wrote:
         | du[t] > .disk-usage-"`date +"%d-%m-%Y"`"
         | 
         | And then use diff later?
        
           | inbetween wrote:
           | Almost all of them will have some difference. What is needed
           | is to parse the previous state, calculate the difference in
           | size, and show only the "significant" difference.
        
         | thesh4d0w wrote:
         | gt5 does this - https://gt5.sourceforge.net/
        
           | inbetween wrote:
           | That looks perfect, thanks!
        
         | namibj wrote:
         | btdu extra (or expert?) mode with snapshots kinda does that:
         | you can see what's only in the new version and not in a
         | snapshot; and vice-versa. Also it offers attributing size to
         | folders only for extends that aren't shared with a different
         | folder (snapshots are essentially just special folders), to
         | kinda get a diff between the two (stuff only present in the old
         | snapshot is shown there; stuff only present in the new version
         | is shown there).
        
       | kseistrup wrote:
       | Now available from AUR: https://aur.archlinux.org/packages/dut-
       | git
        
       | tonymet wrote:
       | great app. very fast at scanning nested dirs. I often need
       | recursive disk usage when I suddenly run out of space and
       | scramble to clean up while everything is crashing.
        
       | cycomanic wrote:
       | This looks awesome!
       | 
       | One comment, I find the benchmark results really cumbersome to
       | read. Why don't you make a graph (e.g. a barplot) that would make
       | results obvious at a quick glance. I'm a strong believer in
       | presenting numerical data graphically whenever possible, it
       | avoids many mistakes and misunderstandings.
        
       | oigursh wrote:
       | ncdu has been my go to for years. Pleased to have a modern
       | alternative.
        
       | jbaber wrote:
       | I always want treemaps.
       | 
       | console: rust: cargo install diskonaut python: pip install ohmu
       | GUI: gdmap windows: windirstat mac: grand perspective (I seem to
       | r call)
        
       ___________________________________________________________________
       (page generated 2024-07-11 23:01 UTC)