[HN Gopher] Show HN: Filesystem Watcher
       ___________________________________________________________________
        
       Show HN: Filesystem Watcher
        
       An arbitrary filesystem event watcher which is:  - simple  -
       efficient  - dependency free  - runnable anywhere with a filesystem
       - header only  Watcher is extremely efficient. In most cases, even
       when scanning millions of paths, this library uses a near-zero
       amount of resources.  Watcher is simple. The library exposes a
       single function and a single object. That is all.  Happy hacking.
        
       Author : e-dant
       Score  : 81 points
       Date   : 2022-10-18 13:54 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | [deleted]
        
       | mikulas_florek wrote:
       | Looking at Win32, it scans the whole directory periodically,
       | right? I must miss something, but how can that be called
       | efficient?
        
         | MichaelCollins wrote:
         | I don't know about efficient, but at least it sounds reliable.
         | I'm at my wit's end with trying to figure out why KDE's Dolphin
         | can't reliably watch a directory for new files, frequently (but
         | not always) forcing me to F5 to see new files.
        
         | rasz wrote:
         | Wait, it doesnt hook the OS file handling routines? it actually
         | manually rescans the filesystem?
        
           | e-dant wrote:
           | It does one or the other. There are concerns about OS
           | filesystem event hooks.
           | 
           | The current solution isn't ideal, and is being addressed
           | here: https://github.com/e-dant/watcher/issues/10
        
             | mikulas_florek wrote:
             | not on windows (only platform I checked/care about)
        
           | [deleted]
        
         | e-dant wrote:
         | It's efficient because it beats kqueue while reporting events
         | accurately.
         | 
         | A proper benchmarking program is in the works, however manual
         | testing does show only minimal resource usage.
         | 
         | For more, see this issue:
         | https://github.com/e-dant/watcher/issues/10
        
           | mikulas_florek wrote:
           | I only mentioned Win32. There this library is very
           | inefficient compared to ReadDirectoryChangesW, which consumes
           | no CPU times when nothing changes.
        
         | dwringer wrote:
         | See here: https://news.ycombinator.com/item?id=33247735
        
       | bigmattystyles wrote:
       | When I last tried to implement this, by far the toughest part was
       | making sure the file that's been newly detected is done being
       | written to. On ntfs I couldn't find a good technique, even last
       | modified time was not reliable. I had to watch it for changes
       | myself.
        
         | nomel wrote:
         | Is "last modified" the time of the beginning of the write?
        
           | bigmattystyles wrote:
           | Never even thought of that; I don't know. I assumed it was
           | when a write was done. Whatever that means, I don't know
           | either.
        
             | 0xJRS wrote:
             | Would this mean that fs event based antivirus scanners
             | could be side-stepped by writing a payload to a file and
             | then never closing the handler?
        
         | infogulch wrote:
         | I've done this by watching the NTFS journal which is
         | surprisingly efficient. First I scanned the whole journal for
         | filesystem metadata and dumped it into a SQLite database (which
         | took about a minute), then kept it up to date which took
         | virtually no resources. This was an absurdly faster way to
         | search by file name, a search across the whole FS came back in
         | milliseconds instead of Explorer's multiple minutes.
        
       | e-dant wrote:
        
         | e-dant wrote:
        
       | roeles wrote:
       | Should this work over NFS or SMB?
        
       | jonhohle wrote:
       | How does this compare to something like famd(8) and if it's a
       | marked improvement, could the techniques used here be backported
       | to FAM?
        
       | rasz wrote:
       | Looks like it does exactly what you can get out of Everything
       | (https://www.voidtools.com) Index Journal
       | 
       | https://www.voidtools.com/forum/viewtopic.php?t=9792
       | 
       | but programmatically and its highly scriptable, pretty cool. Will
       | definitely add it to my arsenal of troubleshooting tools.
       | 
       | Edit: never mind, This tool is manually scanning the filesystem
       | instead of listening to OS events
       | https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
        
         | e-dant wrote:
         | It's being addressed here:
         | 
         | https://github.com/e-dant/watcher/issues/10
        
         | mellosouls wrote:
         | Voidtools gives security warnings in my browser, fwiw.
        
           | rasz wrote:
           | Of course it does, its competing with Microsoft by providing
           | actually working instant local search.
        
             | mellosouls wrote:
             | My _chrome_ browser.
        
               | rasz wrote:
               | > instant local search
               | 
               | still checks out :) But I checked just to be sure and no
               | warning in Version 106.0.5249.91 (Official Build)
               | (32-bit). Maybe its your corporate baby content web
               | filter?
        
               | mellosouls wrote:
               | No, its a certificate issue; probably minor, but a
               | server-side thing to attend to by the looks of it.
        
               | rasz wrote:
               | There are no certificate issues in my Chrome. Are you
               | sure your not on some MITMing VPN?
        
         | who23 wrote:
         | For clarity, the author did say in another comment that for
         | Windows they plan to implement system API calls to watch files
         | instead of manually scanning the filesystem. For macOS and
         | linux it is listening to OS events.
        
           | rasz wrote:
           | Everything uses https://en.wikipedia.org/wiki/USN_Journal for
           | fast NTFS monitoring.
        
       | cmovq wrote:
       | Any reason for making delay_ms a template parameter? A compiler
       | should be able to optimize passing a constant as a regular
       | function argument. And if it's not optimized I assume a variable
       | delay wouldn't affect much?
        
         | e-dant wrote:
         | No perfectly good reason. I will look into that before version
         | 1.
        
       | diffxx wrote:
       | I have personally written a similar tool and I am very curious
       | about how this could be using a near-zero amount of resources
       | while maintaining accuracy. As far as I know, there are two ways
       | to implement this functionality: 1) store an in memory
       | representation of the file system and periodically refresh the in
       | memory state by polling the paths under watch and emitting events
       | when differences are detected 2) hook into the underlying kernel
       | events like kqueue, inotify, fsevents, ReadDirectoryChangesW, etc
       | and report events
       | 
       | Option 1 uses a lot of CPU and memory (the map storing the paths
       | being monitored could easily grow to be tens or even hundreds of
       | megabytes if many files are being monitored, which is often the
       | case in large source projects). I have seen tools that use
       | polling with a 100ms interval continuously burn 50% of cpu
       | monitoring a modest sized directory with tens of thousands of
       | files.
       | 
       | Option 2 theoretically would use less memory and little to no
       | cpu, but in practice, the story is more complicated. If you are
       | using an inotify or kqueue like api, you will have to store
       | handles for all of the paths that are being monitored, which can
       | take a significant amount of memory. On macos, the file system
       | events are not accurate in the sense that you can't trust the
       | type of event. It doesn't reliably distinguish between creation
       | and modification events. So if you want to know specifically what
       | kind of event happened, you end up back in case 1 where you have
       | to store an in memory representation of the file system and diff
       | against the in memory representation and the current file system
       | state when you detect an event. For some use cases, you may not
       | care to distinguish between creations and modifications and can
       | get away with a lower memory, but less accurate, solution.
       | 
       | In my experience, getting all of this right is much more
       | difficult than it appears at first glance. Good luck to you.
        
         | e-dant wrote:
         | More technically, here's what we have:
         | 
         | A "baseline" filesystem watcher which uses only the standard
         | library. It has been made to beat kqueue. And it does.
         | 
         | A platform filesystem watcher for Darwin is used, but certain
         | event properties are handled by the standard library. Namely,
         | the event time and the path type.
         | 
         | A platform filesystem watcher is schedule for Windows. Work
         | hasn't been started.
         | 
         | A platform filesystem watcher for Linux (> 2.4 or so) was toyed
         | with but ultimately rejected out of accuracy concerns. It was
         | far more efficient than the cross-platform implementation
         | "warthog", no doubt, but it lacked accuracy. Work is being done
         | to get most of the benefits from both worlds.
         | 
         | There are problems with the "baseline" watcher (which I've
         | named "warthog" because it's sturdy and reliable). But those
         | are potential efficiency losses when watcher more than a few
         | million paths. They are, thankfully, not accuracy or safety
         | problems.
         | 
         | Maybe you can see the solution emerging here?
         | 
         | Here's where we're going next:
         | 
         | The most efficient kernel watchers can be used on most
         | platforms, but checked for their accuracy periodically by the
         | "warthog" watcher.
        
           | diffxx wrote:
           | What do you mean by beat kqueue? Is it faster than kqueue?
           | Does it use less memory than kqueue?
           | 
           | How does the baseline filesystem watcher work? If it doesn't
           | use kqueue, does it poll the filesystem periodically and diff
           | against an in memory representation? If yes, see my other
           | comments. If not, I am genuinely curious what you are doing
           | because you know something that I do not.
        
             | e-dant wrote:
             | When I began this project, I started with kqueue. The
             | performance was wanting and there were bugs with very large
             | file trees.
             | 
             | I moved to a minimal std::filesystem-based watcher and
             | optimized it from there.
             | 
             | There hasn't been a formal head-to-head test between the
             | two. That should be about halfway down my todo list. It's
             | worth revisiting more formally.
             | 
             | My response to this question should help here:
             | https://news.ycombinator.com/item?id=33247155#33251437
             | 
             | In short, there's no secret sauce. There's an efficiency
             | spread in (what I consider) edge-cases.
             | 
             | Every potential gain over other naive watchers implemented
             | with kqueue is likely algorithmic. I store events in a
             | historical map, compare differences to the current state of
             | the file tree, prune them, and send events when they
             | change. That's the whole implementation: scan paths, record
             | their attributes, check for differences in the map, and
             | send events when they happen. I haven't given much thought
             | to exactly why it beats kqueue, nor are there any good
             | tests showing by how much. (Again, this is worth doing.)
        
               | diffxx wrote:
               | Makes sense. I have only used kqueue on macos to monitor
               | a small number of files and I find it quite painful to
               | use and the semantics were confusing, not sure if it is
               | different on say freebsd.
               | 
               | Just as a heads up, one of the strange fsevents issues is
               | that it fails if you register two directories where one
               | directory is a prefix of the other. So say that you want
               | to monitor directories $ROOT/foo and $ROOT/fo and you
               | register an event stream first with $ROOT/foo and then
               | $ROOT/fo, you will only receive events for paths in
               | $ROOT/fo and no events for paths in $ROOT/foo (I just
               | double checked that this is still the case in Monterey at
               | least). I never bothered to report this to apple but
               | worked around it by just registering a stream with $ROOT
               | if I detected that one path name was a substring of
               | another.
        
         | commandlinefan wrote:
         | > hook into the underlying kernel events like kqueue...
         | 
         | I'm really surprised that this sort of functionality isn't
         | built into OS's/filesystems. I recently had to do this for
         | HDFS, and I finally "gave up" and polled the file system like
         | you suggest as your first option. Event notification seems like
         | something that ought to be a fundamental feature and is best
         | owned by the file system itself.
        
           | nomel wrote:
           | > I'm really surprised that this sort of functionality isn't
           | built into OS's/filesystems
           | 
           | It appears to be built into macOS [1]?
           | 
           | > Whenever the filesystem is changed, the kernel passes
           | notifications via the special device file /dev/fsevents to a
           | userspace process called fseventsd
           | 
           | Which I assume is what they're referring to here:
           | 
           | > A platform filesystem watcher for Darwin is used, but
           | certain event properties are handled by the standard library.
           | Namely, the event time and the path type.
           | 
           | 1. https://en.wikipedia.org/wiki/FSEvents
        
             | diffxx wrote:
             | > It appears to be built into macOS [1]?
             | 
             | It is, but it's badly implemented and buggy. But the real
             | problem is that there is no posix like specification for
             | file system events so every platform does it differently.
             | Even if every platform implementation were perfect and bug
             | free, it is a huge pain to write wrappers for each one.
        
           | diffxx wrote:
           | Completely agree. That is why having built a tool similar to
           | this one, I'm not even linking to it. The complexity involved
           | in working around the OS limitations is maddening and
           | convinced me that it would be better to think of a different
           | approach to writing software that wouldn't require monitoring
           | files to achieve the fast feedback loop that these tools are
           | designed to facilitate.
           | 
           | The magic file approach described by kevincox below is
           | probably the best way to get > 95% of the benefit with < 1%
           | of the work.
        
         | e-dant wrote:
         | It's difficult to get it perfectly right.
         | 
         | There is ongoing work attempting to make it more perfect.
         | 
         | I expect a year or two before this is complete.
         | 
         | For now though, it does do what it says. The tests I've run
         | show that it is accurate over large amounts of events and time.
         | For under 1 million files and/or directories, it uses a near-
         | zero amount of resources. Testing on older processors shows
         | similarly positive results.
         | 
         | But this is so far from perfect. This is only the groundwork.
         | Most of the bugs have yet to be discovered. The platform
         | support, more often than not, uses the safe "baseline" watcher
         | in favor of accuracy.
         | 
         | Ned14 of Boost fame has given the project some expert advice
         | which will help it along smoothly.
        
           | diffxx wrote:
           | What do you mean near-zero? You said that inotify doesn't
           | work (and ned14 offers his comments about it). If you are
           | using polling, I do not understand how your approach could be
           | using non-zero amount of resources. Let's say you are
           | monitoring a directory with 1 million files, how can you
           | store the state in less than 20MB of memory (which is about
           | the most optimistic lower bound that I can think of)? What is
           | your secret sauce? Do you mean there is no overhead beyond
           | the baseline watcher? But what about the overhead of the
           | baseline watcher itself?
           | 
           | For what it's worth, in spite of ned14's comments, I have
           | never seen inotify fail in practice (except for if it hits
           | the os file descriptor limits in which case it does fail
           | noisily). The tool I wrote uses inotify for linux. It is used
           | by thousands of developers every day as part of an editor
           | integration and there are no open issues about dropped file
           | events.
           | 
           | Your time frame is probably about right. It took me about a
           | year to work through all the edge cases.
        
             | e-dant wrote:
             | Near-zero is a bit loose. It keeps a relatively compact in-
             | memory representation. You're about right with your
             | estimate. Having measured just now, it's about 30mb for 1
             | million directories.
             | 
             | The baseline Watcher's efficiency has a wide spread. When
             | there are many thousands of nested subdirectories, the CPU
             | approaches the limit of the thread it's on. Flatter
             | directories, or many files without nested subdirectories,
             | do not have nearly as much of an effect. I've seen it run
             | on around 10 million paths on a very flat test directory.
             | 
             | So, near-zero is somewhat misleading. There's a wide spread
             | in efficiency. It was my judgement that deeply nested
             | directory trees were far less common in practice then, so I
             | wrote "near-zero" in the optimistic case.
             | 
             | It uses polling under the hood (at least, I'm sure it does.
             | It uses whatever std::filesystem uses, which is almost
             | certainly polling).
        
       | remram wrote:
       | Does it keep working when files get overwritten by moving another
       | over it, like some Linux text editors do?
        
         | e-dant wrote:
         | I should test this. I haven't seen a problem with that so far
         | in my personal usage, so I'm inclined to say probably.
         | 
         | That's a good test case. I'll make an issue.
        
       | queuebert wrote:
       | In the early days of Linux, there was a tool that saved file info
       | to a floppy. Then you would write protect the floppy and leave it
       | in a drive, and the tool would periodically compare OS files with
       | that to detect alteration. I can't for the life of me remember
       | the name, though. It was great for hardened systems.
        
         | stonogo wrote:
         | Tripwire. https://www.linuxjournal.com/article/8758
        
       | widdershins wrote:
       | Looks great! I'm wondering what operating systems are supported.
       | I'm assuming Linux. What about macOS and Windows?
        
         | OnlyMortal wrote:
         | It uses FSEvents on the Mac. It can do dumb polling too.
        
         | dljsjr wrote:
         | https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
         | 
         | Looks like it works on quite a few systems including Android
         | and iOS.
        
         | e-dant wrote:
         | All are supported.
         | 
         | Although, to be more efficient, I need to write system API
         | calls for Windows.
         | 
         | That will be the 1.0 release.
        
           | hackyhacky wrote:
        
             | sk0g wrote:
             | They were responding to a comment asking about Linux,
             | MacOS, and Windows support.
             | 
             | Needless pedantry is one thing, but deliberately
             | misconstruing a discussion to support said pedantry is sad.
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
       | GordonS wrote:
       | How does this work under the covers on Linux? Is it using eBPF,
       | or is it simply an abstraction over inotify?
       | 
       | I'm particularly interested in something like this, but which
       | will include information about what process made the change, and
       | which user it was running as at the time.
        
         | bertman wrote:
         | >or is it simply an abstraction over inotify?
         | 
         | Looks like it:
         | 
         | https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
        
           | tleb_ wrote:
           | `scan_directory` in the same file recursively iterates
           | directories and no calls to `inotify_*` functions seem to be
           | made; no grep matches in the project directory.
        
             | e-dant wrote:
             | https://github.com/e-dant/watcher/issues/10
        
         | e-dant wrote:
         | I've gone back and forth with inotify on Linux. Ned14 gave a
         | great rundown of the ideal next steps for a best-possible
         | implementation.
         | 
         | You can check out issue/10 for a full description of how it
         | works now, why neither inotify nor our current solution is
         | ideal, and where the project will be going next.
        
           | GordonS wrote:
           | I'm sorry, I don't know what issue/10 means? Is it an e-zine
           | or something? (apologies if this is obvious, I'm extremely
           | tired!)
        
             | e-dant wrote:
             | https://github.com/e-dant/watcher/issues/10
        
               | GordonS wrote:
               | Ah, sorry, that really should have been obvious :D
               | 
               | Have you looked into using eBPF for tracking file system
               | changes at any point? (I don't mean for this project, as
               | it's clear you're taking a particular approach that will
               | work across platforms).
        
       | dyerjohn wrote:
       | "Watcher is extremely efficient. In most cases, even when
       | scanning millions of paths, this library uses a near-zero amount
       | of resources." Yea, maybe or maybe not and my first guess is
       | maybe not.
       | 
       | This needs at least some bullet points on HOW it does this so
       | efficiently so that I'll keep looking. A blanket statement like
       | this means "they hope it is efficient" or "They want it to be
       | efficient" or "It's good in some scenarios but not others".
       | 
       | With those additional bits, I have a reason to dig around the
       | source.
        
         | chasil wrote:
         | Is this just using inotify on Linux?
         | 
         | If so, there are equivalent options, including systemd path
         | units, incron, and the inotifywait utility, in addition to the
         | C API.
         | 
         | The "man systemd.path" page does list explicit limitations of
         | this kernel system call:
         | 
         | "Internally, path units use the inotify(7) API to monitor file
         | systems. Due to that, it suffers by the same limitations as
         | inotify, and for example cannot be used to monitor files or
         | directories changed by other machines on remote NFS file
         | systems." (Files modified by mmap() also don't trigger events.)
         | 
         | https://www.linuxjournal.com/content/linux-filesystem-events...
         | 
         | Windows busybox also has an inotifyd, which appears to do
         | something similar.
        
         | e-dant wrote:
         | You are right. I'll make sure to give a deeper breakdown in the
         | readme.
        
       | [deleted]
        
       | waynesonfire wrote:
       | forgot "written in C++" .. unless you're trying to get views,
       | better to leave that part out.
        
         | e-dant wrote:
         | What do you mean?
        
       | awestroke wrote:
       | See also:
       | 
       | sane - for node
       | 
       | watchexec - rust based, static binary
        
         | christophilus wrote:
         | Reflex for go: https://github.com/cespare/reflex
        
         | kevincox wrote:
         | For CLI usage I found that the best option was instead of
         | watching my source directory just to watch a magic file. Then I
         | configured my editor to touch that file when saving. This has a
         | few benefits:
         | 
         | 1. No need to worry about which files to watch or ignoring
         | build outputs.
         | 
         | 2. Works with every project with no setup.
         | 
         | 3. Easy to trigger a re-run without actually changing a file.
         | 
         | 4. Always runs after all files are saved instead of starting
         | after the first file is saved and racing the rest.
         | 
         | 5. Infinitely scalable.
        
           | matijs wrote:
           | Out of curiosity, what editor do you use and how do you make
           | sure 4. happens when for example using 'save all'?
        
             | kevincox wrote:
             | I'm currently using neovim so it is pretty trivial to add
             | save hooks. Although the approach I am currently using is
             | just a custom shortcut that saves all files and touches the
             | file. This way only explicit saves by me trigger the rerun.
             | 
             | My current setup is documented here but it's easy to tweak
             | to your prefered workflow.
             | https://kevincox.ca/2022/06/14/small-tools/#w
        
       ___________________________________________________________________
       (page generated 2022-10-18 23:02 UTC)