[HN Gopher] Show HN: Filesystem Watcher
___________________________________________________________________
Show HN: Filesystem Watcher
An arbitrary filesystem event watcher which is: - simple -
efficient - dependency free - runnable anywhere with a filesystem
- header only Watcher is extremely efficient. In most cases, even
when scanning millions of paths, this library uses a near-zero
amount of resources. Watcher is simple. The library exposes a
single function and a single object. That is all. Happy hacking.
Author : e-dant
Score : 81 points
Date : 2022-10-18 13:54 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| [deleted]
| mikulas_florek wrote:
| Looking at Win32, it scans the whole directory periodically,
| right? I must miss something, but how can that be called
| efficient?
| MichaelCollins wrote:
| I don't know about efficient, but at least it sounds reliable.
| I'm at my wit's end with trying to figure out why KDE's Dolphin
| can't reliably watch a directory for new files, frequently (but
| not always) forcing me to F5 to see new files.
| rasz wrote:
| Wait, it doesnt hook the OS file handling routines? it actually
| manually rescans the filesystem?
| e-dant wrote:
| It does one or the other. There are concerns about OS
| filesystem event hooks.
|
| The current solution isn't ideal, and is being addressed
| here: https://github.com/e-dant/watcher/issues/10
| mikulas_florek wrote:
| not on windows (only platform I checked/care about)
| [deleted]
| e-dant wrote:
| It's efficient because it beats kqueue while reporting events
| accurately.
|
| A proper benchmarking program is in the works, however manual
| testing does show only minimal resource usage.
|
| For more, see this issue:
| https://github.com/e-dant/watcher/issues/10
| mikulas_florek wrote:
| I only mentioned Win32. There this library is very
| inefficient compared to ReadDirectoryChangesW, which consumes
| no CPU times when nothing changes.
| dwringer wrote:
| See here: https://news.ycombinator.com/item?id=33247735
| bigmattystyles wrote:
| When I last tried to implement this, by far the toughest part was
| making sure the file that's been newly detected is done being
| written to. On ntfs I couldn't find a good technique, even last
| modified time was not reliable. I had to watch it for changes
| myself.
| nomel wrote:
| Is "last modified" the time of the beginning of the write?
| bigmattystyles wrote:
| Never even thought of that; I don't know. I assumed it was
| when a write was done. Whatever that means, I don't know
| either.
| 0xJRS wrote:
| Would this mean that fs event based antivirus scanners
| could be side-stepped by writing a payload to a file and
| then never closing the handler?
| infogulch wrote:
| I've done this by watching the NTFS journal which is
| surprisingly efficient. First I scanned the whole journal for
| filesystem metadata and dumped it into a SQLite database (which
| took about a minute), then kept it up to date which took
| virtually no resources. This was an absurdly faster way to
| search by file name, a search across the whole FS came back in
| milliseconds instead of Explorer's multiple minutes.
| e-dant wrote:
| e-dant wrote:
| roeles wrote:
| Should this work over NFS or SMB?
| jonhohle wrote:
| How does this compare to something like famd(8) and if it's a
| marked improvement, could the techniques used here be backported
| to FAM?
| rasz wrote:
| Looks like it does exactly what you can get out of Everything
| (https://www.voidtools.com) Index Journal
|
| https://www.voidtools.com/forum/viewtopic.php?t=9792
|
| but programmatically and its highly scriptable, pretty cool. Will
| definitely add it to my arsenal of troubleshooting tools.
|
| Edit: never mind, This tool is manually scanning the filesystem
| instead of listening to OS events
| https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
| e-dant wrote:
| It's being addressed here:
|
| https://github.com/e-dant/watcher/issues/10
| mellosouls wrote:
| Voidtools gives security warnings in my browser, fwiw.
| rasz wrote:
| Of course it does, its competing with Microsoft by providing
| actually working instant local search.
| mellosouls wrote:
| My _chrome_ browser.
| rasz wrote:
| > instant local search
|
| still checks out :) But I checked just to be sure and no
| warning in Version 106.0.5249.91 (Official Build)
| (32-bit). Maybe its your corporate baby content web
| filter?
| mellosouls wrote:
| No, its a certificate issue; probably minor, but a
| server-side thing to attend to by the looks of it.
| rasz wrote:
| There are no certificate issues in my Chrome. Are you
| sure your not on some MITMing VPN?
| who23 wrote:
| For clarity, the author did say in another comment that for
| Windows they plan to implement system API calls to watch files
| instead of manually scanning the filesystem. For macOS and
| linux it is listening to OS events.
| rasz wrote:
| Everything uses https://en.wikipedia.org/wiki/USN_Journal for
| fast NTFS monitoring.
| cmovq wrote:
| Any reason for making delay_ms a template parameter? A compiler
| should be able to optimize passing a constant as a regular
| function argument. And if it's not optimized I assume a variable
| delay wouldn't affect much?
| e-dant wrote:
| No perfectly good reason. I will look into that before version
| 1.
| diffxx wrote:
| I have personally written a similar tool and I am very curious
| about how this could be using a near-zero amount of resources
| while maintaining accuracy. As far as I know, there are two ways
| to implement this functionality: 1) store an in memory
| representation of the file system and periodically refresh the in
| memory state by polling the paths under watch and emitting events
| when differences are detected 2) hook into the underlying kernel
| events like kqueue, inotify, fsevents, ReadDirectoryChangesW, etc
| and report events
|
| Option 1 uses a lot of CPU and memory (the map storing the paths
| being monitored could easily grow to be tens or even hundreds of
| megabytes if many files are being monitored, which is often the
| case in large source projects). I have seen tools that use
| polling with a 100ms interval continuously burn 50% of cpu
| monitoring a modest sized directory with tens of thousands of
| files.
|
| Option 2 theoretically would use less memory and little to no
| cpu, but in practice, the story is more complicated. If you are
| using an inotify or kqueue like api, you will have to store
| handles for all of the paths that are being monitored, which can
| take a significant amount of memory. On macos, the file system
| events are not accurate in the sense that you can't trust the
| type of event. It doesn't reliably distinguish between creation
| and modification events. So if you want to know specifically what
| kind of event happened, you end up back in case 1 where you have
| to store an in memory representation of the file system and diff
| against the in memory representation and the current file system
| state when you detect an event. For some use cases, you may not
| care to distinguish between creations and modifications and can
| get away with a lower memory, but less accurate, solution.
|
| In my experience, getting all of this right is much more
| difficult than it appears at first glance. Good luck to you.
| e-dant wrote:
| More technically, here's what we have:
|
| A "baseline" filesystem watcher which uses only the standard
| library. It has been made to beat kqueue. And it does.
|
| A platform filesystem watcher for Darwin is used, but certain
| event properties are handled by the standard library. Namely,
| the event time and the path type.
|
| A platform filesystem watcher is schedule for Windows. Work
| hasn't been started.
|
| A platform filesystem watcher for Linux (> 2.4 or so) was toyed
| with but ultimately rejected out of accuracy concerns. It was
| far more efficient than the cross-platform implementation
| "warthog", no doubt, but it lacked accuracy. Work is being done
| to get most of the benefits from both worlds.
|
| There are problems with the "baseline" watcher (which I've
| named "warthog" because it's sturdy and reliable). But those
| are potential efficiency losses when watcher more than a few
| million paths. They are, thankfully, not accuracy or safety
| problems.
|
| Maybe you can see the solution emerging here?
|
| Here's where we're going next:
|
| The most efficient kernel watchers can be used on most
| platforms, but checked for their accuracy periodically by the
| "warthog" watcher.
| diffxx wrote:
| What do you mean by beat kqueue? Is it faster than kqueue?
| Does it use less memory than kqueue?
|
| How does the baseline filesystem watcher work? If it doesn't
| use kqueue, does it poll the filesystem periodically and diff
| against an in memory representation? If yes, see my other
| comments. If not, I am genuinely curious what you are doing
| because you know something that I do not.
| e-dant wrote:
| When I began this project, I started with kqueue. The
| performance was wanting and there were bugs with very large
| file trees.
|
| I moved to a minimal std::filesystem-based watcher and
| optimized it from there.
|
| There hasn't been a formal head-to-head test between the
| two. That should be about halfway down my todo list. It's
| worth revisiting more formally.
|
| My response to this question should help here:
| https://news.ycombinator.com/item?id=33247155#33251437
|
| In short, there's no secret sauce. There's an efficiency
| spread in (what I consider) edge-cases.
|
| Every potential gain over other naive watchers implemented
| with kqueue is likely algorithmic. I store events in a
| historical map, compare differences to the current state of
| the file tree, prune them, and send events when they
| change. That's the whole implementation: scan paths, record
| their attributes, check for differences in the map, and
| send events when they happen. I haven't given much thought
| to exactly why it beats kqueue, nor are there any good
| tests showing by how much. (Again, this is worth doing.)
| diffxx wrote:
| Makes sense. I have only used kqueue on macos to monitor
| a small number of files and I find it quite painful to
| use and the semantics were confusing, not sure if it is
| different on say freebsd.
|
| Just as a heads up, one of the strange fsevents issues is
| that it fails if you register two directories where one
| directory is a prefix of the other. So say that you want
| to monitor directories $ROOT/foo and $ROOT/fo and you
| register an event stream first with $ROOT/foo and then
| $ROOT/fo, you will only receive events for paths in
| $ROOT/fo and no events for paths in $ROOT/foo (I just
| double checked that this is still the case in Monterey at
| least). I never bothered to report this to apple but
| worked around it by just registering a stream with $ROOT
| if I detected that one path name was a substring of
| another.
| commandlinefan wrote:
| > hook into the underlying kernel events like kqueue...
|
| I'm really surprised that this sort of functionality isn't
| built into OS's/filesystems. I recently had to do this for
| HDFS, and I finally "gave up" and polled the file system like
| you suggest as your first option. Event notification seems like
| something that ought to be a fundamental feature and is best
| owned by the file system itself.
| nomel wrote:
| > I'm really surprised that this sort of functionality isn't
| built into OS's/filesystems
|
| It appears to be built into macOS [1]?
|
| > Whenever the filesystem is changed, the kernel passes
| notifications via the special device file /dev/fsevents to a
| userspace process called fseventsd
|
| Which I assume is what they're referring to here:
|
| > A platform filesystem watcher for Darwin is used, but
| certain event properties are handled by the standard library.
| Namely, the event time and the path type.
|
| 1. https://en.wikipedia.org/wiki/FSEvents
| diffxx wrote:
| > It appears to be built into macOS [1]?
|
| It is, but it's badly implemented and buggy. But the real
| problem is that there is no posix like specification for
| file system events so every platform does it differently.
| Even if every platform implementation were perfect and bug
| free, it is a huge pain to write wrappers for each one.
| diffxx wrote:
| Completely agree. That is why having built a tool similar to
| this one, I'm not even linking to it. The complexity involved
| in working around the OS limitations is maddening and
| convinced me that it would be better to think of a different
| approach to writing software that wouldn't require monitoring
| files to achieve the fast feedback loop that these tools are
| designed to facilitate.
|
| The magic file approach described by kevincox below is
| probably the best way to get > 95% of the benefit with < 1%
| of the work.
| e-dant wrote:
| It's difficult to get it perfectly right.
|
| There is ongoing work attempting to make it more perfect.
|
| I expect a year or two before this is complete.
|
| For now though, it does do what it says. The tests I've run
| show that it is accurate over large amounts of events and time.
| For under 1 million files and/or directories, it uses a near-
| zero amount of resources. Testing on older processors shows
| similarly positive results.
|
| But this is so far from perfect. This is only the groundwork.
| Most of the bugs have yet to be discovered. The platform
| support, more often than not, uses the safe "baseline" watcher
| in favor of accuracy.
|
| Ned14 of Boost fame has given the project some expert advice
| which will help it along smoothly.
| diffxx wrote:
| What do you mean near-zero? You said that inotify doesn't
| work (and ned14 offers his comments about it). If you are
| using polling, I do not understand how your approach could be
| using non-zero amount of resources. Let's say you are
| monitoring a directory with 1 million files, how can you
| store the state in less than 20MB of memory (which is about
| the most optimistic lower bound that I can think of)? What is
| your secret sauce? Do you mean there is no overhead beyond
| the baseline watcher? But what about the overhead of the
| baseline watcher itself?
|
| For what it's worth, in spite of ned14's comments, I have
| never seen inotify fail in practice (except for if it hits
| the os file descriptor limits in which case it does fail
| noisily). The tool I wrote uses inotify for linux. It is used
| by thousands of developers every day as part of an editor
| integration and there are no open issues about dropped file
| events.
|
| Your time frame is probably about right. It took me about a
| year to work through all the edge cases.
| e-dant wrote:
| Near-zero is a bit loose. It keeps a relatively compact in-
| memory representation. You're about right with your
| estimate. Having measured just now, it's about 30mb for 1
| million directories.
|
| The baseline Watcher's efficiency has a wide spread. When
| there are many thousands of nested subdirectories, the CPU
| approaches the limit of the thread it's on. Flatter
| directories, or many files without nested subdirectories,
| do not have nearly as much of an effect. I've seen it run
| on around 10 million paths on a very flat test directory.
|
| So, near-zero is somewhat misleading. There's a wide spread
| in efficiency. It was my judgement that deeply nested
| directory trees were far less common in practice then, so I
| wrote "near-zero" in the optimistic case.
|
| It uses polling under the hood (at least, I'm sure it does.
| It uses whatever std::filesystem uses, which is almost
| certainly polling).
| remram wrote:
| Does it keep working when files get overwritten by moving another
| over it, like some Linux text editors do?
| e-dant wrote:
| I should test this. I haven't seen a problem with that so far
| in my personal usage, so I'm inclined to say probably.
|
| That's a good test case. I'll make an issue.
| queuebert wrote:
| In the early days of Linux, there was a tool that saved file info
| to a floppy. Then you would write protect the floppy and leave it
| in a drive, and the tool would periodically compare OS files with
| that to detect alteration. I can't for the life of me remember
| the name, though. It was great for hardened systems.
| stonogo wrote:
| Tripwire. https://www.linuxjournal.com/article/8758
| widdershins wrote:
| Looks great! I'm wondering what operating systems are supported.
| I'm assuming Linux. What about macOS and Windows?
| OnlyMortal wrote:
| It uses FSEvents on the Mac. It can do dumb polling too.
| dljsjr wrote:
| https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
|
| Looks like it works on quite a few systems including Android
| and iOS.
| e-dant wrote:
| All are supported.
|
| Although, to be more efficient, I need to write system API
| calls for Windows.
|
| That will be the 1.0 release.
| hackyhacky wrote:
| sk0g wrote:
| They were responding to a comment asking about Linux,
| MacOS, and Windows support.
|
| Needless pedantry is one thing, but deliberately
| misconstruing a discussion to support said pedantry is sad.
|
| https://news.ycombinator.com/newsguidelines.html
| GordonS wrote:
| How does this work under the covers on Linux? Is it using eBPF,
| or is it simply an abstraction over inotify?
|
| I'm particularly interested in something like this, but which
| will include information about what process made the change, and
| which user it was running as at the time.
| bertman wrote:
| >or is it simply an abstraction over inotify?
|
| Looks like it:
|
| https://github.com/e-dant/watcher/blob/989147b183ee0547d71a1...
| tleb_ wrote:
| `scan_directory` in the same file recursively iterates
| directories and no calls to `inotify_*` functions seem to be
| made; no grep matches in the project directory.
| e-dant wrote:
| https://github.com/e-dant/watcher/issues/10
| e-dant wrote:
| I've gone back and forth with inotify on Linux. Ned14 gave a
| great rundown of the ideal next steps for a best-possible
| implementation.
|
| You can check out issue/10 for a full description of how it
| works now, why neither inotify nor our current solution is
| ideal, and where the project will be going next.
| GordonS wrote:
| I'm sorry, I don't know what issue/10 means? Is it an e-zine
| or something? (apologies if this is obvious, I'm extremely
| tired!)
| e-dant wrote:
| https://github.com/e-dant/watcher/issues/10
| GordonS wrote:
| Ah, sorry, that really should have been obvious :D
|
| Have you looked into using eBPF for tracking file system
| changes at any point? (I don't mean for this project, as
| it's clear you're taking a particular approach that will
| work across platforms).
| dyerjohn wrote:
| "Watcher is extremely efficient. In most cases, even when
| scanning millions of paths, this library uses a near-zero amount
| of resources." Yea, maybe or maybe not and my first guess is
| maybe not.
|
| This needs at least some bullet points on HOW it does this so
| efficiently so that I'll keep looking. A blanket statement like
| this means "they hope it is efficient" or "They want it to be
| efficient" or "It's good in some scenarios but not others".
|
| With those additional bits, I have a reason to dig around the
| source.
| chasil wrote:
| Is this just using inotify on Linux?
|
| If so, there are equivalent options, including systemd path
| units, incron, and the inotifywait utility, in addition to the
| C API.
|
| The "man systemd.path" page does list explicit limitations of
| this kernel system call:
|
| "Internally, path units use the inotify(7) API to monitor file
| systems. Due to that, it suffers by the same limitations as
| inotify, and for example cannot be used to monitor files or
| directories changed by other machines on remote NFS file
| systems." (Files modified by mmap() also don't trigger events.)
|
| https://www.linuxjournal.com/content/linux-filesystem-events...
|
| Windows busybox also has an inotifyd, which appears to do
| something similar.
| e-dant wrote:
| You are right. I'll make sure to give a deeper breakdown in the
| readme.
| [deleted]
| waynesonfire wrote:
| forgot "written in C++" .. unless you're trying to get views,
| better to leave that part out.
| e-dant wrote:
| What do you mean?
| awestroke wrote:
| See also:
|
| sane - for node
|
| watchexec - rust based, static binary
| christophilus wrote:
| Reflex for go: https://github.com/cespare/reflex
| kevincox wrote:
| For CLI usage I found that the best option was instead of
| watching my source directory just to watch a magic file. Then I
| configured my editor to touch that file when saving. This has a
| few benefits:
|
| 1. No need to worry about which files to watch or ignoring
| build outputs.
|
| 2. Works with every project with no setup.
|
| 3. Easy to trigger a re-run without actually changing a file.
|
| 4. Always runs after all files are saved instead of starting
| after the first file is saved and racing the rest.
|
| 5. Infinitely scalable.
| matijs wrote:
| Out of curiosity, what editor do you use and how do you make
| sure 4. happens when for example using 'save all'?
| kevincox wrote:
| I'm currently using neovim so it is pretty trivial to add
| save hooks. Although the approach I am currently using is
| just a custom shortcut that saves all files and touches the
| file. This way only explicit saves by me trigger the rerun.
|
| My current setup is documented here but it's easy to tweak
| to your prefered workflow.
| https://kevincox.ca/2022/06/14/small-tools/#w
___________________________________________________________________
(page generated 2022-10-18 23:02 UTC)