hngopher.com

       [HN Gopher] Show HN: VectorVFS, your filesystem as a vector data...
       ___________________________________________________________________
        
       Show HN: VectorVFS, your filesystem as a vector database
        
       Author : perone
       Score  : 191 points
       Date   : 2025-05-05 15:17 UTC (7 hours ago)
        
 (HTM) web link (vectorvfs.readthedocs.io)
 (TXT) w3m dump (vectorvfs.readthedocs.io)
        
       | malcolmgreaves wrote:
       | Fun idea storing embeddings in inodes! Very clever!
       | 
       | I want to point out that this isn't suitable for any kind of
       | actual things you'd use a vector database for. There's no notion
       | of a search index. It's always a O(N) linear search through all
       | of your files:
       | https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
       | 
       | Still, fun idea :)
        
         | perone wrote:
         | Thanks. There is a bit of a nuance there, for example: you can
         | build an index in first pass which will indeed be linear, but
         | then later keep it in an open prompt for subsequent queries,
         | I'm planning to implement that mode soon. But agree, it is not
         | intended to search 10 million files, but you seldom have this
         | use case in local use anyways.
        
         | binarymax wrote:
         | O(n) is still OK for vector search if n isn't too large.
         | Filesystem search solutions are currently terrible, with
         | background indexing jobs and poor relevance. This won't scale
         | for every file on your system but anything in your working
         | documents folder would easily work well.
        
         | PaulHoule wrote:
         | The lack of an index is not bad at all if you have it stored
         | contiguously in RAM: the mechanical sympathy is great, SIMD
         | will spin like a top not to mention multithreaded programming,
         | etc. Circa 2014 or so I worked on a search engine that scanned
         | maybe 2GB worth of vectors for 10 million documents, queries
         | were turned around in much less than a second, nobody
         | complained about the speed.
         | 
         | If you gotta gather the data from a lot of different inodes, it
         | is a different story.
        
           | ori_b wrote:
           | It's not stored continuously in ram. It's stored in extended
           | attributes.
        
         | esafak wrote:
         | thanks for saving readers time. If so this is not a viable tool
         | for production.
        
         | int_19h wrote:
         | An index could be built on top of this though if desired. No
         | need to have it in the FS itself.
        
       | natas wrote:
       | this is actually a great idea
        
         | iugtmkbdfil834 wrote:
         | Assuming I understand it correctly, the idea is to be able to
         | have LLMs get through file systems more easily with some
         | interesting benefits to human users as well. The idea is
         | interesting and I want to try it out.
        
           | perone wrote:
           | Hi, there are no LLMs involved, it is all local and an
           | embedding (vector representation) of the data is created and
           | then that is used for search later, nothing is sent to cloud
           | from your files and there are no local LLMs running as well,
           | only the encoders (I use the Perception Encoder from Meta
           | released a few weeks ago).
        
       | anotherpaul wrote:
       | Great idea indeed. The documentation needs a bit more information
       | to be useful. What GPU backends are supported for example? How do
       | I delete the embedding information after I decide to uninstall
       | it? Will give it a try though.
        
         | perone wrote:
         | Thanks, I'm working on implementing the commands to clean the
         | embeddings (you can now do that with Linux xattr command-line
         | tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and
         | it only supports Linux at the moment.
        
           | 3abiton wrote:
           | I am curious why Python, and not rust for example?
        
             | danudey wrote:
             | Not OP, but despite working in an all-Go shop I just wrote
             | a utility in Python the other week and caught some flak for
             | it.
             | 
             | The reason I gave (which was accepted) was that the process
             | of creating a proof of concept and iterating on it rapidly
             | is vastly easier in Python (for me) than it is in Go. In
             | essence, it would have taken me at least a week, possibly
             | more, to write the program I ended up with in Golang, but
             | it only took me a day to write it in Python, and, now that
             | I understand the problem and have a working (production-
             | ready) prototype, it would probably only take me another
             | day to rewrite it in Golang.
             | 
             | Also, a large chunk of the functionality in this Python
             | script seems to be libraries - pillow for image processing,
             | but also pytorch and related vision/audio/codec libraries.
             | Even if similar production-ready Rust crates are available
             | (I'm not sure if they are), this kind of thing is something
             | Python excels at and which these modules are already
             | optimized for. Most of the "work" happening here isn't
             | happening in Python, by and large.
        
               | hadlock wrote:
               | Sure, but now your all-go shop now needs to support two
               | languages, two sets of linters, ci/cd etc for a single
               | utility. It might be faster for you but if the utility is
               | going to be used for more than a couple of weeks now it's
               | a real hassle to get a go developer to make sure they
               | have the right version of the interpeter, remember all
               | the ins and outs of python etc.
        
             | perone wrote:
             | Hi, I think Rust won't bring much benefit here to be
             | honest, the bottleneck is mainly the model and model
             | loading. It would probably be a nightmare to load these
             | models from Rust, I would have to use torch bindings and
             | then convert everything from the preprocessing already in
             | Python to Rust.
        
       | badmonster wrote:
       | interesting
        
       | Ericson2314 wrote:
       | The idea that filesystems are not just a flavor of database
       | management systems was always a mistake.
       | 
       | Maybe with micro-kernels we'll finally fix this.
        
         | 7qW24A wrote:
         | I'm a database guy, not an OS guy, so I agree, obviously... But
         | what is the micro-kernel angle?
        
           | packetlost wrote:
           | Likely the idea that filesystems should run as userspace /
           | unprivileged (or at least limited privilege) processes which
           | would make them, ultimately, indistinguishable from a form of
           | database engine.
           | 
           | Persistent file systems are essentially key-value stores,
           | usually with optimizations for enumerating keys under a
           | namespace (also known as listing the files in a directory).
           | IMO a big problem with POSIX filesystems is the lack of
           | atomicity and lock guarantees when editing a file. This and a
           | complete lack of consistent networked API are the key reasons
           | few treat file systems as KV stores. It's a pity, really.
        
             | mrlongroots wrote:
             | > "Likely the idea that filesystems should run as userspace
             | / unprivileged (or at least limited privilege) processes
             | which would make them, ultimately, indistinguishable from a
             | form of database engine."
             | 
             | "Userspace vs not" is a different argument from
             | "consistency vs not" or "atomicity vs not" or "POSIX vs
             | not". Someone still needs to solve that problem. Sure
             | instead of SQLite over POSIX you could implement POSIX over
             | SQLite over raw blocks. But you haven't gained anything
             | meaningful.
             | 
             | > Persistent file systems are essentially key-value stores
             | 
             | I think this is reductive enough to be equivalent to "a
             | key-value store is a thin wrapper over the block
             | abstraction, as it already provides a key-value interface,
             | which is just a thin layer over taking a magnet and
             | pointing it at an offset".
             | 
             | Persistent filesystems can be built over key-value stores.
             | This is especially common in distributed filesystems. But
             | they also circumvent a key-value abstraction entirely.
             | 
             | > IMO a big problem with POSIX filesystems is the lack of
             | atomicity
             | 
             | Atomicity requires write-ahead logging + flushing a cache.
             | I fail to see why this needs to be mandatory, when it can
             | be effectively implemented at a higher layer.
             | 
             | > This and a complete lack of consistent networked API
             | 
             | A consistent networked API would require you to hit the
             | metadata server for every operation. No caching. Your
             | system would grind to a halt.
             | 
             | Finally, nothing in the POSIX spec prohibits an atomic
             | filesystem or consistency guarantees. It is just that no
             | one wants to implement these things that way because it
             | overprovisions for one property at the expense of others.
        
               | packetlost wrote:
               | > "Userspace vs not" is a different argument from
               | "consistency vs not" or "atomicity vs not" or "POSIX vs
               | not". Someone still needs to solve that problem. Sure
               | instead of SQLite over POSIX you could implement POSIX
               | over SQLite over raw blocks. But you haven't gained
               | anything meaningful.
               | 
               | This was an attempt to possibly explain the microkernel
               | point GP made, which only really matters _below_ the FS.
               | 
               | > I think this is reductive enough to be equivalent to "a
               | key-value store is a thin wrapper over the block
               | abstraction, as it already provides a key-value
               | interface, which is just a thin layer over taking a
               | magnet and pointing it at an offset".
               | 
               | I disagree with this premise. Key-value stores are an
               | API, not an abstraction over block storage (though many
               | are or can be configured to be so). File systems are
               | essentially a superset of a KV API with a multitude of
               | "backing stores". Saying KV stores are always backed by
               | blocks is overly reductive, no?
               | 
               | > Atomicity requires write-ahead logging + flushing a
               | cache. I fail to see why this needs to be mandatory, when
               | it can be effectively implemented at a higher layer.
               | 
               | You're confusing durability for atomicity. You don't need
               | a log to implement atomicity, you just need a way to lock
               | one or more entities (whatever the unit of atomic updates
               | are). A CoW filesystem in direct mode (zero page caching)
               | would need neither but could still support atomic updates
               | to file (names).
               | 
               | > A consistent networked API would require you to hit the
               | metadata server for every operation. No caching. Your
               | system would grind to a halt.
               | 
               | Sorry, I don't mean consistent in the ACID context, I
               | mean consistent in the loosely defined API shape context.
               | Think NFS or 9P.
               | 
               | I also disagree with this to some degree: pipelined
               | operations would certainly still be possible and
               | performant but would be rather clunky. End-to-end latency
               | for get->update-write, the common mode of operation,
               | would be pretty awful.
               | 
               | > Finally, nothing in the POSIX spec prohibits an atomic
               | filesystem or consistency guarantees. It is just that no
               | one wants to implement these things that way because it
               | overprovisions for one property at the expense of others.
               | 
               | I didn't say it did, but it doesn't require it which
               | means it effectively doesn't exist as far as the users of
               | FS APIs are concerned. Rename operations are the only API
               | that atomicity is required by POSIX. However without a
               | CAS-like operation you can't safely implement a lock
               | without several extra syscalls.
        
           | Ericson2314 wrote:
           | The filesystem interface is only privilaged interface because
           | it is the kernel knows about. E.g. you can already use FUSE
           | and NFS to roll your own FS _implementations_ , but those do
           | not a microkernel make, because the OS is still in the way
           | dictating the implementation.
           | 
           | The safest way to put the FS on a level-playing field with
           | other interfaces is to make the kernel not know about, just
           | as it doesn't know about, say, SQL.
        
         | qwertox wrote:
         | I can't agree with this. I like it that I can have all these
         | tools which work with files and are tools which are not db-
         | oriented, and the fact that there are different filesystems for
         | different scenarios, that I can sandwich LVM between a FS and
         | the block device. That /proc/ can pretend to be a FS because
         | else we'd possibly end up with something like the Windows
         | Registry for these operations, only managed through a database.
         | 
         | Would you store all your ~/ in something like SQLite database?
        
           | hdevalence wrote:
           | Yes, I would
        
           | 90s_dev wrote:
           | > Would you store all your ~/ in something like SQLite
           | database?
           | 
           | Actually yeah that sounds pretty good.
           | 
           | For Desktop/Finder/Explorer you'd just need a nice UI.
           | 
           | Searching Documents/projects/etc would be the same just maybe
           | faster?
           | 
           | All the arbitrary stuff like ~/.npm/**/* would stop
           | cluttering up my ls -la in ~ and could be stored in their own
           | tables whose names I genuinely don't care about. (This was
           | the dream of ~/Library, no?)
           | 
           | [edit] Ooooh, I get it now. This doesn't solve namespacing or
           | traversal.
        
         | foobiekr wrote:
         | Every single time this has been tried it has gone wrong, but
         | sure.
         | 
         | Almost all of the operations done on actual filesystems are not
         | database like, they are close to the underlying hardware for
         | practical reasons. If you want a database view, add one in an
         | upper layer.
        
           | jonhohle wrote:
           | BeOS got it right with BeFS. An Email client was just a
           | folder. MP3s could be sorted and filtered in the file system.
           | https://news.ycombinator.com/item?id=12309686
        
             | int_19h wrote:
             | Windows does something similar with Explorer today when you
             | open a folder that has mostly music files in it.
        
             | foobiekr wrote:
             | BeFS wasn't a database. It had indexed queries on EAs and
             | they had the habit of asking application files to add their
             | indexable content to the EAs. Internally it was just a
             | mostly-not-transactional collection of btrees.
             | 
             | There was no query language for updating files, or even
             | inspecting anything about a file that was not published in
             | the EAs (or implicitly do as with adapters), there were no
             | multi-file transactions, no joins, nothing. Just rich
             | metadata support in the FS.
        
               | Ericson2314 wrote:
               | Yeah I am talking more deep architecture, and BeOS is
               | more notable here mostly on just the user-interface
               | level.
               | 
               | However, I think it is reasonable to think that with way
               | more time and money, these things would meet up. Think
               | about it as digging a tunnel from both sides of the
               | mountain.
        
           | adolph wrote:
           | > they are close to the underlying hardware for practical
           | reasons
           | 
           | Could you provide reference information to support this
           | background assertion? I'm not totally familiar with
           | filesystems under the hood, but at this point doesn't storage
           | hardware maintain an electrical representation relatively
           | independent from the logical given things like wear leveling?
        
             | Ericson2314 wrote:
             | Yes I agree, that assertion doesn't pass muster.
             | 
             | Mature database implementations also bypass a lot of kernel
             | machinary to get closer to the underlying block devices.
             | The layering of DB on top of FS is a failure.
        
               | foobiekr wrote:
               | You are confusing that databases implement their own
               | filesystem equivalent functionality in an application-
               | specific way with the idea that FS's can or should be
               | databases.
        
               | Ericson2314 wrote:
               | I am not confusing any such thing. You need to define
               | "database" such that "file system" doesn't include it.
               | 
               | Common usage does this by convention, but that's just
               | sloppy thinking and populist _extentional_ definitining.
               | I posit that any rigorous, thought-out, not overfit
               | _intentional_ definition of a database will, as a matter
               | of course, also include file systems.
        
             | mrlongroots wrote:
             | Some examples off the top of my head:
             | 
             | - You can reason about block offsets. If your writes are
             | 512B-aligned, you can be ensured minimal write
             | amplification.
             | 
             | - If your writes are append-only, log-structured, that
             | makes SSD compaction a lot more straightforward
             | 
             | - No caching guarantees by default. Again, even SSDs cache
             | writes. Block writes are not atomic even with SSDs. The
             | only way to guarantee atomicity is via write-ahead logs.
             | 
             | - The NVMe layer exposes async submission/completion
             | queues, to control the io_depth the device is subjected to,
             | which is essential to get max perf from modern NVMe SSDs.
             | Although you need to use the right interface to leverage it
             | (libaio/io_uring/SPDK).
        
           | packetlost wrote:
           | I don't see how file systems aren't _some_ sort of DBMS,
           | definitely not _relational_ but that wasn 't a stated
           | requirement.
        
         | 01HNNWZ0MV43FF wrote:
         | You could do a loopback network filesystem and make any user-
         | space FS you want. That's what WSL does, and there's a Rust
         | crate for it. Can't recall the name at all.
        
           | Ericson2314 wrote:
           | There is NFS and FUSE so you can write your own
           | _implementation_ , but you are still stuck with the
           | _interface_ that the kernel understands.
        
         | runlaszlorun wrote:
         | I've heard this mentioned a couple times but what would this
         | look like functionality wise? A single "files" table with
         | columns? Different tables for different categories of files?
         | FTS? Something else?
        
           | Ericson2314 wrote:
           | See the other comments. The point is not a specific new
           | interface, but a separation of concerns, and leveling the
           | playing field.
           | 
           | I'll try to do an example. The kernel doesn't currently know
           | about SQL. Instead, you e.g. connect to a socket, and start
           | talking to postgres. Imagine if FS stuff was the same thing:
           | you connect to a socket, and then issue various command to
           | read and write files. Ignore perf for a moment, it works
           | right?
           | 
           | Now, one counter-argument might be "hold up, what is this
           | socket you need to connect to, isn't that part of a file
           | system? Is there now an all-userspace inner filesystem, still
           | kernel-supported 'meta filesystem'?" Well, the answer to that
           | is maybe the Unix idea of making communication channels like
           | pipes and (to a lesser extent) sockets, was a _bad_ idea. Or
           | rather, there may be nothing wrong with saying a directory
           | can have a child which may be such a communication channel,
           | but there _is_ a problem with saying that every such
           | communication channel should live inside some directory.
        
         | mrlongroots wrote:
         | Thoughts:
         | 
         | 1. Distributed filesystems do often use databases for metadata
         | (FoundationDB for 3FS being a recent example)
         | 
         | 2. Using a B+ tree for metadata is not much different from
         | having a sorted index
         | 
         | 3. Filesystems are a common enough usecase that skipping the
         | abstraction complexity to co-optimize the stack is warranted
        
       | b0a04gl wrote:
       | If VectorVFS obscures retrieval logic behind opaque embeddings,
       | how do users debug why a file surfaced--or worse, why one didn't?
        
         | refulgentis wrote:
         | What is a non-opaque embedding?
         | 
         | Does VectorVFS do retrieval, or store embeddings in EXT4?
         | 
         | Is retrieval logic obscured by VectorVFS?
         | 
         | If VectorVFS did retrieval with non-opaque embeddings, how
         | would one debug why a file surfaced?
        
         | perone wrote:
         | Hi, not sure if I understood what you meant by opaque
         | embeddings as well, but the reason why files surface or not is
         | due to the similarity score (which is basically the dot product
         | of embeddings).
        
           | jlhawn wrote:
           | How much work do you think it would be to also have a
           | separate xattr which has a human-readable description of the
           | file contents? I wonder if it that might already be an
           | intermediate product of some of the embedding tools, like
           | "arbitrary media" -> "text description of media" ->
           | "embedding vector". You could store both of those as xattrs
           | and you could debug by comparing your text query with the
           | text description of the file contents as they should produce
           | similar embedding vectors. You could even audit any file,
           | assuming you know what its contents are, by checking the text
           | description xattr generated by this program.
        
       | esafak wrote:
       | Files-as-vector stores is LanceDB's value proposition. How do you
       | compare in performance, etc.?
        
         | perone wrote:
         | This is quite different than LanceDB. In VectorVFS I'm using
         | the inodes directly to store the embeddings, there is no
         | external file with metadata and db, the db is your filesystem
         | itself, that's the key difference.
        
           | esafak wrote:
           | That's an implementation detail, and it sounds more like a
           | liability than a selling point, to have such tight coupling.
           | (Why) do you see not using files as a good thing?
           | 
           | Let me ask another question: is this intended for production
           | use, or is it more of a research project? Because as a user I
           | care about things like speed, simplicity, flexibility, and
           | robustness.
        
       | adenta wrote:
       | I wonder if I could use this locally on my macbook. The finder
       | applications built-in search is kinda meh.
        
         | perone wrote:
         | I'm planning to support MacOS, the only issue is with the
         | encoders that I'm using now, I will probably work more on it
         | next week to try to make a release that works on MacOS as well.
         | Thanks !
        
       | tzury wrote:
       | I've found that starting with a plain old filesystem often
       | outperforms fancy services - just as the Unix philosophy
       | ("everything is a file" [1]) has preached for decades [2].
       | 
       | When BigQuery was still in alpha I had to ingest ~15 billion HTTP
       | requests a day (headers, bodies, and metadata). None of the
       | official tooling was ready, so I wrote a tiny bash script that:
       | 1. uploaded the raw logs to Cloud Storage, and         2. tracked
       | state with three folders: `pending/`, `processing/`, `done/`.
       | 
       | A cron job cycled through those directories and quietly pushed
       | petabytes every week without dropping a byte. Later, Google's own
       | pipelines--and third-party stacks like Logstash--never matched
       | that script's throughput or reliability.
       | 
       | Lesson: reach for the filesystem first; add services only once
       | you've proven you actually need them.
       | 
       | [1] https://en.wikipedia.org/wiki/Everything_is_a_file [2]
       | https://en.wikipedia.org/wiki/Unix_philosophy
        
         | dominicq wrote:
         | Can you say more about the use case? What problem were you
         | solving? How did it work exactly? Sounds interesting so I'd
         | like to learn more.
        
           | tzury wrote:
           | Sure.
           | 
           | We were building Reblaze (started 2011), a cloud WAF / DDoS-
           | mitigation platform. Every HTTP request--good, bad, or ugly--
           | had to be stored for offline anomaly-detection and
           | clustering.                  Traffic profile               -
           | Baseline: [?] 15 B requests/day          - Under attack: the
           | same 15 B can arrive in 2-3 hours
           | 
           | Why BigQuery (even in alpha)?
           | 
           | It was the only thing that could swallow that firehose and
           | stay query-able minutes later -- crucial when you're under
           | attack and your data source must _not_ melt down.
           | 
           | Pipeline (all shell + cron)
           | 
           | Edge nodes - write JSON logs locally and a local cron push to
           | Cloud Storage
           | 
           | Tiny VM with a cron loop                  - Scans `pending/`,
           | composes many small blobs into one "max-size" blob in
           | `processing/`.        - Executes `bq load ...` into the
           | customer's isolated dataset.        - On success, moves the
           | blob to `done/`; on failure, drops it back to `pending/`.
           | 
           | Downstream ML/alerting* pulls straight from BigQuery
           | 
           | That handful of `gsutil`, `bq`, and `mv` commands moved
           | multiple petabytes a week without losing a byte. Later
           | pipelines--Dataflow, Logstash, etc.--never matched its
           | throughput or reliability.
        
         | cratermoon wrote:
         | Command line tools can be 225x faster than a Hadoop cluster.
         | https://news.ycombinator.com/item?id=17135841
        
         | ryanianian wrote:
         | Not sure if it's still in use, but for a very long time, AWS
         | billing relied on getting usage data via rsync.
        
         | sunshine-o wrote:
         | Absolutely.
         | 
         | I would add that filesystems are superior to data formats (XML,
         | JSON, YAML, TOML) for many use cases such as configuration or
         | just storing data.
         | 
         | - Hierarchy are dirs,
         | 
         | - Keys are file names,
         | 
         | - Value is the content of the file.
         | 
         | - Other metadata are in hidden files
         | 
         | It will work forever, you can leverage ZFS, Git, rsync,
         | syncthing much better. If you want, a fancy shells like Nushell
         | will bring the experience pretty close to a database.
         | 
         | Most important you don't need fancy editor plugins or to learn
         | XPath, jq or yq.
        
           | drob518 wrote:
           | Yes, but a couple downsides:
           | 
           | 1. For config, it spreads the config across a bunch of nested
           | directories, making it hard to read and write it without some
           | sort of special tool that shows it all to you at once. Sure,
           | you can easily edit 50 files from all sorts of directories in
           | your text editor, but that's pretty painful.
           | 
           | 2. For data storage is that lots of smaller files will waste
           | partial storage blocks in many file systems. Some do coalesce
           | small files, but many don't.
           | 
           | 3. For both, it's often going to be higher performance to
           | read a single file from start to finish than a bunch of
           | files. Most file systems will try to keep file blocks in
           | mostly sequential order (defrag'd), whereas they don't
           | typically do that for multiple files in different
           | directories. SSD makes this mostly a non-issue these days,
           | however. You still have the issue of openings, closings, and
           | more read calls, however.
        
             | sunshine-o wrote:
             | > 1. For config, it spreads the config across a bunch of
             | nested directories, making it hard to read and write it
             | without some sort of special tool that shows it all to you
             | at once. Sure, you can easily edit 50 files from all sorts
             | of directories in your text editor, but that's pretty
             | painful.
             | 
             | It really depends how comfortable you are using the shell
             | and which one you use.
             | 
             | cat, tree, sed, grep, etc will get you quite far and one
             | might argue that it is simpler to master than vim and
             | various format. Actually mastering VSCode also takes a lot
             | of efforts.
             | 
             | > 2. For data storage is that lots of smaller files will
             | waste partial storage blocks in many file systems. Some do
             | coalesce small files, but many don't.
             | 
             | > 3. For both, it's often going to be higher performance to
             | read a single file from start to finish than a bunch of
             | files. Most file systems will try to keep file blocks in
             | mostly sequential order (defrag'd), whereas they don't
             | typically do that for multiple files in different
             | directories. SSD makes this mostly a non-issue these days,
             | however. You still have the issue of openings, closings,
             | and more read calls, however.
             | 
             | Agreed but for most use case here it really doesn't matter
             | and if I need to optimise storage I will need a database
             | anyway.
             | 
             | And I sincerely believe that most micro optimisations at
             | the filesystem level are cancelled by running most editors
             | with data format support enabled....
        
           | cryptonector wrote:
           | Except that now when you do need a tool like XSLT/XPath, jq,
           | or yq, now you need bash. I use bash lots, but still I'd
           | rather use a better language, like the ones you listed.
           | 
           | I'm being slightly hypocritical because I've made plenty of
           | use of the filesystem as a configuration store. In code it's
           | quite easy to stat one path relative to a directory, or open
           | it and read it, so it's very tempting.
        
       | bullen wrote:
       | I did something similar, but I use these EXT4 requirements:
       | - hard links (only tar works for backup)       - small file size
       | (or inodes run out before disk space)
       | 
       | http://root.rupy.se
       | 
       | It's very useful for global distributed real-time data that don't
       | need the P in CAP for writes.
       | 
       | (no new data can be created if one node is offline = you can
       | login, but not register)
        
       | jlhawn wrote:
       | If I understand correctly, this is attaching metadata to files in
       | a format that LLMs (or any tool that can understand the semantic
       | embedding vector) can leverage to understand what a file is
       | without having to actually read the contents of the file.
       | 
       | That obviously has a lot of interesting use cases, but my first
       | assumption was that this could be used to quickly/easily search
       | your filesystem with some prompt like "Play the video from last
       | month where we went camping and saw a flock of turkeys". But that
       | would require having an actual vector DB running on your system
       | which you could use to quickly look up files using an embedding
       | of your query, no?
        
         | lstodd wrote:
         | so, like magic(5)?
        
           | mywittyname wrote:
           | What is magic(5) and how is it similar to what was described?
        
             | danudey wrote:
             | magic(5) is a system for determining the type of a file by
             | examining the 'magic bytes' at or near the start of a file.
             | 
             | For example, POSIX tar files have a defined file format
             | that starts with a header struct: https://www.gnu.org/softw
             | are/tar/manual/html_node/Standard.h...
             | 
             | You can see that at byte offset 257 is `char magic[6]`,
             | which contains `TMAGIC`, which is the byte string
             | "ustar\0". Thus, if a file has the bytes 'ustar\0' at
             | offset 257 we can reasonably assume that it's a tar file.
             | Almost every defined file type has some kind of string of
             | 'magic' predefined bytes at a predefined location that lets
             | a program know "yes, this is in fact a JPEG file" rather
             | than just asserting "it says .jpg so let's try to interpret
             | this bytestring and see what happens".
             | 
             | As for how it's similar: I don't think it actually is, I
             | think that's a misunderstanding. The metadata that this
             | vector FS is storing is more than "this is a a JPEG" or
             | "this is a word document", as I understand it, so comparing
             | it to magic(5) is extremely reductionist. I could be
             | mistaken, however.
        
             | yjftsjthsd-h wrote:
             | https://manpages.org/magic/5 is a database of file types,
             | used by the file(1) command. I don't exactly follow how
             | it's the same though; it would let you say "what files are
             | videos" but not "what files are videos of a cat". Which is
             | sort of related but unless I missed something there is a
             | difference.
        
             | simcop2387 wrote:
             | I think they're referring to this,
             | https://linux.die.net/man/5/magic given the notation. That
             | said I don't really see how it'd be all that relevant to
             | the discussion so maybe i'm missing something else.
        
             | 0x457 wrote:
             | magic(5) means `man 5 magic`:
             | https://linux.die.net/man/5/magic
             | 
             | It's just a tool that can read "magic bytes" to figure out
             | what files contains. Very different from what VectorVFS is.
        
             | lstodd wrote:
             | four people answered strictly correctly as to what magic(5)
             | is, but not a single one realized that storing some aux
             | data as xattr in linux FS is not in any way different from
             | just storing the exact same data as a file header. which is
             | how magic(5) works.
             | 
             | how come?
             | 
             | (besides good luck not forgetting to rsync those xattrs)
        
         | perone wrote:
         | Hi, it is quite different, there is no LLM involved, we can
         | certainly use it for a RAG for example, but what is currently
         | implemented is basically a way to generate embeddings (vector
         | representation) which are then used for search later, it is all
         | offline and local (no data is ever sent to cloud from your
         | files).
        
           | jlhawn wrote:
           | I understand that LLMs aren't involved in generating the
           | embeddings and adding the xattrs. I was just wondering what
           | the value add of this is if there's no other background
           | process (like mds on macOS) which is using it to build a
           | search index.
           | 
           | I guess what I'm asking is: how does VectorVFS enable search
           | besides iterating through all files and iteratively comparing
           | file embeddings with the embedding of a search query? The
           | project description says "efficient and semantically
           | searchable" and "eliminating the need for external index
           | files or services" but I can't think of any more efficient
           | way to do a search without literally walking the entire
           | filesystem tree to look for the file with the most similar
           | vector.
           | 
           | Edit: reading the docs [1] confirmed this. The `vfs search
           | TERM DIRECTORY` command:
           | 
           | > will automatically iterate over all files in the folder,
           | look for supported files and then embed the file or load
           | existing embeddings directly from the filesystem."
           | 
           | [1]:
           | https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-
           | se...
        
           | pilooch wrote:
           | Using it for a RAG is smart indeed, especially with a
           | multimodal encoder (vision-rag), as the implementation would
           | be straightforward from what you already have.
        
         | lstodd wrote:
         | if you go look up how xattrs work, you will understand it's no
         | different than just reading a chunk of the file in question,
         | and in fact can be slower.
         | 
         | xattrs are better be forgotten already. it was just as dumb
         | idea as macos resource forks/
        
       | colordrops wrote:
       | Rt
        
       | pseudosavant wrote:
       | This immediately made me nostalgic for BeOS's BeFS or Windows
       | Longhorn's WinFS database filesystems, and how this kind of thing
       | would have fit them perfect. So much cool stuff you could do with
       | vectors for everything. Smart folders that include files for a
       | project based on a description of the project. Show me all of my
       | config files for appXYZ. Images of a black dog at the beach. At
       | the OS-level for any other app to easily tap into.
       | 
       | I'd be surprised if cloud storage services like OneDrive don't
       | already do some kind of vector for every file you store. But an
       | online web service isn't the same as being built into the core of
       | the OS.
        
         | perone wrote:
         | I share the same feeling, I think filesystems will have to
         | reinvent themselves given the pace of how useful ML models
         | became in the past years.
        
           | didgetmaster wrote:
           | I built a local object store that was designed to replace
           | file systems. You can create hundreds of millions of objects
           | (e.g. files) and attach a variety of metadata tags to each
           | one. A tag could be a number, string, or other data type
           | (including vector info). Searches for objects with certain
           | tags is exceptionally fast.
           | 
           | I invented it because I found searching conventional file
           | systems that support extended attributes to be unbearably
           | slow.
        
             | tugdual wrote:
             | Got a demo ?
        
               | didgetmaster wrote:
               | Tons of demo videos on my YouTube channel. Free beta
               | available for download on my website. Links in my
               | profile.
        
         | p_ing wrote:
         | WinFS wasn't a file system laid down on hardware, it was just a
         | SQL database that stored arbitrary data.
        
           | didgetmaster wrote:
           | I think that is one of the main reason it failed to launch.
           | It was just too easy for the metadata stored in the separate
           | database to become out of sync with the actual file data.
           | 
           | Microsoft saw the tech support nightmare this could generate,
           | and abandoned the project.
        
             | pseudosavant wrote:
             | They just weren't able to pull it off for whatever reason.
             | I actually ran BeOS as my daily driver for quite a while
             | (way) back in the day. BeFS was genuinely amazing, and not
             | something I've seen replicated elsewhere yet. There hasn't
             | really been anything interesting done in filesystems used
             | by users on devices in a really long time.
        
             | p_ing wrote:
             | It was abandoned due to The Cloud. There was no need for
             | WinFS as a tech when you could store everything in The
             | Cloud.
             | 
             | It was also complex, ran poorly, and would have required
             | developers to integrate their applications.
             | 
             | Microsoft had long solved the problem of blobs and metadata
             | in ESE and SharePoint's use of MS SQL for binary + metadata
             | storage.
        
               | WalterGR wrote:
               | > it was just a SQL database that stored arbitrary data.
               | 
               | I mean, for some definitions "just", "SQL database", and
               | "arbitrary data." :) It was a schematised graph database
               | implemented on top of a slimmed-down version of SQL
               | Server. The query language was not SQL-based.
               | 
               | > It was abandoned due to The Cloud.
               | 
               | It was discontinued circa 2007. The cloud was much less
               | of a Thing back then. I don't recall that factoring into
               | the decision to cancel the project at all, though it
               | would have been prescient.
               | 
               | (Disclaimer: I was on the WinFS team at Microsoft.)
        
       | asadawadia wrote:
       | is the embedding for the whole file? or each 1024/512 byte chunk?
        
       | javier2 wrote:
       | i looked into something similar a few years ago, where i stored
       | embeddings in xattrs
        
       | quantadev wrote:
       | I've been wondering for about 20 years why File Systems basically
       | died and stopped innovating. For example we have lots of
       | hierarchical data structures in the world, and no one seems to
       | have figured out how to let a folder be the storage, instead of
       | always just databases.
       | 
       | For example, if we simply had the ability to have "ordered" files
       | inside folders, that would instantly make it practical for a
       | folder structure to represent "Documents". After all, documents
       | are nothing but a list of paragraphs and images, so if we simply
       | had ordering in file systems we could have document editors which
       | are using individual files for each paragraph of text or image.
       | It would be amazing.
       | 
       | Also think about use cases like Jupyter Notebooks. We could stop
       | using the XML file format, and just make it a folder structure
       | instead. Each cell (node) being in a file. All social media
       | messages and chatbot conversations could be easily saved as
       | folders structures.
       | 
       | I've heard many file copy tools ignore XATTR so I've never tried
       | to use it for this purpose, so maybe we've had the capability all
       | along and just nobody thought to use it in a big way that became
       | popular yet. Maybe I should consider XATTR and take it seriously.
        
       | thirdtrigger wrote:
       | Might be interesting to add an optional embedded Weaviate [1]
       | with a flat-index [2] to the project. It wouldn't use external
       | services and is fully disk-based. Would allow you to search the
       | whole filesystem (about 1.5kb per file (384 dimensions) which
       | would be added to the metadata as well).
       | 
       | 1.
       | https://weaviate.io/developers/weaviate/installation/embedde...
       | 2. https://weaviate.io/developers/academy/py/vector_index/flat
        
         | binarymax wrote:
         | Why weaviate and not FAISS? The latter is faster and lighter.
        
       | gitroom wrote:
       | Gotta say, the old school debate on filesystems vs databases will
       | never get old for me - I always end up with more questions than
       | answers after reading stuff like this.
        
         | j45 wrote:
         | Everything's old school, everything's new.
         | 
         | It's important to remember that the cloud is also invented by
         | the old school and understanding the oscillation between
         | client/server architectures vs local, and it's implication on
         | topics of data and files is interesting too.
         | 
         | More questions means more learning until I learned there's no
         | one right or wrong, just what works best, where, when, for how
         | long, and what the tradeoffs are.
         | 
         | Quick wins/decisions are often bandaids that pile up in an
         | different way.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:00 UTC)