[HN Gopher] Show HN: VectorVFS, your filesystem as a vector data...
___________________________________________________________________
Show HN: VectorVFS, your filesystem as a vector database
Author : perone
Score : 191 points
Date : 2025-05-05 15:17 UTC (7 hours ago)
(HTM) web link (vectorvfs.readthedocs.io)
(TXT) w3m dump (vectorvfs.readthedocs.io)
| malcolmgreaves wrote:
| Fun idea storing embeddings in inodes! Very clever!
|
| I want to point out that this isn't suitable for any kind of
| actual things you'd use a vector database for. There's no notion
| of a search index. It's always a O(N) linear search through all
| of your files:
| https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
|
| Still, fun idea :)
| perone wrote:
| Thanks. There is a bit of a nuance there, for example: you can
| build an index in first pass which will indeed be linear, but
| then later keep it in an open prompt for subsequent queries,
| I'm planning to implement that mode soon. But agree, it is not
| intended to search 10 million files, but you seldom have this
| use case in local use anyways.
| binarymax wrote:
| O(n) is still OK for vector search if n isn't too large.
| Filesystem search solutions are currently terrible, with
| background indexing jobs and poor relevance. This won't scale
| for every file on your system but anything in your working
| documents folder would easily work well.
| PaulHoule wrote:
| The lack of an index is not bad at all if you have it stored
| contiguously in RAM: the mechanical sympathy is great, SIMD
| will spin like a top not to mention multithreaded programming,
| etc. Circa 2014 or so I worked on a search engine that scanned
| maybe 2GB worth of vectors for 10 million documents, queries
| were turned around in much less than a second, nobody
| complained about the speed.
|
| If you gotta gather the data from a lot of different inodes, it
| is a different story.
| ori_b wrote:
| It's not stored continuously in ram. It's stored in extended
| attributes.
| esafak wrote:
| thanks for saving readers time. If so this is not a viable tool
| for production.
| int_19h wrote:
| An index could be built on top of this though if desired. No
| need to have it in the FS itself.
| natas wrote:
| this is actually a great idea
| iugtmkbdfil834 wrote:
| Assuming I understand it correctly, the idea is to be able to
| have LLMs get through file systems more easily with some
| interesting benefits to human users as well. The idea is
| interesting and I want to try it out.
| perone wrote:
| Hi, there are no LLMs involved, it is all local and an
| embedding (vector representation) of the data is created and
| then that is used for search later, nothing is sent to cloud
| from your files and there are no local LLMs running as well,
| only the encoders (I use the Perception Encoder from Meta
| released a few weeks ago).
| anotherpaul wrote:
| Great idea indeed. The documentation needs a bit more information
| to be useful. What GPU backends are supported for example? How do
| I delete the embedding information after I decide to uninstall
| it? Will give it a try though.
| perone wrote:
| Thanks, I'm working on implementing the commands to clean the
| embeddings (you can now do that with Linux xattr command-line
| tool). I'm supporting CPU or GPU (NVIDIA) for the encoders and
| it only supports Linux at the moment.
| 3abiton wrote:
| I am curious why Python, and not rust for example?
| danudey wrote:
| Not OP, but despite working in an all-Go shop I just wrote
| a utility in Python the other week and caught some flak for
| it.
|
| The reason I gave (which was accepted) was that the process
| of creating a proof of concept and iterating on it rapidly
| is vastly easier in Python (for me) than it is in Go. In
| essence, it would have taken me at least a week, possibly
| more, to write the program I ended up with in Golang, but
| it only took me a day to write it in Python, and, now that
| I understand the problem and have a working (production-
| ready) prototype, it would probably only take me another
| day to rewrite it in Golang.
|
| Also, a large chunk of the functionality in this Python
| script seems to be libraries - pillow for image processing,
| but also pytorch and related vision/audio/codec libraries.
| Even if similar production-ready Rust crates are available
| (I'm not sure if they are), this kind of thing is something
| Python excels at and which these modules are already
| optimized for. Most of the "work" happening here isn't
| happening in Python, by and large.
| hadlock wrote:
| Sure, but now your all-go shop now needs to support two
| languages, two sets of linters, ci/cd etc for a single
| utility. It might be faster for you but if the utility is
| going to be used for more than a couple of weeks now it's
| a real hassle to get a go developer to make sure they
| have the right version of the interpeter, remember all
| the ins and outs of python etc.
| perone wrote:
| Hi, I think Rust won't bring much benefit here to be
| honest, the bottleneck is mainly the model and model
| loading. It would probably be a nightmare to load these
| models from Rust, I would have to use torch bindings and
| then convert everything from the preprocessing already in
| Python to Rust.
| badmonster wrote:
| interesting
| Ericson2314 wrote:
| The idea that filesystems are not just a flavor of database
| management systems was always a mistake.
|
| Maybe with micro-kernels we'll finally fix this.
| 7qW24A wrote:
| I'm a database guy, not an OS guy, so I agree, obviously... But
| what is the micro-kernel angle?
| packetlost wrote:
| Likely the idea that filesystems should run as userspace /
| unprivileged (or at least limited privilege) processes which
| would make them, ultimately, indistinguishable from a form of
| database engine.
|
| Persistent file systems are essentially key-value stores,
| usually with optimizations for enumerating keys under a
| namespace (also known as listing the files in a directory).
| IMO a big problem with POSIX filesystems is the lack of
| atomicity and lock guarantees when editing a file. This and a
| complete lack of consistent networked API are the key reasons
| few treat file systems as KV stores. It's a pity, really.
| mrlongroots wrote:
| > "Likely the idea that filesystems should run as userspace
| / unprivileged (or at least limited privilege) processes
| which would make them, ultimately, indistinguishable from a
| form of database engine."
|
| "Userspace vs not" is a different argument from
| "consistency vs not" or "atomicity vs not" or "POSIX vs
| not". Someone still needs to solve that problem. Sure
| instead of SQLite over POSIX you could implement POSIX over
| SQLite over raw blocks. But you haven't gained anything
| meaningful.
|
| > Persistent file systems are essentially key-value stores
|
| I think this is reductive enough to be equivalent to "a
| key-value store is a thin wrapper over the block
| abstraction, as it already provides a key-value interface,
| which is just a thin layer over taking a magnet and
| pointing it at an offset".
|
| Persistent filesystems can be built over key-value stores.
| This is especially common in distributed filesystems. But
| they also circumvent a key-value abstraction entirely.
|
| > IMO a big problem with POSIX filesystems is the lack of
| atomicity
|
| Atomicity requires write-ahead logging + flushing a cache.
| I fail to see why this needs to be mandatory, when it can
| be effectively implemented at a higher layer.
|
| > This and a complete lack of consistent networked API
|
| A consistent networked API would require you to hit the
| metadata server for every operation. No caching. Your
| system would grind to a halt.
|
| Finally, nothing in the POSIX spec prohibits an atomic
| filesystem or consistency guarantees. It is just that no
| one wants to implement these things that way because it
| overprovisions for one property at the expense of others.
| packetlost wrote:
| > "Userspace vs not" is a different argument from
| "consistency vs not" or "atomicity vs not" or "POSIX vs
| not". Someone still needs to solve that problem. Sure
| instead of SQLite over POSIX you could implement POSIX
| over SQLite over raw blocks. But you haven't gained
| anything meaningful.
|
| This was an attempt to possibly explain the microkernel
| point GP made, which only really matters _below_ the FS.
|
| > I think this is reductive enough to be equivalent to "a
| key-value store is a thin wrapper over the block
| abstraction, as it already provides a key-value
| interface, which is just a thin layer over taking a
| magnet and pointing it at an offset".
|
| I disagree with this premise. Key-value stores are an
| API, not an abstraction over block storage (though many
| are or can be configured to be so). File systems are
| essentially a superset of a KV API with a multitude of
| "backing stores". Saying KV stores are always backed by
| blocks is overly reductive, no?
|
| > Atomicity requires write-ahead logging + flushing a
| cache. I fail to see why this needs to be mandatory, when
| it can be effectively implemented at a higher layer.
|
| You're confusing durability for atomicity. You don't need
| a log to implement atomicity, you just need a way to lock
| one or more entities (whatever the unit of atomic updates
| are). A CoW filesystem in direct mode (zero page caching)
| would need neither but could still support atomic updates
| to file (names).
|
| > A consistent networked API would require you to hit the
| metadata server for every operation. No caching. Your
| system would grind to a halt.
|
| Sorry, I don't mean consistent in the ACID context, I
| mean consistent in the loosely defined API shape context.
| Think NFS or 9P.
|
| I also disagree with this to some degree: pipelined
| operations would certainly still be possible and
| performant but would be rather clunky. End-to-end latency
| for get->update-write, the common mode of operation,
| would be pretty awful.
|
| > Finally, nothing in the POSIX spec prohibits an atomic
| filesystem or consistency guarantees. It is just that no
| one wants to implement these things that way because it
| overprovisions for one property at the expense of others.
|
| I didn't say it did, but it doesn't require it which
| means it effectively doesn't exist as far as the users of
| FS APIs are concerned. Rename operations are the only API
| that atomicity is required by POSIX. However without a
| CAS-like operation you can't safely implement a lock
| without several extra syscalls.
| Ericson2314 wrote:
| The filesystem interface is only privilaged interface because
| it is the kernel knows about. E.g. you can already use FUSE
| and NFS to roll your own FS _implementations_ , but those do
| not a microkernel make, because the OS is still in the way
| dictating the implementation.
|
| The safest way to put the FS on a level-playing field with
| other interfaces is to make the kernel not know about, just
| as it doesn't know about, say, SQL.
| qwertox wrote:
| I can't agree with this. I like it that I can have all these
| tools which work with files and are tools which are not db-
| oriented, and the fact that there are different filesystems for
| different scenarios, that I can sandwich LVM between a FS and
| the block device. That /proc/ can pretend to be a FS because
| else we'd possibly end up with something like the Windows
| Registry for these operations, only managed through a database.
|
| Would you store all your ~/ in something like SQLite database?
| hdevalence wrote:
| Yes, I would
| 90s_dev wrote:
| > Would you store all your ~/ in something like SQLite
| database?
|
| Actually yeah that sounds pretty good.
|
| For Desktop/Finder/Explorer you'd just need a nice UI.
|
| Searching Documents/projects/etc would be the same just maybe
| faster?
|
| All the arbitrary stuff like ~/.npm/**/* would stop
| cluttering up my ls -la in ~ and could be stored in their own
| tables whose names I genuinely don't care about. (This was
| the dream of ~/Library, no?)
|
| [edit] Ooooh, I get it now. This doesn't solve namespacing or
| traversal.
| foobiekr wrote:
| Every single time this has been tried it has gone wrong, but
| sure.
|
| Almost all of the operations done on actual filesystems are not
| database like, they are close to the underlying hardware for
| practical reasons. If you want a database view, add one in an
| upper layer.
| jonhohle wrote:
| BeOS got it right with BeFS. An Email client was just a
| folder. MP3s could be sorted and filtered in the file system.
| https://news.ycombinator.com/item?id=12309686
| int_19h wrote:
| Windows does something similar with Explorer today when you
| open a folder that has mostly music files in it.
| foobiekr wrote:
| BeFS wasn't a database. It had indexed queries on EAs and
| they had the habit of asking application files to add their
| indexable content to the EAs. Internally it was just a
| mostly-not-transactional collection of btrees.
|
| There was no query language for updating files, or even
| inspecting anything about a file that was not published in
| the EAs (or implicitly do as with adapters), there were no
| multi-file transactions, no joins, nothing. Just rich
| metadata support in the FS.
| Ericson2314 wrote:
| Yeah I am talking more deep architecture, and BeOS is
| more notable here mostly on just the user-interface
| level.
|
| However, I think it is reasonable to think that with way
| more time and money, these things would meet up. Think
| about it as digging a tunnel from both sides of the
| mountain.
| adolph wrote:
| > they are close to the underlying hardware for practical
| reasons
|
| Could you provide reference information to support this
| background assertion? I'm not totally familiar with
| filesystems under the hood, but at this point doesn't storage
| hardware maintain an electrical representation relatively
| independent from the logical given things like wear leveling?
| Ericson2314 wrote:
| Yes I agree, that assertion doesn't pass muster.
|
| Mature database implementations also bypass a lot of kernel
| machinary to get closer to the underlying block devices.
| The layering of DB on top of FS is a failure.
| foobiekr wrote:
| You are confusing that databases implement their own
| filesystem equivalent functionality in an application-
| specific way with the idea that FS's can or should be
| databases.
| Ericson2314 wrote:
| I am not confusing any such thing. You need to define
| "database" such that "file system" doesn't include it.
|
| Common usage does this by convention, but that's just
| sloppy thinking and populist _extentional_ definitining.
| I posit that any rigorous, thought-out, not overfit
| _intentional_ definition of a database will, as a matter
| of course, also include file systems.
| mrlongroots wrote:
| Some examples off the top of my head:
|
| - You can reason about block offsets. If your writes are
| 512B-aligned, you can be ensured minimal write
| amplification.
|
| - If your writes are append-only, log-structured, that
| makes SSD compaction a lot more straightforward
|
| - No caching guarantees by default. Again, even SSDs cache
| writes. Block writes are not atomic even with SSDs. The
| only way to guarantee atomicity is via write-ahead logs.
|
| - The NVMe layer exposes async submission/completion
| queues, to control the io_depth the device is subjected to,
| which is essential to get max perf from modern NVMe SSDs.
| Although you need to use the right interface to leverage it
| (libaio/io_uring/SPDK).
| packetlost wrote:
| I don't see how file systems aren't _some_ sort of DBMS,
| definitely not _relational_ but that wasn 't a stated
| requirement.
| 01HNNWZ0MV43FF wrote:
| You could do a loopback network filesystem and make any user-
| space FS you want. That's what WSL does, and there's a Rust
| crate for it. Can't recall the name at all.
| Ericson2314 wrote:
| There is NFS and FUSE so you can write your own
| _implementation_ , but you are still stuck with the
| _interface_ that the kernel understands.
| runlaszlorun wrote:
| I've heard this mentioned a couple times but what would this
| look like functionality wise? A single "files" table with
| columns? Different tables for different categories of files?
| FTS? Something else?
| Ericson2314 wrote:
| See the other comments. The point is not a specific new
| interface, but a separation of concerns, and leveling the
| playing field.
|
| I'll try to do an example. The kernel doesn't currently know
| about SQL. Instead, you e.g. connect to a socket, and start
| talking to postgres. Imagine if FS stuff was the same thing:
| you connect to a socket, and then issue various command to
| read and write files. Ignore perf for a moment, it works
| right?
|
| Now, one counter-argument might be "hold up, what is this
| socket you need to connect to, isn't that part of a file
| system? Is there now an all-userspace inner filesystem, still
| kernel-supported 'meta filesystem'?" Well, the answer to that
| is maybe the Unix idea of making communication channels like
| pipes and (to a lesser extent) sockets, was a _bad_ idea. Or
| rather, there may be nothing wrong with saying a directory
| can have a child which may be such a communication channel,
| but there _is_ a problem with saying that every such
| communication channel should live inside some directory.
| mrlongroots wrote:
| Thoughts:
|
| 1. Distributed filesystems do often use databases for metadata
| (FoundationDB for 3FS being a recent example)
|
| 2. Using a B+ tree for metadata is not much different from
| having a sorted index
|
| 3. Filesystems are a common enough usecase that skipping the
| abstraction complexity to co-optimize the stack is warranted
| b0a04gl wrote:
| If VectorVFS obscures retrieval logic behind opaque embeddings,
| how do users debug why a file surfaced--or worse, why one didn't?
| refulgentis wrote:
| What is a non-opaque embedding?
|
| Does VectorVFS do retrieval, or store embeddings in EXT4?
|
| Is retrieval logic obscured by VectorVFS?
|
| If VectorVFS did retrieval with non-opaque embeddings, how
| would one debug why a file surfaced?
| perone wrote:
| Hi, not sure if I understood what you meant by opaque
| embeddings as well, but the reason why files surface or not is
| due to the similarity score (which is basically the dot product
| of embeddings).
| jlhawn wrote:
| How much work do you think it would be to also have a
| separate xattr which has a human-readable description of the
| file contents? I wonder if it that might already be an
| intermediate product of some of the embedding tools, like
| "arbitrary media" -> "text description of media" ->
| "embedding vector". You could store both of those as xattrs
| and you could debug by comparing your text query with the
| text description of the file contents as they should produce
| similar embedding vectors. You could even audit any file,
| assuming you know what its contents are, by checking the text
| description xattr generated by this program.
| esafak wrote:
| Files-as-vector stores is LanceDB's value proposition. How do you
| compare in performance, etc.?
| perone wrote:
| This is quite different than LanceDB. In VectorVFS I'm using
| the inodes directly to store the embeddings, there is no
| external file with metadata and db, the db is your filesystem
| itself, that's the key difference.
| esafak wrote:
| That's an implementation detail, and it sounds more like a
| liability than a selling point, to have such tight coupling.
| (Why) do you see not using files as a good thing?
|
| Let me ask another question: is this intended for production
| use, or is it more of a research project? Because as a user I
| care about things like speed, simplicity, flexibility, and
| robustness.
| adenta wrote:
| I wonder if I could use this locally on my macbook. The finder
| applications built-in search is kinda meh.
| perone wrote:
| I'm planning to support MacOS, the only issue is with the
| encoders that I'm using now, I will probably work more on it
| next week to try to make a release that works on MacOS as well.
| Thanks !
| tzury wrote:
| I've found that starting with a plain old filesystem often
| outperforms fancy services - just as the Unix philosophy
| ("everything is a file" [1]) has preached for decades [2].
|
| When BigQuery was still in alpha I had to ingest ~15 billion HTTP
| requests a day (headers, bodies, and metadata). None of the
| official tooling was ready, so I wrote a tiny bash script that:
| 1. uploaded the raw logs to Cloud Storage, and 2. tracked
| state with three folders: `pending/`, `processing/`, `done/`.
|
| A cron job cycled through those directories and quietly pushed
| petabytes every week without dropping a byte. Later, Google's own
| pipelines--and third-party stacks like Logstash--never matched
| that script's throughput or reliability.
|
| Lesson: reach for the filesystem first; add services only once
| you've proven you actually need them.
|
| [1] https://en.wikipedia.org/wiki/Everything_is_a_file [2]
| https://en.wikipedia.org/wiki/Unix_philosophy
| dominicq wrote:
| Can you say more about the use case? What problem were you
| solving? How did it work exactly? Sounds interesting so I'd
| like to learn more.
| tzury wrote:
| Sure.
|
| We were building Reblaze (started 2011), a cloud WAF / DDoS-
| mitigation platform. Every HTTP request--good, bad, or ugly--
| had to be stored for offline anomaly-detection and
| clustering. Traffic profile -
| Baseline: [?] 15 B requests/day - Under attack: the
| same 15 B can arrive in 2-3 hours
|
| Why BigQuery (even in alpha)?
|
| It was the only thing that could swallow that firehose and
| stay query-able minutes later -- crucial when you're under
| attack and your data source must _not_ melt down.
|
| Pipeline (all shell + cron)
|
| Edge nodes - write JSON logs locally and a local cron push to
| Cloud Storage
|
| Tiny VM with a cron loop - Scans `pending/`,
| composes many small blobs into one "max-size" blob in
| `processing/`. - Executes `bq load ...` into the
| customer's isolated dataset. - On success, moves the
| blob to `done/`; on failure, drops it back to `pending/`.
|
| Downstream ML/alerting* pulls straight from BigQuery
|
| That handful of `gsutil`, `bq`, and `mv` commands moved
| multiple petabytes a week without losing a byte. Later
| pipelines--Dataflow, Logstash, etc.--never matched its
| throughput or reliability.
| cratermoon wrote:
| Command line tools can be 225x faster than a Hadoop cluster.
| https://news.ycombinator.com/item?id=17135841
| ryanianian wrote:
| Not sure if it's still in use, but for a very long time, AWS
| billing relied on getting usage data via rsync.
| sunshine-o wrote:
| Absolutely.
|
| I would add that filesystems are superior to data formats (XML,
| JSON, YAML, TOML) for many use cases such as configuration or
| just storing data.
|
| - Hierarchy are dirs,
|
| - Keys are file names,
|
| - Value is the content of the file.
|
| - Other metadata are in hidden files
|
| It will work forever, you can leverage ZFS, Git, rsync,
| syncthing much better. If you want, a fancy shells like Nushell
| will bring the experience pretty close to a database.
|
| Most important you don't need fancy editor plugins or to learn
| XPath, jq or yq.
| drob518 wrote:
| Yes, but a couple downsides:
|
| 1. For config, it spreads the config across a bunch of nested
| directories, making it hard to read and write it without some
| sort of special tool that shows it all to you at once. Sure,
| you can easily edit 50 files from all sorts of directories in
| your text editor, but that's pretty painful.
|
| 2. For data storage is that lots of smaller files will waste
| partial storage blocks in many file systems. Some do coalesce
| small files, but many don't.
|
| 3. For both, it's often going to be higher performance to
| read a single file from start to finish than a bunch of
| files. Most file systems will try to keep file blocks in
| mostly sequential order (defrag'd), whereas they don't
| typically do that for multiple files in different
| directories. SSD makes this mostly a non-issue these days,
| however. You still have the issue of openings, closings, and
| more read calls, however.
| sunshine-o wrote:
| > 1. For config, it spreads the config across a bunch of
| nested directories, making it hard to read and write it
| without some sort of special tool that shows it all to you
| at once. Sure, you can easily edit 50 files from all sorts
| of directories in your text editor, but that's pretty
| painful.
|
| It really depends how comfortable you are using the shell
| and which one you use.
|
| cat, tree, sed, grep, etc will get you quite far and one
| might argue that it is simpler to master than vim and
| various format. Actually mastering VSCode also takes a lot
| of efforts.
|
| > 2. For data storage is that lots of smaller files will
| waste partial storage blocks in many file systems. Some do
| coalesce small files, but many don't.
|
| > 3. For both, it's often going to be higher performance to
| read a single file from start to finish than a bunch of
| files. Most file systems will try to keep file blocks in
| mostly sequential order (defrag'd), whereas they don't
| typically do that for multiple files in different
| directories. SSD makes this mostly a non-issue these days,
| however. You still have the issue of openings, closings,
| and more read calls, however.
|
| Agreed but for most use case here it really doesn't matter
| and if I need to optimise storage I will need a database
| anyway.
|
| And I sincerely believe that most micro optimisations at
| the filesystem level are cancelled by running most editors
| with data format support enabled....
| cryptonector wrote:
| Except that now when you do need a tool like XSLT/XPath, jq,
| or yq, now you need bash. I use bash lots, but still I'd
| rather use a better language, like the ones you listed.
|
| I'm being slightly hypocritical because I've made plenty of
| use of the filesystem as a configuration store. In code it's
| quite easy to stat one path relative to a directory, or open
| it and read it, so it's very tempting.
| bullen wrote:
| I did something similar, but I use these EXT4 requirements:
| - hard links (only tar works for backup) - small file size
| (or inodes run out before disk space)
|
| http://root.rupy.se
|
| It's very useful for global distributed real-time data that don't
| need the P in CAP for writes.
|
| (no new data can be created if one node is offline = you can
| login, but not register)
| jlhawn wrote:
| If I understand correctly, this is attaching metadata to files in
| a format that LLMs (or any tool that can understand the semantic
| embedding vector) can leverage to understand what a file is
| without having to actually read the contents of the file.
|
| That obviously has a lot of interesting use cases, but my first
| assumption was that this could be used to quickly/easily search
| your filesystem with some prompt like "Play the video from last
| month where we went camping and saw a flock of turkeys". But that
| would require having an actual vector DB running on your system
| which you could use to quickly look up files using an embedding
| of your query, no?
| lstodd wrote:
| so, like magic(5)?
| mywittyname wrote:
| What is magic(5) and how is it similar to what was described?
| danudey wrote:
| magic(5) is a system for determining the type of a file by
| examining the 'magic bytes' at or near the start of a file.
|
| For example, POSIX tar files have a defined file format
| that starts with a header struct: https://www.gnu.org/softw
| are/tar/manual/html_node/Standard.h...
|
| You can see that at byte offset 257 is `char magic[6]`,
| which contains `TMAGIC`, which is the byte string
| "ustar\0". Thus, if a file has the bytes 'ustar\0' at
| offset 257 we can reasonably assume that it's a tar file.
| Almost every defined file type has some kind of string of
| 'magic' predefined bytes at a predefined location that lets
| a program know "yes, this is in fact a JPEG file" rather
| than just asserting "it says .jpg so let's try to interpret
| this bytestring and see what happens".
|
| As for how it's similar: I don't think it actually is, I
| think that's a misunderstanding. The metadata that this
| vector FS is storing is more than "this is a a JPEG" or
| "this is a word document", as I understand it, so comparing
| it to magic(5) is extremely reductionist. I could be
| mistaken, however.
| yjftsjthsd-h wrote:
| https://manpages.org/magic/5 is a database of file types,
| used by the file(1) command. I don't exactly follow how
| it's the same though; it would let you say "what files are
| videos" but not "what files are videos of a cat". Which is
| sort of related but unless I missed something there is a
| difference.
| simcop2387 wrote:
| I think they're referring to this,
| https://linux.die.net/man/5/magic given the notation. That
| said I don't really see how it'd be all that relevant to
| the discussion so maybe i'm missing something else.
| 0x457 wrote:
| magic(5) means `man 5 magic`:
| https://linux.die.net/man/5/magic
|
| It's just a tool that can read "magic bytes" to figure out
| what files contains. Very different from what VectorVFS is.
| lstodd wrote:
| four people answered strictly correctly as to what magic(5)
| is, but not a single one realized that storing some aux
| data as xattr in linux FS is not in any way different from
| just storing the exact same data as a file header. which is
| how magic(5) works.
|
| how come?
|
| (besides good luck not forgetting to rsync those xattrs)
| perone wrote:
| Hi, it is quite different, there is no LLM involved, we can
| certainly use it for a RAG for example, but what is currently
| implemented is basically a way to generate embeddings (vector
| representation) which are then used for search later, it is all
| offline and local (no data is ever sent to cloud from your
| files).
| jlhawn wrote:
| I understand that LLMs aren't involved in generating the
| embeddings and adding the xattrs. I was just wondering what
| the value add of this is if there's no other background
| process (like mds on macOS) which is using it to build a
| search index.
|
| I guess what I'm asking is: how does VectorVFS enable search
| besides iterating through all files and iteratively comparing
| file embeddings with the embedding of a search query? The
| project description says "efficient and semantically
| searchable" and "eliminating the need for external index
| files or services" but I can't think of any more efficient
| way to do a search without literally walking the entire
| filesystem tree to look for the file with the most similar
| vector.
|
| Edit: reading the docs [1] confirmed this. The `vfs search
| TERM DIRECTORY` command:
|
| > will automatically iterate over all files in the folder,
| look for supported files and then embed the file or load
| existing embeddings directly from the filesystem."
|
| [1]:
| https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-
| se...
| pilooch wrote:
| Using it for a RAG is smart indeed, especially with a
| multimodal encoder (vision-rag), as the implementation would
| be straightforward from what you already have.
| lstodd wrote:
| if you go look up how xattrs work, you will understand it's no
| different than just reading a chunk of the file in question,
| and in fact can be slower.
|
| xattrs are better be forgotten already. it was just as dumb
| idea as macos resource forks/
| colordrops wrote:
| Rt
| pseudosavant wrote:
| This immediately made me nostalgic for BeOS's BeFS or Windows
| Longhorn's WinFS database filesystems, and how this kind of thing
| would have fit them perfect. So much cool stuff you could do with
| vectors for everything. Smart folders that include files for a
| project based on a description of the project. Show me all of my
| config files for appXYZ. Images of a black dog at the beach. At
| the OS-level for any other app to easily tap into.
|
| I'd be surprised if cloud storage services like OneDrive don't
| already do some kind of vector for every file you store. But an
| online web service isn't the same as being built into the core of
| the OS.
| perone wrote:
| I share the same feeling, I think filesystems will have to
| reinvent themselves given the pace of how useful ML models
| became in the past years.
| didgetmaster wrote:
| I built a local object store that was designed to replace
| file systems. You can create hundreds of millions of objects
| (e.g. files) and attach a variety of metadata tags to each
| one. A tag could be a number, string, or other data type
| (including vector info). Searches for objects with certain
| tags is exceptionally fast.
|
| I invented it because I found searching conventional file
| systems that support extended attributes to be unbearably
| slow.
| tugdual wrote:
| Got a demo ?
| didgetmaster wrote:
| Tons of demo videos on my YouTube channel. Free beta
| available for download on my website. Links in my
| profile.
| p_ing wrote:
| WinFS wasn't a file system laid down on hardware, it was just a
| SQL database that stored arbitrary data.
| didgetmaster wrote:
| I think that is one of the main reason it failed to launch.
| It was just too easy for the metadata stored in the separate
| database to become out of sync with the actual file data.
|
| Microsoft saw the tech support nightmare this could generate,
| and abandoned the project.
| pseudosavant wrote:
| They just weren't able to pull it off for whatever reason.
| I actually ran BeOS as my daily driver for quite a while
| (way) back in the day. BeFS was genuinely amazing, and not
| something I've seen replicated elsewhere yet. There hasn't
| really been anything interesting done in filesystems used
| by users on devices in a really long time.
| p_ing wrote:
| It was abandoned due to The Cloud. There was no need for
| WinFS as a tech when you could store everything in The
| Cloud.
|
| It was also complex, ran poorly, and would have required
| developers to integrate their applications.
|
| Microsoft had long solved the problem of blobs and metadata
| in ESE and SharePoint's use of MS SQL for binary + metadata
| storage.
| WalterGR wrote:
| > it was just a SQL database that stored arbitrary data.
|
| I mean, for some definitions "just", "SQL database", and
| "arbitrary data." :) It was a schematised graph database
| implemented on top of a slimmed-down version of SQL
| Server. The query language was not SQL-based.
|
| > It was abandoned due to The Cloud.
|
| It was discontinued circa 2007. The cloud was much less
| of a Thing back then. I don't recall that factoring into
| the decision to cancel the project at all, though it
| would have been prescient.
|
| (Disclaimer: I was on the WinFS team at Microsoft.)
| asadawadia wrote:
| is the embedding for the whole file? or each 1024/512 byte chunk?
| javier2 wrote:
| i looked into something similar a few years ago, where i stored
| embeddings in xattrs
| quantadev wrote:
| I've been wondering for about 20 years why File Systems basically
| died and stopped innovating. For example we have lots of
| hierarchical data structures in the world, and no one seems to
| have figured out how to let a folder be the storage, instead of
| always just databases.
|
| For example, if we simply had the ability to have "ordered" files
| inside folders, that would instantly make it practical for a
| folder structure to represent "Documents". After all, documents
| are nothing but a list of paragraphs and images, so if we simply
| had ordering in file systems we could have document editors which
| are using individual files for each paragraph of text or image.
| It would be amazing.
|
| Also think about use cases like Jupyter Notebooks. We could stop
| using the XML file format, and just make it a folder structure
| instead. Each cell (node) being in a file. All social media
| messages and chatbot conversations could be easily saved as
| folders structures.
|
| I've heard many file copy tools ignore XATTR so I've never tried
| to use it for this purpose, so maybe we've had the capability all
| along and just nobody thought to use it in a big way that became
| popular yet. Maybe I should consider XATTR and take it seriously.
| thirdtrigger wrote:
| Might be interesting to add an optional embedded Weaviate [1]
| with a flat-index [2] to the project. It wouldn't use external
| services and is fully disk-based. Would allow you to search the
| whole filesystem (about 1.5kb per file (384 dimensions) which
| would be added to the metadata as well).
|
| 1.
| https://weaviate.io/developers/weaviate/installation/embedde...
| 2. https://weaviate.io/developers/academy/py/vector_index/flat
| binarymax wrote:
| Why weaviate and not FAISS? The latter is faster and lighter.
| gitroom wrote:
| Gotta say, the old school debate on filesystems vs databases will
| never get old for me - I always end up with more questions than
| answers after reading stuff like this.
| j45 wrote:
| Everything's old school, everything's new.
|
| It's important to remember that the cloud is also invented by
| the old school and understanding the oscillation between
| client/server architectures vs local, and it's implication on
| topics of data and files is interesting too.
|
| More questions means more learning until I learned there's no
| one right or wrong, just what works best, where, when, for how
| long, and what the tradeoffs are.
|
| Quick wins/decisions are often bandaids that pile up in an
| different way.
___________________________________________________________________
(page generated 2025-05-05 23:00 UTC)