hngopher.com

       [HN Gopher] An intro to DeepSeek's distributed file system
       ___________________________________________________________________
        
       An intro to DeepSeek's distributed file system
        
       Author : sebg
       Score  : 436 points
       Date   : 2025-04-17 12:50 UTC (10 hours ago)
        
 (HTM) web link (maknee.github.io)
 (TXT) w3m dump (maknee.github.io)
        
       | vFunct wrote:
       | Can we replicate this with ZFS drives distributed across multiple
       | machines?
        
         | eatonphil wrote:
         | As far as I'm aware ZFS does not scale out.
         | 
         | https://unix.stackexchange.com/a/99218
        
           | db48x wrote:
           | Yea, it wasn't designed to scale that way.
           | 
           | In principle you could use fiberchannel to connect a really
           | large number (224 iirc) of disks to a single server and then
           | create a single ZFS pool using all of them. This lets you
           | scale the _storage_ as high as you want.
           | 
           | But that still limits you to however many requests per second
           | that your single server can handle. You can scale that pretty
           | high too, but probably not by a factor of 224.
        
       | jack_pp wrote:
       | I don't have direct experience with distributed file systems but
       | it so happens I did a tiny bit of research in the past month
       | and.. there are quite a few open source ones available. Would've
       | been nice for the authors to explain why the already existing
       | solutions didn't work for them.
        
         | dboreham wrote:
         | They have an HFT background so probably it was developed long
         | ago for that workload (which tends to be outside the design
         | envelope for off the shelf solutions).
        
       | londons_explore wrote:
       | This seems like a pretty complex setup with lots of features
       | which aren't obviously important for a deep learning workload.
       | 
       | Presumably the key necessary features are PB's worth of storage,
       | read/write parallelism (can be achieved by splitting a 1PB file
       | into say 10,000 100GB shards, and then having each client only
       | read the necessary shards), and redundancy
       | 
       | Consistency is hard to achieve and seems to have no use here -
       | your programmers can manage to make sure different processes are
       | writing to different filenames.
        
         | sungam wrote:
         | I wonder whether it may have been originally developed for the
         | quantitive hedge fund
        
           | huntaub wrote:
           | Yes, I think this is probably true. I've worked with a lot of
           | different hedge funds who have a similar problem -- lots of
           | shared data that they need in a file system so that they can
           | do backtesting of strategies with things like kdb+.
           | Generally, these folks are using NFS which is kind of a pain
           | -- especially for scaleability -- so building your own for
           | that specific use case (which happens to have a similar usage
           | pattern for AI training) makes a lot of sense.
        
         | threeseed wrote:
         | > Consistency is hard to achieve and seems to have no use here
         | 
         | Famous last words.
         | 
         | It is very common when operating data platforms like this at
         | this scale to lose a lot of nodes over time especially in the
         | cloud. So having a robust consistency/replication mechanism is
         | vital to making sure your training job doesn't need to be
         | restarted just because the block it needs isn't on the
         | particular node.
        
           | londons_explore wrote:
           | indeed redundancy is fairly important (although the largest
           | part, the training data, actually doesn't matter if chunks
           | are missing).
           | 
           | But the type of consistency they were talking about is strong
           | ordering - the type of thing you might want on a database
           | with lots of people reading and writing tiny bits of data,
           | potentially the _same_ bits of data, and you need to make
           | sure a users writes are rejected if impossible to fulfil, and
           | reads never return an impossible intermediate state. That isn
           | 't needed for machine learning.
        
           | ted_dunning wrote:
           | Sadly, these are often Famous First words.
           | 
           | What follows is a long period of saying "see, distributed
           | systems are easy for genius developers like me"
           | 
           | The last words are typically "oh shit", shortly followed
           | oxymoronically by "bye! gotta go"
        
       | jamesblonde wrote:
       | Architecturally, it is a scale-out metadata filesystem [ref].
       | Other related distributed file systems are Collosus, Tectonic
       | (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think
       | PolarFS (Alibaba). They all use different distributed row-
       | oriented DBs for storing metadata. S3FS uses FoundationDB,
       | Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not
       | sure), HopsFS uses RonDB.
       | 
       | What's important here with S3FS is that it supports (1) a fuse
       | client - it just makes life so much easiter - and (2) NVMe
       | storage - so that training pipelines aren't Disk I/O bound (you
       | can't always split files small enough and parallel
       | reading/writing enough to a S3 object store).
       | 
       | Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe
       | for recent data and S3 for archival.
       | 
       | [ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-
       | bree...
        
         | nickfixit wrote:
         | I've been using JuiceFS since the start for my AI stacks.
         | Similar and used postgresql for the meta.
        
           | jamesblonde wrote:
           | JuiceFS is very good. I didn't have it as a scaleout metadata
           | FS - it supports lots of DBs (single host and distributed
           | DBs).
        
         | threeseed wrote:
         | Tiered storage and FUSE has existed with Alluxio for years.
         | 
         | And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).
         | 
         | None of it is particularly ground breaking just a lot of pieces
         | brought together.
        
           | jamesblonde wrote:
           | The difference is scale-out metadata in the filesystem.
           | Alluxio uses Raft, i believe, for metadata - that has to fit
           | on a single server.
        
             | rfoo wrote:
             | 3FS isn't particularly fast in mdbench, though. Maybe our
             | FDB tuning skill is what to blame, or FUSE, I don't know,
             | but it doesn't really matter.
             | 
             | The truly amazing part for me is combining NVMe SSD + RDMA
             | + supports reading a huge batch of random offsets from a
             | few already opened huge files efficiently. This is how you
             | get your training boxes consuming 20~30GiB/s (and roughly 4
             | million IOPS).
        
               | rjzzleep wrote:
               | FUSE has traditionally been famously slow. I remember
               | there were some changes that supposedly made it faster,
               | but maybe that was just a certain fuse implementation.
        
               | jamesblonde wrote:
               | The block size is 4KB by default, which is a killer. We
               | set it to 1MB or so by default - makes a huge difference.
               | 
               | https://github.com/logicalclocks/hopsfs-go-mount
        
         | objectivefs wrote:
         | There is also ObjectiveFS that supports FUSE and uses S3 for
         | both data and metadata storage, so there is no need to run any
         | metadata nodes. Using S3 instead of a separate database also
         | allows scaling both data and metadata with the performance of
         | the S3 object store.
        
         | joatmon-snoo wrote:
         | nit: Colossus* for Google.
        
         | MertsA wrote:
         | >Tectonic some KV store,
         | 
         | Tectonic is built on ZippyDB which is a distributed DB built on
         | RocksDB.
         | 
         | >What's important here with S3FS is that it supports (1) a fuse
         | client - it just makes life so much easier
         | 
         | Tectonic also has a FUSE client built for GenAI workloads on
         | clusters backed by 100% NVMe storage.
         | 
         | https://engineering.fb.com/2024/03/12/data-center-engineerin...
         | 
         | Personally what stands out to me for 3FS isn't just that it has
         | a FUSE client, but that they made it more of a hybrid of FUSE
         | client and native IO path. You open the file just like normal
         | but once you have a fd you use their native library to do the
         | actual IO. You still need to adapt whatever AI training code to
         | use 3FS natively if you want to avoid FUSE overhead, but now
         | you use your FUSE client for all the metadata operations that
         | the native client would have needed to implement.
         | 
         | https://github.com/deepseek-ai/3FS/blob/ee9a5cee0a85c64f4797...
        
       | randomtoast wrote:
       | Why not use CephFS instead? It has been thoroughly tested in
       | real-world scenarios and has demonstrated reliability even at
       | petabyte scale. As an open-source solution, it can run on the
       | fastest NVMe storage, achieving very high IOPS with 10 Gigabit or
       | faster interconnect.
       | 
       | I think their "Other distributed filesystem" section does not
       | answer this question.
        
         | tempest_ wrote:
         | We have a couple ceph clusters.
         | 
         | If my systems guys are telling me the truth is it a real time
         | sink to run and can require an awful lot of babysitting at
         | times.
        
           | huntaub wrote:
           | IMO this is the problem with _all_ storage clusters that you
           | run yourself, not just Ceph. Ultimately, keeping data alive
           | through instance failures is just a lot of maintenance that
           | needs to happen (even with automation).
        
           | _joel wrote:
           | I admin'd a cluster about 10 years back and it was 'ok' then,
           | around bluestore. One issue was definitely my mistake but it
           | wasn't all that bad.
        
         | elashri wrote:
         | CERN use CephFS with ~50PB for different applications and they
         | are happy with it.
        
           | dfc wrote:
           | I thought they used ceph too. But I started looking around
           | and it seems like they have switched to CernVM-FS and in
           | house solution. I'm not sure what changed.
        
             | elashri wrote:
             | They didn't switch, they use both for different needs. EOS
             | (CVMFS) is used mainly for physics data storage and user
             | data. Ceph is used for many other things like
             | infrastructure, selfhosted apps..etc.
        
         | charleshn wrote:
         | Because it's actually fairly slow.
         | 
         | Among other things, the OSD was not designed with NVMe drives
         | in mind - which is fair, given how old it is - so it's nowhere
         | close to being able to handle modern NVMe IO throughput and
         | IOPS.
         | 
         | For that you need zero-copy, RDMA etc.
         | 
         | Note that there is a next-generation OSD project called Crimson
         | [0], however it's been a while, and I'm not sure how well it's
         | going. It's based on the awesome Seastar framework [1], backing
         | ScyllaDB.
         | 
         | Achieving such performance would also require many changes to
         | the client (RDMA, etc).
         | 
         | Something like Weka [2] has a much better design for this kind
         | of performance.
         | 
         | [0] https://ceph.io/en/news/crimson/
         | 
         | [1] https://seastar.io/
         | 
         | [2] https://www.weka.io/
        
           | __turbobrew__ wrote:
           | With latest ceph releases I am able to saturate modern NVME
           | devices with 2 OSD/NVME. It is kind of a hack to have
           | multiple OSD per NVME, but it works.
           | 
           | I do agree that nvme-of is the next hurdle for ceph
           | performance.
        
         | skrtskrt wrote:
         | DigitalOcean uses Ceph underneath their S3 and block volume
         | products. When I was there they had 2 teams just managing Ceph,
         | not even any of the control plane stuff built on top.
         | 
         | It is a complete bear to manage and tune at scale. And DO never
         | greenlit offering anything based on CephFS either because it
         | was going to be a whole other host of things to manage.
         | 
         | Then of course you have to fight with the maintainers (Red Hat
         | devs) to get any improvements contributed, assuming you even
         | have team members with the requisite C++ expertise.
        
           | Andys wrote:
           | Ceph is massively over-complicated, if I had two teams I'd
           | probably try and write one from scratch instead.
        
             | skrtskrt wrote:
             | Most of the legitimate datacenter-scale direct Ceph
             | alternatives unfortunately are proprietary, in part because
             | it takes so much money and human-expertise-hours to even be
             | able to prove out that scale, they want to recoup costs and
             | stay ahead.
             | 
             | Minio is absolutely not datacenter-scale and I would not
             | expect anything in Go to really reach that point. Garbage
             | collection is a rough thing at such enormous scale.
             | 
             | I bet we'll get one in Rust eventually. Maybe from Oxide
             | computer company? Though despite doing so much OSS, they
             | seem to be focused around their specific server rack OS,
             | not general-purpose solutions
        
               | steveklabnik wrote:
               | > I bet we'll get one in Rust eventually. Maybe from
               | Oxide computer company?
               | 
               | Crucible is our storage service:
               | https://github.com/oxidecomputer/crucible
               | 
               | RFD 60, linked in the README, contains a bit of info
               | about Ceph, which we did evaluate:
               | https://rfd.shared.oxide.computer/rfd/0060
        
       | huntaub wrote:
       | I think that the author is spot on, there are a couple of
       | dimensions in which you should evaluate these systems:
       | theoretical limits, efficiency, and practical limits.
       | 
       | From a theoretical point of view, like others have pointed out,
       | parallel distributed file systems have existed for years -- most
       | notably Lustre. These file systems should be capable of scaling
       | out their storage and throughput to, effectively, infinity -- if
       | you add enough nodes.
       | 
       | Then you start to ask, well how much storage and throughput can I
       | get with a node that has X TiB of disk -- starting to evaluate
       | efficiency. I ran some calculations (against FSx for Lustre,
       | since I'm an AWS guy) -- and it appears that you can run 3FS in
       | AWS for about 12-30% cheaper depending on the replication factors
       | that you choose against FSxL (which is good, but not great
       | considering that you're now managing the cluster yourself).
       | 
       | Then, the third thing you start to ask is anecdotally, are people
       | able to actually configure these file systems into the size of
       | deployment that I want (which is where you hear things like "oh
       | it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen
       | from something like 3FS.
       | 
       | Ultimately, I obviously believe that storage and data are really
       | important keys to how these AI companies operate -- so it makes
       | sense that DeepSeek would build something like this in-house to
       | get the properties that they're looking for. My hope is that we,
       | at Archil, can find a better set of defaults that work for most
       | people without needing to manage a giant cluster or even worry
       | about how things are replicated.
        
         | jamesblonde wrote:
         | Maybe AWS could start by making fast NVMes available - without
         | requiring multi TB disks just to get 1 GB/s. S3FS experiments
         | were run on 14 GB/s NVMe disks - an order of magnitude higher
         | throughput than anything available in AWS today.
         | 
         | SSDs Have Become Ridiculously Fast, Except in the Cloud:
         | https://news.ycombinator.com/item?id=39443679
        
           | kridsdale1 wrote:
           | On my home LAN connected with 10gbps fiber between MacBook
           | Pro and server, 10 feet away, I get about 1.5gbps vs the non-
           | network speed of the disks of ~50 gbps. (Bits, not bytes)
           | 
           | I worked this out to the macOS SMB implementation really
           | sucking. I set up a NFS driver and it got about twice as fast
           | but it's annoying to mount and use, and still far from the
           | disk's capabilities.
           | 
           | I've mostly resorted to abandoning the network (after large
           | expense) and using Thunderbolt and physical transport of the
           | drives.
        
             | greenavocado wrote:
             | Is NFS out of the question?
        
               | kridsdale1 wrote:
               | I have set it up but it's not easy to get drivers working
               | on a Mac.
        
             | dundarious wrote:
             | SMB/CIFS is an incredibly chatty, synchronous protocol.
             | There are/were massive products built around mitigating and
             | working around this when trying to use it over high latency
             | satellite links (US military did/does this).
        
           | __turbobrew__ wrote:
           | There are i4i instances in AWS which can get you a lot of
           | IOPS with a smaller disk.
        
       | stapedium wrote:
       | I'm just a small business & homelab guy, so I'll probably never
       | use one of these big distributed file systems. But when people
       | start talking petabytes, I always wonder if these things are
       | actually backed up and what you use for backup and recovery?
        
         | huntaub wrote:
         | Well, for active data, the idea is that the replication within
         | the system is enough to keep the data alive from instance
         | failure (assuming that you're doing the proper maintenance and
         | repairing hosts pretty quickly after failure). Backup and
         | recovery, in that case, is used more for saving yourself
         | against fat-fingering an "rm -rf /" type command. Since it's
         | just a file system, you should be able to use any backup and
         | recovery solution that works with regular files.
        
         | shermantanktop wrote:
         | Backup and recovery is a process with a non-zero failure rate.
         | The more you test it, the lower the rate, but there is always a
         | failure mode.
         | 
         | With these systems, the runtime guarantees of data integrity
         | are very high and the failure rate is very low. And best of
         | all, failure is constantly happening as a normal activity in
         | the system.
         | 
         | So once you have data integrity guarantees that are better in
         | you runtime system than your backup process, why backup?
         | 
         | There are still reasons, but they become more specific to the
         | data being stored and less important as a general datastore
         | feature.
        
           | Eikon wrote:
           | > why backup?
           | 
           | Because of mistakes and malicious actors...
        
             | overfeed wrote:
             | ...and the "Disaster" in "Disaster recovery" may have been
             | localized and extensive (fire, flooding, major earthquake,
             | brownouts due to a faulty transformer, building collapse, a
             | solvent tanker driving through the wall into the server
             | room, a massive sinkhole, etc)
        
               | shermantanktop wrote:
               | Yes, the dreaded fiber vs. backhoe. But if your
               | distributed file system is geographically redundant,
               | you're not exposed to that, at least from an integrity
               | POV. It sucks that 1/3 or 1/5 or whatever of your serving
               | fleet just disappeared, but backup won't help with that.
        
               | overfeed wrote:
               | > But if your distributed file system is geographically
               | redundant
               | 
               | Redundancy and backups are not the same thing! There's
               | some overlap, but treating them as interchangeable will
               | occasionally result in terrible outcomes, like when a
               | config change that results in all 5/5 datacenters
               | fragmenting and failing to create a quorum, then finding
               | out your services have circular dependencies when you are
               | trying to bootstrap foundational services. Local backups
               | would solve this, each DC would load last known good
               | config, but rebuilding consensus necessary for redundancy
               | requires coordination from now-unreachable hosts.
        
         | ted_dunning wrote:
         | It is common for the backup of these systems to be a secondary
         | data center.
         | 
         | Remember that there are two purposes for backup. One is
         | hardware failures, the second is fat fingers. Hardware failures
         | are dealt with by redundancy which always involves keeping
         | redundant information across multiple failure domains. Those
         | domains can be as small as a cache line or as big as a data
         | center. These failures can be dealt with transparently and
         | automagically in modern file systems.
         | 
         | With fat fingers, the failure domain has no natural boundaries
         | other than time. As such, snapshots kept in the file system are
         | the best choice, especially if you have a copy-on-write that
         | can keep snapshots with very little overhead.
         | 
         | There is also the special case of adversarial fat fingering
         | which appears in ransomware. The answer is snapshots, but the
         | core problem is timely detection since otherwise you may not
         | have a single point in time to recover from.
        
       | mertleee wrote:
       | What are the odds 3fs is backdoored?
        
         | huntaub wrote:
         | I think that's a pretty odd concern to have. What would you
         | imagine that looks like? If you're running these kinds of
         | things securely, you should be locking down the network access
         | to the hosts (they don't need outbound internet access, and
         | they shouldn't need inbound access from anything except your
         | application).
        
           | xpe wrote:
           | > I think that's a pretty odd concern to have.
           | 
           | Thinking about security risks is an odd concern to have?
        
             | huntaub wrote:
             | I think that worrying that a self-hosted file system has a
             | backdoor to exfiltrate data is an odd concern. Security
             | concerns are (obviously) normal, but you should not be
             | exposing these kinds of services to the public internet (or
             | giving them access to the public internet), eliminating the
             | concern that it's giving your data away.
        
               | xpe wrote:
               | > I think that worrying that a self-hosted file system
               | has a backdoor to exfiltrate data is an odd concern.
               | 
               | Great security teams get paid to "worry", by which I mean
               | "make a plan" for a given attack tree.
               | 
               | Your use of "odd" looks like a rhetorical technique to
               | downplay legitimate security considerations. From my
               | point of view, you are casually waving away an essential
               | area of security. See:
               | 
               | https://www.paloaltonetworks.com/cyberpedia/data-
               | exfiltratio...
               | 
               | > but you should not be exposing these kinds of services
               | to the public internet (or giving them access to the
               | public internet), eliminating the concern that it's
               | giving your data away.
               | 
               | Let's evaluate the following claim: "a "properly"
               | isolated system would, in theory, have no risk of data
               | exfiltration. Now we have to define what we mean.
               | Isolated from the network? Running in a sandbox? What
               | happens when that system can interact with a compromised
               | system? What happens when people mess up?
               | 
               | From a security POV, any software could have some
               | component that is malicious, compromised, or backdoored.
               | It might be in the project itself or in the supply chain.
               | And so on.
               | 
               | Defense in depth matters. None of the above concerns are
               | "odd". A good security team is going to factor these in.
               | 
               | P.S. If you mean "low probability" then just say that.
        
         | MaxPock wrote:
         | By, NSA or Britain's GCHQ which wants all software backdoored?
        
       | robinhoodexe wrote:
       | I'm interested in how it is compared to seaweedfs[1], which we
       | use for storing weather data (about 3 PB) for ML training.
       | 
       | [1] https://github.com/seaweedfs/seaweedfs
        
         | huntaub wrote:
         | My guess is going to be that performance is pretty comparable,
         | but it looks like Seaweed contains a lot more management
         | features (such as tiered storage) which you may or may not be
         | using.
        
         | rfoo wrote:
         | IMO they serve similar at a glance, but actually very different
         | use cases.
         | 
         | SeaweedFS is more about amazing small object read performance
         | because you effectively have no metadata to query to read an
         | object. You just distribute volume id, file id (+cookie) to
         | clients.
         | 
         | 3FS is less extreme in this, supports actual POSIX interface,
         | and isn't particularly good at how fast you can open() files.
         | On the other hand, it shards files into smaller (e.g. 512KiB)
         | chunks, demands RDMA NICs and makes reading randomly from large
         | files scary fast [0]. If your dataset is immutable you can
         | emulate what SeaweedFS does, but if it isn't then SeaweedFS is
         | better.
         | 
         | [0] By scary fast I mean being able to completely saturate 12
         | PCIe Gen 4 NVMe SSD at 4K random reads on a single storage
         | server and you can horizontally scale that.
        
           | jszymborski wrote:
           | I wonder how close to something like 3FS you can get by
           | mounting SeaweedFS with S3FS, which mounts using FUSE.
           | 
           | https://github.com/s3fs-fuse/s3fs-fuse
        
       | seethishat wrote:
       | How easy is it to disable DeepSeek's distributed FS? Say for
       | example a US college has been authorized to use DeepSeek for
       | research, but must ensure no data leaves the local research
       | cluster filesystem?
       | 
       | Edit: I am a DeepSeek newbie BTW, so if this question makes no
       | sense at all, that's why ;)
        
         | ikeashark wrote:
         | I might need more clarification, but if one is paranoid or is
         | dealing with this sensitive of information the DeepSeek model
         | and 3FS are able to be deployed locally offline and not
         | connected to the internet.
        
           | seethishat wrote:
           | Thank you, that answers my question.
        
       | snthpy wrote:
       | Similar to the SeaweedFS question in sibling comment, how does
       | this compare to JuiceFS?
       | 
       | In particular for my homelab setup I'm planning to run JuiceFS on
       | top of S3 Garage. I know garage is only replication without any
       | erasure coding or sharding so it's not really comparable but I
       | don't need all that and it looked at lot simpler to set up to me.
        
         | huntaub wrote:
         | It's a very different architecture. 3FS is storing everything
         | on SSDs, which makes it extremely expensive but also low
         | latency (think ~100-300us for access). JuiceFS stores data in
         | S3, which is extremely cheap but very high latency (~20-60ms
         | for access). The performance _scalability_ should be pretty
         | similar, if you 're able to tolerate the latency numbers. Of
         | course, they both use databases for the metadata layer, so
         | assuming you pick the same one -- the metadata performance
         | should also be similar.
        
       | nodesocket wrote:
       | Does this get us closer to fully and performant distributed LLMs?
       | 
       | Instead of spending ludicrous amounts on Nvidia GPUs, can just
       | deploy commodity clusters of AMD EPYC servers with tons of
       | memory, NVMe disks, and 40G or 100G networking which is vastly
       | less expensive.
       | 
       | Goodbye Nvidia moat!
        
       | dang wrote:
       | Related. Others?
       | 
       |  _Understanding Smallpond and 3FS_ -
       | https://news.ycombinator.com/item?id=43232410 - March 2025 (47
       | comments)
       | 
       |  _Smallpond - A lightweight data processing framework built on
       | DuckDB and 3FS_ - https://news.ycombinator.com/item?id=43200793 -
       | Feb 2025 (73 comments)
       | 
       |  _Fire-Flyer File System (3FS)_ -
       | https://news.ycombinator.com/item?id=43200572 - Feb 2025 (101
       | comments)
        
       ___________________________________________________________________
       (page generated 2025-04-17 23:00 UTC)