[HN Gopher] An intro to DeepSeek's distributed file system
___________________________________________________________________
An intro to DeepSeek's distributed file system
Author : sebg
Score : 436 points
Date : 2025-04-17 12:50 UTC (10 hours ago)
(HTM) web link (maknee.github.io)
(TXT) w3m dump (maknee.github.io)
| vFunct wrote:
| Can we replicate this with ZFS drives distributed across multiple
| machines?
| eatonphil wrote:
| As far as I'm aware ZFS does not scale out.
|
| https://unix.stackexchange.com/a/99218
| db48x wrote:
| Yea, it wasn't designed to scale that way.
|
| In principle you could use fiberchannel to connect a really
| large number (224 iirc) of disks to a single server and then
| create a single ZFS pool using all of them. This lets you
| scale the _storage_ as high as you want.
|
| But that still limits you to however many requests per second
| that your single server can handle. You can scale that pretty
| high too, but probably not by a factor of 224.
| jack_pp wrote:
| I don't have direct experience with distributed file systems but
| it so happens I did a tiny bit of research in the past month
| and.. there are quite a few open source ones available. Would've
| been nice for the authors to explain why the already existing
| solutions didn't work for them.
| dboreham wrote:
| They have an HFT background so probably it was developed long
| ago for that workload (which tends to be outside the design
| envelope for off the shelf solutions).
| londons_explore wrote:
| This seems like a pretty complex setup with lots of features
| which aren't obviously important for a deep learning workload.
|
| Presumably the key necessary features are PB's worth of storage,
| read/write parallelism (can be achieved by splitting a 1PB file
| into say 10,000 100GB shards, and then having each client only
| read the necessary shards), and redundancy
|
| Consistency is hard to achieve and seems to have no use here -
| your programmers can manage to make sure different processes are
| writing to different filenames.
| sungam wrote:
| I wonder whether it may have been originally developed for the
| quantitive hedge fund
| huntaub wrote:
| Yes, I think this is probably true. I've worked with a lot of
| different hedge funds who have a similar problem -- lots of
| shared data that they need in a file system so that they can
| do backtesting of strategies with things like kdb+.
| Generally, these folks are using NFS which is kind of a pain
| -- especially for scaleability -- so building your own for
| that specific use case (which happens to have a similar usage
| pattern for AI training) makes a lot of sense.
| threeseed wrote:
| > Consistency is hard to achieve and seems to have no use here
|
| Famous last words.
|
| It is very common when operating data platforms like this at
| this scale to lose a lot of nodes over time especially in the
| cloud. So having a robust consistency/replication mechanism is
| vital to making sure your training job doesn't need to be
| restarted just because the block it needs isn't on the
| particular node.
| londons_explore wrote:
| indeed redundancy is fairly important (although the largest
| part, the training data, actually doesn't matter if chunks
| are missing).
|
| But the type of consistency they were talking about is strong
| ordering - the type of thing you might want on a database
| with lots of people reading and writing tiny bits of data,
| potentially the _same_ bits of data, and you need to make
| sure a users writes are rejected if impossible to fulfil, and
| reads never return an impossible intermediate state. That isn
| 't needed for machine learning.
| ted_dunning wrote:
| Sadly, these are often Famous First words.
|
| What follows is a long period of saying "see, distributed
| systems are easy for genius developers like me"
|
| The last words are typically "oh shit", shortly followed
| oxymoronically by "bye! gotta go"
| jamesblonde wrote:
| Architecturally, it is a scale-out metadata filesystem [ref].
| Other related distributed file systems are Collosus, Tectonic
| (Meta), ADLSv2 (Microsoft), HopsFS (Hopsworks), and I think
| PolarFS (Alibaba). They all use different distributed row-
| oriented DBs for storing metadata. S3FS uses FoundationDB,
| Collosus uses BigTable, Tectonic some KV store, ADLSv2 (not
| sure), HopsFS uses RonDB.
|
| What's important here with S3FS is that it supports (1) a fuse
| client - it just makes life so much easiter - and (2) NVMe
| storage - so that training pipelines aren't Disk I/O bound (you
| can't always split files small enough and parallel
| reading/writing enough to a S3 object store).
|
| Disclaimer: i worked on HopsFS. HopsFS adds tiered storage - NVMe
| for recent data and S3 for archival.
|
| [ref]: https://www.hopsworks.ai/post/scalable-metadata-the-new-
| bree...
| nickfixit wrote:
| I've been using JuiceFS since the start for my AI stacks.
| Similar and used postgresql for the meta.
| jamesblonde wrote:
| JuiceFS is very good. I didn't have it as a scaleout metadata
| FS - it supports lots of DBs (single host and distributed
| DBs).
| threeseed wrote:
| Tiered storage and FUSE has existed with Alluxio for years.
|
| And NVMe optimisations e.g. NVMeoF in OpenEBS (Mayastor).
|
| None of it is particularly ground breaking just a lot of pieces
| brought together.
| jamesblonde wrote:
| The difference is scale-out metadata in the filesystem.
| Alluxio uses Raft, i believe, for metadata - that has to fit
| on a single server.
| rfoo wrote:
| 3FS isn't particularly fast in mdbench, though. Maybe our
| FDB tuning skill is what to blame, or FUSE, I don't know,
| but it doesn't really matter.
|
| The truly amazing part for me is combining NVMe SSD + RDMA
| + supports reading a huge batch of random offsets from a
| few already opened huge files efficiently. This is how you
| get your training boxes consuming 20~30GiB/s (and roughly 4
| million IOPS).
| rjzzleep wrote:
| FUSE has traditionally been famously slow. I remember
| there were some changes that supposedly made it faster,
| but maybe that was just a certain fuse implementation.
| jamesblonde wrote:
| The block size is 4KB by default, which is a killer. We
| set it to 1MB or so by default - makes a huge difference.
|
| https://github.com/logicalclocks/hopsfs-go-mount
| objectivefs wrote:
| There is also ObjectiveFS that supports FUSE and uses S3 for
| both data and metadata storage, so there is no need to run any
| metadata nodes. Using S3 instead of a separate database also
| allows scaling both data and metadata with the performance of
| the S3 object store.
| joatmon-snoo wrote:
| nit: Colossus* for Google.
| MertsA wrote:
| >Tectonic some KV store,
|
| Tectonic is built on ZippyDB which is a distributed DB built on
| RocksDB.
|
| >What's important here with S3FS is that it supports (1) a fuse
| client - it just makes life so much easier
|
| Tectonic also has a FUSE client built for GenAI workloads on
| clusters backed by 100% NVMe storage.
|
| https://engineering.fb.com/2024/03/12/data-center-engineerin...
|
| Personally what stands out to me for 3FS isn't just that it has
| a FUSE client, but that they made it more of a hybrid of FUSE
| client and native IO path. You open the file just like normal
| but once you have a fd you use their native library to do the
| actual IO. You still need to adapt whatever AI training code to
| use 3FS natively if you want to avoid FUSE overhead, but now
| you use your FUSE client for all the metadata operations that
| the native client would have needed to implement.
|
| https://github.com/deepseek-ai/3FS/blob/ee9a5cee0a85c64f4797...
| randomtoast wrote:
| Why not use CephFS instead? It has been thoroughly tested in
| real-world scenarios and has demonstrated reliability even at
| petabyte scale. As an open-source solution, it can run on the
| fastest NVMe storage, achieving very high IOPS with 10 Gigabit or
| faster interconnect.
|
| I think their "Other distributed filesystem" section does not
| answer this question.
| tempest_ wrote:
| We have a couple ceph clusters.
|
| If my systems guys are telling me the truth is it a real time
| sink to run and can require an awful lot of babysitting at
| times.
| huntaub wrote:
| IMO this is the problem with _all_ storage clusters that you
| run yourself, not just Ceph. Ultimately, keeping data alive
| through instance failures is just a lot of maintenance that
| needs to happen (even with automation).
| _joel wrote:
| I admin'd a cluster about 10 years back and it was 'ok' then,
| around bluestore. One issue was definitely my mistake but it
| wasn't all that bad.
| elashri wrote:
| CERN use CephFS with ~50PB for different applications and they
| are happy with it.
| dfc wrote:
| I thought they used ceph too. But I started looking around
| and it seems like they have switched to CernVM-FS and in
| house solution. I'm not sure what changed.
| elashri wrote:
| They didn't switch, they use both for different needs. EOS
| (CVMFS) is used mainly for physics data storage and user
| data. Ceph is used for many other things like
| infrastructure, selfhosted apps..etc.
| charleshn wrote:
| Because it's actually fairly slow.
|
| Among other things, the OSD was not designed with NVMe drives
| in mind - which is fair, given how old it is - so it's nowhere
| close to being able to handle modern NVMe IO throughput and
| IOPS.
|
| For that you need zero-copy, RDMA etc.
|
| Note that there is a next-generation OSD project called Crimson
| [0], however it's been a while, and I'm not sure how well it's
| going. It's based on the awesome Seastar framework [1], backing
| ScyllaDB.
|
| Achieving such performance would also require many changes to
| the client (RDMA, etc).
|
| Something like Weka [2] has a much better design for this kind
| of performance.
|
| [0] https://ceph.io/en/news/crimson/
|
| [1] https://seastar.io/
|
| [2] https://www.weka.io/
| __turbobrew__ wrote:
| With latest ceph releases I am able to saturate modern NVME
| devices with 2 OSD/NVME. It is kind of a hack to have
| multiple OSD per NVME, but it works.
|
| I do agree that nvme-of is the next hurdle for ceph
| performance.
| skrtskrt wrote:
| DigitalOcean uses Ceph underneath their S3 and block volume
| products. When I was there they had 2 teams just managing Ceph,
| not even any of the control plane stuff built on top.
|
| It is a complete bear to manage and tune at scale. And DO never
| greenlit offering anything based on CephFS either because it
| was going to be a whole other host of things to manage.
|
| Then of course you have to fight with the maintainers (Red Hat
| devs) to get any improvements contributed, assuming you even
| have team members with the requisite C++ expertise.
| Andys wrote:
| Ceph is massively over-complicated, if I had two teams I'd
| probably try and write one from scratch instead.
| skrtskrt wrote:
| Most of the legitimate datacenter-scale direct Ceph
| alternatives unfortunately are proprietary, in part because
| it takes so much money and human-expertise-hours to even be
| able to prove out that scale, they want to recoup costs and
| stay ahead.
|
| Minio is absolutely not datacenter-scale and I would not
| expect anything in Go to really reach that point. Garbage
| collection is a rough thing at such enormous scale.
|
| I bet we'll get one in Rust eventually. Maybe from Oxide
| computer company? Though despite doing so much OSS, they
| seem to be focused around their specific server rack OS,
| not general-purpose solutions
| steveklabnik wrote:
| > I bet we'll get one in Rust eventually. Maybe from
| Oxide computer company?
|
| Crucible is our storage service:
| https://github.com/oxidecomputer/crucible
|
| RFD 60, linked in the README, contains a bit of info
| about Ceph, which we did evaluate:
| https://rfd.shared.oxide.computer/rfd/0060
| huntaub wrote:
| I think that the author is spot on, there are a couple of
| dimensions in which you should evaluate these systems:
| theoretical limits, efficiency, and practical limits.
|
| From a theoretical point of view, like others have pointed out,
| parallel distributed file systems have existed for years -- most
| notably Lustre. These file systems should be capable of scaling
| out their storage and throughput to, effectively, infinity -- if
| you add enough nodes.
|
| Then you start to ask, well how much storage and throughput can I
| get with a node that has X TiB of disk -- starting to evaluate
| efficiency. I ran some calculations (against FSx for Lustre,
| since I'm an AWS guy) -- and it appears that you can run 3FS in
| AWS for about 12-30% cheaper depending on the replication factors
| that you choose against FSxL (which is good, but not great
| considering that you're now managing the cluster yourself).
|
| Then, the third thing you start to ask is anecdotally, are people
| able to actually configure these file systems into the size of
| deployment that I want (which is where you hear things like "oh
| it's hard to get Ceph to 1 TiB/s") -- and that remains to be seen
| from something like 3FS.
|
| Ultimately, I obviously believe that storage and data are really
| important keys to how these AI companies operate -- so it makes
| sense that DeepSeek would build something like this in-house to
| get the properties that they're looking for. My hope is that we,
| at Archil, can find a better set of defaults that work for most
| people without needing to manage a giant cluster or even worry
| about how things are replicated.
| jamesblonde wrote:
| Maybe AWS could start by making fast NVMes available - without
| requiring multi TB disks just to get 1 GB/s. S3FS experiments
| were run on 14 GB/s NVMe disks - an order of magnitude higher
| throughput than anything available in AWS today.
|
| SSDs Have Become Ridiculously Fast, Except in the Cloud:
| https://news.ycombinator.com/item?id=39443679
| kridsdale1 wrote:
| On my home LAN connected with 10gbps fiber between MacBook
| Pro and server, 10 feet away, I get about 1.5gbps vs the non-
| network speed of the disks of ~50 gbps. (Bits, not bytes)
|
| I worked this out to the macOS SMB implementation really
| sucking. I set up a NFS driver and it got about twice as fast
| but it's annoying to mount and use, and still far from the
| disk's capabilities.
|
| I've mostly resorted to abandoning the network (after large
| expense) and using Thunderbolt and physical transport of the
| drives.
| greenavocado wrote:
| Is NFS out of the question?
| kridsdale1 wrote:
| I have set it up but it's not easy to get drivers working
| on a Mac.
| dundarious wrote:
| SMB/CIFS is an incredibly chatty, synchronous protocol.
| There are/were massive products built around mitigating and
| working around this when trying to use it over high latency
| satellite links (US military did/does this).
| __turbobrew__ wrote:
| There are i4i instances in AWS which can get you a lot of
| IOPS with a smaller disk.
| stapedium wrote:
| I'm just a small business & homelab guy, so I'll probably never
| use one of these big distributed file systems. But when people
| start talking petabytes, I always wonder if these things are
| actually backed up and what you use for backup and recovery?
| huntaub wrote:
| Well, for active data, the idea is that the replication within
| the system is enough to keep the data alive from instance
| failure (assuming that you're doing the proper maintenance and
| repairing hosts pretty quickly after failure). Backup and
| recovery, in that case, is used more for saving yourself
| against fat-fingering an "rm -rf /" type command. Since it's
| just a file system, you should be able to use any backup and
| recovery solution that works with regular files.
| shermantanktop wrote:
| Backup and recovery is a process with a non-zero failure rate.
| The more you test it, the lower the rate, but there is always a
| failure mode.
|
| With these systems, the runtime guarantees of data integrity
| are very high and the failure rate is very low. And best of
| all, failure is constantly happening as a normal activity in
| the system.
|
| So once you have data integrity guarantees that are better in
| you runtime system than your backup process, why backup?
|
| There are still reasons, but they become more specific to the
| data being stored and less important as a general datastore
| feature.
| Eikon wrote:
| > why backup?
|
| Because of mistakes and malicious actors...
| overfeed wrote:
| ...and the "Disaster" in "Disaster recovery" may have been
| localized and extensive (fire, flooding, major earthquake,
| brownouts due to a faulty transformer, building collapse, a
| solvent tanker driving through the wall into the server
| room, a massive sinkhole, etc)
| shermantanktop wrote:
| Yes, the dreaded fiber vs. backhoe. But if your
| distributed file system is geographically redundant,
| you're not exposed to that, at least from an integrity
| POV. It sucks that 1/3 or 1/5 or whatever of your serving
| fleet just disappeared, but backup won't help with that.
| overfeed wrote:
| > But if your distributed file system is geographically
| redundant
|
| Redundancy and backups are not the same thing! There's
| some overlap, but treating them as interchangeable will
| occasionally result in terrible outcomes, like when a
| config change that results in all 5/5 datacenters
| fragmenting and failing to create a quorum, then finding
| out your services have circular dependencies when you are
| trying to bootstrap foundational services. Local backups
| would solve this, each DC would load last known good
| config, but rebuilding consensus necessary for redundancy
| requires coordination from now-unreachable hosts.
| ted_dunning wrote:
| It is common for the backup of these systems to be a secondary
| data center.
|
| Remember that there are two purposes for backup. One is
| hardware failures, the second is fat fingers. Hardware failures
| are dealt with by redundancy which always involves keeping
| redundant information across multiple failure domains. Those
| domains can be as small as a cache line or as big as a data
| center. These failures can be dealt with transparently and
| automagically in modern file systems.
|
| With fat fingers, the failure domain has no natural boundaries
| other than time. As such, snapshots kept in the file system are
| the best choice, especially if you have a copy-on-write that
| can keep snapshots with very little overhead.
|
| There is also the special case of adversarial fat fingering
| which appears in ransomware. The answer is snapshots, but the
| core problem is timely detection since otherwise you may not
| have a single point in time to recover from.
| mertleee wrote:
| What are the odds 3fs is backdoored?
| huntaub wrote:
| I think that's a pretty odd concern to have. What would you
| imagine that looks like? If you're running these kinds of
| things securely, you should be locking down the network access
| to the hosts (they don't need outbound internet access, and
| they shouldn't need inbound access from anything except your
| application).
| xpe wrote:
| > I think that's a pretty odd concern to have.
|
| Thinking about security risks is an odd concern to have?
| huntaub wrote:
| I think that worrying that a self-hosted file system has a
| backdoor to exfiltrate data is an odd concern. Security
| concerns are (obviously) normal, but you should not be
| exposing these kinds of services to the public internet (or
| giving them access to the public internet), eliminating the
| concern that it's giving your data away.
| xpe wrote:
| > I think that worrying that a self-hosted file system
| has a backdoor to exfiltrate data is an odd concern.
|
| Great security teams get paid to "worry", by which I mean
| "make a plan" for a given attack tree.
|
| Your use of "odd" looks like a rhetorical technique to
| downplay legitimate security considerations. From my
| point of view, you are casually waving away an essential
| area of security. See:
|
| https://www.paloaltonetworks.com/cyberpedia/data-
| exfiltratio...
|
| > but you should not be exposing these kinds of services
| to the public internet (or giving them access to the
| public internet), eliminating the concern that it's
| giving your data away.
|
| Let's evaluate the following claim: "a "properly"
| isolated system would, in theory, have no risk of data
| exfiltration. Now we have to define what we mean.
| Isolated from the network? Running in a sandbox? What
| happens when that system can interact with a compromised
| system? What happens when people mess up?
|
| From a security POV, any software could have some
| component that is malicious, compromised, or backdoored.
| It might be in the project itself or in the supply chain.
| And so on.
|
| Defense in depth matters. None of the above concerns are
| "odd". A good security team is going to factor these in.
|
| P.S. If you mean "low probability" then just say that.
| MaxPock wrote:
| By, NSA or Britain's GCHQ which wants all software backdoored?
| robinhoodexe wrote:
| I'm interested in how it is compared to seaweedfs[1], which we
| use for storing weather data (about 3 PB) for ML training.
|
| [1] https://github.com/seaweedfs/seaweedfs
| huntaub wrote:
| My guess is going to be that performance is pretty comparable,
| but it looks like Seaweed contains a lot more management
| features (such as tiered storage) which you may or may not be
| using.
| rfoo wrote:
| IMO they serve similar at a glance, but actually very different
| use cases.
|
| SeaweedFS is more about amazing small object read performance
| because you effectively have no metadata to query to read an
| object. You just distribute volume id, file id (+cookie) to
| clients.
|
| 3FS is less extreme in this, supports actual POSIX interface,
| and isn't particularly good at how fast you can open() files.
| On the other hand, it shards files into smaller (e.g. 512KiB)
| chunks, demands RDMA NICs and makes reading randomly from large
| files scary fast [0]. If your dataset is immutable you can
| emulate what SeaweedFS does, but if it isn't then SeaweedFS is
| better.
|
| [0] By scary fast I mean being able to completely saturate 12
| PCIe Gen 4 NVMe SSD at 4K random reads on a single storage
| server and you can horizontally scale that.
| jszymborski wrote:
| I wonder how close to something like 3FS you can get by
| mounting SeaweedFS with S3FS, which mounts using FUSE.
|
| https://github.com/s3fs-fuse/s3fs-fuse
| seethishat wrote:
| How easy is it to disable DeepSeek's distributed FS? Say for
| example a US college has been authorized to use DeepSeek for
| research, but must ensure no data leaves the local research
| cluster filesystem?
|
| Edit: I am a DeepSeek newbie BTW, so if this question makes no
| sense at all, that's why ;)
| ikeashark wrote:
| I might need more clarification, but if one is paranoid or is
| dealing with this sensitive of information the DeepSeek model
| and 3FS are able to be deployed locally offline and not
| connected to the internet.
| seethishat wrote:
| Thank you, that answers my question.
| snthpy wrote:
| Similar to the SeaweedFS question in sibling comment, how does
| this compare to JuiceFS?
|
| In particular for my homelab setup I'm planning to run JuiceFS on
| top of S3 Garage. I know garage is only replication without any
| erasure coding or sharding so it's not really comparable but I
| don't need all that and it looked at lot simpler to set up to me.
| huntaub wrote:
| It's a very different architecture. 3FS is storing everything
| on SSDs, which makes it extremely expensive but also low
| latency (think ~100-300us for access). JuiceFS stores data in
| S3, which is extremely cheap but very high latency (~20-60ms
| for access). The performance _scalability_ should be pretty
| similar, if you 're able to tolerate the latency numbers. Of
| course, they both use databases for the metadata layer, so
| assuming you pick the same one -- the metadata performance
| should also be similar.
| nodesocket wrote:
| Does this get us closer to fully and performant distributed LLMs?
|
| Instead of spending ludicrous amounts on Nvidia GPUs, can just
| deploy commodity clusters of AMD EPYC servers with tons of
| memory, NVMe disks, and 40G or 100G networking which is vastly
| less expensive.
|
| Goodbye Nvidia moat!
| dang wrote:
| Related. Others?
|
| _Understanding Smallpond and 3FS_ -
| https://news.ycombinator.com/item?id=43232410 - March 2025 (47
| comments)
|
| _Smallpond - A lightweight data processing framework built on
| DuckDB and 3FS_ - https://news.ycombinator.com/item?id=43200793 -
| Feb 2025 (73 comments)
|
| _Fire-Flyer File System (3FS)_ -
| https://news.ycombinator.com/item?id=43200572 - Feb 2025 (101
| comments)
___________________________________________________________________
(page generated 2025-04-17 23:00 UTC)