[HN Gopher] Ceph: A Journey to 1 TiB/s
       ___________________________________________________________________
        
       Ceph: A Journey to 1 TiB/s
        
       Author : davidmr
       Score  : 369 points
       Date   : 2024-01-19 20:02 UTC (1 days ago)
        
 (HTM) web link (ceph.io)
 (TXT) w3m dump (ceph.io)
        
       | riku_iki wrote:
       | What router/switch one would use for such speed?
        
         | KeplerBoy wrote:
         | 800Gbps via OSFP and QSFP-DD are already a thing. Multiple
         | vendors have NICs and switches for that.
        
           | _zoltan_ wrote:
           | can you show me a 800G NIC?
           | 
           | the switch is fine, I'm buying 64x800G switches, but NIC wise
           | I'm limited to 400Gbit.
        
             | KeplerBoy wrote:
             | fair enough, it seems I was mistaken about the NIC. I guess
             | that has to wait for PCIe 6 and should arrive soon-ish.
        
           | CyberDildonics wrote:
           | 16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is
           | any computer using 100 GB/s ?
        
             | KeplerBoy wrote:
             | I was talking about Gigabit/s, not Gigabyte/s.
             | 
             | The article however actually talks about Terabyte/s scale,
             | albeit not over a single node.
        
               | CyberDildonics wrote:
               | 800 gigabits is 100 gigabytes which is still more than
               | PCIe 5.0 16x 64 gigabyte per second bandwidth.
               | 
               | You said there were 800 gigabit network cards, I'm
               | wondering how that much bandwidth makes it to the card in
               | the first place.
               | 
               |  _The article however actually talks about Terabyte /s
               | scale, albeit not over a single node._
               | 
               | This does not have anything to do with what you
               | originally said, you were talking about 800gb single
               | ports.
        
               | KeplerBoy wrote:
               | Yes, apparently I was mistaken about the NICs. They don't
               | seem to be available yet.
               | 
               | But it's not a PCIe limitation. There are PCIe devices
               | out there which use 32 lanes, so you could achieve the
               | bandwidth even on PCIe5.
               | 
               | https://www.servethehome.com/ocp-nic-3-0-form-factors-
               | quick-...
        
               | NavinF wrote:
               | I'm not aware of any 800G cards, but FYI a single
               | Mellanox card can use two PCIe x16 slots to avoid NUMA
               | issues on dual-socket servers: https://www.nvidia.com/en-
               | us/networking/ethernet/socket-dire...
               | 
               | So the software infra for using multiple slots already
               | exists and doesn't require any special config. Oh and
               | some cards can use PCIe slots across multiple hosts. No
               | idea why you'd want to do that, but you can.
        
         | epistasis wrote:
         | Given their configuration of just 4U spread across 17 racks,
         | there's likely a bunch of compute in the rest of the rack, and
         | 1-2 top of rack switches like this:
         | 
         | https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...
         | 
         | And then you connect the TOR switches to higher level switches
         | in something like a Clos distribution to get the desired
         | bandwidth between any two nodes:
         | 
         | https://www.techtarget.com/searchnetworking/definition/Clos-...
        
         | NavinF wrote:
         | Linked article says they used 68 machines with 2 x 100GbE
         | Mellanox ConnectX-6 cards. So any 100G pizza box switches
         | should work.
         | 
         | Note that 36 port 56G switches are dirt cheap on eBay and 4tbps
         | is good enough for most homelab use cases
        
           | riku_iki wrote:
           | > So any 100G pizza box switches should work.
           | 
           | but will it be able to handle combined TB/s traffic?
        
             | baq wrote:
             | any switch which can't handle full load on all ports isn't
             | worthy of the name 'switch', it's more like 'toy network
             | appliance'
        
               | birdman3131 wrote:
               | I will forever be scarred by the "Gigabit" switches of
               | old that were 2 gigabit ports and 22 100mb ports.
               | Coworker bought it missing the nuance.
        
               | bombcar wrote:
               | Still happens, gotta see if the top speed mentioned is an
               | uplink or normal ports.
        
             | aaronax wrote:
             | Yes. Most network switches can handle all ports at 100%
             | utilization in both directions simultaneously.
             | 
             | Take for example the Mellanox SX6790 available for less
             | than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 =
             | 4032gbps and it is stated to have a switching capacity of
             | 4.032Tbps.
             | 
             | Edit: I guess you are asking how one would possibly sip
             | 1TiB/s of data into a given client. You would need multiple
             | clients spread across several switches to generate such
             | load. Or maybe some freaky link aggregation. 10x 800gbps
             | links for your client, plus at least 10x 800gbps links out
             | to the servers.
        
             | bombcar wrote:
             | Even the bargain Mikrotik can do 1.2Tbps
             | https://mikrotik.com/product/crs518_16xs_2xq
        
               | margalabargala wrote:
               | For those curious, a "bargain" on a 100gbps switch means
               | about $1350
        
               | epistasis wrote:
               | On a cluster with more than $1M of NVMe disks, that does
               | actually seem like a bargain.
               | 
               | (Note that the linked MikroTik switch only has 100gbe on
               | a few ports, and wouldn't really classify as a full
               | 100gbe switch to most people)
        
               | margalabargala wrote:
               | Sure- I don't mean to imply that it isn't. I can
               | absolutely see how that's inexpensive for 100gbe
               | equipment.
               | 
               | That was more for the benefit of others like myself, who
               | were wondering if "bargain" was comparative, or
               | inexpensive enough that it might be worth buying one next
               | time they upgraded switches. For me personally it's still
               | an order of magnitude away from that.
        
               | bombcar wrote:
               | https://mikrotik.com/product/crs305_1g_4s_in is the sweet
               | spot right now for home users. Four 10g ports and a 1g,
               | you can use the 1g for "uplink" to the internet and one
               | of the 10g for your "big old Nortel gigabit switch with
               | 10g uplink" and one for your Mac and two for your NAS and
               | VM server. ;)
               | 
               | Direct cables are moderately cheap, and modules for 10g
               | Ethernet aren't insanely expensive.
        
               | Palomides wrote:
               | there's usually some used dx010 (32x100gbe) on ebay for
               | less than $500
               | 
               | the cheapest new 100gbe switch I know of is the mikrotik
               | CRS504-4XQ-IN (4x100gbe, around $650)
        
               | riku_iki wrote:
               | TB != Tb..
        
       | matheusmoreira wrote:
       | Does anyone have experience running ceph in a home lab? Last time
       | I looked into it, there were quite significant hardware
       | requirements.
        
         | nullwarp wrote:
         | There still are. As someone who has done both production and
         | homelab deployments: unless you are specifically just looking
         | for experience with it and just setting up a demo - don't
         | bother.
         | 
         | When it works, it works great - when it goes wrong it's a huge
         | headache.
         | 
         | Edit: As just an edit, if distributed storage is just something
         | you are interested in there are much better options for a
         | homelab setup:
         | 
         | - seaweedfs has been rock solid for me for years in both small
         | and huge scales. we actually moved our production ceph setup to
         | this.
         | 
         | - longhorn was solid for me when i was in the k8s world
         | 
         | - glusterfs is still fine as long as you know what you are
         | going into.
        
           | reactordev wrote:
           | I'd throw minio [1] in the list there as well for homelab k8s
           | object storage.
           | 
           | [1] https://min.io/
        
             | speedgoose wrote:
             | Also garage. https://garagehq.deuxfleurs.fr/
        
               | BlackLotus89 wrote:
               | Garage seems to only to duplication https://garagehq.deux
               | fleurs.fr/documentation/design/goals/
               | 
               | > Storage optimizations: erasure coding or any other
               | coding technique both increase the difficulty of placing
               | data and synchronizing; we limit ourselves to
               | duplication.
               | 
               | This is probably a nogo for most use cases where you work
               | with large datasets....
        
             | plagiarist wrote:
             | Minio doesn't make any sense to me in a homelab. Unless I'm
             | reading it wrong it sounds like a giant pain to add more
             | capacity while it is already in use. There's basically no
             | situation where I'm more likely to add capacity over time
             | than a homelab.
        
               | reactordev wrote:
               | You get a new nas (minio server pool) and you plug it
               | into your home lab (site replication) and now it's part
               | of the distributed minio storage layer (k8s are happy).
               | How is that hard? It's the same basic thing for Ceph or
               | any distributed JBOD mass storage engine. Minio has some
               | funkiness with how you add more storage but it's totally
               | capable of doing it while in use. Everything is atomic.
        
           | dataangel wrote:
           | I really wish there was a benchmark comparing all of these +
           | MinIO and S3. I'm in the market for a key value store, using
           | S3 for now but eyeing moving to my own hardware in the future
           | and having to do all the work to compare these is one of the
           | major things making me procrastinate.
        
             | woopwoop24 wrote:
             | minio is good but you really need fast disks. They also
             | really don't like, when you want to change the size of your
             | cluster setup. No plan to add cache disks, they just say
             | use faster disks. I have it running, goes smoothly but not
             | really user friendly to optimize
        
             | rglullis wrote:
             | Minio gives you "only" S3 object storage. I've setup a
             | 3-node Minio cluster for object storage on Hetzner, each
             | server having 4x10TB, for ~50EUR/month each. This means
             | 80TB usable data for ~150EUR/month. It can be worth it if
             | you are trying to avoid egress fees, but if I were building
             | a data lake or anything where the data was used mostly for
             | internal services, I'd just stick with S3.
        
           | rglullis wrote:
           | > glusterfs is still fine as long as you know what you are
           | going into.
           | 
           | Does that include storage volumes for databases? I was using
           | glusterFS as a way to scale my swarm cluster horizontally and
           | I am reasonably sure that it corrupted one database to the
           | point I lost more than a few hours of data. I was quite
           | satisfied with the setup until I hit that.
           | 
           | I know that I am considered crazy for sticking with Docker
           | Swarm until now, but aside from this lingering issue with how
           | to manage stateful services, I've honestly don't feel the
           | need to move yet to k8s. My clusters is ~10 nodes running <
           | 30 stacks and it's not like I have tens of people working
           | with me on it.
        
             | camkego wrote:
             | Docker Swarm seems to be underrated, from a simplicity and
             | reliability perspective, IMHO.
        
           | bityard wrote:
           | Ceph is sort of a storage all-in-one: it provides object
           | storage, block storage, and network file storage. May I ask,
           | which of these are you using seaweedfs for? Is it as
           | performant as Ceph claims to be?
        
           | asadhaider wrote:
           | I thought it was popular for people running Proxmox clusters
        
             | geerlingguy wrote:
             | It is, and if you have a few nodes with at least 10 GbE
             | networking, it's certainly the best clustered storage
             | option I can think of.
        
           | matheusmoreira wrote:
           | I just want to hoard data. I hate having to delete stuff to
           | make space. Things disappear from the web every day. I should
           | hold onto them.
           | 
           | My requirements for a storage solution are:
           | 
           | > Single root file system
           | 
           | > Storage device failure tolerance
           | 
           | > Gradual expansion capability
           | 
           | The problem with every storage solution I've ever seen is the
           | lack of gradual expandability. I'm not a corporation, I'm
           | just a guy. I don't have the money to buy 200 hard disks all
           | at once. I need to gradually expand capacity as needed.
           | 
           | I was attracted to this ceph because it apparently allows you
           | to throw a bunch of drives of any make and model at it and it
           | just pools them all up without complaining. The complexity is
           | nightmarish though.
           | 
           | ZFS is nearly perfect but when it comes to expanding capacity
           | it's just as bad as RAID. Expansion features seem to be
           | _just_ about to land for quite a few years now. I remember
           | getting excited about it after seeing news here only for
           | people to deflate my expectations. Btrfs has a flexible block
           | allocator which is just what I need but... It 's btrfs.
        
             | chromatin wrote:
             | > ZFS is nearly perfect but when it comes to expanding
             | capacity it's just as bad as RAID.
             | 
             | if you don't mind the overhead of a "pool of mirrors"
             | approach [1], then it is easy to expand storage by adding
             | pairs of disks! This is how my home NAS is configured.
             | 
             | [1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-
             | vdevs...
        
               | roygbiv2 wrote:
               | This is also exactly how mine is done. Started off with a
               | bunch of 2TB disks. I've now got a mixture of 16TB down
               | to 4TB, all in the original pool.
        
               | Snow_Falls wrote:
               | 50% storage efficiency is a tough pill to swallow, but
               | drives are pretty big and the ability to expand as you go
               | means it can be cheaper in the long run to just buy the
               | larger, new drives coming out than pay upfront for a
               | bunch of drives in a raidz config.
        
             | bityard wrote:
             | On a single host, you could do this with LVM. Add a pair of
             | disks, make them a RAID 1, create a physical volume on
             | them, then a volume group, then a logical volume with XFS
             | on top. To expand, you add a pair of disks, RAID 1 them,
             | and add them to the LVM. It's a little stupid, but it would
             | work.
             | 
             | If multiple nodes are not off the table, also look into
             | seaweedfs.
             | 
             | Also consider how (or if) you are going to back up your
             | hoard of data.
        
               | matheusmoreira wrote:
               | > Also consider how (or if) you are going to back up your
               | hoard of data.
               | 
               | I actually emailed backblaze years ago about their
               | supposedly unlimited consumer backup plan. Asked them if
               | they would _really_ allow me to dump into their systems
               | dozens of terabytes of encrypted undeduplicable data.
               | They responded that yes, they would. Still didn 't
               | believe them, these corporations never really mean it
               | when they say unlimited. Plus they had no Linux software.
        
               | nijave wrote:
               | > these corporations never really mean it when they say
               | unlimited. Plus they had no Linux software
               | 
               | Afaik they rely on the latter to mitigate the risk of the
               | former.
        
               | Snow_Falls wrote:
               | Considering the fact that most data heavy servers are
               | llnux, that would be a pretty clever way of staying true
               | to their word.
        
             | deadbunny wrote:
             | ZFS using mirrors is extremely easy to expand. Need more
             | space and you have small drives? Replace the drives in a
             | mirror one by one with bigger ones. Need more space and
             | already have huge drives? Just add another vdev mirror. And
             | the added benefit of not living in fear of drive failure
             | while resilvering as it is much faster with mirrors than
             | raidX.
             | 
             | Sure the density isn't great as you're essentially running
             | at 50% or raw storage but - touches wood - my home zpool
             | has been running strong for about a decade doing the above
             | from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives
             | (8x mirrors, differing sized drives but matched per mirror
             | like a 10tb x2 mirror, a 16tb x2 mirror etc).
             | 
             | Edit: Just realised someone else as already mentioned a
             | pool or mirrors. Consider this another +1.
        
               | matheusmoreira wrote:
               | > Replace the drives in a mirror one by one with bigger
               | ones.
               | 
               | That's exactly what I meant by "just as bad as RAID".
               | Expanding an existing array is analogous to every single
               | drive in the array failing and getting replaced with
               | higher capacity drives.
               | 
               | When a drive fails, the array is in a degraded state.
               | Additional drive failures put the entire system in danger
               | of data loss. The rebuilding process generates enormous
               | I/O loads on all the disks. Not only does it take an
               | insane amount of time, according to my calculations the
               | probability of read errors happening _during the error
               | recovery process_ is about 3%. Such expansion operations
               | have a real chance of destroying the entire array.
        
               | deadbunny wrote:
               | That's not the case it mirrored vdevs. There is no
               | degredatuon of the array with a failed drive in a
               | mirrored vdev, it continues humming along perfectly fine.
               | 
               | Also resilvers are not as intensive when rebuilding a
               | mirror as you are just copying from one disk in the vdev
               | to the other, not all X other drives and recalculating
               | parity at the same time. This means less reads across the
               | entire array and much much quicker resilver times, thus
               | less window for drive failure.
               | 
               | But don't just take my word for it. This is a blog post
               | that go much into much more detail
               | https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-
               | vdevs...
        
             | amadio wrote:
             | EOS (https://cern.ch/eos, https://github.com/cern-eos/eos)
             | is probably a bit more complicated than other solutions to
             | setup and manage, but does allow to add/remove new disks
             | and nodes serving data on the fly. This is essential to let
             | us upgrade harware of the clusters serving experimental
             | data with minimal to no downtime.
        
             | sekh60 wrote:
             | I've run Ceph at home since the jewel release. I migrated
             | to it after running FreeNAS.
             | 
             | I use it for RBD volumes for my OpenStack cluster and for
             | CephFS. With a total raw capacity of around 350TiB. Around
             | 14 of that is nvme storage for RBD and CephFS metadata. The
             | rest is rust. This is spread across 5 nodes.
             | 
             | I currently am only buying 20TB exos drives for rust. SMR
             | and I think HSMR are both no goes for Ceph as are non
             | enterprise SSDs, so storage is expensive. Ibdinhave a mix
             | of disks though as the cluster has grown organically. So I
             | have a few 6TB WD Reds in there, before their SMR shift.
             | 
             | My networks for OpenStack, Ceph and Ceph backend are all
             | 10Gbps. With the flash storage when repairing I get about
             | 8GiB/s a second. With rust it is around 270MiB/s. The
             | bottle neck I think is due to 3 of the nodes running on
             | first gen xeon-d boards, the the few Reds do slow things
             | down too. The 4th node runs an AMD Rome CPU, and the newest
             | an AMD Genoa cpu. So I am looking at about 5k CAD a node
             | before disks. I colocate the MDS, OSDs and MONs, with 64GiB
             | of ram each. Each node gets 6 rust, and 2 nvme drives.
             | 
             | Complexity is pretty simple. I deployed the initial
             | iteration by hand, and then when cephadmin was released i
             | converted it daemon by daemon smoothly. I find on the
             | mailing lists and Reddit most of the people encountering
             | problems deploy it via Proxmox and don't really understand
             | Ceph because of it.
        
             | nijave wrote:
             | Not sure what the multidisk consensus is for btrfs now-a-
             | days but adding/removing devices is trivial, you can do
             | "offline" dedupe, and you can rebalance data if you change
             | the disk config.
             | 
             | As an added bonus it's also in-tree so you don't have to
             | worry about kernel updates breaking things
             | 
             | I think you can also potentially do btrfs+LVM and let LVM
             | manage multi device. Not sure what performance looks like
             | there, though
        
               | matheusmoreira wrote:
               | That's all great but btrfs parity striping is still
               | unusable. How many more decades will it take?
        
             | Snow_Falls wrote:
             | If you're willing to use mirror vdevs, expansions can be
             | done two drives at a time.Also, depending on how often your
             | data changes, you should check out snapraid. Doesn't have
             | all the features of ZFS but its perfect for stuff that
             | rarely changes (media or, in your case, archiving).
             | 
             | Also unionfs or similar can let you merge zfs and snapraid
             | into one unified filesystem so you can place important data
             | in zfs and unchanging archive data in snapraid.
        
           | sob727 wrote:
           | Curious, what do you mean by "know what you go into" re
           | glusterfs?
           | 
           | I recently tried ceph in a homelab setup, gave up because of
           | complexity, and settled on glusterfs. I'm not a pro though,
           | so I'm not sure if there's any shortcomings that are clear to
           | everybody but me, hence why your comment caught my attention.
        
           | cholmon wrote:
           | GlusterFS support looks to be permanently ending later this
           | year.
           | 
           | https://access.redhat.com/support/policy/updates/rhs
           | 
           |  _Note that the Red Hat Gluster Storage product has a defined
           | support lifecycle through to 31-Dec-24, after which the Red
           | Hat Gluster Storage product will have reached its EOL.
           | Specifically, RHGS 3.5 represents the final supported RHGS
           | series of releases._
           | 
           | For folks using GlusterFS currently, what's your plan after
           | this year?
        
         | loeg wrote:
         | Why would you bother with a distributed filesystem when you
         | don't have to?
        
           | iwontberude wrote:
           | It's cool to cluster everything for some people (myself
           | included). I see it more like a design constraint than a pure
           | benefit.
        
           | imiric wrote:
           | For the same reason you would use one in enterprise
           | deployments: if setup properly, it's easier to scale. You
           | don't need to invest in a huge storage server upfront, but
           | could build it out as needed with cheap nodes. Assuming it
           | works painlessly as a single node filesystem, of which I'm
           | not yet convinced if the existing solutions do.
        
             | loeg wrote:
             | > if setup properly, it's easier to scale
             | 
             | For home use/needs, I think vertical scaling is much
             | easier.
        
               | imiric wrote:
               | Not really. Most consumer motherboards have a limited
               | number of SATA ports, and server hardware is more
               | expensive, noisy and requires a lot of space. Consumers
               | usually go with branded NAS appliances, which are also
               | expensive and limited at scaling.
               | 
               | Setting up a cluster of small heterogeneous nodes is
               | cheaper, more flexible, and can easily be scaled as
               | needed, _assuming_ that the distributed storage software
               | is easy to work with and trouble-free. This last part is
               | what makes it difficult to setup and maintain, but if the
               | software is stable, I would prefer this approach for home
               | use.
        
           | m463 wrote:
           | lol, wrong place to ask questions of such practicality.
           | 
           | that said, I played with virtualization and I didn't need to.
           | 
           | but then I retired a machine or two and it has been very
           | helpful.
           | 
           | And I used to just use physical disks and partitions. But
           | with the VMs I started using volume manager. It became easier
           | to grow and shrink storage.
           | 
           | and...
           | 
           | well, now a lot of this is second nature. I can spin up a new
           | "machine" for a project and it doesn't affect anything else.
           | I have better backups. I can move a virtual machine.
           | 
           | yeah, there are extra layers of abstraction but hey.
        
           | erulabs wrote:
           | So that when you _do_ have to, you know how to do it.
        
             | loeg wrote:
             | I think most of us will go our whole lives never having to
             | deploy Ceph, especially at home.
        
               | erulabs wrote:
               | You're absolutely not wrong - but asking a devops
               | engineer why they over engineered their home cluster is
               | sort of like asking a mechanic "why is your car so fast?
               | Couldn't you just take the bus?"
        
           | matheusmoreira wrote:
           | I'm indifferent towards the distributed nature thing. What I
           | want is ceph's ability to pool any combination of drives of
           | any make, model and capacity into organized redundant fault
           | tolerant storage, and its ability to add arbitrary drives to
           | that pool at any point in the system's lifetime. RAID-like
           | solutions require identical drives and can't be easily
           | expanded.
        
             | loeg wrote:
             | ZFS and BtrFS have some capability for this.
        
           | nh2 wrote:
           | One reason for using Ceph instead of other RAID solutions on
           | a single machine is that it supports disk failures more
           | flexibly.
           | 
           | In most RAIDs (including ZFS's, to my knowledge), the set of
           | disks that can fail together is static.
           | 
           | Say you have physical disks A B C D E F; common setup is to
           | group RAID1'd disks into a pool such as `mirror(A, B) +
           | mirror(C, D) + mirror(E, F)`.
           | 
           | With that, if disk A fails, and then later B fails before you
           | replace A, your data is lost.
           | 
           | But with Ceph, and replication `size = 2`, when A fails, Ceph
           | will (almost) immediately redistribute your data so that it
           | has 2 replicas again, across all remaining disks B-F. So then
           | B can fail and you still have your data.
           | 
           | So in Ceph, you give it a pool of disks and tell it to
           | "figure out the replication" iself. Most other systems don't
           | offer that; the human defines a static replication structure.
        
         | bluedino wrote:
         | Related question, how does someone get into working with Ceph?
         | Other than working somewhere that already uses it.
        
           | SteveNuts wrote:
           | You could start by installing Proxmox on old machines you
           | have, it uses Ceph for its distributed storage, if you choose
           | to use it.
        
           | candiddevmike wrote:
           | Look into the Rook project
        
           | hathawsh wrote:
           | The recommended way to set up Ceph is cephadm, a single-file
           | Python script that is a multi-tool for both creating and
           | administering clusters.
           | 
           | https://docs.ceph.com/en/latest/cephadm/
           | 
           | To learn about Ceph, I recommend you create at least 3 KVM
           | virtual machines (using virt-manager) on a development box,
           | network them together, and use cephadm to set up a cluster
           | between the VMs. The RAM and storage requirements aren't huge
           | (Ceph can run on Raspberry Pis, after all) and I find it a
           | lot easier to figure things out when I have a desktop window
           | for every node.
           | 
           | I recently set up Ceph twice. Now that Ceph (specifically
           | RBD) is providing the storage for virtual machines, I can
           | live-migrate VMs between hosts and reboot hosts (with zero
           | guest downtime) anytime I need. I'm impressed with how well
           | it works.
        
         | ianlevesque wrote:
         | I played around with it and it has a very cool web UI, object
         | storage & file storage, but it was very hard to get decent
         | performance and it was possible to get the metadata daemons
         | stuck pretty easily with a small cluster. Ultimately when the
         | fun wore off I just put zfs on a single box instead.
        
         | reactordev wrote:
         | There's a blog post they did where they setup Ceph on some rPI
         | 4's. I'd say that's not significant hardware at all. [1]
         | 
         | [1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-
         | raspberr...
        
           | m463 wrote:
           | I think "significant" turns out to mean the number of nodes
           | required.
        
         | m463 wrote:
         | I think you need 3 or was it 5 machines?
         | 
         | proxmox will use it - just click to install
        
         | victorhooi wrote:
         | I have some experience with Ceph, both for work, and with
         | homelab-y stuff.
         | 
         | First, bear in mind that Ceph is a _distributed_ storage system
         | - so the idea is that you will have multiple nodes.
         | 
         | For learning, you can definitely virtualise it all on a single
         | box - but you'll have a better time with discrete physical
         | machines.
         | 
         | Also, Ceph does prefer physical access to disks (similar to
         | ZFS).
         | 
         | And you do need decent networking connectivity - I think that's
         | the main thing people think of, when they think of high
         | hardware requirements for Ceph. Ideally 10Gbe at the minimum -
         | although more if you want higher performance - there can be a
         | lot of network traffic, particularly with things like backfill.
         | (25Gbps if you can find that gear cheap for homelab - 50Gbps is
         | a technological dead-end. 100Gbps works well).
         | 
         | But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe
         | will work fine, and you should get acceptable performance, and
         | it'll be good for learning.
         | 
         | You can install Ceph directly on bare-metal, or if you want to
         | do the homelab k8s route, you can use Rook (https://rook.io/).
         | 
         | Hope this helps, and good luck! Let me know if you have any
         | other questions.
        
           | eurekin wrote:
           | NUC with 10gbit eth - can you recommend any?
        
             | justinclift wrote:
             | If you want something cheap, you could go with Lenovo
             | M720q's:
             | 
             | https://www.servethehome.com/lenovo-
             | thinkcentre-m720q-tinymi...
             | 
             | They have a PCIe slot and can take 8th/9th gen intel cpus
             | (6 core, etc). That PCIe slot should let you throw in a
             | decent network card (eg 10GbE, 25GbE, etc).
        
         | chomp wrote:
         | I've ran Ceph in my home lab since Jewel (~8 years ago).
         | Currently up to 70TB storage on a single node. Have been pretty
         | successful vertically scaling, but will have to add a 2nd node
         | here in a bit.
         | 
         | Ceph isn't the fastest, but it's incredibly resilient and
         | scalable. Haven't needed any crazy hardware requirements, just
         | ram and an i7.
        
         | willglynn wrote:
         | The hardware minimums are real, and the complexity floor is
         | significant. Do _not_ deploy Ceph unless you mean it.
         | 
         | I started considering alternatives when my NAS crossed 100 TB
         | of HDDs, and when a scary scrub prompted me to replace all the
         | HDDs, I finally pulled the trigger. (ZFS resilvered everything
         | fine, but replacing every disk sequentially gave me a lot of
         | time to think.) Today I have far more HDD capacity and a few
         | hundred terabytes of NVMe, and despite its challenges, I
         | wouldn't dare run anything like it without Ceph.
        
           | samcat116 wrote:
           | Can I ask what you use all that storage for on your NAS?
        
         | mcronce wrote:
         | I run Ceph in my lab. It's pretty heavy on CPU, but it works
         | well as long as you're willing to spring for fast networking
         | (at least 10Gb, ideally 40+) and at least a few nodes with 6+
         | disks each if you're using spinners. You can probably get away
         | with far fewer disks per node if you're going all-SSD.
        
         | aaronax wrote:
         | I just set up a three-node Proxmox+Ceph cluster a few weeks
         | ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs
         | of 1TB and 2TB mix (was 5 until I noticed one of my scavenged
         | SSDs was failed). Single 1gbps network on each so I am seeing
         | 30-120MB/s disk performance depending on things. I think in a
         | few months I will upgrade to 10gbps for about $400.
         | 
         | I'm about 1/2 through the process of moving my 15 virtual
         | machines over. It is a little slow but tolerable. Not having to
         | decide on RAIDs or a NAS ahead of time is amazing. I can throw
         | disks and nodes at it whenever.
        
         | sgarland wrote:
         | Yes. I first tried it with Rook, and that was a disaster, so I
         | shifted to Longhorn. That has had its own share of problems,
         | and is quite slow. Finally, I let Proxmox manage Ceph for me,
         | and it's been a dream. So far I haven't migrated my K8s
         | workloads to it, but I've used it for RDBMS storage (DBs in
         | VMs), and it works flawlessly.
         | 
         | I don't have an incredibly great setup, either: 3x Dell R620s
         | (Ivy Bridge-era Xeons), and 1GBe. Proxmox's corosync has a
         | dedicated switch, but that's about it. The disks are nice to be
         | fair - Samsung PM863 3.84 TB NVMe. They are absolutely
         | bottlenecked by the LAN at the moment.
         | 
         | I plan on upgrading to 10GBe as soon as I can convince myself
         | to pay for an L3 10G switch.
        
           | sixdonuts wrote:
           | Just get a 25G switch and MM fiber. 25G switches are cheaper,
           | use less power and can work with 10 and 25G SFPs.
        
             | sgarland wrote:
             | The main blocker (other than needing to buy new NICs, since
             | everything I have already came with quad 1/1/10/10) is I'm
             | heavily invested into the Ubiquiti ecosystem, and since
             | they killed off the USW-Leaf (and the even more brief UDC-
             | Leaf), they don't have anything that fits the bill.
             | 
             | I'm not entirely opposed to getting a Mikrotik or something
             | and it just being the oddball out, but it's nice to have
             | everything centrally managed.
             | 
             | EDIT: They do have the PRO-Aggregation, but there are only
             | 4x 25G ports. Technically it _would_ meet my needs for
             | Ceph, and Ceph only.
        
         | mmerlin wrote:
         | Proxmox makes Ceph easy, even with just one single server if
         | you are homelabbing...
         | 
         | I had 4 NUCs running Proxmox+Ceph for a few years, and apart
         | from slightly annoying slowness syncing after spinning the
         | machines up from cold start, it all ran very smoothly.
        
         | louwrentius wrote:
         | If you want decent performance, you need a lot of OSDs
         | especially if you use HDD. But a lot of consumer SDDs will
         | suffer terrible performance degradation with writes depending
         | on the circumstances and workloads.
        
         | mikecoles wrote:
         | Works great, depending on what you want to do. Running on SBCs
         | or computers with cheap sata cards will greatly reduce the
         | performance. It's been running well for years after I found out
         | the issues regarding SMR drives and the SATA card bottlenecks.
         | 
         | 45Drives has a homelab setup if you're looking for a canned
         | solution.
        
         | antongribok wrote:
         | I run Ceph on some Raspberry Pi 4s. It's super reliable, and
         | with cephadm it's very easy[1] to install and maintain.
         | 
         | My household is already 100% on Linux, so having a native
         | network filesystem that I can just mount from any laptop is
         | very handy.
         | 
         | Works great over Tailscale too, so I don't even have to be at
         | home.
         | 
         | [1] I run a large install of Ceph at work, so "easy" might be a
         | bit relative.
        
           | dcplaya wrote:
           | What are your speeds? Do you rub ceph FS too?
           | 
           | I'm trying to do similar.
        
             | antongribok wrote:
             | It's been a while since I've done some benchmarks, but it
             | can definitely do 40MB/s sustained writes, which is very
             | good given the single 1GbE links on each node, and 5TB SMR
             | drives.
             | 
             | Latency is hilariously terrible though. It's funny to open
             | a text file over the network in vi, paste a long blob of
             | text and watch it sync that line by line over the network.
             | 
             | If by "rub" you mean scrub, then yes, although I increased
             | the scrub intervals. There's no need to scrub everything
             | every week.
        
       | stuff4ben wrote:
       | I used to love doing experiments like this. I was afforded that
       | luxury as a tech lead back when I was at Cisco setting up
       | Kubernetes on bare metal and getting to play with setting up
       | GlusterFS and Ceph just to learn and see which was better. This
       | was back in 2017/2018 if I recall. Good ole days. Loved this
       | writeup!
        
         | knicholes wrote:
         | I had to run a bunch of benchmarks to compare speeds of not
         | just AWS instance types, but actual individual instances in
         | each type, as some NVME SSDs have been more used than others in
         | order to lube up some Aerospike response times. Crazy.
        
           | j33zusjuice wrote:
           | Ad-tech, or?
        
             | knicholes wrote:
             | Yeah. Serving profiles for customized ad selection.
        
         | redrove wrote:
         | A Heketi man! I had the same experience around the same years,
         | what a blast. Everything was so new..and broken!
        
           | CTrox wrote:
           | Same here, still remember that time our Heketi DB partially
           | corrupted and we had to fix it up by exporting it to a
           | massive json file, fix it up by looking at the Gluster state
           | and importing it again. I can't quite remember the details
           | but I think it had to do with Gluster snapshots being out of
           | sync with the state in the DB.
        
       | amluto wrote:
       | I wish someone would try to scale the nodes down. The system
       | described here is ~300W/node for 10 disks/node, so 30W or so per
       | disk. That's a fair amount of overhead, and it also requires
       | quite a lot of storage to get any redundancy at all.
       | 
       | I bet some engineering effort could divide the whole thing by 10.
       | Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+
       | sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an
       | eMMC chip or SD slot for boot.
       | 
       | This could scale down to just a few nodes, and it reduces the
       | exposure to a single failure taking out 10 disks at a time.
       | 
       | I bet a lot of copies of this system could fit in a 4U enclosure.
       | Optionally the same enclosure could contain two entirely
       | independent switches to aggregate the internal nodes.
        
         | jeffbee wrote:
         | I think the chief source of inefficiency in this architecture
         | would be the NVMe controller. When the operating system and the
         | NVMe device are at arm's length, there is natural inefficiency,
         | as the controller needs to infer the intent of the request and
         | do its best in terms of placement and wear leveling. The new
         | FDP (flexible data placement) features try to address this by
         | giving the operating system more control. The best thing would
         | be to just hoist it all up into the host operating system and
         | present the flash, as nearly as possible, as a giant field of
         | dumb transistors that happens to be a PCIe device. With layers
         | of abstraction removed, the hardware unit could be something
         | like an Atom with integrated 100gbps NICs and a proportional
         | amount of flash to achieve the desired system parallelism.
        
         | booi wrote:
         | Is that a lot of overhead? The disk itself uses about 10W and
         | high speed controllers use about 75W leaves pretty much 100W
         | for the rest of the system including overhead of about 10%.
         | Scale up the system to 16 disks and there's not a lot of room
         | for improvement
        
         | kbenson wrote:
         | There probably is a sweet spot for power to speed, but I think
         | it's possibly a bit larger than you suggest. There's overhead
         | from the other components as well. For example, the Mellanox
         | NIC seems to utilize about 20W itself, and while the reduced
         | numbers of drives might allow for a single port NIC which seems
         | to use about half the power, if we're going to increase the
         | number of cables (3 per 12 disks instead of 2 per 5), we're not
         | just increasing the power usage of the nodes themselves put
         | also possible increasing the power usage or changing the type
         | of switch required to combine the nodes.
         | 
         | If looked at as a whole, it appears to be more about whether
         | you're combining resources at a low level (on the PCI bus on
         | nodes) or a high level (in the switching infrastructure), and
         | we should be careful not to push power (or complexity, as is
         | often a similar goal) to a separate part of the system that is
         | out of our immediate thoughts but still very much part of the
         | system. Then again, sometimes parts of the system are much
         | better at handling the complexity for certain cases, so in
         | those cases that can be a definite win.
        
         | Palomides wrote:
         | here's a weird calculation:
         | 
         | this cluster does something vaguely like 0.8 gigabits per
         | second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per
         | tb / 34 nodes / 300 watts
         | 
         | a new mac mini (super efficient arm system) runs around 10
         | watts in interactive usage and can do 10 gigabits per second
         | network, so maybe 1 gigabit per second per watt of data
         | 
         | so OP's cluster, back of the envelope, is basically the same
         | bits per second per watt that a very efficient arm system can
         | do
         | 
         | I don't think running tiny nodes would actually get you any
         | more efficiency, and would probably cost more! performance per
         | watt is quite good on powerful servers now
         | 
         | anyway, this is all open source software running on off-the-
         | shelf hardware, you can do it yourself for a few hundred bucks
        
           | 3abiton wrote:
           | Trusting your maths, damn Apple did a great job on their M
           | design.
        
             | hirako2000 wrote:
             | Didn't ARM (the company, that originally designed ARM
             | processors) do most of that job and Apple pushed perf to
             | consumption even further?
        
           | amluto wrote:
           | I think the Mac Mini has massively more compute than needed
           | for this kind of work. It also has a power supply, and
           | computer power supplies are generally not amazing at low
           | output.
           | 
           | I'm imagining something quite specialized. Use a low
           | frequency CPU with either vector units or even DMA engines
           | optimized for the specific workloads needed, or go all out
           | and arrange for data to be DMAed directly between the disk
           | and the NIC.
        
             | Palomides wrote:
             | sounds like a DPU (mellanox bluefield for example), they're
             | entire ARM systems with a high speed NIC all on a PCIe
             | card, I think the bluefield ones can even directly
             | interface over the bus to nvme drives without the host
             | system involved
        
               | amluto wrote:
               | That Bluefield hardware looks neat, although it also
               | sounds like a real project to program it :).
               | 
               | I can imagine two credible configurations for high
               | efficiency:
               | 
               | 1. A motherboard with a truly minimal CPU for
               | bootstrapping but a bit beefy PCIe root complex. 32 lanes
               | to the DPU and a bunch of lanes for NVMe. The CPU doesn't
               | touch the data at all. I wonder if anyone makes a
               | motherboard optimized like this -- a 64-lane mobo with a
               | Xeon in it would be quite wasteful but fine for
               | prototyping I suppose.
               | 
               | 2. Wire up the NVMe ports directly to the Bluefield DPU,
               | letting the DPU be the root complex. At least 28 of the
               | lanes are presumably usable for this or maybe even all
               | 32. It's not entirely clear to me that the Bluefield DPU
               | can operate without a host computer, though.
        
           | hirako2000 wrote:
           | I checked selling prices of those racks + top end SSDs, this
           | 1Tb/s achievement runs on $4 million worth of hardware
           | cluster. Or more I didn't check the networking interface
           | costs.
           | 
           | But yeah could run on commodity hardware. Not sure those
           | highly efficient arm packaged for a premium from Apple would
           | beat the Dell racks though regarding throughput relative to
           | hardware investment costs.
        
             | amluto wrote:
             | Dell's list prices have essentially nothing to do with the
             | prices that any competent buyer would actually pay,
             | _especially_ when storage is involved. Look at the prices
             | of Dell disks, which are nothing special compared to name
             | brand disks of equal or better spec and _much_ lower list
             | price.
             | 
             | I don't know what discount large buyers get, but I wouldn't
             | be surprised if it's around 75%.
        
           | georgyo wrote:
           | You're comparing one machine with many machines.
           | 
           | You're comparing raw disks with shards and erasure
           | encouraging.
           | 
           | Lastly, you're comparing only network bandwidth and not
           | storage capacity.
        
         | evanreichard wrote:
         | I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's
         | [0]. Was a royal pain to get installed (armhf processor). But
         | once it was running it worked great. Just slow with the single
         | 1Gb NIC.
         | 
         | Was just a learning experience at the time.
         | 
         | [0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/
        
           | eurekin wrote:
           | Same here, but on PI 4b's. 6 node cluster with a 2tb hdd and
           | 512 Tb ssd per node. CEPH made a huge impression on me, as in
           | I didn't recognize how extensive the package was. I went up
           | to 122mb/s and thought it's too little for my hack-NAS
           | replacement :)
           | 
           | The functionality: mixing various pool types on the same set
           | of SSD's, different redundancy types (erasure coded,
           | replicated) was very impressive. Now I can't help but look
           | down at a RAID NAS in comparision. Still, some extra packages
           | like the NFS exporter were not ready for the arm architecture
        
         | somat wrote:
         | I have always wanted to set up a ceph system with one drive per
         | node. The ideal form factor would be a drive with a couple
         | network interfaces built in. western digital had a press
         | release about an experiment they did that was exactly this, but
         | it never ended up with drive you could buy.
         | 
         | The hardkernel HC2 SOC was a nearly ideal form factor for this,
         | and I still have a stack of them laying around that I bought to
         | make a ceph cluster, but I ran out of steam when I figured out
         | they were 32bit. not to say it would be impossible I just never
         | did it.
        
           | progval wrote:
           | I used to use Ceph Luminous (v12) on these, they worked fine.
           | Unfortunately, a bug in Nautilus (v14) prevented 32-bits and
           | 64-bits archs from talking to each other. Pacific (v16)
           | allegedly solves this, but I didn't try it:
           | https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/
           | 
           | If you want to try it with a more modern (and 64-bits)
           | device, the hardkernel HC4 might do it for you. It's
           | conceptually similar to the HC2 but has two drives.
           | Unfortunately it only has double the RAM (4GB), which is
           | probably not enough anymore.
        
             | eurekin wrote:
             | Looks so good, wish for a > 1gbit version, since HDDs alone
             | can saturate that
        
               | progval wrote:
               | Did you look at their H3? It's pricier but it has two
               | 2.5Gbits ports (along with a NVMe slot and an Intel CPU)
        
               | eurekin wrote:
               | I have one and love it! It bravely holds together my
               | intranet dev services :)
               | 
               | For a ceph node would still consider a version with
               | 10gbit eth
        
           | narism wrote:
           | Sounds like the Seagate Kinetic HDD:
           | https://www.seagate.com/www-content/product-content/hdd-
           | fam/...
        
             | somat wrote:
             | That would be perfect. Unfortunately, going by the data
             | sheet it would not run ceph you would have to work with
             | seagate's proprietary object store. I will note that as far
             | as I can tell it is unobtainium. none of the usual vendors
             | stock them, you probably have to prove to seagate that you
             | are a "serious enterprise customer" and commit to a
             | thousand units before they will let you buy some.
        
         | walrus01 wrote:
         | 10 Gbps is increasingly obsolete with very low cost 100 Gbps
         | switches and 100Gbps interfaces. Something would have to be
         | really tiny and low cost to justify doing a ceph setup with
         | 10Gbps interfaces now... If you're at that scale of very small
         | stuff you are probably better off doing local NVME storage on
         | each server instead.
        
         | wildylion wrote:
         | IIRC, WD has experimented with placing Ethernet and some
         | compute directly onto hard drives some time back.
         | 
         |  _sigh_ I used to do some small-scale Ceph back in 2017 or
         | so...
        
       | mrb wrote:
       | I wanted to see how 1 TiB/s compares to the actual theoretical
       | limits of the hardware. So here is what I found:
       | 
       | The cluster has 68 nodes, each a Dell PowerEdge R6615
       | (https://www.delltechnologies.com/asset/en-
       | us/products/server...). The R6615 configuration they run is the
       | one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe
       | gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes
       | have negligible ~3% overhead thanks to 128b-132b encoding.
       | 
       | This means each U.2 link has a maximum link bandwith of 16 * 4 =
       | 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are
       | Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to
       | be capable of 7 Gbyte/s read throughput (https://www.serversupply
       | .com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not
       | bottlenecked by the U.2 link (8 Gbyte/s).
       | 
       | Each node has 10 U.2 drive, so each node can do local read I/O at
       | a maximum of 10 * 7 = 70 Gbyte/s.
       | 
       | However each node has a network bandwith of only 200 Gbit/s (2 x
       | 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This
       | implies that remote reads are under-utilizing the drives (capable
       | of 70 Gbyte/s). The network is the bottleneck.
       | 
       | Assuming no additional network bottlenecks (they don't describe
       | the network architecture), this implies the 68 nodes can provide
       | 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1
       | TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of
       | the maximum theoretical 1700 Gbyte/s. That's pretty decent, but
       | in theory it's still possible to be doing a bit better assuming
       | all nodes can concurrently truly saturate their 200 Gbit/s
       | network link.
       | 
       | Reading this whole blog post, I got the impression ceph's
       | complexity hits the CPU pretty hard. Not compiling a module with
       | -O2 ("Fix Three": linked by the author:
       | https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can
       | reduce performance "up to 5x slower with some workloads"
       | (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure
       | I/O workload. Also what's up with OSD's threads causing excessive
       | CPU waste grabbing the IOMMU spinlock? I agree with the
       | conclusion that the OSD threading model is suboptimal. A
       | relatively simple synthetic 100% read benchmark should not expose
       | a threading contention if that part of ceph's software
       | architecture was well designed (which is fixable, so I hope the
       | ceph devs prioritize this.)
        
         | wmf wrote:
         | I think PCIe TLP overhead and NVMe commands account for the
         | difference between 7 and 8 GB/s.
        
           | mrb wrote:
           | You are probably right. Reading some old notes of mine when I
           | was fine-tuning PCIe bandwith on my ZFS server, I had
           | discovered back then that a PCIe Max_Payload_Size of 256
           | bytes limited usable bandwidth to about 74% of the link's
           | theoretical max. I had calculated that 512 and 1024 bytes
           | (the maximum) would raise it to respectively about 86% and
           | 93% (but my SATA controllers didn't support a value greater
           | than 256.)
        
             | _zoltan_ wrote:
             | Mellanox recommends setting this from the default 512 to
             | 4096 on their NICs.
        
         | magicalhippo wrote:
         | They're benchmarking random IO though, and the disks can "only"
         | do a bit over 1000k random 4k read IOPS, which translates to
         | about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.
         | 
         | At least thats the number I could find. Not exactly tons of
         | reviews on these enterprise NVMe disks...
         | 
         | Still, that seems like a good match to the NICs. At this scale
         | most workloads will likely appear as random IO at the storage
         | layer anyway.
        
           | mrb wrote:
           | The benchmark were they accomplish 1025 GiB/s is for
           | sequential reads. For random reads they do 25.5M iops or ~100
           | GiB/s. See last table, column "630 OSDs (3x)".
        
             | magicalhippo wrote:
             | Oh wow how did I miss that table, cheers.
        
         | markhpc wrote:
         | I wanted to chime in and mention that we've never seen any
         | issues with IOMMU before in Ceph. We have a previous generation
         | of the same 1U chassis from Dell with AMD Rome processors in
         | the upstream ceph lab and they don't suffer from the same issue
         | despite performing similarly at the same scale (~30 OSDs). The
         | customer did say they've seen this in the past in their data
         | center. I'm hoping we can work with AMD to figure out what's
         | going on.
         | 
         | I did some work last summer kind of duct taping the OSD's
         | existing threading model (double buffering the hand-off between
         | async msgr and worker threads, adaptive thread wakeup, etc). I
         | could achieve significant performance / efficiency gains under
         | load, but at the expense of increased low-load latency (Ceph by
         | default is very aggressive about waking up threads when new IO
         | arrives for a given shard).
         | 
         | One of the other core developers and I discussed it and we both
         | came to the conclusion that it probably makes sense to do a
         | more thorough rewrite of the threading code.
        
       | MPSimmons wrote:
       | The worst problems I've had with in-cluster dynamic storage were
       | never strictly IO related, and were more the storage controller
       | software in kubernetes having problems with real-world problems
       | like pods dying and the PVCs not attaching until after very long
       | timeouts expired, with the pod sitting in ContainerCreating until
       | the PVC lock was freed.
       | 
       | This has happened in multiple clusters, using rook/ceph as well
       | as Longhorn.
        
       | einpoklum wrote:
       | Where can I read about the rationale for ceph as a project? I'm
       | not familiar with it.
        
         | jseutter wrote:
         | http://www.45drives.com/blog/ceph/what-is-ceph-why-our-custo...
         | is a pretty good introduction. Basically you can take off-the-
         | shelf hardware and keep expanding your storage cluster and ceph
         | will scale fairly linearly up through hundreds of nodes. It is
         | seeing quite a bit of use in things like Kubernetes and
         | OpenShift as a cheap and cheerful alternative to SANs. It is
         | not without complexity, so if you don't know you need it, it's
         | probably not worth the hassle.
        
         | jacobwg wrote:
         | Not sure how common the use-case is, but we're using Ceph to
         | effectively roll our own EBS inside AWS on top of i3en EC2
         | instances. For us it's about 30% cheaper than the base EBS
         | cost, but provides access to 10x the IOPS of base gp3 volumes.
         | 
         | The downside is durability and operations - we have to keep
         | Ceph alive and are responsible for making sure the data is
         | persistent. That said, we're storing cache from container
         | builds, so in the worst-case where we lose the storage cluster,
         | we can run builds without cache while we restore.
        
       | alberth wrote:
       | Ceph has an interesting history.
       | 
       | It was created at Dreamhost (DH), for their internal needs by the
       | founders.
       | 
       | DH was doing effectively IaaS & PaaS before those were industry
       | coined words (VPS, managed OS/database/app-servers).
       | 
       | They spun Ceph off and Redhat bought it.
       | 
       | https://en.wikipedia.org/wiki/DreamHost
        
         | epistasis wrote:
         | A bit more to the story is that it was created also at UC Santa
         | Cruz, by Sage Weil, a Dreamhost founder, while he was doing
         | graduate work there. UCSC has had a lot of good storage
         | research.
        
           | dekhn wrote:
           | the fighting banana slugs
        
           | AdamJacobMuller wrote:
           | I remember the first time I deployed ceph, would have been
           | around 2010 or 2011, had some really major issues which would
           | nearly resulted in data loss and due to someone else not
           | realizing what "this cluster is experimental, do not store
           | any important data here" meant, the data on ceph was the only
           | copy of the irreplaceable data in the world, loosing the data
           | would have been fairly catastrophic for us.
           | 
           | I ended up on the ceph IRC channel and eventually had Sage
           | helping me fix the issues directly, helping me find bugs and
           | writing patches to fix them in realtime.
           | 
           | Super amazingly nice guy that he was willing to help, never
           | once chastised me for being so stupid (even though I was),
           | also wicked smart.
        
           | antongribok wrote:
           | Sage is one of the nicest, down to earth, super smart
           | individuals I've met.
           | 
           | I've talked to him at a few OpenStack and Ceph conferences,
           | and he's always very patient answering questions.
        
         | artyom wrote:
         | Yeah, as a customer (still one) I remember their "Hey, we're
         | going to build this Ceph thing, maybe it ends up being cool"
         | blog entry (or newsletter?) kinda just sharing what they were
         | toying with. It was a time of no marketing copy and not
         | crafting every sentence to sell you things.
         | 
         | I think it was the university project of one of the founders,
         | and the others jumped in supporting it. Docker has a similar
         | origins story as far as I know.
        
           | pas wrote:
           | https://en.wikipedia.org/wiki/Sage_Weil right?
           | 
           | https://ceph.com/assets/pdfs/weil-crush-sc06.pdf
        
       | louwrentius wrote:
       | Remember, random IOPs without latency is a meaningless figure.
        
       | peter_d_sherman wrote:
       | Ceph is interesting... open source software whose only purpose is
       | to implement a distributed file system...
       | 
       | Functionally, Linux implements a file system (well, several!) as
       | well (in addition to many other OS features) -- but (usually!)
       | only on top of local hardware.
       | 
       | There seems to be some missing software here -- if we examine
       | these two paradigms side-by-side.
       | 
       | For example, what if I want a Linux (or more broadly, a general
       | OS) -- but one that doesn't manage a local file system or local
       | storage at all?
       | 
       | One that operates solely using the network, solely using a
       | distributed file system that Ceph, or software like Ceph, would
       | provide?
       | 
       | Conversely, what if I don't want to run a full OS on a network
       | machine, a network node that manages its own local storage?
       | 
       | The only thing I can think of to solve those types of problems --
       | is:
       | 
       |  _What if the Linux filesystem was written such that it was a
       | completely separate piece of software, and a distributed file
       | system like Ceph, and not dependent on the other kernel source
       | code_ (although, still complilable into the kernel as most linux
       | components normally are)...
       | 
       | A lot of work? Probably!
       | 
       | But there seems to be some software need for something between a
       | solely distributed file system as Ceph is, and a completely
       | monolithic "everything baked in" (but not distributed!) OS/kernel
       | as Linux is...
       | 
       | Note that I am just thinking aloud here -- I probably am wrong
       | and/or misinformed on one or more fronts!
       | 
       | So, kindly take this random "thinking aloud" post -- with the
       | proverbial "grain of salt!" :-)
        
         | wmf wrote:
         | _what if I want a Linux ... that doesn 't manage a local file
         | system or local storage at all [but] operates solely using the
         | network, solely using a distributed file system_
         | 
         | Linux can boot from NFS although that's kind of lost knowledge.
         | Booting from CephFS might even be possible if you put the right
         | parts in the initrd.
        
           | lmz wrote:
           | NFS root docs here https://www.kernel.org/doc/Documentation/f
           | ilesystems/nfs/nfs...
        
             | peter_d_sherman wrote:
             | NFS is an excellent point!
             | 
             | NFS (now that I think about it!) -- brings up two
             | additional software engineering considerations:
             | 
             | 1) Distributed _file system protocol_.
             | 
             | 2) Software that implements that distributed (or at least
             | remote/network) file system -- via that _file system
             | protocol_.
             | 
             | NFS is both.
             | 
             | That's not a bad thing(!) -- but ideally from a software
             | engineering "separation of concerns" perspective, this
             | future software layer/level would ideally be decoupled from
             | the underlying protocol -- that is, it might have a "plug-
             | in" protocol architecture, where various 3rd party file
             | system protocols (somewhat analogous to drivers) could be
             | "plugged-in"...
             | 
             | But NFS could definitely be used to boot/run Linux over the
             | network, and is definitely a step in the right direction,
             | and something worth evaluating for these purposes... its
             | source code is definitely worth looking at...
             | 
             | So, an excellent point!
        
         | plagiarist wrote:
         | It sounds you want microkernels, and I agree, it would be nice.
        
       | chx wrote:
       | There was a point in history when the total amount of digital
       | data stored worldwide reached 1TiB for the first time. It is
       | extremely likely this day was within the last sixty years.
       | 
       | And here we are moving that amount of data every second on the
       | servers of a fairly random entity. We not talking of a nation
       | state or a supranatural research effort.
        
         | qingcharles wrote:
         | That reminds me of a calculation I did which showed that my
         | desktop PC would be more powerful than all of the computers on
         | the planet combined in like 1978 :D
        
           | plagiarist wrote:
           | My phone has more computation than anything I would have
           | imagined owning, and I sometimes turn on the screen just to
           | use as a quick flashlight.
        
             | qingcharles wrote:
             | Haha.. imagine taking it back to 1978 and showing how it
             | has more computing power than the entire planet and then
             | telling them that you mostly just use it to find that thing
             | you lost under the couch :D
        
         | fiddlerwoaroof wrote:
         | It's at least 20ish years ago: I remember an old sysadmin
         | talking about managing petabytes before 2003
        
           | aspenmayer wrote:
           | Those numbers seem reasonable in that context. I first
           | started using BitTorrent around that time as well, and it
           | wasn't uncommon to see many users long-term seeding multiple
           | hundreds of gigabytes of Linux ISOs alone.
           | 
           | Here's another usage scenario with data usage numbers I found
           | a while back.
           | 
           | > A 2004 paper published in ACM Transactions on Programming
           | Languages and Systems shows how Hancock code can sift calling
           | card records, long distance calls, IP addresses and internet
           | traffic dumps, and even track the physical movements of
           | mobile phone customers as their signal moves from cell site
           | to cell site.
           | 
           | > With Hancock, "analysts could store sufficiently precise
           | information to enable new applications previously thought to
           | be infeasible," the program authors wrote. AT&T uses Hancock
           | code to sift 9 GB of telephone traffic data a night,
           | according to the paper.
           | 
           | https://web.archive.org/web/20200309221602/https://www.wired.
           | ..
        
             | ComputerGuru wrote:
             | I archived Hancock here over a decade ago, stumbled upon it
             | via HN at the time if I'm not mistaken:
             | https://github.com/mqudsi/hancock
        
               | aspenmayer wrote:
               | That's pretty cool. I remember someone on that repo from
               | while back and was surprised to see their name pop up
               | again. Thanks for archiving this!
               | 
               | Corinna Cortes et al wrote the paper(s) on Hancock and
               | also the Communities of Interest paper referenced in the
               | Wired article I linked to. She's apparently a pretty big
               | deal and went on to work at Google after her prestigious
               | work at AT&T.
               | 
               | Hancock: A Language for Extracting Signatures from Data
               | 
               | https://scholar.google.com/citations?view_op=view_citatio
               | n&h...
               | 
               | Hancock: A Language for Analyzing Transactional Data
               | Streams
               | 
               | https://scholar.google.com/citations?view_op=view_citatio
               | n&h...
               | 
               | Communities of Interest
               | 
               | https://scholar.google.com/citations?view_op=view_citatio
               | n&h...
        
             | fiddlerwoaroof wrote:
             | Yeah, at the other end of the scale, it sounds like Apple
             | is now managing exabytes:
             | https://read.engineerscodex.com/p/how-apple-built-icloud-
             | to-...
             | 
             | This is pretty mind-boggling to me.
        
           | chx wrote:
           | Must be much more than 20ish years, some 2400 ft reels in the
           | 60s stored a few megabytes, you only need 100 000s of those
           | to reach a terabyte. https://en.wikipedia.org/wiki/IBM_7330
           | 
           | > a single 2400-foot tape could store the equivalent of some
           | 50,000 punched cards (about 4,000,000 six-bit bytes).
           | 
           | In 1964 with the introduction of System/360 you are going a
           | magnitude higher
           | https://www.core77.com/posts/108573/A-Storage-Cabinet-
           | Based-...
           | 
           | > It could store a maximum of 45MB on 2,400 feet
           | 
           | At this point you only need a few ten thousand reels in
           | existence to reach a terabyte. So I strongly suspect the
           | "terabyte point" was some time in the 1960s.
        
           | chx wrote:
           | I raised this to retro se and
           | https://retrocomputing.stackexchange.com/a/28322/3722 notes a
           | TiB of digital data likely was reached in the 1930s with
           | punch cards.
        
       | one_buggy_boi wrote:
       | Is modern Ceph appropriate for transactional database storage,
       | how is the IO latency? I'd like to move to a cheaper cfs that can
       | compete with systems like Oracle's clustered file system or DBs
       | backed by something like Veritas. Veritas supports multi-petabyte
       | DBs and I haven't seen much outside of it or ocfs that similarly
       | scales with acceptable latency
        
         | samcat116 wrote:
         | Latency is quite poor, I wouldn't recommend running high
         | performance database loads there.
        
           | louwrentius wrote:
           | From my dated experience, Ceph is absolutely amazing but
           | latency is indeed a relative weak spot.
           | 
           | Everything has a trade-off and for Ceph you get a ton of
           | capability but latency is such a trade-off. Databases -
           | depending on requirements - may be better off on regular NVMe
           | and not on Ceph.
        
         | antongribok wrote:
         | Not sure about putting DBs on CephFS directly, but Ceph RBD can
         | definitely run RDBMS workloads.
         | 
         | You need to pay attention to the kind of hardware you use, but
         | you can definitely get Ceph down to 0.5-0.6 ms latency on block
         | workloads doing single thread, single queue, sync 4K writes.
         | 
         | Source, I run Ceph at work doing pretty much this.
        
           | patrakov wrote:
           | It is important to specify which kind of latency percentile
           | this is. Checking on a customer's cluster (made from 336 SATA
           | SSDs in 15 servers, so not the best one in the world):
           | 50th percentile = 1.75 ms       90th percentile = 3.15 ms
           | 99th percentile = 9.54 ms
           | 
           | That's with 700 MB/s of reads and 200 MB/s of writes, or
           | approximately 7000 reads IOPS and 9000 writes IOPS.
        
             | louwrentius wrote:
             | These numbers may be good enough for your use case but from
             | what's possible with SSDs these numbers aren't great.
             | Please note, I mean well. Still a cool setup.
             | 
             | I'd like to see much more latency consistency and 99th even
             | sub ms. Might want to set a latency target with fio and see
             | what kind of load is possible until 99 hits 1ms.
             | 
             | However, I can say all of this but it's all about context
             | and depending on workload your figures may be totally fine.
        
       | rafaelturk wrote:
       | I'm playing a lot with MicroCeph. Its aopinionated low TOS,
       | friendly setup of Ceph. Looking forward additional comments.
       | Planning to use it in production and replace lots of NAS servers.
        
         | louwrentius wrote:
         | I think Ceph can be fine for NAS use cases, but be wary of
         | latency and do some benchmarking. You may need more nodes/osds
         | than you think to reach latency and throughput targets.
        
       | hinkley wrote:
       | Sure would be nice if you defined some acronyms.
        
       | nghnam wrote:
       | My old company ran public and private cloud with Openstack and
       | Ceph. We had 20 Supermicro (24 disks per server) storage nodes
       | and total capacity was 3PB. We learnt some experiences,
       | especially a flapping disk made whole system performance
       | degraded. Solution was removing bad sector disk as soon as
       | possible.
        
       | amadio wrote:
       | Nice article! We've also recently reached the mark of 1TB/s at
       | CERN, but with EOS (https://cern.ch/eos), not ceph:
       | https://www.home.cern/news/news/computing/exabyte-disk-stora...
       | 
       | Our EOS clusters have a lot more nodes, however, and use mostly
       | HDDs. CERN also uses ceph extensively.
        
         | theyinwhy wrote:
         | Great! What's your take on ceph? Is the idea to migrate to EOS
         | long term?
        
           | amadio wrote:
           | EOS and ceph have different use cases at CERN. EOS holds
           | physics data and user data in CERNBox, while ceph is used for
           | a lot of the rest (e.g. storage for VMs, and other
           | applications). So both will continue to be used as they are
           | now. CERN has over 100PB on ceph.
        
             | ComputerGuru wrote:
             | Is there a reason you run both and don't converge on one or
             | the other?
        
       | brobinson wrote:
       | I'm curious what the performance difference would be on a modern
       | kernel.
        
         | PiratesScorn wrote:
         | For context, I've been leading the work on this cluster client-
         | side (not the engineer that discovered the IOMMU fix) with
         | Clyso.
         | 
         | There was no significant difference when testing between the
         | latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In
         | both cases we ran into the same IOMMU behaviour. Our tooling is
         | all very much catered around Ubuntu so testing newer kernels
         | with other distros just wasn't feasible in the timescale we had
         | to get this built. The plan was < 2 months from initial design
         | to completion.
         | 
         | Awesome to see this on HN, we're a pretty under-the-radar
         | operation so there's not much more I can say but proud to have
         | worked on this!
        
       | kylegalbraith wrote:
       | This is a fascinating read. We run a Ceph storage cluster for
       | persisting Docker layer cache [0]. We went from using EBS to Ceph
       | and saw a massive difference in throughput. Went from a write
       | throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000
       | IOPS.
       | 
       | The best part is that it pretty much just works. Very little
       | babysitting with the exception of the occasional fs trim or
       | something.
       | 
       | It's been a massive improvement for our caching system.
       | 
       | [0] https://depot.dev/blog/cache-v2-faster-builds
        
         | guywhocodes wrote:
         | Did something very similar almost 10 years ago, EBS costs were
         | 10x+ the cost for same perfomance CEPH cluster on the node
         | disks. Eventually we switched to our own racks and cut it
         | almost in ten again. We developed the inhouse expertise for how
         | to do it and we were free.
        
       | louwrentius wrote:
       | I wrote an intro to Ceph[0] for those who are new to Ceph.
       | 
       | It featured in a Jeff Geerling video briefly recently :-)
       | 
       | [0]: Understanding Ceph: open-source scalable storage
       | https://louwrentius.com/understanding-ceph-open-source-scala...
        
         | justinclift wrote:
         | Has anything important changed since 2018, when you wrote that?
         | :)
        
           | louwrentius wrote:
           | Conceptually not as far as I know.
        
       | mobilemidget wrote:
       | Cool benchmark, and interesting, however it would have read a lot
       | better if abbreviations are explained at first usage. Not
       | everybody is familiar with all terminology used in the post.
       | Nonetheless congrats with results.
        
         | markhpc wrote:
         | Thanks (truly) for the feedback! I'll try to remember for
         | future articles. It's easy to forget how much jargon we use
         | after being in the field for so long.
        
       | up2isomorphism wrote:
       | This is an insanely expensive cluster built to show a benchmark.
       | 68 node cluster serving only 15TB storage in total.
        
         | PiratesScorn wrote:
         | The purpose of the benchmarking was to validate the design of
         | the cluster and to identify any issues before going into
         | production, so it achieved exactly that objective. Without
         | doing this work a lot of performance would have been left on
         | the table before the cluster could even get out the door.
         | 
         | As per the blog, the cluster is now in a 6+2 EC configuration
         | for production which gives ~7PiB usable. Expensive yes, but
         | well worth it if this is the scale and performance required.
        
       | francoismassot wrote:
       | Does someone knows how Ceph compares to other object storage
       | engine like MinIO/Garage/...?
       | 
       | I would love to see some benchmarks there.
        
         | matesz wrote:
         | This would be great, to have a universal benchmark of all
         | available open source solutions for self-hosting. Links
         | appreciated!
        
       ___________________________________________________________________
       (page generated 2024-01-20 23:01 UTC)