[HN Gopher] Ceph: A Journey to 1 TiB/s
       ___________________________________________________________________
        
       Ceph: A Journey to 1 TiB/s
        
       Author : davidmr
       Score  : 146 points
       Date   : 2024-01-19 20:02 UTC (2 hours ago)
        
 (HTM) web link (ceph.io)
 (TXT) w3m dump (ceph.io)
        
       | riku_iki wrote:
       | What router/switch one would use for such speed?
        
         | KeplerBoy wrote:
         | 800Gbps via OSFP and QSFP-DD are already a thing. Multiple
         | vendors have NICs and switches for that.
        
           | _zoltan_ wrote:
           | can you show me a 800G NIC?
           | 
           | the switch is fine, I'm buying 64x800G switches, but NIC wise
           | I'm limited to 400Gbit.
        
           | CyberDildonics wrote:
           | 16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is
           | any computer using 100 GB/s ?
        
         | epistasis wrote:
         | Given their configuration of just 4U spread across 17 racks,
         | there's likely a bunch of compute in the rest of the rack, and
         | 1-2 top of rack switches like this:
         | 
         | https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...
         | 
         | And then you connect the TOR switches to higher level switches
         | in something like a Clos distribution to get the desired
         | bandwidth between any two nodes:
         | 
         | https://www.techtarget.com/searchnetworking/definition/Clos-...
        
         | NavinF wrote:
         | Linked article says they used 68 machines with 2 x 100GbE
         | Mellanox ConnectX-6 cards. So any 100G pizza box switches
         | should work.
         | 
         | Note that 36 port 56G switches are dirt cheap on eBay and 4tbps
         | is good enough for most homelab use cases
        
           | riku_iki wrote:
           | > So any 100G pizza box switches should work.
           | 
           | but will it be able to handle combined TB/s traffic?
        
             | baq wrote:
             | any switch which can't handle full load on all ports isn't
             | worthy of the name 'switch', it's more like 'toy network
             | appliance'
        
               | birdman3131 wrote:
               | I will forever be scarred by the "Gigabit" switches of
               | old that were 2 gigabit ports and 22 100mb ports.
               | Coworker bought it missing the nuance.
        
               | bombcar wrote:
               | Still happens, gotta see if the top speed mentioned is an
               | uplink or normal ports.
        
             | aaronax wrote:
             | Yes. Most network switches can handle all ports at 100%
             | utilization in both directions simultaneously.
             | 
             | Take for example the Mellanox SX6790 available for less
             | than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 =
             | 4032gbps and it is stated to have a switching capacity of
             | 4.032Tbps.
             | 
             | Edit: I guess you are asking how one would possibly sip
             | 1TiB/s of data into a given client. You would need multiple
             | clients spread across several switches to generate such
             | load. Or maybe some freaky link aggregation. 10x 800gbps
             | links for your client, plus at least 10x 800gbps links out
             | to the servers.
        
             | bombcar wrote:
             | Even the bargain Mikrotik can do 1.2Tbps
             | https://mikrotik.com/product/crs518_16xs_2xq
        
               | margalabargala wrote:
               | For those curious, a "bargain" on a 100gbps switch means
               | about $1350
        
               | epistasis wrote:
               | On a cluster with more than $1M of NVMe disks, that does
               | actually seem like a bargain.
               | 
               | (Note that the linked MikroTik switch only has 100gbe on
               | a few ports, and wouldn't really classify as a full
               | 100gbe switch to most people)
        
               | margalabargala wrote:
               | Sure- I don't mean to imply that it isn't. I can
               | absolutely see how that's inexpensive for 100gbe
               | equipment.
               | 
               | That was more for the benefit of others like myself, who
               | were wondering if "bargain" was comparative, or
               | inexpensive enough that it might be worth buying one next
               | time they upgraded switches. For me personally it's still
               | an order of magnitude away from that.
        
               | riku_iki wrote:
               | TB != Tb..
        
       | matheusmoreira wrote:
       | Does anyone have experience running ceph in a home lab? Last time
       | I looked into it, there were quite significant hardware
       | requirements.
        
         | nullwarp wrote:
         | There still are. As someone who has done both production and
         | homelab deployments: unless you are specifically just looking
         | for experience with it and just setting up a demo - don't
         | bother.
         | 
         | When it works, it works great - when it goes wrong it's a huge
         | headache.
         | 
         | Edit: As just an edit, if distributed storage is just something
         | you are interested in there are much better options for a
         | homelab setup:
         | 
         | - seaweedfs has been rock solid for me for years in both small
         | and huge scales. we actually moved our production ceph setup to
         | this.
         | 
         | - longhorn was solid for me when i was in the k8s world
         | 
         | - glusterfs is still fine as long as you know what you are
         | going into.
        
           | reactordev wrote:
           | I'd throw minio [1] in the list there as well for homelab k8s
           | object storage.
           | 
           | [1] https://min.io/
        
             | speedgoose wrote:
             | Also garage. https://garagehq.deuxfleurs.fr/
        
           | dataangel wrote:
           | I really wish there was a benchmark comparing all of these +
           | MinIO and S3. I'm in the market for a key value store, using
           | S3 for now but eyeing moving to my own hardware in the future
           | and having to do all the work to compare these is one of the
           | major things making me procrastinate.
        
             | woopwoop24 wrote:
             | minio is good but you really need fast disks. They also
             | really don't like, when you want to change the size of your
             | cluster setup. No plan to add cache disks, they just say
             | use faster disks. I have it running, goes smoothly but not
             | really user friendly to optimize
        
             | rglullis wrote:
             | Minio gives you "only" S3 object storage. I've setup a
             | 3-node Minio cluster for object storage on Hetzner, each
             | server having 4x10TB, for ~50EUR/month each. This means
             | 80TB usable data for ~150EUR/month. It can be worth it if
             | you are trying to avoid egress fees, but if I were building
             | a data lake or anything where the data was used mostly for
             | internal services, I'd just stick with S3.
        
           | rglullis wrote:
           | > glusterfs is still fine as long as you know what you are
           | going into.
           | 
           | Does that include storage volumes for databases? I was using
           | glusterFS as a way to scale my swarm cluster horizontally and
           | I am reasonably sure that it corrupted one database to the
           | point I lost more than a few hours of data. I was quite
           | satisfied with the setup until I hit that.
           | 
           | I know that I am considered crazy for sticking with Docker
           | Swarm until now, but aside from this lingering issue with how
           | to manage stateful services, I've honestly don't feel the
           | need to move yet to k8s. My clusters is ~10 nodes running <
           | 30 stacks and it's not like I have tens of people working
           | with me on it.
        
           | bityard wrote:
           | Ceph is sort of a storage all-in-one: it provides object
           | storage, block storage, and network file storage. May I ask,
           | which of these are you using seaweedfs for? Is it as
           | performant as Ceph claims to be?
        
           | asadhaider wrote:
           | I thought it was popular for people running Proxmox clusters
        
         | loeg wrote:
         | Why would you bother with a distributed filesystem when you
         | don't have to?
        
           | iwontberude wrote:
           | It's cool to cluster everything for some people (myself
           | included). I see it more like a design constraint than a pure
           | benefit.
        
           | imiric wrote:
           | For the same reason you would use one in enterprise
           | deployments: if setup properly, it's easier to scale. You
           | don't need to invest in a huge storage server upfront, but
           | could build it out as needed with cheap nodes. Assuming it
           | works painlessly as a single node filesystem, of which I'm
           | not yet convinced if the existing solutions do.
        
           | m463 wrote:
           | lol, wrong place to ask questions of such practicality.
           | 
           | that said, I played with virtualization and I didn't need to.
           | 
           | but then I retired a machine or two and it has been very
           | helpful.
           | 
           | And I used to just use physical disks and partitions. But
           | with the VMs I started using volume manager. It became easier
           | to grow and shrink storage.
           | 
           | and...
           | 
           | well, now a lot of this is second nature. I can spin up a new
           | "machine" for a project and it doesn't affect anything else.
           | I have better backups. I can move a virtual machine.
           | 
           | yeah, there are extra layers of abstraction but hey.
        
           | erulabs wrote:
           | So that when you _do_ have to, you know how to do it.
        
         | bluedino wrote:
         | Related question, how does someone get into working with Ceph?
         | Other than working somewhere that already uses it.
        
           | SteveNuts wrote:
           | You could start by installing Proxmox on old machines you
           | have, it uses Ceph for its distributed storage, if you choose
           | to use it.
        
           | candiddevmike wrote:
           | Look into the Rook project
        
         | ianlevesque wrote:
         | I played around with it and it has a very cool web UI, object
         | storage & file storage, but it was very hard to get decent
         | performance and it was possible to get the metadata daemons
         | stuck pretty easily with a small cluster. Ultimately when the
         | fun wore off I just put zfs on a single box instead.
        
         | reactordev wrote:
         | There's a blog post they did where they setup Ceph on some rPI
         | 4's. I'd say that's not significant hardware at all. [1]
         | 
         | [1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-
         | raspberr...
        
           | m463 wrote:
           | I think "significant" turns out to mean the number of nodes
           | required.
        
         | m463 wrote:
         | I think you need 3 or was it 5 machines?
         | 
         | proxmox will use it - just click to install
        
         | victorhooi wrote:
         | I have some experience with Ceph, both for work, and with
         | homelab-y stuff.
         | 
         | First, bear in mind that Ceph is a _distributed_ storage system
         | - so the idea is that you will have multiple nodes.
         | 
         | For learning, you can definitely virtualise it all on a single
         | box - but you'll have a better time with discrete physical
         | machines.
         | 
         | Also, Ceph does prefer physical access to disks (similar to
         | ZFS).
         | 
         | And you do need decent networking connectivity - I think that's
         | the main thing people think of, when they think of high
         | hardware requirements for Ceph. Ideally 10Gbe at the minimum -
         | although more if you want higher performance - there can be a
         | lot of network traffic, particularly with things like backfill.
         | (25Gbps if you can find that gear cheap for homelab - 50Gbps is
         | a technological dead-end. 100Gbps works well).
         | 
         | But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe
         | will work fine, and you should get acceptable performance, and
         | it'll be good for learning.
         | 
         | You can install Ceph directly on bare-metal, or if you want to
         | do the homelab k8s route, you can use Rook (https://rook.io/).
         | 
         | Hope this helps, and good luck! Let me know if you have any
         | other questions.
        
         | chomp wrote:
         | I've ran Ceph in my home lab since Jewel (~8 years ago).
         | Currently up to 70TB storage on a single node. Have been pretty
         | successful vertically scaling, but will have to add a 2nd node
         | here in a bit.
         | 
         | Ceph isn't the fastest, but it's incredibly resilient and
         | scalable. Haven't needed any crazy hardware requirements, just
         | ram and an i7.
        
         | willglynn wrote:
         | The hardware minimums are real, and the complexity floor is
         | significant. Do _not_ deploy Ceph unless you mean it.
         | 
         | I started considering alternatives when my NAS crossed 100 TB
         | of HDDs, and when a scary scrub prompted me to replace all the
         | HDDs, I finally pulled the trigger. (ZFS resilvered everything
         | fine, but replacing every disk sequentially gave me a lot of
         | time to think.) Today I have far more HDD capacity and a few
         | hundred terabytes of NVMe, and despite its challenges, I
         | wouldn't dare run anything like it without Ceph.
        
         | mcronce wrote:
         | I run Ceph in my lab. It's pretty heavy on CPU, but it works
         | well as long as you're willing to spring for fast networking
         | (at least 10Gb, ideally 40+) and at least a few nodes with 6+
         | disks each if you're using spinners. You can probably get away
         | with far fewer disks per node if you're going all-SSD.
        
         | aaronax wrote:
         | I just set up a three-node Proxmox+Ceph cluster a few weeks
         | ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs
         | of 1TB and 2TB mix (was 5 until I noticed one of my scavenged
         | SSDs was failed). Single 1gbps network on each so I am seeing
         | 30-120MB/s disk performance depending on things. I think in a
         | few months I will upgrade to 10gbps for about $400.
         | 
         | I'm about 1/2 through the process of moving my 15 virtual
         | machines over. It is a little slow but tolerable. Not having to
         | decide on RAIDs or a NAS ahead of time is amazing. I can throw
         | disks and nodes at it whenever.
        
       | stuff4ben wrote:
       | I used to love doing experiments like this. I was afforded that
       | luxury as a tech lead back when I was at Cisco setting up
       | Kubernetes on bare metal and getting to play with setting up
       | GlusterFS and Ceph just to learn and see which was better. This
       | was back in 2017/2018 if I recall. Good ole days. Loved this
       | writeup!
        
         | knicholes wrote:
         | I had to run a bunch of benchmarks to compare speeds of not
         | just AWS instance types, but actual individual instances in
         | each type, as some NVME SSDs have been more used than others in
         | order to lube up some Aerospike response times. Crazy.
        
       | amluto wrote:
       | I wish someone would try to scale the nodes down. The system
       | described here is ~300W/node for 10 disks/node, so 30W or so per
       | disk. That's a fair amount of overhead, and it also requires
       | quite a lot of storage to get any redundancy at all.
       | 
       | I bet some engineering effort could divide the whole thing by 10.
       | Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+
       | sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an
       | eMMC chip or SD slot for boot.
       | 
       | This could scale down to just a few nodes, and it reduces the
       | exposure to a single failure taking out 10 disks at a time.
       | 
       | I bet a lot of copies of this system could fit in a 4U enclosure.
       | Optionally the same enclosure could contain two entirely
       | independent switches to aggregate the internal nodes.
        
         | jeffbee wrote:
         | I think the chief source of inefficiency in this architecture
         | would be the NVMe controller. When the operating system and the
         | NVMe device are at arm's length, there is natural inefficiency,
         | as the controller needs to infer the intent of the request and
         | do its best in terms of placement and wear leveling. The new
         | FDP (flexible data placement) features try to address this by
         | giving the operating system more control. The best thing would
         | be to just hoist it all up into the host operating system and
         | present the flash, as nearly as possible, as a giant field of
         | dumb transistors that happens to be a PCIe device. With layers
         | of abstraction removed, the hardware unit could be something
         | like an Atom with integrated 100gbps NICs and a proportional
         | amount of flash to achieve the desired system parallelism.
        
         | booi wrote:
         | Is that a lot of overhead? The disk itself uses about 10W and
         | high speed controllers use about 75W leaves pretty much 100W
         | for the rest of the system including overhead of about 10%.
         | Scale up the system to 16 disks and there's not a lot of room
         | for improvement
        
         | kbenson wrote:
         | There probably is a sweet spot for power to speed, but I think
         | it's possibly a bit larger than you suggest. There's overhead
         | from the other components as well. For example, the Mellanox
         | NIC seems to utilize about 20W itself, and while the reduced
         | numbers of drives might allow for a single port NIC which seems
         | to use about half the power, if we're going to increase the
         | number of cables (3 per 12 disks instead of 2 per 5), we're not
         | just increasing the power usage of the nodes themselves put
         | also possible increasing the power usage or changing the type
         | of switch required to combine the nodes.
         | 
         | If looked at as a whole, it appears to be more about whether
         | you're combining resources at a low level (on the PCI bus on
         | nodes) or a high level (in the switching infrastructure), and
         | we should be careful not to push power (or complexity, as is
         | often a similar goal) to a separate part of the system that is
         | out of our immediate thoughts but still very much part of the
         | system. Then again, sometimes parts of the system are much
         | better at handling the complexity for certain cases, so in
         | those cases that can be a definite win.
        
       | mrb wrote:
       | I wanted to see how 1 TiB/s compares to the actual theoretical
       | limits of the hardware. So here is what I found:
       | 
       | The cluster has 68 nodes, each a Dell PowerEdge R6615
       | (https://www.delltechnologies.com/asset/en-
       | us/products/server...). The R6615 configuration they run is the
       | one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe
       | gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes
       | have negligible ~3% overhead thanks to 128b-132b encoding.
       | 
       | This means each U.2 link has a maximum link bandwith of 16 * 4 =
       | 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are
       | Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to
       | be capable of 7 Gbyte/s read throughput (https://www.serversupply
       | .com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not
       | bottlenecked by the U.2 link (8 Gbyte/s).
       | 
       | Each node has 10 U.2 drive, so each node can do local read I/O at
       | a maximum of 10 * 7 = 70 Gbyte/s.
       | 
       | However each node has a network bandwith of only 200 Gbit/s (2 x
       | 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This
       | implies that remote reads are under-utilizing the drives (capable
       | of 70 Gbyte/s). The network is the bottleneck.
       | 
       | Assuming no additional network bottlenecks (they don't describe
       | the network architecture), this implies the 68 nodes can provide
       | 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1
       | TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of
       | the maximum theoretical 1700 Gbyte/s. That's pretty decent, but
       | in theory it's still possible to be doing a bit better assuming
       | all nodes can concurrently truly saturate their 200 Gbit/s
       | network link.
       | 
       | Reading this whole blog post, I got the impression ceph's
       | complexity hits the CPU pretty hard. Not compiling a module with
       | -O2 ("Fix Three": linked by the author:
       | https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can
       | reduce performance "up to 5x slower with some workloads"
       | (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure
       | I/O workload. Also what's up with OSD's threads causing excessive
       | CPU waste grabbing the IOMMU spinlock? I agree with the
       | conclusion that the OSD threading model is suboptimal. A
       | relatively simple synthetic 100% read benchmark should not expose
       | a threading contention if that part of ceph's software
       | architecture was well designed (which is fixable, so I hope the
       | ceph devs prioritize this.)
        
       | MPSimmons wrote:
       | The worst problems I've had with in-cluster dynamic storage were
       | never strictly IO related, and were more the storage controller
       | software in kubernetes having problems with real-world problems
       | like pods dying and the PVCs not attaching until after very long
       | timeouts expired, with the pod sitting in ContainerCreating until
       | the PVC lock was freed.
       | 
       | This has happened in multiple clusters, using rook/ceph as well
       | as Longhorn.
        
       | einpoklum wrote:
       | Where can I read about the rationale for ceph as a project? I'm
       | not familiar with it.
        
       ___________________________________________________________________
       (page generated 2024-01-19 23:00 UTC)