[HN Gopher] Ceph: A Journey to 1 TiB/s
___________________________________________________________________
Ceph: A Journey to 1 TiB/s
Author : davidmr
Score : 146 points
Date : 2024-01-19 20:02 UTC (2 hours ago)
(HTM) web link (ceph.io)
(TXT) w3m dump (ceph.io)
| riku_iki wrote:
| What router/switch one would use for such speed?
| KeplerBoy wrote:
| 800Gbps via OSFP and QSFP-DD are already a thing. Multiple
| vendors have NICs and switches for that.
| _zoltan_ wrote:
| can you show me a 800G NIC?
|
| the switch is fine, I'm buying 64x800G switches, but NIC wise
| I'm limited to 400Gbit.
| CyberDildonics wrote:
| 16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is
| any computer using 100 GB/s ?
| epistasis wrote:
| Given their configuration of just 4U spread across 17 racks,
| there's likely a bunch of compute in the rest of the rack, and
| 1-2 top of rack switches like this:
|
| https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...
|
| And then you connect the TOR switches to higher level switches
| in something like a Clos distribution to get the desired
| bandwidth between any two nodes:
|
| https://www.techtarget.com/searchnetworking/definition/Clos-...
| NavinF wrote:
| Linked article says they used 68 machines with 2 x 100GbE
| Mellanox ConnectX-6 cards. So any 100G pizza box switches
| should work.
|
| Note that 36 port 56G switches are dirt cheap on eBay and 4tbps
| is good enough for most homelab use cases
| riku_iki wrote:
| > So any 100G pizza box switches should work.
|
| but will it be able to handle combined TB/s traffic?
| baq wrote:
| any switch which can't handle full load on all ports isn't
| worthy of the name 'switch', it's more like 'toy network
| appliance'
| birdman3131 wrote:
| I will forever be scarred by the "Gigabit" switches of
| old that were 2 gigabit ports and 22 100mb ports.
| Coworker bought it missing the nuance.
| bombcar wrote:
| Still happens, gotta see if the top speed mentioned is an
| uplink or normal ports.
| aaronax wrote:
| Yes. Most network switches can handle all ports at 100%
| utilization in both directions simultaneously.
|
| Take for example the Mellanox SX6790 available for less
| than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 =
| 4032gbps and it is stated to have a switching capacity of
| 4.032Tbps.
|
| Edit: I guess you are asking how one would possibly sip
| 1TiB/s of data into a given client. You would need multiple
| clients spread across several switches to generate such
| load. Or maybe some freaky link aggregation. 10x 800gbps
| links for your client, plus at least 10x 800gbps links out
| to the servers.
| bombcar wrote:
| Even the bargain Mikrotik can do 1.2Tbps
| https://mikrotik.com/product/crs518_16xs_2xq
| margalabargala wrote:
| For those curious, a "bargain" on a 100gbps switch means
| about $1350
| epistasis wrote:
| On a cluster with more than $1M of NVMe disks, that does
| actually seem like a bargain.
|
| (Note that the linked MikroTik switch only has 100gbe on
| a few ports, and wouldn't really classify as a full
| 100gbe switch to most people)
| margalabargala wrote:
| Sure- I don't mean to imply that it isn't. I can
| absolutely see how that's inexpensive for 100gbe
| equipment.
|
| That was more for the benefit of others like myself, who
| were wondering if "bargain" was comparative, or
| inexpensive enough that it might be worth buying one next
| time they upgraded switches. For me personally it's still
| an order of magnitude away from that.
| riku_iki wrote:
| TB != Tb..
| matheusmoreira wrote:
| Does anyone have experience running ceph in a home lab? Last time
| I looked into it, there were quite significant hardware
| requirements.
| nullwarp wrote:
| There still are. As someone who has done both production and
| homelab deployments: unless you are specifically just looking
| for experience with it and just setting up a demo - don't
| bother.
|
| When it works, it works great - when it goes wrong it's a huge
| headache.
|
| Edit: As just an edit, if distributed storage is just something
| you are interested in there are much better options for a
| homelab setup:
|
| - seaweedfs has been rock solid for me for years in both small
| and huge scales. we actually moved our production ceph setup to
| this.
|
| - longhorn was solid for me when i was in the k8s world
|
| - glusterfs is still fine as long as you know what you are
| going into.
| reactordev wrote:
| I'd throw minio [1] in the list there as well for homelab k8s
| object storage.
|
| [1] https://min.io/
| speedgoose wrote:
| Also garage. https://garagehq.deuxfleurs.fr/
| dataangel wrote:
| I really wish there was a benchmark comparing all of these +
| MinIO and S3. I'm in the market for a key value store, using
| S3 for now but eyeing moving to my own hardware in the future
| and having to do all the work to compare these is one of the
| major things making me procrastinate.
| woopwoop24 wrote:
| minio is good but you really need fast disks. They also
| really don't like, when you want to change the size of your
| cluster setup. No plan to add cache disks, they just say
| use faster disks. I have it running, goes smoothly but not
| really user friendly to optimize
| rglullis wrote:
| Minio gives you "only" S3 object storage. I've setup a
| 3-node Minio cluster for object storage on Hetzner, each
| server having 4x10TB, for ~50EUR/month each. This means
| 80TB usable data for ~150EUR/month. It can be worth it if
| you are trying to avoid egress fees, but if I were building
| a data lake or anything where the data was used mostly for
| internal services, I'd just stick with S3.
| rglullis wrote:
| > glusterfs is still fine as long as you know what you are
| going into.
|
| Does that include storage volumes for databases? I was using
| glusterFS as a way to scale my swarm cluster horizontally and
| I am reasonably sure that it corrupted one database to the
| point I lost more than a few hours of data. I was quite
| satisfied with the setup until I hit that.
|
| I know that I am considered crazy for sticking with Docker
| Swarm until now, but aside from this lingering issue with how
| to manage stateful services, I've honestly don't feel the
| need to move yet to k8s. My clusters is ~10 nodes running <
| 30 stacks and it's not like I have tens of people working
| with me on it.
| bityard wrote:
| Ceph is sort of a storage all-in-one: it provides object
| storage, block storage, and network file storage. May I ask,
| which of these are you using seaweedfs for? Is it as
| performant as Ceph claims to be?
| asadhaider wrote:
| I thought it was popular for people running Proxmox clusters
| loeg wrote:
| Why would you bother with a distributed filesystem when you
| don't have to?
| iwontberude wrote:
| It's cool to cluster everything for some people (myself
| included). I see it more like a design constraint than a pure
| benefit.
| imiric wrote:
| For the same reason you would use one in enterprise
| deployments: if setup properly, it's easier to scale. You
| don't need to invest in a huge storage server upfront, but
| could build it out as needed with cheap nodes. Assuming it
| works painlessly as a single node filesystem, of which I'm
| not yet convinced if the existing solutions do.
| m463 wrote:
| lol, wrong place to ask questions of such practicality.
|
| that said, I played with virtualization and I didn't need to.
|
| but then I retired a machine or two and it has been very
| helpful.
|
| And I used to just use physical disks and partitions. But
| with the VMs I started using volume manager. It became easier
| to grow and shrink storage.
|
| and...
|
| well, now a lot of this is second nature. I can spin up a new
| "machine" for a project and it doesn't affect anything else.
| I have better backups. I can move a virtual machine.
|
| yeah, there are extra layers of abstraction but hey.
| erulabs wrote:
| So that when you _do_ have to, you know how to do it.
| bluedino wrote:
| Related question, how does someone get into working with Ceph?
| Other than working somewhere that already uses it.
| SteveNuts wrote:
| You could start by installing Proxmox on old machines you
| have, it uses Ceph for its distributed storage, if you choose
| to use it.
| candiddevmike wrote:
| Look into the Rook project
| ianlevesque wrote:
| I played around with it and it has a very cool web UI, object
| storage & file storage, but it was very hard to get decent
| performance and it was possible to get the metadata daemons
| stuck pretty easily with a small cluster. Ultimately when the
| fun wore off I just put zfs on a single box instead.
| reactordev wrote:
| There's a blog post they did where they setup Ceph on some rPI
| 4's. I'd say that's not significant hardware at all. [1]
|
| [1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-
| raspberr...
| m463 wrote:
| I think "significant" turns out to mean the number of nodes
| required.
| m463 wrote:
| I think you need 3 or was it 5 machines?
|
| proxmox will use it - just click to install
| victorhooi wrote:
| I have some experience with Ceph, both for work, and with
| homelab-y stuff.
|
| First, bear in mind that Ceph is a _distributed_ storage system
| - so the idea is that you will have multiple nodes.
|
| For learning, you can definitely virtualise it all on a single
| box - but you'll have a better time with discrete physical
| machines.
|
| Also, Ceph does prefer physical access to disks (similar to
| ZFS).
|
| And you do need decent networking connectivity - I think that's
| the main thing people think of, when they think of high
| hardware requirements for Ceph. Ideally 10Gbe at the minimum -
| although more if you want higher performance - there can be a
| lot of network traffic, particularly with things like backfill.
| (25Gbps if you can find that gear cheap for homelab - 50Gbps is
| a technological dead-end. 100Gbps works well).
|
| But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe
| will work fine, and you should get acceptable performance, and
| it'll be good for learning.
|
| You can install Ceph directly on bare-metal, or if you want to
| do the homelab k8s route, you can use Rook (https://rook.io/).
|
| Hope this helps, and good luck! Let me know if you have any
| other questions.
| chomp wrote:
| I've ran Ceph in my home lab since Jewel (~8 years ago).
| Currently up to 70TB storage on a single node. Have been pretty
| successful vertically scaling, but will have to add a 2nd node
| here in a bit.
|
| Ceph isn't the fastest, but it's incredibly resilient and
| scalable. Haven't needed any crazy hardware requirements, just
| ram and an i7.
| willglynn wrote:
| The hardware minimums are real, and the complexity floor is
| significant. Do _not_ deploy Ceph unless you mean it.
|
| I started considering alternatives when my NAS crossed 100 TB
| of HDDs, and when a scary scrub prompted me to replace all the
| HDDs, I finally pulled the trigger. (ZFS resilvered everything
| fine, but replacing every disk sequentially gave me a lot of
| time to think.) Today I have far more HDD capacity and a few
| hundred terabytes of NVMe, and despite its challenges, I
| wouldn't dare run anything like it without Ceph.
| mcronce wrote:
| I run Ceph in my lab. It's pretty heavy on CPU, but it works
| well as long as you're willing to spring for fast networking
| (at least 10Gb, ideally 40+) and at least a few nodes with 6+
| disks each if you're using spinners. You can probably get away
| with far fewer disks per node if you're going all-SSD.
| aaronax wrote:
| I just set up a three-node Proxmox+Ceph cluster a few weeks
| ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs
| of 1TB and 2TB mix (was 5 until I noticed one of my scavenged
| SSDs was failed). Single 1gbps network on each so I am seeing
| 30-120MB/s disk performance depending on things. I think in a
| few months I will upgrade to 10gbps for about $400.
|
| I'm about 1/2 through the process of moving my 15 virtual
| machines over. It is a little slow but tolerable. Not having to
| decide on RAIDs or a NAS ahead of time is amazing. I can throw
| disks and nodes at it whenever.
| stuff4ben wrote:
| I used to love doing experiments like this. I was afforded that
| luxury as a tech lead back when I was at Cisco setting up
| Kubernetes on bare metal and getting to play with setting up
| GlusterFS and Ceph just to learn and see which was better. This
| was back in 2017/2018 if I recall. Good ole days. Loved this
| writeup!
| knicholes wrote:
| I had to run a bunch of benchmarks to compare speeds of not
| just AWS instance types, but actual individual instances in
| each type, as some NVME SSDs have been more used than others in
| order to lube up some Aerospike response times. Crazy.
| amluto wrote:
| I wish someone would try to scale the nodes down. The system
| described here is ~300W/node for 10 disks/node, so 30W or so per
| disk. That's a fair amount of overhead, and it also requires
| quite a lot of storage to get any redundancy at all.
|
| I bet some engineering effort could divide the whole thing by 10.
| Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+
| sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an
| eMMC chip or SD slot for boot.
|
| This could scale down to just a few nodes, and it reduces the
| exposure to a single failure taking out 10 disks at a time.
|
| I bet a lot of copies of this system could fit in a 4U enclosure.
| Optionally the same enclosure could contain two entirely
| independent switches to aggregate the internal nodes.
| jeffbee wrote:
| I think the chief source of inefficiency in this architecture
| would be the NVMe controller. When the operating system and the
| NVMe device are at arm's length, there is natural inefficiency,
| as the controller needs to infer the intent of the request and
| do its best in terms of placement and wear leveling. The new
| FDP (flexible data placement) features try to address this by
| giving the operating system more control. The best thing would
| be to just hoist it all up into the host operating system and
| present the flash, as nearly as possible, as a giant field of
| dumb transistors that happens to be a PCIe device. With layers
| of abstraction removed, the hardware unit could be something
| like an Atom with integrated 100gbps NICs and a proportional
| amount of flash to achieve the desired system parallelism.
| booi wrote:
| Is that a lot of overhead? The disk itself uses about 10W and
| high speed controllers use about 75W leaves pretty much 100W
| for the rest of the system including overhead of about 10%.
| Scale up the system to 16 disks and there's not a lot of room
| for improvement
| kbenson wrote:
| There probably is a sweet spot for power to speed, but I think
| it's possibly a bit larger than you suggest. There's overhead
| from the other components as well. For example, the Mellanox
| NIC seems to utilize about 20W itself, and while the reduced
| numbers of drives might allow for a single port NIC which seems
| to use about half the power, if we're going to increase the
| number of cables (3 per 12 disks instead of 2 per 5), we're not
| just increasing the power usage of the nodes themselves put
| also possible increasing the power usage or changing the type
| of switch required to combine the nodes.
|
| If looked at as a whole, it appears to be more about whether
| you're combining resources at a low level (on the PCI bus on
| nodes) or a high level (in the switching infrastructure), and
| we should be careful not to push power (or complexity, as is
| often a similar goal) to a separate part of the system that is
| out of our immediate thoughts but still very much part of the
| system. Then again, sometimes parts of the system are much
| better at handling the complexity for certain cases, so in
| those cases that can be a definite win.
| mrb wrote:
| I wanted to see how 1 TiB/s compares to the actual theoretical
| limits of the hardware. So here is what I found:
|
| The cluster has 68 nodes, each a Dell PowerEdge R6615
| (https://www.delltechnologies.com/asset/en-
| us/products/server...). The R6615 configuration they run is the
| one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe
| gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes
| have negligible ~3% overhead thanks to 128b-132b encoding.
|
| This means each U.2 link has a maximum link bandwith of 16 * 4 =
| 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are
| Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to
| be capable of 7 Gbyte/s read throughput (https://www.serversupply
| .com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not
| bottlenecked by the U.2 link (8 Gbyte/s).
|
| Each node has 10 U.2 drive, so each node can do local read I/O at
| a maximum of 10 * 7 = 70 Gbyte/s.
|
| However each node has a network bandwith of only 200 Gbit/s (2 x
| 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This
| implies that remote reads are under-utilizing the drives (capable
| of 70 Gbyte/s). The network is the bottleneck.
|
| Assuming no additional network bottlenecks (they don't describe
| the network architecture), this implies the 68 nodes can provide
| 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1
| TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of
| the maximum theoretical 1700 Gbyte/s. That's pretty decent, but
| in theory it's still possible to be doing a bit better assuming
| all nodes can concurrently truly saturate their 200 Gbit/s
| network link.
|
| Reading this whole blog post, I got the impression ceph's
| complexity hits the CPU pretty hard. Not compiling a module with
| -O2 ("Fix Three": linked by the author:
| https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can
| reduce performance "up to 5x slower with some workloads"
| (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure
| I/O workload. Also what's up with OSD's threads causing excessive
| CPU waste grabbing the IOMMU spinlock? I agree with the
| conclusion that the OSD threading model is suboptimal. A
| relatively simple synthetic 100% read benchmark should not expose
| a threading contention if that part of ceph's software
| architecture was well designed (which is fixable, so I hope the
| ceph devs prioritize this.)
| MPSimmons wrote:
| The worst problems I've had with in-cluster dynamic storage were
| never strictly IO related, and were more the storage controller
| software in kubernetes having problems with real-world problems
| like pods dying and the PVCs not attaching until after very long
| timeouts expired, with the pod sitting in ContainerCreating until
| the PVC lock was freed.
|
| This has happened in multiple clusters, using rook/ceph as well
| as Longhorn.
| einpoklum wrote:
| Where can I read about the rationale for ceph as a project? I'm
| not familiar with it.
___________________________________________________________________
(page generated 2024-01-19 23:00 UTC)