[HN Gopher] Ceph: A Journey to 1 TiB/s
___________________________________________________________________
Ceph: A Journey to 1 TiB/s
Author : davidmr
Score : 369 points
Date : 2024-01-19 20:02 UTC (1 days ago)
(HTM) web link (ceph.io)
(TXT) w3m dump (ceph.io)
| riku_iki wrote:
| What router/switch one would use for such speed?
| KeplerBoy wrote:
| 800Gbps via OSFP and QSFP-DD are already a thing. Multiple
| vendors have NICs and switches for that.
| _zoltan_ wrote:
| can you show me a 800G NIC?
|
| the switch is fine, I'm buying 64x800G switches, but NIC wise
| I'm limited to 400Gbit.
| KeplerBoy wrote:
| fair enough, it seems I was mistaken about the NIC. I guess
| that has to wait for PCIe 6 and should arrive soon-ish.
| CyberDildonics wrote:
| 16x PCIe 4.0 is 32GB/s 16x PCIe 5.0 should be 64 GB/s, how is
| any computer using 100 GB/s ?
| KeplerBoy wrote:
| I was talking about Gigabit/s, not Gigabyte/s.
|
| The article however actually talks about Terabyte/s scale,
| albeit not over a single node.
| CyberDildonics wrote:
| 800 gigabits is 100 gigabytes which is still more than
| PCIe 5.0 16x 64 gigabyte per second bandwidth.
|
| You said there were 800 gigabit network cards, I'm
| wondering how that much bandwidth makes it to the card in
| the first place.
|
| _The article however actually talks about Terabyte /s
| scale, albeit not over a single node._
|
| This does not have anything to do with what you
| originally said, you were talking about 800gb single
| ports.
| KeplerBoy wrote:
| Yes, apparently I was mistaken about the NICs. They don't
| seem to be available yet.
|
| But it's not a PCIe limitation. There are PCIe devices
| out there which use 32 lanes, so you could achieve the
| bandwidth even on PCIe5.
|
| https://www.servethehome.com/ocp-nic-3-0-form-factors-
| quick-...
| NavinF wrote:
| I'm not aware of any 800G cards, but FYI a single
| Mellanox card can use two PCIe x16 slots to avoid NUMA
| issues on dual-socket servers: https://www.nvidia.com/en-
| us/networking/ethernet/socket-dire...
|
| So the software infra for using multiple slots already
| exists and doesn't require any special config. Oh and
| some cards can use PCIe slots across multiple hosts. No
| idea why you'd want to do that, but you can.
| epistasis wrote:
| Given their configuration of just 4U spread across 17 racks,
| there's likely a bunch of compute in the rest of the rack, and
| 1-2 top of rack switches like this:
|
| https://www.qct.io/product/index/Switch/Ethernet-Switch/T700...
|
| And then you connect the TOR switches to higher level switches
| in something like a Clos distribution to get the desired
| bandwidth between any two nodes:
|
| https://www.techtarget.com/searchnetworking/definition/Clos-...
| NavinF wrote:
| Linked article says they used 68 machines with 2 x 100GbE
| Mellanox ConnectX-6 cards. So any 100G pizza box switches
| should work.
|
| Note that 36 port 56G switches are dirt cheap on eBay and 4tbps
| is good enough for most homelab use cases
| riku_iki wrote:
| > So any 100G pizza box switches should work.
|
| but will it be able to handle combined TB/s traffic?
| baq wrote:
| any switch which can't handle full load on all ports isn't
| worthy of the name 'switch', it's more like 'toy network
| appliance'
| birdman3131 wrote:
| I will forever be scarred by the "Gigabit" switches of
| old that were 2 gigabit ports and 22 100mb ports.
| Coworker bought it missing the nuance.
| bombcar wrote:
| Still happens, gotta see if the top speed mentioned is an
| uplink or normal ports.
| aaronax wrote:
| Yes. Most network switches can handle all ports at 100%
| utilization in both directions simultaneously.
|
| Take for example the Mellanox SX6790 available for less
| than $100 on eBay. It has 36 56gbps ports. 36 * 2 * 56 =
| 4032gbps and it is stated to have a switching capacity of
| 4.032Tbps.
|
| Edit: I guess you are asking how one would possibly sip
| 1TiB/s of data into a given client. You would need multiple
| clients spread across several switches to generate such
| load. Or maybe some freaky link aggregation. 10x 800gbps
| links for your client, plus at least 10x 800gbps links out
| to the servers.
| bombcar wrote:
| Even the bargain Mikrotik can do 1.2Tbps
| https://mikrotik.com/product/crs518_16xs_2xq
| margalabargala wrote:
| For those curious, a "bargain" on a 100gbps switch means
| about $1350
| epistasis wrote:
| On a cluster with more than $1M of NVMe disks, that does
| actually seem like a bargain.
|
| (Note that the linked MikroTik switch only has 100gbe on
| a few ports, and wouldn't really classify as a full
| 100gbe switch to most people)
| margalabargala wrote:
| Sure- I don't mean to imply that it isn't. I can
| absolutely see how that's inexpensive for 100gbe
| equipment.
|
| That was more for the benefit of others like myself, who
| were wondering if "bargain" was comparative, or
| inexpensive enough that it might be worth buying one next
| time they upgraded switches. For me personally it's still
| an order of magnitude away from that.
| bombcar wrote:
| https://mikrotik.com/product/crs305_1g_4s_in is the sweet
| spot right now for home users. Four 10g ports and a 1g,
| you can use the 1g for "uplink" to the internet and one
| of the 10g for your "big old Nortel gigabit switch with
| 10g uplink" and one for your Mac and two for your NAS and
| VM server. ;)
|
| Direct cables are moderately cheap, and modules for 10g
| Ethernet aren't insanely expensive.
| Palomides wrote:
| there's usually some used dx010 (32x100gbe) on ebay for
| less than $500
|
| the cheapest new 100gbe switch I know of is the mikrotik
| CRS504-4XQ-IN (4x100gbe, around $650)
| riku_iki wrote:
| TB != Tb..
| matheusmoreira wrote:
| Does anyone have experience running ceph in a home lab? Last time
| I looked into it, there were quite significant hardware
| requirements.
| nullwarp wrote:
| There still are. As someone who has done both production and
| homelab deployments: unless you are specifically just looking
| for experience with it and just setting up a demo - don't
| bother.
|
| When it works, it works great - when it goes wrong it's a huge
| headache.
|
| Edit: As just an edit, if distributed storage is just something
| you are interested in there are much better options for a
| homelab setup:
|
| - seaweedfs has been rock solid for me for years in both small
| and huge scales. we actually moved our production ceph setup to
| this.
|
| - longhorn was solid for me when i was in the k8s world
|
| - glusterfs is still fine as long as you know what you are
| going into.
| reactordev wrote:
| I'd throw minio [1] in the list there as well for homelab k8s
| object storage.
|
| [1] https://min.io/
| speedgoose wrote:
| Also garage. https://garagehq.deuxfleurs.fr/
| BlackLotus89 wrote:
| Garage seems to only to duplication https://garagehq.deux
| fleurs.fr/documentation/design/goals/
|
| > Storage optimizations: erasure coding or any other
| coding technique both increase the difficulty of placing
| data and synchronizing; we limit ourselves to
| duplication.
|
| This is probably a nogo for most use cases where you work
| with large datasets....
| plagiarist wrote:
| Minio doesn't make any sense to me in a homelab. Unless I'm
| reading it wrong it sounds like a giant pain to add more
| capacity while it is already in use. There's basically no
| situation where I'm more likely to add capacity over time
| than a homelab.
| reactordev wrote:
| You get a new nas (minio server pool) and you plug it
| into your home lab (site replication) and now it's part
| of the distributed minio storage layer (k8s are happy).
| How is that hard? It's the same basic thing for Ceph or
| any distributed JBOD mass storage engine. Minio has some
| funkiness with how you add more storage but it's totally
| capable of doing it while in use. Everything is atomic.
| dataangel wrote:
| I really wish there was a benchmark comparing all of these +
| MinIO and S3. I'm in the market for a key value store, using
| S3 for now but eyeing moving to my own hardware in the future
| and having to do all the work to compare these is one of the
| major things making me procrastinate.
| woopwoop24 wrote:
| minio is good but you really need fast disks. They also
| really don't like, when you want to change the size of your
| cluster setup. No plan to add cache disks, they just say
| use faster disks. I have it running, goes smoothly but not
| really user friendly to optimize
| rglullis wrote:
| Minio gives you "only" S3 object storage. I've setup a
| 3-node Minio cluster for object storage on Hetzner, each
| server having 4x10TB, for ~50EUR/month each. This means
| 80TB usable data for ~150EUR/month. It can be worth it if
| you are trying to avoid egress fees, but if I were building
| a data lake or anything where the data was used mostly for
| internal services, I'd just stick with S3.
| rglullis wrote:
| > glusterfs is still fine as long as you know what you are
| going into.
|
| Does that include storage volumes for databases? I was using
| glusterFS as a way to scale my swarm cluster horizontally and
| I am reasonably sure that it corrupted one database to the
| point I lost more than a few hours of data. I was quite
| satisfied with the setup until I hit that.
|
| I know that I am considered crazy for sticking with Docker
| Swarm until now, but aside from this lingering issue with how
| to manage stateful services, I've honestly don't feel the
| need to move yet to k8s. My clusters is ~10 nodes running <
| 30 stacks and it's not like I have tens of people working
| with me on it.
| camkego wrote:
| Docker Swarm seems to be underrated, from a simplicity and
| reliability perspective, IMHO.
| bityard wrote:
| Ceph is sort of a storage all-in-one: it provides object
| storage, block storage, and network file storage. May I ask,
| which of these are you using seaweedfs for? Is it as
| performant as Ceph claims to be?
| asadhaider wrote:
| I thought it was popular for people running Proxmox clusters
| geerlingguy wrote:
| It is, and if you have a few nodes with at least 10 GbE
| networking, it's certainly the best clustered storage
| option I can think of.
| matheusmoreira wrote:
| I just want to hoard data. I hate having to delete stuff to
| make space. Things disappear from the web every day. I should
| hold onto them.
|
| My requirements for a storage solution are:
|
| > Single root file system
|
| > Storage device failure tolerance
|
| > Gradual expansion capability
|
| The problem with every storage solution I've ever seen is the
| lack of gradual expandability. I'm not a corporation, I'm
| just a guy. I don't have the money to buy 200 hard disks all
| at once. I need to gradually expand capacity as needed.
|
| I was attracted to this ceph because it apparently allows you
| to throw a bunch of drives of any make and model at it and it
| just pools them all up without complaining. The complexity is
| nightmarish though.
|
| ZFS is nearly perfect but when it comes to expanding capacity
| it's just as bad as RAID. Expansion features seem to be
| _just_ about to land for quite a few years now. I remember
| getting excited about it after seeing news here only for
| people to deflate my expectations. Btrfs has a flexible block
| allocator which is just what I need but... It 's btrfs.
| chromatin wrote:
| > ZFS is nearly perfect but when it comes to expanding
| capacity it's just as bad as RAID.
|
| if you don't mind the overhead of a "pool of mirrors"
| approach [1], then it is easy to expand storage by adding
| pairs of disks! This is how my home NAS is configured.
|
| [1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-
| vdevs...
| roygbiv2 wrote:
| This is also exactly how mine is done. Started off with a
| bunch of 2TB disks. I've now got a mixture of 16TB down
| to 4TB, all in the original pool.
| Snow_Falls wrote:
| 50% storage efficiency is a tough pill to swallow, but
| drives are pretty big and the ability to expand as you go
| means it can be cheaper in the long run to just buy the
| larger, new drives coming out than pay upfront for a
| bunch of drives in a raidz config.
| bityard wrote:
| On a single host, you could do this with LVM. Add a pair of
| disks, make them a RAID 1, create a physical volume on
| them, then a volume group, then a logical volume with XFS
| on top. To expand, you add a pair of disks, RAID 1 them,
| and add them to the LVM. It's a little stupid, but it would
| work.
|
| If multiple nodes are not off the table, also look into
| seaweedfs.
|
| Also consider how (or if) you are going to back up your
| hoard of data.
| matheusmoreira wrote:
| > Also consider how (or if) you are going to back up your
| hoard of data.
|
| I actually emailed backblaze years ago about their
| supposedly unlimited consumer backup plan. Asked them if
| they would _really_ allow me to dump into their systems
| dozens of terabytes of encrypted undeduplicable data.
| They responded that yes, they would. Still didn 't
| believe them, these corporations never really mean it
| when they say unlimited. Plus they had no Linux software.
| nijave wrote:
| > these corporations never really mean it when they say
| unlimited. Plus they had no Linux software
|
| Afaik they rely on the latter to mitigate the risk of the
| former.
| Snow_Falls wrote:
| Considering the fact that most data heavy servers are
| llnux, that would be a pretty clever way of staying true
| to their word.
| deadbunny wrote:
| ZFS using mirrors is extremely easy to expand. Need more
| space and you have small drives? Replace the drives in a
| mirror one by one with bigger ones. Need more space and
| already have huge drives? Just add another vdev mirror. And
| the added benefit of not living in fear of drive failure
| while resilvering as it is much faster with mirrors than
| raidX.
|
| Sure the density isn't great as you're essentially running
| at 50% or raw storage but - touches wood - my home zpool
| has been running strong for about a decade doing the above
| from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives
| (8x mirrors, differing sized drives but matched per mirror
| like a 10tb x2 mirror, a 16tb x2 mirror etc).
|
| Edit: Just realised someone else as already mentioned a
| pool or mirrors. Consider this another +1.
| matheusmoreira wrote:
| > Replace the drives in a mirror one by one with bigger
| ones.
|
| That's exactly what I meant by "just as bad as RAID".
| Expanding an existing array is analogous to every single
| drive in the array failing and getting replaced with
| higher capacity drives.
|
| When a drive fails, the array is in a degraded state.
| Additional drive failures put the entire system in danger
| of data loss. The rebuilding process generates enormous
| I/O loads on all the disks. Not only does it take an
| insane amount of time, according to my calculations the
| probability of read errors happening _during the error
| recovery process_ is about 3%. Such expansion operations
| have a real chance of destroying the entire array.
| deadbunny wrote:
| That's not the case it mirrored vdevs. There is no
| degredatuon of the array with a failed drive in a
| mirrored vdev, it continues humming along perfectly fine.
|
| Also resilvers are not as intensive when rebuilding a
| mirror as you are just copying from one disk in the vdev
| to the other, not all X other drives and recalculating
| parity at the same time. This means less reads across the
| entire array and much much quicker resilver times, thus
| less window for drive failure.
|
| But don't just take my word for it. This is a blog post
| that go much into much more detail
| https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-
| vdevs...
| amadio wrote:
| EOS (https://cern.ch/eos, https://github.com/cern-eos/eos)
| is probably a bit more complicated than other solutions to
| setup and manage, but does allow to add/remove new disks
| and nodes serving data on the fly. This is essential to let
| us upgrade harware of the clusters serving experimental
| data with minimal to no downtime.
| sekh60 wrote:
| I've run Ceph at home since the jewel release. I migrated
| to it after running FreeNAS.
|
| I use it for RBD volumes for my OpenStack cluster and for
| CephFS. With a total raw capacity of around 350TiB. Around
| 14 of that is nvme storage for RBD and CephFS metadata. The
| rest is rust. This is spread across 5 nodes.
|
| I currently am only buying 20TB exos drives for rust. SMR
| and I think HSMR are both no goes for Ceph as are non
| enterprise SSDs, so storage is expensive. Ibdinhave a mix
| of disks though as the cluster has grown organically. So I
| have a few 6TB WD Reds in there, before their SMR shift.
|
| My networks for OpenStack, Ceph and Ceph backend are all
| 10Gbps. With the flash storage when repairing I get about
| 8GiB/s a second. With rust it is around 270MiB/s. The
| bottle neck I think is due to 3 of the nodes running on
| first gen xeon-d boards, the the few Reds do slow things
| down too. The 4th node runs an AMD Rome CPU, and the newest
| an AMD Genoa cpu. So I am looking at about 5k CAD a node
| before disks. I colocate the MDS, OSDs and MONs, with 64GiB
| of ram each. Each node gets 6 rust, and 2 nvme drives.
|
| Complexity is pretty simple. I deployed the initial
| iteration by hand, and then when cephadmin was released i
| converted it daemon by daemon smoothly. I find on the
| mailing lists and Reddit most of the people encountering
| problems deploy it via Proxmox and don't really understand
| Ceph because of it.
| nijave wrote:
| Not sure what the multidisk consensus is for btrfs now-a-
| days but adding/removing devices is trivial, you can do
| "offline" dedupe, and you can rebalance data if you change
| the disk config.
|
| As an added bonus it's also in-tree so you don't have to
| worry about kernel updates breaking things
|
| I think you can also potentially do btrfs+LVM and let LVM
| manage multi device. Not sure what performance looks like
| there, though
| matheusmoreira wrote:
| That's all great but btrfs parity striping is still
| unusable. How many more decades will it take?
| Snow_Falls wrote:
| If you're willing to use mirror vdevs, expansions can be
| done two drives at a time.Also, depending on how often your
| data changes, you should check out snapraid. Doesn't have
| all the features of ZFS but its perfect for stuff that
| rarely changes (media or, in your case, archiving).
|
| Also unionfs or similar can let you merge zfs and snapraid
| into one unified filesystem so you can place important data
| in zfs and unchanging archive data in snapraid.
| sob727 wrote:
| Curious, what do you mean by "know what you go into" re
| glusterfs?
|
| I recently tried ceph in a homelab setup, gave up because of
| complexity, and settled on glusterfs. I'm not a pro though,
| so I'm not sure if there's any shortcomings that are clear to
| everybody but me, hence why your comment caught my attention.
| cholmon wrote:
| GlusterFS support looks to be permanently ending later this
| year.
|
| https://access.redhat.com/support/policy/updates/rhs
|
| _Note that the Red Hat Gluster Storage product has a defined
| support lifecycle through to 31-Dec-24, after which the Red
| Hat Gluster Storage product will have reached its EOL.
| Specifically, RHGS 3.5 represents the final supported RHGS
| series of releases._
|
| For folks using GlusterFS currently, what's your plan after
| this year?
| loeg wrote:
| Why would you bother with a distributed filesystem when you
| don't have to?
| iwontberude wrote:
| It's cool to cluster everything for some people (myself
| included). I see it more like a design constraint than a pure
| benefit.
| imiric wrote:
| For the same reason you would use one in enterprise
| deployments: if setup properly, it's easier to scale. You
| don't need to invest in a huge storage server upfront, but
| could build it out as needed with cheap nodes. Assuming it
| works painlessly as a single node filesystem, of which I'm
| not yet convinced if the existing solutions do.
| loeg wrote:
| > if setup properly, it's easier to scale
|
| For home use/needs, I think vertical scaling is much
| easier.
| imiric wrote:
| Not really. Most consumer motherboards have a limited
| number of SATA ports, and server hardware is more
| expensive, noisy and requires a lot of space. Consumers
| usually go with branded NAS appliances, which are also
| expensive and limited at scaling.
|
| Setting up a cluster of small heterogeneous nodes is
| cheaper, more flexible, and can easily be scaled as
| needed, _assuming_ that the distributed storage software
| is easy to work with and trouble-free. This last part is
| what makes it difficult to setup and maintain, but if the
| software is stable, I would prefer this approach for home
| use.
| m463 wrote:
| lol, wrong place to ask questions of such practicality.
|
| that said, I played with virtualization and I didn't need to.
|
| but then I retired a machine or two and it has been very
| helpful.
|
| And I used to just use physical disks and partitions. But
| with the VMs I started using volume manager. It became easier
| to grow and shrink storage.
|
| and...
|
| well, now a lot of this is second nature. I can spin up a new
| "machine" for a project and it doesn't affect anything else.
| I have better backups. I can move a virtual machine.
|
| yeah, there are extra layers of abstraction but hey.
| erulabs wrote:
| So that when you _do_ have to, you know how to do it.
| loeg wrote:
| I think most of us will go our whole lives never having to
| deploy Ceph, especially at home.
| erulabs wrote:
| You're absolutely not wrong - but asking a devops
| engineer why they over engineered their home cluster is
| sort of like asking a mechanic "why is your car so fast?
| Couldn't you just take the bus?"
| matheusmoreira wrote:
| I'm indifferent towards the distributed nature thing. What I
| want is ceph's ability to pool any combination of drives of
| any make, model and capacity into organized redundant fault
| tolerant storage, and its ability to add arbitrary drives to
| that pool at any point in the system's lifetime. RAID-like
| solutions require identical drives and can't be easily
| expanded.
| loeg wrote:
| ZFS and BtrFS have some capability for this.
| nh2 wrote:
| One reason for using Ceph instead of other RAID solutions on
| a single machine is that it supports disk failures more
| flexibly.
|
| In most RAIDs (including ZFS's, to my knowledge), the set of
| disks that can fail together is static.
|
| Say you have physical disks A B C D E F; common setup is to
| group RAID1'd disks into a pool such as `mirror(A, B) +
| mirror(C, D) + mirror(E, F)`.
|
| With that, if disk A fails, and then later B fails before you
| replace A, your data is lost.
|
| But with Ceph, and replication `size = 2`, when A fails, Ceph
| will (almost) immediately redistribute your data so that it
| has 2 replicas again, across all remaining disks B-F. So then
| B can fail and you still have your data.
|
| So in Ceph, you give it a pool of disks and tell it to
| "figure out the replication" iself. Most other systems don't
| offer that; the human defines a static replication structure.
| bluedino wrote:
| Related question, how does someone get into working with Ceph?
| Other than working somewhere that already uses it.
| SteveNuts wrote:
| You could start by installing Proxmox on old machines you
| have, it uses Ceph for its distributed storage, if you choose
| to use it.
| candiddevmike wrote:
| Look into the Rook project
| hathawsh wrote:
| The recommended way to set up Ceph is cephadm, a single-file
| Python script that is a multi-tool for both creating and
| administering clusters.
|
| https://docs.ceph.com/en/latest/cephadm/
|
| To learn about Ceph, I recommend you create at least 3 KVM
| virtual machines (using virt-manager) on a development box,
| network them together, and use cephadm to set up a cluster
| between the VMs. The RAM and storage requirements aren't huge
| (Ceph can run on Raspberry Pis, after all) and I find it a
| lot easier to figure things out when I have a desktop window
| for every node.
|
| I recently set up Ceph twice. Now that Ceph (specifically
| RBD) is providing the storage for virtual machines, I can
| live-migrate VMs between hosts and reboot hosts (with zero
| guest downtime) anytime I need. I'm impressed with how well
| it works.
| ianlevesque wrote:
| I played around with it and it has a very cool web UI, object
| storage & file storage, but it was very hard to get decent
| performance and it was possible to get the metadata daemons
| stuck pretty easily with a small cluster. Ultimately when the
| fun wore off I just put zfs on a single box instead.
| reactordev wrote:
| There's a blog post they did where they setup Ceph on some rPI
| 4's. I'd say that's not significant hardware at all. [1]
|
| [1] https://ceph.io/en/news/blog/2022/install-ceph-in-a-
| raspberr...
| m463 wrote:
| I think "significant" turns out to mean the number of nodes
| required.
| m463 wrote:
| I think you need 3 or was it 5 machines?
|
| proxmox will use it - just click to install
| victorhooi wrote:
| I have some experience with Ceph, both for work, and with
| homelab-y stuff.
|
| First, bear in mind that Ceph is a _distributed_ storage system
| - so the idea is that you will have multiple nodes.
|
| For learning, you can definitely virtualise it all on a single
| box - but you'll have a better time with discrete physical
| machines.
|
| Also, Ceph does prefer physical access to disks (similar to
| ZFS).
|
| And you do need decent networking connectivity - I think that's
| the main thing people think of, when they think of high
| hardware requirements for Ceph. Ideally 10Gbe at the minimum -
| although more if you want higher performance - there can be a
| lot of network traffic, particularly with things like backfill.
| (25Gbps if you can find that gear cheap for homelab - 50Gbps is
| a technological dead-end. 100Gbps works well).
|
| But honestly, for a homelab, a cheap mini PC or NUC with 10Gbe
| will work fine, and you should get acceptable performance, and
| it'll be good for learning.
|
| You can install Ceph directly on bare-metal, or if you want to
| do the homelab k8s route, you can use Rook (https://rook.io/).
|
| Hope this helps, and good luck! Let me know if you have any
| other questions.
| eurekin wrote:
| NUC with 10gbit eth - can you recommend any?
| justinclift wrote:
| If you want something cheap, you could go with Lenovo
| M720q's:
|
| https://www.servethehome.com/lenovo-
| thinkcentre-m720q-tinymi...
|
| They have a PCIe slot and can take 8th/9th gen intel cpus
| (6 core, etc). That PCIe slot should let you throw in a
| decent network card (eg 10GbE, 25GbE, etc).
| chomp wrote:
| I've ran Ceph in my home lab since Jewel (~8 years ago).
| Currently up to 70TB storage on a single node. Have been pretty
| successful vertically scaling, but will have to add a 2nd node
| here in a bit.
|
| Ceph isn't the fastest, but it's incredibly resilient and
| scalable. Haven't needed any crazy hardware requirements, just
| ram and an i7.
| willglynn wrote:
| The hardware minimums are real, and the complexity floor is
| significant. Do _not_ deploy Ceph unless you mean it.
|
| I started considering alternatives when my NAS crossed 100 TB
| of HDDs, and when a scary scrub prompted me to replace all the
| HDDs, I finally pulled the trigger. (ZFS resilvered everything
| fine, but replacing every disk sequentially gave me a lot of
| time to think.) Today I have far more HDD capacity and a few
| hundred terabytes of NVMe, and despite its challenges, I
| wouldn't dare run anything like it without Ceph.
| samcat116 wrote:
| Can I ask what you use all that storage for on your NAS?
| mcronce wrote:
| I run Ceph in my lab. It's pretty heavy on CPU, but it works
| well as long as you're willing to spring for fast networking
| (at least 10Gb, ideally 40+) and at least a few nodes with 6+
| disks each if you're using spinners. You can probably get away
| with far fewer disks per node if you're going all-SSD.
| aaronax wrote:
| I just set up a three-node Proxmox+Ceph cluster a few weeks
| ago. Three Optiplex desktops 7040, 3060, and 7060 and 4x SSDs
| of 1TB and 2TB mix (was 5 until I noticed one of my scavenged
| SSDs was failed). Single 1gbps network on each so I am seeing
| 30-120MB/s disk performance depending on things. I think in a
| few months I will upgrade to 10gbps for about $400.
|
| I'm about 1/2 through the process of moving my 15 virtual
| machines over. It is a little slow but tolerable. Not having to
| decide on RAIDs or a NAS ahead of time is amazing. I can throw
| disks and nodes at it whenever.
| sgarland wrote:
| Yes. I first tried it with Rook, and that was a disaster, so I
| shifted to Longhorn. That has had its own share of problems,
| and is quite slow. Finally, I let Proxmox manage Ceph for me,
| and it's been a dream. So far I haven't migrated my K8s
| workloads to it, but I've used it for RDBMS storage (DBs in
| VMs), and it works flawlessly.
|
| I don't have an incredibly great setup, either: 3x Dell R620s
| (Ivy Bridge-era Xeons), and 1GBe. Proxmox's corosync has a
| dedicated switch, but that's about it. The disks are nice to be
| fair - Samsung PM863 3.84 TB NVMe. They are absolutely
| bottlenecked by the LAN at the moment.
|
| I plan on upgrading to 10GBe as soon as I can convince myself
| to pay for an L3 10G switch.
| sixdonuts wrote:
| Just get a 25G switch and MM fiber. 25G switches are cheaper,
| use less power and can work with 10 and 25G SFPs.
| sgarland wrote:
| The main blocker (other than needing to buy new NICs, since
| everything I have already came with quad 1/1/10/10) is I'm
| heavily invested into the Ubiquiti ecosystem, and since
| they killed off the USW-Leaf (and the even more brief UDC-
| Leaf), they don't have anything that fits the bill.
|
| I'm not entirely opposed to getting a Mikrotik or something
| and it just being the oddball out, but it's nice to have
| everything centrally managed.
|
| EDIT: They do have the PRO-Aggregation, but there are only
| 4x 25G ports. Technically it _would_ meet my needs for
| Ceph, and Ceph only.
| mmerlin wrote:
| Proxmox makes Ceph easy, even with just one single server if
| you are homelabbing...
|
| I had 4 NUCs running Proxmox+Ceph for a few years, and apart
| from slightly annoying slowness syncing after spinning the
| machines up from cold start, it all ran very smoothly.
| louwrentius wrote:
| If you want decent performance, you need a lot of OSDs
| especially if you use HDD. But a lot of consumer SDDs will
| suffer terrible performance degradation with writes depending
| on the circumstances and workloads.
| mikecoles wrote:
| Works great, depending on what you want to do. Running on SBCs
| or computers with cheap sata cards will greatly reduce the
| performance. It's been running well for years after I found out
| the issues regarding SMR drives and the SATA card bottlenecks.
|
| 45Drives has a homelab setup if you're looking for a canned
| solution.
| antongribok wrote:
| I run Ceph on some Raspberry Pi 4s. It's super reliable, and
| with cephadm it's very easy[1] to install and maintain.
|
| My household is already 100% on Linux, so having a native
| network filesystem that I can just mount from any laptop is
| very handy.
|
| Works great over Tailscale too, so I don't even have to be at
| home.
|
| [1] I run a large install of Ceph at work, so "easy" might be a
| bit relative.
| dcplaya wrote:
| What are your speeds? Do you rub ceph FS too?
|
| I'm trying to do similar.
| antongribok wrote:
| It's been a while since I've done some benchmarks, but it
| can definitely do 40MB/s sustained writes, which is very
| good given the single 1GbE links on each node, and 5TB SMR
| drives.
|
| Latency is hilariously terrible though. It's funny to open
| a text file over the network in vi, paste a long blob of
| text and watch it sync that line by line over the network.
|
| If by "rub" you mean scrub, then yes, although I increased
| the scrub intervals. There's no need to scrub everything
| every week.
| stuff4ben wrote:
| I used to love doing experiments like this. I was afforded that
| luxury as a tech lead back when I was at Cisco setting up
| Kubernetes on bare metal and getting to play with setting up
| GlusterFS and Ceph just to learn and see which was better. This
| was back in 2017/2018 if I recall. Good ole days. Loved this
| writeup!
| knicholes wrote:
| I had to run a bunch of benchmarks to compare speeds of not
| just AWS instance types, but actual individual instances in
| each type, as some NVME SSDs have been more used than others in
| order to lube up some Aerospike response times. Crazy.
| j33zusjuice wrote:
| Ad-tech, or?
| knicholes wrote:
| Yeah. Serving profiles for customized ad selection.
| redrove wrote:
| A Heketi man! I had the same experience around the same years,
| what a blast. Everything was so new..and broken!
| CTrox wrote:
| Same here, still remember that time our Heketi DB partially
| corrupted and we had to fix it up by exporting it to a
| massive json file, fix it up by looking at the Gluster state
| and importing it again. I can't quite remember the details
| but I think it had to do with Gluster snapshots being out of
| sync with the state in the DB.
| amluto wrote:
| I wish someone would try to scale the nodes down. The system
| described here is ~300W/node for 10 disks/node, so 30W or so per
| disk. That's a fair amount of overhead, and it also requires
| quite a lot of storage to get any redundancy at all.
|
| I bet some engineering effort could divide the whole thing by 10.
| Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+
| sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an
| eMMC chip or SD slot for boot.
|
| This could scale down to just a few nodes, and it reduces the
| exposure to a single failure taking out 10 disks at a time.
|
| I bet a lot of copies of this system could fit in a 4U enclosure.
| Optionally the same enclosure could contain two entirely
| independent switches to aggregate the internal nodes.
| jeffbee wrote:
| I think the chief source of inefficiency in this architecture
| would be the NVMe controller. When the operating system and the
| NVMe device are at arm's length, there is natural inefficiency,
| as the controller needs to infer the intent of the request and
| do its best in terms of placement and wear leveling. The new
| FDP (flexible data placement) features try to address this by
| giving the operating system more control. The best thing would
| be to just hoist it all up into the host operating system and
| present the flash, as nearly as possible, as a giant field of
| dumb transistors that happens to be a PCIe device. With layers
| of abstraction removed, the hardware unit could be something
| like an Atom with integrated 100gbps NICs and a proportional
| amount of flash to achieve the desired system parallelism.
| booi wrote:
| Is that a lot of overhead? The disk itself uses about 10W and
| high speed controllers use about 75W leaves pretty much 100W
| for the rest of the system including overhead of about 10%.
| Scale up the system to 16 disks and there's not a lot of room
| for improvement
| kbenson wrote:
| There probably is a sweet spot for power to speed, but I think
| it's possibly a bit larger than you suggest. There's overhead
| from the other components as well. For example, the Mellanox
| NIC seems to utilize about 20W itself, and while the reduced
| numbers of drives might allow for a single port NIC which seems
| to use about half the power, if we're going to increase the
| number of cables (3 per 12 disks instead of 2 per 5), we're not
| just increasing the power usage of the nodes themselves put
| also possible increasing the power usage or changing the type
| of switch required to combine the nodes.
|
| If looked at as a whole, it appears to be more about whether
| you're combining resources at a low level (on the PCI bus on
| nodes) or a high level (in the switching infrastructure), and
| we should be careful not to push power (or complexity, as is
| often a similar goal) to a separate part of the system that is
| out of our immediate thoughts but still very much part of the
| system. Then again, sometimes parts of the system are much
| better at handling the complexity for certain cases, so in
| those cases that can be a definite win.
| Palomides wrote:
| here's a weird calculation:
|
| this cluster does something vaguely like 0.8 gigabits per
| second per watt (1 terabyte/s * 8 bits per byte * 1024 gb per
| tb / 34 nodes / 300 watts
|
| a new mac mini (super efficient arm system) runs around 10
| watts in interactive usage and can do 10 gigabits per second
| network, so maybe 1 gigabit per second per watt of data
|
| so OP's cluster, back of the envelope, is basically the same
| bits per second per watt that a very efficient arm system can
| do
|
| I don't think running tiny nodes would actually get you any
| more efficiency, and would probably cost more! performance per
| watt is quite good on powerful servers now
|
| anyway, this is all open source software running on off-the-
| shelf hardware, you can do it yourself for a few hundred bucks
| 3abiton wrote:
| Trusting your maths, damn Apple did a great job on their M
| design.
| hirako2000 wrote:
| Didn't ARM (the company, that originally designed ARM
| processors) do most of that job and Apple pushed perf to
| consumption even further?
| amluto wrote:
| I think the Mac Mini has massively more compute than needed
| for this kind of work. It also has a power supply, and
| computer power supplies are generally not amazing at low
| output.
|
| I'm imagining something quite specialized. Use a low
| frequency CPU with either vector units or even DMA engines
| optimized for the specific workloads needed, or go all out
| and arrange for data to be DMAed directly between the disk
| and the NIC.
| Palomides wrote:
| sounds like a DPU (mellanox bluefield for example), they're
| entire ARM systems with a high speed NIC all on a PCIe
| card, I think the bluefield ones can even directly
| interface over the bus to nvme drives without the host
| system involved
| amluto wrote:
| That Bluefield hardware looks neat, although it also
| sounds like a real project to program it :).
|
| I can imagine two credible configurations for high
| efficiency:
|
| 1. A motherboard with a truly minimal CPU for
| bootstrapping but a bit beefy PCIe root complex. 32 lanes
| to the DPU and a bunch of lanes for NVMe. The CPU doesn't
| touch the data at all. I wonder if anyone makes a
| motherboard optimized like this -- a 64-lane mobo with a
| Xeon in it would be quite wasteful but fine for
| prototyping I suppose.
|
| 2. Wire up the NVMe ports directly to the Bluefield DPU,
| letting the DPU be the root complex. At least 28 of the
| lanes are presumably usable for this or maybe even all
| 32. It's not entirely clear to me that the Bluefield DPU
| can operate without a host computer, though.
| hirako2000 wrote:
| I checked selling prices of those racks + top end SSDs, this
| 1Tb/s achievement runs on $4 million worth of hardware
| cluster. Or more I didn't check the networking interface
| costs.
|
| But yeah could run on commodity hardware. Not sure those
| highly efficient arm packaged for a premium from Apple would
| beat the Dell racks though regarding throughput relative to
| hardware investment costs.
| amluto wrote:
| Dell's list prices have essentially nothing to do with the
| prices that any competent buyer would actually pay,
| _especially_ when storage is involved. Look at the prices
| of Dell disks, which are nothing special compared to name
| brand disks of equal or better spec and _much_ lower list
| price.
|
| I don't know what discount large buyers get, but I wouldn't
| be surprised if it's around 75%.
| georgyo wrote:
| You're comparing one machine with many machines.
|
| You're comparing raw disks with shards and erasure
| encouraging.
|
| Lastly, you're comparing only network bandwidth and not
| storage capacity.
| evanreichard wrote:
| I used to run a 5 node Ceph cluster on a bunch of ODROID-HC2's
| [0]. Was a royal pain to get installed (armhf processor). But
| once it was running it worked great. Just slow with the single
| 1Gb NIC.
|
| Was just a learning experience at the time.
|
| [0] https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/
| eurekin wrote:
| Same here, but on PI 4b's. 6 node cluster with a 2tb hdd and
| 512 Tb ssd per node. CEPH made a huge impression on me, as in
| I didn't recognize how extensive the package was. I went up
| to 122mb/s and thought it's too little for my hack-NAS
| replacement :)
|
| The functionality: mixing various pool types on the same set
| of SSD's, different redundancy types (erasure coded,
| replicated) was very impressive. Now I can't help but look
| down at a RAID NAS in comparision. Still, some extra packages
| like the NFS exporter were not ready for the arm architecture
| somat wrote:
| I have always wanted to set up a ceph system with one drive per
| node. The ideal form factor would be a drive with a couple
| network interfaces built in. western digital had a press
| release about an experiment they did that was exactly this, but
| it never ended up with drive you could buy.
|
| The hardkernel HC2 SOC was a nearly ideal form factor for this,
| and I still have a stack of them laying around that I bought to
| make a ceph cluster, but I ran out of steam when I figured out
| they were 32bit. not to say it would be impossible I just never
| did it.
| progval wrote:
| I used to use Ceph Luminous (v12) on these, they worked fine.
| Unfortunately, a bug in Nautilus (v14) prevented 32-bits and
| 64-bits archs from talking to each other. Pacific (v16)
| allegedly solves this, but I didn't try it:
| https://ceph.com/en/news/blog/2021/v16-2-5-pacific-released/
|
| If you want to try it with a more modern (and 64-bits)
| device, the hardkernel HC4 might do it for you. It's
| conceptually similar to the HC2 but has two drives.
| Unfortunately it only has double the RAM (4GB), which is
| probably not enough anymore.
| eurekin wrote:
| Looks so good, wish for a > 1gbit version, since HDDs alone
| can saturate that
| progval wrote:
| Did you look at their H3? It's pricier but it has two
| 2.5Gbits ports (along with a NVMe slot and an Intel CPU)
| eurekin wrote:
| I have one and love it! It bravely holds together my
| intranet dev services :)
|
| For a ceph node would still consider a version with
| 10gbit eth
| narism wrote:
| Sounds like the Seagate Kinetic HDD:
| https://www.seagate.com/www-content/product-content/hdd-
| fam/...
| somat wrote:
| That would be perfect. Unfortunately, going by the data
| sheet it would not run ceph you would have to work with
| seagate's proprietary object store. I will note that as far
| as I can tell it is unobtainium. none of the usual vendors
| stock them, you probably have to prove to seagate that you
| are a "serious enterprise customer" and commit to a
| thousand units before they will let you buy some.
| walrus01 wrote:
| 10 Gbps is increasingly obsolete with very low cost 100 Gbps
| switches and 100Gbps interfaces. Something would have to be
| really tiny and low cost to justify doing a ceph setup with
| 10Gbps interfaces now... If you're at that scale of very small
| stuff you are probably better off doing local NVME storage on
| each server instead.
| wildylion wrote:
| IIRC, WD has experimented with placing Ethernet and some
| compute directly onto hard drives some time back.
|
| _sigh_ I used to do some small-scale Ceph back in 2017 or
| so...
| mrb wrote:
| I wanted to see how 1 TiB/s compares to the actual theoretical
| limits of the hardware. So here is what I found:
|
| The cluster has 68 nodes, each a Dell PowerEdge R6615
| (https://www.delltechnologies.com/asset/en-
| us/products/server...). The R6615 configuration they run is the
| one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe
| gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes
| have negligible ~3% overhead thanks to 128b-132b encoding.
|
| This means each U.2 link has a maximum link bandwith of 16 * 4 =
| 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are
| Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to
| be capable of 7 Gbyte/s read throughput (https://www.serversupply
| .com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...). So they are not
| bottlenecked by the U.2 link (8 Gbyte/s).
|
| Each node has 10 U.2 drive, so each node can do local read I/O at
| a maximum of 10 * 7 = 70 Gbyte/s.
|
| However each node has a network bandwith of only 200 Gbit/s (2 x
| 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This
| implies that remote reads are under-utilizing the drives (capable
| of 70 Gbyte/s). The network is the bottleneck.
|
| Assuming no additional network bottlenecks (they don't describe
| the network architecture), this implies the 68 nodes can provide
| 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1
| TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of
| the maximum theoretical 1700 Gbyte/s. That's pretty decent, but
| in theory it's still possible to be doing a bit better assuming
| all nodes can concurrently truly saturate their 200 Gbit/s
| network link.
|
| Reading this whole blog post, I got the impression ceph's
| complexity hits the CPU pretty hard. Not compiling a module with
| -O2 ("Fix Three": linked by the author:
| https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453) can
| reduce performance "up to 5x slower with some workloads"
| (https://bugs.gentoo.org/733316) is pretty unexpected, for a pure
| I/O workload. Also what's up with OSD's threads causing excessive
| CPU waste grabbing the IOMMU spinlock? I agree with the
| conclusion that the OSD threading model is suboptimal. A
| relatively simple synthetic 100% read benchmark should not expose
| a threading contention if that part of ceph's software
| architecture was well designed (which is fixable, so I hope the
| ceph devs prioritize this.)
| wmf wrote:
| I think PCIe TLP overhead and NVMe commands account for the
| difference between 7 and 8 GB/s.
| mrb wrote:
| You are probably right. Reading some old notes of mine when I
| was fine-tuning PCIe bandwith on my ZFS server, I had
| discovered back then that a PCIe Max_Payload_Size of 256
| bytes limited usable bandwidth to about 74% of the link's
| theoretical max. I had calculated that 512 and 1024 bytes
| (the maximum) would raise it to respectively about 86% and
| 93% (but my SATA controllers didn't support a value greater
| than 256.)
| _zoltan_ wrote:
| Mellanox recommends setting this from the default 512 to
| 4096 on their NICs.
| magicalhippo wrote:
| They're benchmarking random IO though, and the disks can "only"
| do a bit over 1000k random 4k read IOPS, which translates to
| about 5 GiB/s. With 320 OSDs thats around 1.6 TiB/s.
|
| At least thats the number I could find. Not exactly tons of
| reviews on these enterprise NVMe disks...
|
| Still, that seems like a good match to the NICs. At this scale
| most workloads will likely appear as random IO at the storage
| layer anyway.
| mrb wrote:
| The benchmark were they accomplish 1025 GiB/s is for
| sequential reads. For random reads they do 25.5M iops or ~100
| GiB/s. See last table, column "630 OSDs (3x)".
| magicalhippo wrote:
| Oh wow how did I miss that table, cheers.
| markhpc wrote:
| I wanted to chime in and mention that we've never seen any
| issues with IOMMU before in Ceph. We have a previous generation
| of the same 1U chassis from Dell with AMD Rome processors in
| the upstream ceph lab and they don't suffer from the same issue
| despite performing similarly at the same scale (~30 OSDs). The
| customer did say they've seen this in the past in their data
| center. I'm hoping we can work with AMD to figure out what's
| going on.
|
| I did some work last summer kind of duct taping the OSD's
| existing threading model (double buffering the hand-off between
| async msgr and worker threads, adaptive thread wakeup, etc). I
| could achieve significant performance / efficiency gains under
| load, but at the expense of increased low-load latency (Ceph by
| default is very aggressive about waking up threads when new IO
| arrives for a given shard).
|
| One of the other core developers and I discussed it and we both
| came to the conclusion that it probably makes sense to do a
| more thorough rewrite of the threading code.
| MPSimmons wrote:
| The worst problems I've had with in-cluster dynamic storage were
| never strictly IO related, and were more the storage controller
| software in kubernetes having problems with real-world problems
| like pods dying and the PVCs not attaching until after very long
| timeouts expired, with the pod sitting in ContainerCreating until
| the PVC lock was freed.
|
| This has happened in multiple clusters, using rook/ceph as well
| as Longhorn.
| einpoklum wrote:
| Where can I read about the rationale for ceph as a project? I'm
| not familiar with it.
| jseutter wrote:
| http://www.45drives.com/blog/ceph/what-is-ceph-why-our-custo...
| is a pretty good introduction. Basically you can take off-the-
| shelf hardware and keep expanding your storage cluster and ceph
| will scale fairly linearly up through hundreds of nodes. It is
| seeing quite a bit of use in things like Kubernetes and
| OpenShift as a cheap and cheerful alternative to SANs. It is
| not without complexity, so if you don't know you need it, it's
| probably not worth the hassle.
| jacobwg wrote:
| Not sure how common the use-case is, but we're using Ceph to
| effectively roll our own EBS inside AWS on top of i3en EC2
| instances. For us it's about 30% cheaper than the base EBS
| cost, but provides access to 10x the IOPS of base gp3 volumes.
|
| The downside is durability and operations - we have to keep
| Ceph alive and are responsible for making sure the data is
| persistent. That said, we're storing cache from container
| builds, so in the worst-case where we lose the storage cluster,
| we can run builds without cache while we restore.
| alberth wrote:
| Ceph has an interesting history.
|
| It was created at Dreamhost (DH), for their internal needs by the
| founders.
|
| DH was doing effectively IaaS & PaaS before those were industry
| coined words (VPS, managed OS/database/app-servers).
|
| They spun Ceph off and Redhat bought it.
|
| https://en.wikipedia.org/wiki/DreamHost
| epistasis wrote:
| A bit more to the story is that it was created also at UC Santa
| Cruz, by Sage Weil, a Dreamhost founder, while he was doing
| graduate work there. UCSC has had a lot of good storage
| research.
| dekhn wrote:
| the fighting banana slugs
| AdamJacobMuller wrote:
| I remember the first time I deployed ceph, would have been
| around 2010 or 2011, had some really major issues which would
| nearly resulted in data loss and due to someone else not
| realizing what "this cluster is experimental, do not store
| any important data here" meant, the data on ceph was the only
| copy of the irreplaceable data in the world, loosing the data
| would have been fairly catastrophic for us.
|
| I ended up on the ceph IRC channel and eventually had Sage
| helping me fix the issues directly, helping me find bugs and
| writing patches to fix them in realtime.
|
| Super amazingly nice guy that he was willing to help, never
| once chastised me for being so stupid (even though I was),
| also wicked smart.
| antongribok wrote:
| Sage is one of the nicest, down to earth, super smart
| individuals I've met.
|
| I've talked to him at a few OpenStack and Ceph conferences,
| and he's always very patient answering questions.
| artyom wrote:
| Yeah, as a customer (still one) I remember their "Hey, we're
| going to build this Ceph thing, maybe it ends up being cool"
| blog entry (or newsletter?) kinda just sharing what they were
| toying with. It was a time of no marketing copy and not
| crafting every sentence to sell you things.
|
| I think it was the university project of one of the founders,
| and the others jumped in supporting it. Docker has a similar
| origins story as far as I know.
| pas wrote:
| https://en.wikipedia.org/wiki/Sage_Weil right?
|
| https://ceph.com/assets/pdfs/weil-crush-sc06.pdf
| louwrentius wrote:
| Remember, random IOPs without latency is a meaningless figure.
| peter_d_sherman wrote:
| Ceph is interesting... open source software whose only purpose is
| to implement a distributed file system...
|
| Functionally, Linux implements a file system (well, several!) as
| well (in addition to many other OS features) -- but (usually!)
| only on top of local hardware.
|
| There seems to be some missing software here -- if we examine
| these two paradigms side-by-side.
|
| For example, what if I want a Linux (or more broadly, a general
| OS) -- but one that doesn't manage a local file system or local
| storage at all?
|
| One that operates solely using the network, solely using a
| distributed file system that Ceph, or software like Ceph, would
| provide?
|
| Conversely, what if I don't want to run a full OS on a network
| machine, a network node that manages its own local storage?
|
| The only thing I can think of to solve those types of problems --
| is:
|
| _What if the Linux filesystem was written such that it was a
| completely separate piece of software, and a distributed file
| system like Ceph, and not dependent on the other kernel source
| code_ (although, still complilable into the kernel as most linux
| components normally are)...
|
| A lot of work? Probably!
|
| But there seems to be some software need for something between a
| solely distributed file system as Ceph is, and a completely
| monolithic "everything baked in" (but not distributed!) OS/kernel
| as Linux is...
|
| Note that I am just thinking aloud here -- I probably am wrong
| and/or misinformed on one or more fronts!
|
| So, kindly take this random "thinking aloud" post -- with the
| proverbial "grain of salt!" :-)
| wmf wrote:
| _what if I want a Linux ... that doesn 't manage a local file
| system or local storage at all [but] operates solely using the
| network, solely using a distributed file system_
|
| Linux can boot from NFS although that's kind of lost knowledge.
| Booting from CephFS might even be possible if you put the right
| parts in the initrd.
| lmz wrote:
| NFS root docs here https://www.kernel.org/doc/Documentation/f
| ilesystems/nfs/nfs...
| peter_d_sherman wrote:
| NFS is an excellent point!
|
| NFS (now that I think about it!) -- brings up two
| additional software engineering considerations:
|
| 1) Distributed _file system protocol_.
|
| 2) Software that implements that distributed (or at least
| remote/network) file system -- via that _file system
| protocol_.
|
| NFS is both.
|
| That's not a bad thing(!) -- but ideally from a software
| engineering "separation of concerns" perspective, this
| future software layer/level would ideally be decoupled from
| the underlying protocol -- that is, it might have a "plug-
| in" protocol architecture, where various 3rd party file
| system protocols (somewhat analogous to drivers) could be
| "plugged-in"...
|
| But NFS could definitely be used to boot/run Linux over the
| network, and is definitely a step in the right direction,
| and something worth evaluating for these purposes... its
| source code is definitely worth looking at...
|
| So, an excellent point!
| plagiarist wrote:
| It sounds you want microkernels, and I agree, it would be nice.
| chx wrote:
| There was a point in history when the total amount of digital
| data stored worldwide reached 1TiB for the first time. It is
| extremely likely this day was within the last sixty years.
|
| And here we are moving that amount of data every second on the
| servers of a fairly random entity. We not talking of a nation
| state or a supranatural research effort.
| qingcharles wrote:
| That reminds me of a calculation I did which showed that my
| desktop PC would be more powerful than all of the computers on
| the planet combined in like 1978 :D
| plagiarist wrote:
| My phone has more computation than anything I would have
| imagined owning, and I sometimes turn on the screen just to
| use as a quick flashlight.
| qingcharles wrote:
| Haha.. imagine taking it back to 1978 and showing how it
| has more computing power than the entire planet and then
| telling them that you mostly just use it to find that thing
| you lost under the couch :D
| fiddlerwoaroof wrote:
| It's at least 20ish years ago: I remember an old sysadmin
| talking about managing petabytes before 2003
| aspenmayer wrote:
| Those numbers seem reasonable in that context. I first
| started using BitTorrent around that time as well, and it
| wasn't uncommon to see many users long-term seeding multiple
| hundreds of gigabytes of Linux ISOs alone.
|
| Here's another usage scenario with data usage numbers I found
| a while back.
|
| > A 2004 paper published in ACM Transactions on Programming
| Languages and Systems shows how Hancock code can sift calling
| card records, long distance calls, IP addresses and internet
| traffic dumps, and even track the physical movements of
| mobile phone customers as their signal moves from cell site
| to cell site.
|
| > With Hancock, "analysts could store sufficiently precise
| information to enable new applications previously thought to
| be infeasible," the program authors wrote. AT&T uses Hancock
| code to sift 9 GB of telephone traffic data a night,
| according to the paper.
|
| https://web.archive.org/web/20200309221602/https://www.wired.
| ..
| ComputerGuru wrote:
| I archived Hancock here over a decade ago, stumbled upon it
| via HN at the time if I'm not mistaken:
| https://github.com/mqudsi/hancock
| aspenmayer wrote:
| That's pretty cool. I remember someone on that repo from
| while back and was surprised to see their name pop up
| again. Thanks for archiving this!
|
| Corinna Cortes et al wrote the paper(s) on Hancock and
| also the Communities of Interest paper referenced in the
| Wired article I linked to. She's apparently a pretty big
| deal and went on to work at Google after her prestigious
| work at AT&T.
|
| Hancock: A Language for Extracting Signatures from Data
|
| https://scholar.google.com/citations?view_op=view_citatio
| n&h...
|
| Hancock: A Language for Analyzing Transactional Data
| Streams
|
| https://scholar.google.com/citations?view_op=view_citatio
| n&h...
|
| Communities of Interest
|
| https://scholar.google.com/citations?view_op=view_citatio
| n&h...
| fiddlerwoaroof wrote:
| Yeah, at the other end of the scale, it sounds like Apple
| is now managing exabytes:
| https://read.engineerscodex.com/p/how-apple-built-icloud-
| to-...
|
| This is pretty mind-boggling to me.
| chx wrote:
| Must be much more than 20ish years, some 2400 ft reels in the
| 60s stored a few megabytes, you only need 100 000s of those
| to reach a terabyte. https://en.wikipedia.org/wiki/IBM_7330
|
| > a single 2400-foot tape could store the equivalent of some
| 50,000 punched cards (about 4,000,000 six-bit bytes).
|
| In 1964 with the introduction of System/360 you are going a
| magnitude higher
| https://www.core77.com/posts/108573/A-Storage-Cabinet-
| Based-...
|
| > It could store a maximum of 45MB on 2,400 feet
|
| At this point you only need a few ten thousand reels in
| existence to reach a terabyte. So I strongly suspect the
| "terabyte point" was some time in the 1960s.
| chx wrote:
| I raised this to retro se and
| https://retrocomputing.stackexchange.com/a/28322/3722 notes a
| TiB of digital data likely was reached in the 1930s with
| punch cards.
| one_buggy_boi wrote:
| Is modern Ceph appropriate for transactional database storage,
| how is the IO latency? I'd like to move to a cheaper cfs that can
| compete with systems like Oracle's clustered file system or DBs
| backed by something like Veritas. Veritas supports multi-petabyte
| DBs and I haven't seen much outside of it or ocfs that similarly
| scales with acceptable latency
| samcat116 wrote:
| Latency is quite poor, I wouldn't recommend running high
| performance database loads there.
| louwrentius wrote:
| From my dated experience, Ceph is absolutely amazing but
| latency is indeed a relative weak spot.
|
| Everything has a trade-off and for Ceph you get a ton of
| capability but latency is such a trade-off. Databases -
| depending on requirements - may be better off on regular NVMe
| and not on Ceph.
| antongribok wrote:
| Not sure about putting DBs on CephFS directly, but Ceph RBD can
| definitely run RDBMS workloads.
|
| You need to pay attention to the kind of hardware you use, but
| you can definitely get Ceph down to 0.5-0.6 ms latency on block
| workloads doing single thread, single queue, sync 4K writes.
|
| Source, I run Ceph at work doing pretty much this.
| patrakov wrote:
| It is important to specify which kind of latency percentile
| this is. Checking on a customer's cluster (made from 336 SATA
| SSDs in 15 servers, so not the best one in the world):
| 50th percentile = 1.75 ms 90th percentile = 3.15 ms
| 99th percentile = 9.54 ms
|
| That's with 700 MB/s of reads and 200 MB/s of writes, or
| approximately 7000 reads IOPS and 9000 writes IOPS.
| louwrentius wrote:
| These numbers may be good enough for your use case but from
| what's possible with SSDs these numbers aren't great.
| Please note, I mean well. Still a cool setup.
|
| I'd like to see much more latency consistency and 99th even
| sub ms. Might want to set a latency target with fio and see
| what kind of load is possible until 99 hits 1ms.
|
| However, I can say all of this but it's all about context
| and depending on workload your figures may be totally fine.
| rafaelturk wrote:
| I'm playing a lot with MicroCeph. Its aopinionated low TOS,
| friendly setup of Ceph. Looking forward additional comments.
| Planning to use it in production and replace lots of NAS servers.
| louwrentius wrote:
| I think Ceph can be fine for NAS use cases, but be wary of
| latency and do some benchmarking. You may need more nodes/osds
| than you think to reach latency and throughput targets.
| hinkley wrote:
| Sure would be nice if you defined some acronyms.
| nghnam wrote:
| My old company ran public and private cloud with Openstack and
| Ceph. We had 20 Supermicro (24 disks per server) storage nodes
| and total capacity was 3PB. We learnt some experiences,
| especially a flapping disk made whole system performance
| degraded. Solution was removing bad sector disk as soon as
| possible.
| amadio wrote:
| Nice article! We've also recently reached the mark of 1TB/s at
| CERN, but with EOS (https://cern.ch/eos), not ceph:
| https://www.home.cern/news/news/computing/exabyte-disk-stora...
|
| Our EOS clusters have a lot more nodes, however, and use mostly
| HDDs. CERN also uses ceph extensively.
| theyinwhy wrote:
| Great! What's your take on ceph? Is the idea to migrate to EOS
| long term?
| amadio wrote:
| EOS and ceph have different use cases at CERN. EOS holds
| physics data and user data in CERNBox, while ceph is used for
| a lot of the rest (e.g. storage for VMs, and other
| applications). So both will continue to be used as they are
| now. CERN has over 100PB on ceph.
| ComputerGuru wrote:
| Is there a reason you run both and don't converge on one or
| the other?
| brobinson wrote:
| I'm curious what the performance difference would be on a modern
| kernel.
| PiratesScorn wrote:
| For context, I've been leading the work on this cluster client-
| side (not the engineer that discovered the IOMMU fix) with
| Clyso.
|
| There was no significant difference when testing between the
| latest HWE on Ubuntu 20.04 and kernel 6.2 on Ubuntu 22.04. In
| both cases we ran into the same IOMMU behaviour. Our tooling is
| all very much catered around Ubuntu so testing newer kernels
| with other distros just wasn't feasible in the timescale we had
| to get this built. The plan was < 2 months from initial design
| to completion.
|
| Awesome to see this on HN, we're a pretty under-the-radar
| operation so there's not much more I can say but proud to have
| worked on this!
| kylegalbraith wrote:
| This is a fascinating read. We run a Ceph storage cluster for
| persisting Docker layer cache [0]. We went from using EBS to Ceph
| and saw a massive difference in throughput. Went from a write
| throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000
| IOPS.
|
| The best part is that it pretty much just works. Very little
| babysitting with the exception of the occasional fs trim or
| something.
|
| It's been a massive improvement for our caching system.
|
| [0] https://depot.dev/blog/cache-v2-faster-builds
| guywhocodes wrote:
| Did something very similar almost 10 years ago, EBS costs were
| 10x+ the cost for same perfomance CEPH cluster on the node
| disks. Eventually we switched to our own racks and cut it
| almost in ten again. We developed the inhouse expertise for how
| to do it and we were free.
| louwrentius wrote:
| I wrote an intro to Ceph[0] for those who are new to Ceph.
|
| It featured in a Jeff Geerling video briefly recently :-)
|
| [0]: Understanding Ceph: open-source scalable storage
| https://louwrentius.com/understanding-ceph-open-source-scala...
| justinclift wrote:
| Has anything important changed since 2018, when you wrote that?
| :)
| louwrentius wrote:
| Conceptually not as far as I know.
| mobilemidget wrote:
| Cool benchmark, and interesting, however it would have read a lot
| better if abbreviations are explained at first usage. Not
| everybody is familiar with all terminology used in the post.
| Nonetheless congrats with results.
| markhpc wrote:
| Thanks (truly) for the feedback! I'll try to remember for
| future articles. It's easy to forget how much jargon we use
| after being in the field for so long.
| up2isomorphism wrote:
| This is an insanely expensive cluster built to show a benchmark.
| 68 node cluster serving only 15TB storage in total.
| PiratesScorn wrote:
| The purpose of the benchmarking was to validate the design of
| the cluster and to identify any issues before going into
| production, so it achieved exactly that objective. Without
| doing this work a lot of performance would have been left on
| the table before the cluster could even get out the door.
|
| As per the blog, the cluster is now in a 6+2 EC configuration
| for production which gives ~7PiB usable. Expensive yes, but
| well worth it if this is the scale and performance required.
| francoismassot wrote:
| Does someone knows how Ceph compares to other object storage
| engine like MinIO/Garage/...?
|
| I would love to see some benchmarks there.
| matesz wrote:
| This would be great, to have a universal benchmark of all
| available open source solutions for self-hosting. Links
| appreciated!
___________________________________________________________________
(page generated 2024-01-20 23:01 UTC)