[HN Gopher] Cassandra at Apple: 1000s of Clusters, 300k Nodes, 1...
___________________________________________________________________
Cassandra at Apple: 1000s of Clusters, 300k Nodes, 100 PB
Author : mfiguiere
Score : 72 points
Date : 2022-10-07 17:46 UTC (1 days ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| notacoward wrote:
| The operational complexity of managing thousands of clusters must
| be mind boggling. I've been on two projects managing dozens of
| storage clusters, the second with more data than this in some
| individual clusters and adding up almost this many total nodes.
| There were technical problems with scaling up to 10K nodes per
| cluster, but the _operational_ issues mostly scaled according to
| number of clusters. For example, how many alerts per hour /day
| can you stand? Too many and you're overwhelmed; too few and you
| miss stuff. Walking that fine line became successively more
| difficult as clusters were added. Same thing with graphs and
| dashboards. Also, when your storage and IOPS are siloed this much
| you have no elasticity, so you're going to be chasing capacity or
| load problems much more often. On the plus side, this probably
| means each tenant has their own cluster, so you don't have so
| many worries about them affecting each other.
|
| The big question I'd have is: how many _people_ (including on
| client teams) does it take to manage this much sprawl?
| jhgg wrote:
| At this scale you automate away 99.9% of the things you respond
| to.
|
| We are not even near this scale yet at my place of work, and we
| are moving towards this methodology of strong automation to
| orchestrate a cluster. We hire software engineers to operate
| our database clusters, and the expectation is to somewhat be
| selfishly motivated to write programs to remediate issues so
| you don't get paged constantly. We do not expect to grow our
| headcount proportionally to the number of clusters or nodes we
| operate.
|
| You must treat your nodes like cattle not pets. If a node
| fails, automation kicks in and re-bootstraps it. It is not
| worth figuring out how to nurse it back to health. When you are
| performing rolling or scale up operations on the cluster you
| are just invoking automation to do everything for you.
| redanddead wrote:
| They for sure have in-house software to handle these nodes. The
| software may be pretty good at what it does considering they
| know what their own DBs are susceptible to, that might reduce a
| whole bunch of the human management and the team they need may
| be not that big but likely highly skilled.
| magnawave wrote:
| One of the hard parts of quantifying that, is you have people
| who wear many hats. So sure you have Cassandra gurus, and
| probably a decent number of them. But this is in the league
| where really hardcore automation kicks in to keep staffing
| sane, and operations possible. But outside that, how do you
| count datacenter folks, client folks, networking folks, etc who
| only spend a little fraction of time on the database parts.
|
| But I think I can say with sufficient knowledge, all things
| considered still, "way fewer than you might think".
| iampims wrote:
| I'd be curious to know what the orchestration platform looks like
| for running 300,000 nodes and 1,000s of clusters.
| jackblemming wrote:
| Unrelated but does anyone else get tired of the fetishization of
| BIG NUMBER? I don't care if Facebook has billions of users if
| it's a hot pile of garbage. I don't care if some game has
| millions of players if it's bad. When did BIG NUMBER overtake
| quality and can we go back?
| eurleif wrote:
| Quantity of users is social proof (more accurately, social
| evidence) of quality. The claim "we have high quality" is
| cheap: anyone can make it, and it's subjective enough that it
| can't be disproven in any absolute sense. But if you lie about
| having billions of users, you can be called out on that pretty
| easily; and if you say it honestly, it implies that billions of
| people like the quality of your product or service enough to
| use it. Billions of people can be wrong, but saying you have
| billions of users is still much better evidence of quality than
| saying you have quality.
| api wrote:
| > Quantity of users is social proof (more accurately, social
| evidence) of quality.
|
| It can also be evidence of founder effects (JavaScript), lock
| in (Windows), network effects (most social media), etc.
| osigurdson wrote:
| It is a bit like music. One could state that all of the popular
| music is garbage (I might even state that myself some days),
| but in the end, the music purchasing public have spoken.
| Maursault wrote:
| > I don't care if Facebook has billions of users if it's a hot
| pile of garbage.
|
| Shockingly, Apache Cassandra was initially developed at
| Facebook, so at least a hot pile of garbage is good for
| something. Plus, it'll keep you warm, so that's two things.
| chomp wrote:
| BIG NUMBER implies a dedication to the care and feeding. It's a
| nudge and a wink for "come work for us"
|
| Number of users advertisement is more for investors, and maybe
| a broadcast for FOMO.
|
| There is, however, a tendency in our field to look for systems
| design patterns that handle big N numbers, and apply those to
| little N platforms, but I believe deep down that this is
| business and management dysfunction, as systems refactoring is
| tolerated even less than code refactoring.
| latchkey wrote:
| -\\_(tsu)_/-, I ran over 100k GPUs... it was fun. You sound
| jelly.
| redanddead wrote:
| Sounds fun, do tell
| faizshah wrote:
| Interesting, I thought they were trying to switch over to
| FoundationDB but looks like their Cassandra usage keeps growing.
| onesociety2022 wrote:
| Apple acquired FoundationDB but from what I have heard it's not
| really used much. The FoundationDB founders have left Apple and
| are working on other things. Cassandra is the main datastore
| for iCloud data.
| api wrote:
| I keep hearing people talk about how hard Cassandra is to run
| properly and how many people get it wrong etc. Is there anything
| to this or is it just FUD and people who genuinely don't know
| what they are doing?
| [deleted]
| jhgg wrote:
| Our biggest issue with running Cassandra was related to
| pathological read / write patterns by some tenants on our
| system causing outsized availability impact due to triggering
| garbage collection pressure that would cause whole node GC STW
| pauses and severe tail latency / query degradation.
|
| We have solved these issues in a few ways, mainly: - working
| with the relevant product teams to implement appropriate rate
| limiting or improving data modeling.
|
| - introducing our own query layer, written in Rust that sits in
| front of Cassandra that uses a form of micro-caching called
| read coalescing, and also other forms of query throttling/load
| shedding to reduce work the database must do for hot
| keys/pathological patterns of access. We expose a GRPC
| interface from this - and this lets us centralize control of
| the client driver and tune it appropriately, while also getting
| to leverage the ever growing open source grpc traffic routing
| solutions (envoy, etc...)
|
| and ultimately,
|
| - switching to ScyllaDB, a C++ rewrite of Cassandra which is of
| course void of any garbage collection issues, and features
| faster overall performance and lower latencies.
|
| Scylla, however, is not without its own set of issues - and
| somewhat strict hardware requirements[0] thanks to the seastar
| engine it is built on top of. Their team however has been
| delightful to work with, and our platform is markedly more
| stable in current year than it was in years past thanks to the
| above factors.
|
| Operationally, however, Scylla and Cassandra are quite easy to
| run, the trickiest part is repairs. Common operations such as
| cluster expansion, or replacement of node are so common an
| operation that they are at this point mundane. Be wary however
| about read/write amplification issues inherent to LSMT
| databases, choosing the correct compaction strategy and tuning
| it appropriately can be quite key. Additionally tombstones can
| be quite bad for performance.
|
| In current day we offer a new more generic solution that sits
| on top of scylla (it would work with Cassandra too) that
| provides a simple interface to query KKV based data, without
| having to worry too much about problems like large partitions,
| hot keys, or tombstones! With a design like this, the
| underlying cluster thus far has been issue free and very easy
| to operate.
|
| [0]: https://discord.com/blog/how-discord-supercharges-network-
| di...
| achillean wrote:
| We store a few PB of data in Cassandra, have used it for nearly
| 10 years and in my opinion it's not that hard. Operationally
| it's way easier to manage than Elastic and most other databases
| (ex. PostgreSQL, MongoDB) plus there's a ton of documentation
| available to help you debug/ benchmark your cluster. Note that
| even though CQL looks similar to SQL it's important to
| understand the differences but as with any new technology
| there's a learning curve. I would strongly recommend checking
| out C* if you need a database with high write throughput and
| that needs to scale out.
| pmcf wrote:
| It's a distributed system and if you have been a DBA for a
| single system like Oracle or MySQL there is a lot of new
| competencies to learn. That being said, completely doable and
| it's typical to see small teams running massive amounts of
| Cassandra. At the same conference, Bloomberg talked about their
| large Cassandra footprint with only 4 people. If you want to
| run Cassandra in K8s there is the K8ssandra project that
| automates a lot. It's a fast growing project as a result.
| (http://k8ssandra.io) If you want to use Cassandra and not run
| it, http://astra.datastax.com. One click and a few seconds, you
| get a completely serverless version of Cassandra that you only
| pay for what you use. I'm sure we will hear a lot more of these
| stories at Cassandra Summit in March
| (http://cassandrasummit.org)
| [deleted]
| [deleted]
| rektide wrote:
| I'd love info on how much Apple contributed to Cassandra!
| _benedict wrote:
| I'm sure Scott's talk went into detail about this, but I can
| safely say that his team contributes a great deal to Cassandra
| tpmx wrote:
| 300k Cassandra nodes seems a bit over the top even for a company
| with as many active devices as Apple.
|
| https://www.theverge.com/2022/1/28/22906071/apple-1-8-billio...
|
| 1.8B active devices / 300k nodes = (just) 6k devices per
| Cassandra node
| daniel-grigg wrote:
| Or it tells us something of how much data is being scooped up
| per device. Certainly when I look through the raw health data
| collected it's quite alarming and I'm sure that's just a drop
| in the ocean.
| ezfe wrote:
| Well, Health data can be uploaded to iCloud (CloudKit), but
| it's End-to-End encrypted so not really a concern.
|
| Unlike other data in iCloud, if you lose your devices you
| lose your HealthKit data. This is not true for photos or
| emails, for example - which you keep if you lose your
| devices.
| mwint wrote:
| Why do you think the raw health data is getting sucked off
| your device? That would be totally off brand for them.
|
| Apple does have a separate opt-in "Research" program to
| facilitate this kind of thing.
| faeriechangling wrote:
| Regardless of their current brand, Apple is the next big
| advertising giant and no amount of brand purity is going to
| change this. The data of Apple's users is simply of too
| high value for Apple to ignore forever.
| tpmx wrote:
| Makes me think of that first decade (98-08) when Google
| actually wasn't being evil. Yeah, it's inevitable that
| Apple will turn to this when they can't grow any more
| simply by raising the prices of their devices. Perhaps
| they have reached that point about now...
| smoldesu wrote:
| It's also off-brand for Apple to join PRISM and comply with
| thousands of annual requests for supposedly-inaccessible
| iCloud data. Neither of you will ever be proven right until
| we look inside those servers though, so making _any_
| conclusive statements is a mistake. Apple designed
| Schrodinger 's datacenter.
| prange wrote:
| threeseed wrote:
| Apple has a lot more data than just a list of devices.
|
| There is everything from Weather to Siri to Store Purchases
| etc.
|
| And companies will syndicate data sets to different teams for
| performance and security reasons ie. lots of duplication.
| tpmx wrote:
| > Apple has a lot more data than just a list of devices.
| [...]
|
| Of course. That is not the point here.
| echelon wrote:
| Perhaps you'd be better convinced with a service breakdown.
|
| Breaking monoliths into service boundaries yields easier
| ownership, maintenance, migration, and resilience.
|
| One "tiny" company with a few verticals can be comprised of
| thousands of microservices, each handling their own
| dedicated objective. Authentication, reverse proxy, API
| gateway, SMS, email, customer list, marketing email
| gateway, CMS for marketers on product X, feature flags,
| transaction histories, GDPR compliance handling, billing
| intelligence, various risk models, offline ML risk
| enrichment, etc. etc. Each will have its own data needs and
| replication / availability needs.
|
| This Apple number might seem crazy, but I'm not phased by
| it. I can picture it.
| tpmx wrote:
| I can also picture it, but not really in the way you're
| outlining it.
|
| It's a sad and very inefficient picture though. Apple
| does not _need_ this this much data processing. It 's a
| grotesque amount per device. Or maybe they're just
| wasting insane amounts of energy doing lots and lots
| doing of stupid analytics...
| echelon wrote:
| Sometimes things have to be built as layered abstractions
| in order for humans to reason about them at scale.
|
| See also the natural stochastic gradient ascent that
| produced our crazy complicated metabolic pathways (and
| all of biology).
| [deleted]
| Luker88 wrote:
| Couple of things as always:
|
| Cassandra works really bad with fat nodes (lots of data on one
| node), and much much better with a lot of small nodes, and 100PB
| with 300K nodes confirms this. Scylla scales better vertically,
| but don't know how much.
|
| Some comments are already comparing this to pgsql/mysql/whatever.
| Please don't. You can't make the same queries even though the
| language seems to support it.
|
| Cassandra is good at ingesting data, bad at deleting, really
| really bad at anything remotely relational. Errors are almost
| pointless.
|
| I'm going to point at an older comment of mine on cassandra:
| https://news.ycombinator.com/item?id=20430925#20432564
|
| The takeaway should be: Yes, cassandra/scylla can be really fast
| and scale a lot. But it is also very probably unusable for your
| use case. Don't trust what the CQL language says you can do.
| Don't get me started on how bad the CQL language is, either.
| bluedino wrote:
| Over 2PB per cluster, thousands of clusters, but only 100's of PB
| of data.
|
| What do they use this for? iCloud storage related stuff?
| candiddevmike wrote:
| I thought Cassandra was bad at storing big files?
| riku_iki wrote:
| you can easily chunk big files.
| magnawave wrote:
| Correct, that's what object stores are for. However, metadata
| on said files, is probably very handy to have in a database.
|
| I'm quite sure not all this Cassandra capacity is just
| file/photo metadata storage either.
| xvector wrote:
| They use it for storing iCloud Photos without E2EE while
| heavily marketing privacy
| cassonmars wrote:
| They were moving towards E2EE when everyone freaked out about
| the on-device perceptual hashing trade off.
| smoldesu wrote:
| They should have wheeled out a better marketing spiel than
| "trust us ;)" then.
| AmericanChopper wrote:
| Well... you don't actually need to make a computing device
| automatically report its owner to the authorities for a
| serious crime based on a provably flawed automated process,
| prior to implementing encryption E2EE for a cloud storage
| service. That was simply the strategy that Apple chose to
| pursue. Blaming the users for reacting poorly to this
| strictly anti-user approach is very backwards.
| sneak wrote:
| Or they could just deploy e2e without turning our devices
| into things that spy on us. It's a false dichotomy.
| MBCook wrote:
| Yup. The knee-jerk privacy reaction _cost us_ privacy.
| Gigachad wrote:
| I don't think it's fair to say we need to accept either
| options. Yes the crime they are trying to stop is
| horrific and something must be done, but that doesn't
| justify unlimited technological spyware.
|
| And the scope for abuse is so large. People in the UK are
| getting arrested for retweeting mean memes, it's pretty
| easy to imagine Google and Apple added offensive images
| to their scanning and you get arrested for saving
| something that goes against the current political agenda.
|
| As well as the case where google locked the account of a
| parent who had taken photos to send to a medical expert.
| redanddead wrote:
| I'm under the belief that Redis is much faster than Cassandra, am
| I crazy to think that Apple or any company really should have a
| transition plan? Why isn't redis used more?
| mplewis wrote:
| Cassandra solves different problems than Redis is typically
| used for.
| zeristor wrote:
| How comes iTunes movies takes minutes to display a list on Apple
| TV, feels like minutes all the time?
| jleahy wrote:
| Normally you post a question and someone posts an answer. Here
| someone has posted the answer and you have posted the question.
| Are we playing Jeopardy?
| tmpz22 wrote:
| Or to just say it - because the system has absurd complexity
| as proven by the hardware needed to run it.
| ezfe wrote:
| While it doesn't take that long on my device, it probably is
| related to the fact that the iTunes store is over 20 years old
| and has the tech debt to prove it.
| reaperducer wrote:
| _How comes iTunes movies takes minutes to display a list on
| Apple TV, feels like minutes all the time?_
|
| I just fired up the iTunes Movies app on my AppleTV for the
| first time so no cache (I only watch my DVD/Blu-Ray rips), and
| the app started and loaded a full list of movies in a little
| under 3.5 seconds.
|
| If it takes you minutes, it sounds like PEBKAC.
| rkwasny wrote:
| Can be replaced with 300 servers with ScyllaDB :-)
| leetrout wrote:
| Because of no JVM? Or because of its different architecture
| (different caching and such)?
|
| I would expect it to still require more for high availability
| but from what I have heard around ScyllaDB it does seem there
| is a benefit to it over cassandra.
| neeh0 wrote:
| Relevant post: https://instagram-engineering.com/open-
| sourcing-a-10x-reduct...
| riku_iki wrote:
| will ScyllaDB shrink data 1k times using some magic?
| [deleted]
| pclmulqdq wrote:
| You might need about 1500. I think 64 TB of flash is the
| standard for a 2U database server these days.
___________________________________________________________________
(page generated 2022-10-08 23:00 UTC)