[HN Gopher] Scaling Databases at Activision [pdf]
___________________________________________________________________
Scaling Databases at Activision [pdf]
Author : ksec
Score : 158 points
Date : 2023-04-21 16:09 UTC (6 hours ago)
(HTM) web link (static.sched.com)
(TXT) w3m dump (static.sched.com)
| erksch wrote:
| It would have been interesting to know what alternatives they
| considered. Perhaps this would have been easy to solve using
| Nomad and TiDB.
|
| So it's a story of people who used Kubernetes and MySQL and
| continued to use Kubernetes and MySQL. The end.
| kokizzu2 wrote:
| yup lots of companies do agree https://github.com/kokizzu/list-
| of-tech-migrations
| bickett wrote:
| This is an interesting discussion on database performance and
| scaling. Thanks for sharing!
| achanda358 wrote:
| What kind of queries per second did you see while running on
| kubernetes, compared to VMs?
| rektide wrote:
| 3 months from start to roll out for tbeir large scale deploy is
| hella impressive.
|
| Just trying to do basic scaling or other simple tasks sounded
| like an utter nightmare on their old system. Having autonomic
| computing at their back seems like an obvious win. Tell it your
| desired state & let the controller do the job.
|
| The new paradigm for computer operations is so wonderful.
| endur wrote:
| The timing slide caught my attention as well. 3 months for
| anything is fast...
| cpressland wrote:
| Ahh, I wanted to go see this talk at KubeCon yesterday but the
| room was completely full. Glad to see the finer points got posted
| here.
| brycethornton wrote:
| They have a PlanetScale logo on the last slide. Is this just to
| say "Thanks for your work on Vitess!" or are they using the
| PlanetScale service for some of this?
| jtchang wrote:
| I wonder what game this was for.
|
| To me this is validation that going the SQL route in almost 99%
| of new apps is the right way to go. It will be a rare case that
| you won't be able to scale out given how mature some of these
| technologies are.
| chillydawg wrote:
| Demonware handle the online services for the CoD franchise.
| VirusNewbie wrote:
| Github had DAYS of an outage because they unsuccessfully tried
| to scale out SQL rather than dealing with NoSql.
|
| There are very few things that structured NoSql can't do if you
| ignore reporting. Once you scale out traditional SQL, you
| aren't going to be using it for reporting either way though.
| mh- wrote:
| This is a gross mischaracterization of what caused that
| outage, even just based on the public information from the
| (excellent) blog post [0] about it.
|
| Automating _master failover_ on MySQL without a human in the
| loop - _in the topology GitHub used back then_ - is risky.
|
| 0: https://github.blog/2018-10-30-oct21-post-incident-
| analysis/
| LunaSea wrote:
| Anyone have more blog posts about video game backend tech like
| this one?
| geenat wrote:
| These numbers make more sense on hardware from 5 years ago (to be
| fair, what they were dealing with), but postgres, on modern
| consumer hardware (Yes, consumer! Ex: KC3000 + Ryzen 7950x),
| already does over 50k QPS without being overburdened. You could
| probably squeeze out 100k QPS with faster NVME's.
|
| That's like 10 tall servers for their peak QPS? (500k QPS on
| slide 19)
|
| Lots of natural sharding points here too: company, game, usage of
| data, etc.
|
| This is assuming you avoid or go light on expensive features like
| foreign keys (as Vitess already does).
|
| The "scale magic" in Spanner (Big Table) inspired DB's are just
| hidden automation of traditional sharding-
|
| CockroachDB: Sharded indexes, then runs map reduce for you.
|
| Scylla/Cassanda: Sharded indexes again, but more limitations for
| speedups: Eventually consistent. You don't have fast delete- only
| update (discord uses tombstones). JOIN's are "in app" only.
|
| Vitess: Proxy that dismantles/routes your query to the correct
| server. This scales, but is eventually consistent. JOIN's on co-
| located data, or "in app".... in simple terms, Vitess is like an
| externally managed "in app" query parser/router.
| winrid wrote:
| Mongo's sharding also uses routers like Vitess but can be
| strongly consistent depending on connection and write settings.
|
| Data is divided into chunks and the routers route queries based
| on where your query lands on chunk ranges.
| aeyes wrote:
| I see 500k qps with 30TB data doubling every 2 years on slide
| 19.
|
| In my experience me this isn't close to being possible with a
| single Postgres box. There is no way around sharding this
| workload.
|
| Vitess has different consistency models, it's not as eventually
| consistent as you describe.
| geenat wrote:
| You'd certainly want to shard from the start in any case, but
| it's an order of magnitude less postgres boxes for the QPS.
| leetrout wrote:
| Wonder what the tradeoff on connection overhead would be?
| I've never been around a connection pooler that would be
| anywhere near this load.
| leetrout wrote:
| When you give up one of the major components of relational
| databases like foreign keys you pick up lots of performance
| opportunity.
|
| It is amazing what Vitess / PS will help you accomplish but they
| don't talk a lot about tradeoffs you make to get there (which are
| similar to what you face with sharding MySQL without additional
| tooling).
| berkle4455 wrote:
| Foreign keys are so overrated. Write good code and you don't
| have to rely on the DB to enforce referential integrity. It's
| the equivalent of running test cases with every db write.
| AdrianB1 wrote:
| From experience you are very right. Also from experience,
| most teams are not good enough to write the code with that
| quality that you can give up to referential integrity.
| vecter wrote:
| I'm curious, how would you obviate the need for foreign keys
| with "good" code? Can you provide a toy example or a
| reference to an article so I can understand better? I've used
| NoSQL databases a long time ago and currently rely on on good
| ole PostgreSQL, but I'm having a hard time understanding how
| "good" code can be a better solution for managing
| relationships between data than a foreign key.
| Yoric wrote:
| I guess you can use indices instead of foreign keys? And
| somehow implement all the `ON DELETE CASCADE` manually
| within any transaction that removes the original row? Not
| sure how it's "good" code but it could be faster.
| DasIch wrote:
| A foreign key doesn't necessarily imply an index. If you
| are using postgres, you would have to add an index in
| addition to the foreign key, if you want one.
|
| On delete cascade, depending on how many rows it cascades
| to, can be problematic because it's a very long running
| blocking operation. That's something one might want to do
| as a background operation and in batches. Although that
| won't make it faster.
| sroussey wrote:
| Faster in aggregate throughput can often be different
| from faster for a specific operation.
|
| Personally, I find delete on cascade dangerous. I mean,
| lots of fun for a pen tester, sure...
| berkle4455 wrote:
| > for managing relationships between data than a foreign
| key.
|
| You're conflating the concept of a normalized database with
| insanely slow DB-enforced referential integrity/foreign
| keys.
|
| Toy example? Sure. Take a well-formed 3NF schema and
| disable foreign key constraints.
| AlisdairO wrote:
| How does this stay correct in the presence of concurrent
| activity?
| berkle4455 wrote:
| Using transactions or UUID/ULIDs though maybe I'm
| misunderstandingyour question. How do foreign key
| constraints help with concurrency?
| AlisdairO wrote:
| Table User: userid, etc
|
| Table Resources: resourceid, userid, etc
|
| If I want to restrict deletion of a user to only be
| possible after all the resources are deleted, I'm forced
| into using higher-than-default isolation levels in most
| DBs. This has significant performance implications. It's
| also much easier to make a mistake - for example, if when
| creating a resource I check that the user exists prior to
| starting the transaction, then start the tran, then do
| the work, it will allow insertion of data into a
| nonexistent user.
| vecter wrote:
| Sorry, assume I'm dense.
|
| > Take a well-formed 3NF schema and disable foreign key
| constraints.
|
| I'm familiar with 3NF, but can you expand on how 3NF
| enables you to remove foreign keys? Or feel free to point
| me to an article/blog, I don't want to waste your time if
| it's too much to explain. I did some googling but wasn't
| sure where to proceed from your post.
| Philip-J-Fry wrote:
| They're not saying 3NF enables you to remove foreign
| keys. They are talking about removing foreign key
| constraints from your RDBMS of choice.
|
| Something like SQL Server can enforce foreign key
| constraints if you explicitly tell it your relationships
| between tables. The downside is that having this
| referential integrity costs you performance as the
| database has to check your relations when
| inserting/updating/deleting rows. E.g. checking that a
| foreign key is pointing at a valid primary key, checking
| that you aren't leaving invalid foreign keys when
| deleting a primary key, etc. This is to prevent you
| inserting bad data into the database.
|
| You can delete these constraints and still have the exact
| same behavior _so long as your code is correct_. It just
| means that the database isn 't going to stop you writing
| bad data.
| vecter wrote:
| That makes sense. For workloads where write performance
| isn't very important but read performance is, this sounds
| like it may still be a worthwhile tradeoff to have that
| extra level of data integrity.
|
| My sense is that many typical CRUD apps aren't writing
| gargantuan volumes of data or making very complex edits,
| and if they do, it's ok if it takes a second longer.
| Usually read speed is more of a bottleneck for user-
| facing applications, but I'm sure there are probably some
| examples where this tradeoff is worth it.
| berkle4455 wrote:
| Say you have a super basic setup: user =
| {user_id, email} order = {order_id, user_id}
|
| order.user_id is a foreign key to user.user_id. That's a
| perfectly valid and reasonable way to organize things.
|
| Enabling RDBMS-enforced foreign key constraints is the
| issue. It slows everything down dramatically.
| vecter wrote:
| I see, thanks. Basically just store the foreign keys
| yourself as columns in relevant tables and perform the
| joins in SQL without having the DB enforce FK integrity
| with every insertion/update/delete. Would it be fair to
| say that read speeds are unaffected by this?
| berkle4455 wrote:
| Yes. Quite literally the only difference is not enabling
| foreign key constraints. Read speeds unaffected correct.
| kgeist wrote:
| "Write good code and you won't have bugs" :)
|
| I think it's good practice to enforce consistency rules both
| in the DB and in code. If you make a mistake in your code,
| the DB won't allow it, and vice versa.
| vp8989 wrote:
| In theory it is good practice, but in practice enforcing
| referential integrity is not free and it makes satisfying
| other non-functional requirements harder. Engineering is
| about tradeoffs. Foreign keys are overkill in modern
| architectures where DBs/tables are typically only written
| to by a single application. They solve a problem that lots
| of people don't really have anymore.
| kgeist wrote:
| In the legacy project our team inherited, there's a lot
| of data consistency issues stemming from the lack of
| foreign keys (we're adding them back now). Data
| consistency is important because we deal with financial
| data. Typically, code and data change more often than DB
| schemas, so every release has a chance that someone will
| forget to update all relations manually in code,
| introducing dangling references, for example. Foreign
| keys help as an additional safety measure, because they
| don't change as often. I think you should care about the
| performance of foreign keys only when speed is more
| important than data consistency. It's a good default.
| sroussey wrote:
| Personally, I thinks foreign keys are great in dev and
| bad at any scale that you have many db servers.
|
| And as you point out, there are exceptions, like
| financial data. But not marketing funnels where you might
| throw everything away.
|
| You should probably have different types of engineers
| working on such different projects as well.
| baronvonsp wrote:
| The exact analogy that came to mind for me.
|
| Plus, relational databases don't just sit under a single
| application. There's usually multiple applications/services
| talking to them. Worse, humans connect to them an do all
| sorts of things they shouldn't do. That's the whole point
| of managing referential integrity in the DBMS, since you
| can only control "just write good code" across so many
| application domains.
|
| Of course whether the performance tradeoff is worth it is a
| complicated decision for many of the reasons people have
| mentioned. But in 20 years of working with relational
| databases at big companies, I've seen few examples where
| the performance win exceeded the business risk.
| candiddevmike wrote:
| Things can be really fast when you throw out referential
| integrity
| _a_a_a_ wrote:
| Databases are my interest, so if you don't mind... How would
| removal of foreign keys speed up read access?
|
| I'm also going to disagree with you a little, again on the
| issue of reading, if you have foreign keys then you can do
| optimisations like cutting out chunks of joins. Adding
| constraints means you know more which you can feed to the
| optimiser which means often you can get better performance
| (read performance, that is).
| robmccoll wrote:
| Not OP, but they said that you pick up performance, not
| specifically in reading. The performance you gain is larger
| aggregate write throughput (and read generally - although
| reads that require spreading queries across shards and
| aggregating results will likely perform a little worse). You
| also gain scale in that you have turned something that was
| limited by scale up into something limited by scale out
| (while giving up some data guarantees and performance on some
| queries).
|
| As to why you can't really have foreign keys, basically,
| sharding schemes like this sit at an intermediate layer
| between the application and the various DB clusters that are
| the shards. You CAN have strong data consistency (and useful
| foreign keys) within a shard because it's all inside a single
| database; however across shards, you CAN NOT. The sharding
| layer doesn't perform checks for you, so if you have two
| logical tables that are partitioned differently across the
| shards, you can't have a foreign key that will be enforced
| correctly as the foreign key of a given row in one table may
| live on a different shard in the other table. The local
| database within the shard would reject the insert.
| Transactions across shards can also be tricky to impossible.
| _a_a_a_ wrote:
| In my experience most databases are read-heavy so that's
| what I was asking about. That said, thanks for a clear
| explanation.
| samlambert wrote:
| We have foreign keys coming soon!
| angarg12 wrote:
| Interesting slide deck, but I feel we are missing some of the
| most interesting details out. Particularly, slides 24 and 25
| could be an entire article on their own.
|
| * How did you tune Vitess for resiliency? What were the tradeoffs
| and how was the performance?
|
| * How did you migrate from shared in app config to single DB
| endpoint? (this is an issue we are facing right now)
|
| * What do you mean by some queries being too shard aware? How did
| you optimize queries for efficient routing with Vitess?
| super256 wrote:
| I assume Activision _actually_ knows how to scale databases, but
| has "join queues" for hype? I think everyone knows the infamous
| Blizzard queues in release week, where you have to wait an hour
| with only 5000 people in front of you. Happened recently with the
| OW2 release.
| mh- wrote:
| Databases are not the only contended-for resource in online
| games.
|
| Those queues are usually the result of intentional capacity
| planning. Overprovisioning is expensive.
| super256 wrote:
| > Those queues are usually the result of intentional capacity
| planning. Overprovisioning is expensive.
|
| I'd imagine they could spin up new servers on demand.
| Arrath wrote:
| Could, but after near 20 years of WoW and however many
| expansion launches with big ole wait queues each time, they
| really don't seem to bother.
| rejectfinite wrote:
| Gaming is insane that it even works online in near realtime. Even
| moreso before. You can buy old wow servers. That was a world.
| alecco wrote:
| > * SQL query compatibility > * Minimal changes to the
| application > * Runs MySQL in backend > *
| Kubernetes native > * Provides Kubernetes operator
| > We evaluated multiple candidates and chose Vitess
|
| https://github.com/vitessio/vitess
|
| But it's very light on numbers and doesn't show trade-offs or
| anything.
| edf13 wrote:
| > Vitess has been a core component of YouTube's database
| infrastructure since 2011, and has grown to encompass tens of
| thousands of MySQL nodes.
|
| Would love to hear more about this implementation
| tbarn wrote:
| KubeCon talks should be out in the next month or so and it has
| more details.
| Thaxll wrote:
| Vitess was used for all of Youtube, they moved to Spanner
| couple of years ago, Vitess is proven and scalable tech.
___________________________________________________________________
(page generated 2023-04-21 23:00 UTC)