hngopher.com

       [HN Gopher] Scaling Databases at Activision [pdf]
       ___________________________________________________________________
        
       Scaling Databases at Activision [pdf]
        
       Author : ksec
       Score  : 158 points
       Date   : 2023-04-21 16:09 UTC (6 hours ago)
        
 (HTM) web link (static.sched.com)
 (TXT) w3m dump (static.sched.com)
        
       | erksch wrote:
       | It would have been interesting to know what alternatives they
       | considered. Perhaps this would have been easy to solve using
       | Nomad and TiDB.
       | 
       | So it's a story of people who used Kubernetes and MySQL and
       | continued to use Kubernetes and MySQL. The end.
        
         | kokizzu2 wrote:
         | yup lots of companies do agree https://github.com/kokizzu/list-
         | of-tech-migrations
        
       | bickett wrote:
       | This is an interesting discussion on database performance and
       | scaling. Thanks for sharing!
        
       | achanda358 wrote:
       | What kind of queries per second did you see while running on
       | kubernetes, compared to VMs?
        
       | rektide wrote:
       | 3 months from start to roll out for tbeir large scale deploy is
       | hella impressive.
       | 
       | Just trying to do basic scaling or other simple tasks sounded
       | like an utter nightmare on their old system. Having autonomic
       | computing at their back seems like an obvious win. Tell it your
       | desired state & let the controller do the job.
       | 
       | The new paradigm for computer operations is so wonderful.
        
         | endur wrote:
         | The timing slide caught my attention as well. 3 months for
         | anything is fast...
        
       | cpressland wrote:
       | Ahh, I wanted to go see this talk at KubeCon yesterday but the
       | room was completely full. Glad to see the finer points got posted
       | here.
        
       | brycethornton wrote:
       | They have a PlanetScale logo on the last slide. Is this just to
       | say "Thanks for your work on Vitess!" or are they using the
       | PlanetScale service for some of this?
        
       | jtchang wrote:
       | I wonder what game this was for.
       | 
       | To me this is validation that going the SQL route in almost 99%
       | of new apps is the right way to go. It will be a rare case that
       | you won't be able to scale out given how mature some of these
       | technologies are.
        
         | chillydawg wrote:
         | Demonware handle the online services for the CoD franchise.
        
         | VirusNewbie wrote:
         | Github had DAYS of an outage because they unsuccessfully tried
         | to scale out SQL rather than dealing with NoSql.
         | 
         | There are very few things that structured NoSql can't do if you
         | ignore reporting. Once you scale out traditional SQL, you
         | aren't going to be using it for reporting either way though.
        
           | mh- wrote:
           | This is a gross mischaracterization of what caused that
           | outage, even just based on the public information from the
           | (excellent) blog post [0] about it.
           | 
           | Automating _master failover_ on MySQL without a human in the
           | loop - _in the topology GitHub used back then_ - is risky.
           | 
           | 0: https://github.blog/2018-10-30-oct21-post-incident-
           | analysis/
        
       | LunaSea wrote:
       | Anyone have more blog posts about video game backend tech like
       | this one?
        
       | geenat wrote:
       | These numbers make more sense on hardware from 5 years ago (to be
       | fair, what they were dealing with), but postgres, on modern
       | consumer hardware (Yes, consumer! Ex: KC3000 + Ryzen 7950x),
       | already does over 50k QPS without being overburdened. You could
       | probably squeeze out 100k QPS with faster NVME's.
       | 
       | That's like 10 tall servers for their peak QPS? (500k QPS on
       | slide 19)
       | 
       | Lots of natural sharding points here too: company, game, usage of
       | data, etc.
       | 
       | This is assuming you avoid or go light on expensive features like
       | foreign keys (as Vitess already does).
       | 
       | The "scale magic" in Spanner (Big Table) inspired DB's are just
       | hidden automation of traditional sharding-
       | 
       | CockroachDB: Sharded indexes, then runs map reduce for you.
       | 
       | Scylla/Cassanda: Sharded indexes again, but more limitations for
       | speedups: Eventually consistent. You don't have fast delete- only
       | update (discord uses tombstones). JOIN's are "in app" only.
       | 
       | Vitess: Proxy that dismantles/routes your query to the correct
       | server. This scales, but is eventually consistent. JOIN's on co-
       | located data, or "in app".... in simple terms, Vitess is like an
       | externally managed "in app" query parser/router.
        
         | winrid wrote:
         | Mongo's sharding also uses routers like Vitess but can be
         | strongly consistent depending on connection and write settings.
         | 
         | Data is divided into chunks and the routers route queries based
         | on where your query lands on chunk ranges.
        
         | aeyes wrote:
         | I see 500k qps with 30TB data doubling every 2 years on slide
         | 19.
         | 
         | In my experience me this isn't close to being possible with a
         | single Postgres box. There is no way around sharding this
         | workload.
         | 
         | Vitess has different consistency models, it's not as eventually
         | consistent as you describe.
        
           | geenat wrote:
           | You'd certainly want to shard from the start in any case, but
           | it's an order of magnitude less postgres boxes for the QPS.
        
             | leetrout wrote:
             | Wonder what the tradeoff on connection overhead would be?
             | I've never been around a connection pooler that would be
             | anywhere near this load.
        
       | leetrout wrote:
       | When you give up one of the major components of relational
       | databases like foreign keys you pick up lots of performance
       | opportunity.
       | 
       | It is amazing what Vitess / PS will help you accomplish but they
       | don't talk a lot about tradeoffs you make to get there (which are
       | similar to what you face with sharding MySQL without additional
       | tooling).
        
         | berkle4455 wrote:
         | Foreign keys are so overrated. Write good code and you don't
         | have to rely on the DB to enforce referential integrity. It's
         | the equivalent of running test cases with every db write.
        
           | AdrianB1 wrote:
           | From experience you are very right. Also from experience,
           | most teams are not good enough to write the code with that
           | quality that you can give up to referential integrity.
        
           | vecter wrote:
           | I'm curious, how would you obviate the need for foreign keys
           | with "good" code? Can you provide a toy example or a
           | reference to an article so I can understand better? I've used
           | NoSQL databases a long time ago and currently rely on on good
           | ole PostgreSQL, but I'm having a hard time understanding how
           | "good" code can be a better solution for managing
           | relationships between data than a foreign key.
        
             | Yoric wrote:
             | I guess you can use indices instead of foreign keys? And
             | somehow implement all the `ON DELETE CASCADE` manually
             | within any transaction that removes the original row? Not
             | sure how it's "good" code but it could be faster.
        
               | DasIch wrote:
               | A foreign key doesn't necessarily imply an index. If you
               | are using postgres, you would have to add an index in
               | addition to the foreign key, if you want one.
               | 
               | On delete cascade, depending on how many rows it cascades
               | to, can be problematic because it's a very long running
               | blocking operation. That's something one might want to do
               | as a background operation and in batches. Although that
               | won't make it faster.
        
               | sroussey wrote:
               | Faster in aggregate throughput can often be different
               | from faster for a specific operation.
               | 
               | Personally, I find delete on cascade dangerous. I mean,
               | lots of fun for a pen tester, sure...
        
             | berkle4455 wrote:
             | > for managing relationships between data than a foreign
             | key.
             | 
             | You're conflating the concept of a normalized database with
             | insanely slow DB-enforced referential integrity/foreign
             | keys.
             | 
             | Toy example? Sure. Take a well-formed 3NF schema and
             | disable foreign key constraints.
        
               | AlisdairO wrote:
               | How does this stay correct in the presence of concurrent
               | activity?
        
               | berkle4455 wrote:
               | Using transactions or UUID/ULIDs though maybe I'm
               | misunderstandingyour question. How do foreign key
               | constraints help with concurrency?
        
               | AlisdairO wrote:
               | Table User: userid, etc
               | 
               | Table Resources: resourceid, userid, etc
               | 
               | If I want to restrict deletion of a user to only be
               | possible after all the resources are deleted, I'm forced
               | into using higher-than-default isolation levels in most
               | DBs. This has significant performance implications. It's
               | also much easier to make a mistake - for example, if when
               | creating a resource I check that the user exists prior to
               | starting the transaction, then start the tran, then do
               | the work, it will allow insertion of data into a
               | nonexistent user.
        
               | vecter wrote:
               | Sorry, assume I'm dense.
               | 
               | > Take a well-formed 3NF schema and disable foreign key
               | constraints.
               | 
               | I'm familiar with 3NF, but can you expand on how 3NF
               | enables you to remove foreign keys? Or feel free to point
               | me to an article/blog, I don't want to waste your time if
               | it's too much to explain. I did some googling but wasn't
               | sure where to proceed from your post.
        
               | Philip-J-Fry wrote:
               | They're not saying 3NF enables you to remove foreign
               | keys. They are talking about removing foreign key
               | constraints from your RDBMS of choice.
               | 
               | Something like SQL Server can enforce foreign key
               | constraints if you explicitly tell it your relationships
               | between tables. The downside is that having this
               | referential integrity costs you performance as the
               | database has to check your relations when
               | inserting/updating/deleting rows. E.g. checking that a
               | foreign key is pointing at a valid primary key, checking
               | that you aren't leaving invalid foreign keys when
               | deleting a primary key, etc. This is to prevent you
               | inserting bad data into the database.
               | 
               | You can delete these constraints and still have the exact
               | same behavior _so long as your code is correct_. It just
               | means that the database isn 't going to stop you writing
               | bad data.
        
               | vecter wrote:
               | That makes sense. For workloads where write performance
               | isn't very important but read performance is, this sounds
               | like it may still be a worthwhile tradeoff to have that
               | extra level of data integrity.
               | 
               | My sense is that many typical CRUD apps aren't writing
               | gargantuan volumes of data or making very complex edits,
               | and if they do, it's ok if it takes a second longer.
               | Usually read speed is more of a bottleneck for user-
               | facing applications, but I'm sure there are probably some
               | examples where this tradeoff is worth it.
        
               | berkle4455 wrote:
               | Say you have a super basic setup:                 user =
               | {user_id, email}       order = {order_id, user_id}
               | 
               | order.user_id is a foreign key to user.user_id. That's a
               | perfectly valid and reasonable way to organize things.
               | 
               | Enabling RDBMS-enforced foreign key constraints is the
               | issue. It slows everything down dramatically.
        
               | vecter wrote:
               | I see, thanks. Basically just store the foreign keys
               | yourself as columns in relevant tables and perform the
               | joins in SQL without having the DB enforce FK integrity
               | with every insertion/update/delete. Would it be fair to
               | say that read speeds are unaffected by this?
        
               | berkle4455 wrote:
               | Yes. Quite literally the only difference is not enabling
               | foreign key constraints. Read speeds unaffected correct.
        
           | kgeist wrote:
           | "Write good code and you won't have bugs" :)
           | 
           | I think it's good practice to enforce consistency rules both
           | in the DB and in code. If you make a mistake in your code,
           | the DB won't allow it, and vice versa.
        
             | vp8989 wrote:
             | In theory it is good practice, but in practice enforcing
             | referential integrity is not free and it makes satisfying
             | other non-functional requirements harder. Engineering is
             | about tradeoffs. Foreign keys are overkill in modern
             | architectures where DBs/tables are typically only written
             | to by a single application. They solve a problem that lots
             | of people don't really have anymore.
        
               | kgeist wrote:
               | In the legacy project our team inherited, there's a lot
               | of data consistency issues stemming from the lack of
               | foreign keys (we're adding them back now). Data
               | consistency is important because we deal with financial
               | data. Typically, code and data change more often than DB
               | schemas, so every release has a chance that someone will
               | forget to update all relations manually in code,
               | introducing dangling references, for example. Foreign
               | keys help as an additional safety measure, because they
               | don't change as often. I think you should care about the
               | performance of foreign keys only when speed is more
               | important than data consistency. It's a good default.
        
               | sroussey wrote:
               | Personally, I thinks foreign keys are great in dev and
               | bad at any scale that you have many db servers.
               | 
               | And as you point out, there are exceptions, like
               | financial data. But not marketing funnels where you might
               | throw everything away.
               | 
               | You should probably have different types of engineers
               | working on such different projects as well.
        
             | baronvonsp wrote:
             | The exact analogy that came to mind for me.
             | 
             | Plus, relational databases don't just sit under a single
             | application. There's usually multiple applications/services
             | talking to them. Worse, humans connect to them an do all
             | sorts of things they shouldn't do. That's the whole point
             | of managing referential integrity in the DBMS, since you
             | can only control "just write good code" across so many
             | application domains.
             | 
             | Of course whether the performance tradeoff is worth it is a
             | complicated decision for many of the reasons people have
             | mentioned. But in 20 years of working with relational
             | databases at big companies, I've seen few examples where
             | the performance win exceeded the business risk.
        
         | candiddevmike wrote:
         | Things can be really fast when you throw out referential
         | integrity
        
         | _a_a_a_ wrote:
         | Databases are my interest, so if you don't mind... How would
         | removal of foreign keys speed up read access?
         | 
         | I'm also going to disagree with you a little, again on the
         | issue of reading, if you have foreign keys then you can do
         | optimisations like cutting out chunks of joins. Adding
         | constraints means you know more which you can feed to the
         | optimiser which means often you can get better performance
         | (read performance, that is).
        
           | robmccoll wrote:
           | Not OP, but they said that you pick up performance, not
           | specifically in reading. The performance you gain is larger
           | aggregate write throughput (and read generally - although
           | reads that require spreading queries across shards and
           | aggregating results will likely perform a little worse). You
           | also gain scale in that you have turned something that was
           | limited by scale up into something limited by scale out
           | (while giving up some data guarantees and performance on some
           | queries).
           | 
           | As to why you can't really have foreign keys, basically,
           | sharding schemes like this sit at an intermediate layer
           | between the application and the various DB clusters that are
           | the shards. You CAN have strong data consistency (and useful
           | foreign keys) within a shard because it's all inside a single
           | database; however across shards, you CAN NOT. The sharding
           | layer doesn't perform checks for you, so if you have two
           | logical tables that are partitioned differently across the
           | shards, you can't have a foreign key that will be enforced
           | correctly as the foreign key of a given row in one table may
           | live on a different shard in the other table. The local
           | database within the shard would reject the insert.
           | Transactions across shards can also be tricky to impossible.
        
             | _a_a_a_ wrote:
             | In my experience most databases are read-heavy so that's
             | what I was asking about. That said, thanks for a clear
             | explanation.
        
         | samlambert wrote:
         | We have foreign keys coming soon!
        
       | angarg12 wrote:
       | Interesting slide deck, but I feel we are missing some of the
       | most interesting details out. Particularly, slides 24 and 25
       | could be an entire article on their own.
       | 
       | * How did you tune Vitess for resiliency? What were the tradeoffs
       | and how was the performance?
       | 
       | * How did you migrate from shared in app config to single DB
       | endpoint? (this is an issue we are facing right now)
       | 
       | * What do you mean by some queries being too shard aware? How did
       | you optimize queries for efficient routing with Vitess?
        
       | super256 wrote:
       | I assume Activision _actually_ knows how to scale databases, but
       | has  "join queues" for hype? I think everyone knows the infamous
       | Blizzard queues in release week, where you have to wait an hour
       | with only 5000 people in front of you. Happened recently with the
       | OW2 release.
        
         | mh- wrote:
         | Databases are not the only contended-for resource in online
         | games.
         | 
         | Those queues are usually the result of intentional capacity
         | planning. Overprovisioning is expensive.
        
           | super256 wrote:
           | > Those queues are usually the result of intentional capacity
           | planning. Overprovisioning is expensive.
           | 
           | I'd imagine they could spin up new servers on demand.
        
             | Arrath wrote:
             | Could, but after near 20 years of WoW and however many
             | expansion launches with big ole wait queues each time, they
             | really don't seem to bother.
        
       | rejectfinite wrote:
       | Gaming is insane that it even works online in near realtime. Even
       | moreso before. You can buy old wow servers. That was a world.
        
       | alecco wrote:
       | > * SQL query compatibility         > * Minimal changes to the
       | application         > * Runs MySQL in backend         > *
       | Kubernetes native         > * Provides Kubernetes operator
       | > We evaluated multiple candidates and chose Vitess
       | 
       | https://github.com/vitessio/vitess
       | 
       | But it's very light on numbers and doesn't show trade-offs or
       | anything.
        
         | edf13 wrote:
         | > Vitess has been a core component of YouTube's database
         | infrastructure since 2011, and has grown to encompass tens of
         | thousands of MySQL nodes.
         | 
         | Would love to hear more about this implementation
        
         | tbarn wrote:
         | KubeCon talks should be out in the next month or so and it has
         | more details.
        
         | Thaxll wrote:
         | Vitess was used for all of Youtube, they moved to Spanner
         | couple of years ago, Vitess is proven and scalable tech.
        
       ___________________________________________________________________
       (page generated 2023-04-21 23:00 UTC)