[HN Gopher] An update on recent service disruptions
___________________________________________________________________
An update on recent service disruptions
Author : todsacerdoti
Score : 91 points
Date : 2022-03-23 20:39 UTC (2 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| renewiltord wrote:
| Huh, one guy on HN _did_ say it was the DB that was the problem
| earlier on. Neat!
| speedgoose wrote:
| It's interesting to read that so many systems and activities are
| dependent on a single point of failure : the main primary MySQL
| node at GitHub.
| eatonphil wrote:
| I can imagine what they're in a rush to refactor.
| cyberpunk wrote:
| I mean, they have $MEGABUCKS, they could probably get 1/2 the
| team who maintains mariadb to come in and work for them if they
| wanted, and they still have a giant single db node doing writes
| and struggle to fail it over.
|
| We're doomed >_<
|
| You would think it wouldn't be _THAT_ hard to shard something
| like GitHub effectively.
|
| I mean, all user accounts/repos starting with the letter 'a' go
| to the 'a' cluster and so on seems not exactly science-fiction
| levels of technology.
| [deleted]
| throwusawayus wrote:
| it's profoundly strange that github has not properly sharded
| yet. essentially all large social networks are or have used
| sharded mysql successfully, this is not rocket science
|
| livejournal, facebook, twitter, linkedin, tumblr, pinterest
| all use (or formerly used) sharded mysql and most of these
| are at larger db size than github
|
| i will also repeat my comment from another recent thread: i
| just cannot understand how 20+ former github db and infra
| people recently left to join a db sharding company. this
| makes no sense whatsoever in light of github's lack of
| successful sharding. wtf is going on in the tech world these
| days
| samlambert wrote:
| This is just a gross simplification of the situation. You
| commenting from the outside without much context. GitHub is
| 14 year old rails app that is very complex doing large
| migrations of database platforms can very difficult and
| take a long time.
| throwusawayus wrote:
| You seem totally fine with making gross simplifications
| about your competitors at aws:
| https://news.ycombinator.com/item?id=30459692
| ketzo wrote:
| This is a super, super annoying type of comment, and
| against HN guidelines. It adds absolutely nothing to the
| conversation. Don't go digging through people's comment
| history to throw mud.
| drewbug01 wrote:
| > i just cannot understand how 20+ former github db and
| infra people recently left to join a db sharding company.
| this makes no sense whatsoever in light of github's lack of
| successful sharding.
|
| I believe you have the chain of causality backwards here.
| In fact, I think it suggests that talent that went to
| planet scale is perhaps not the issue.
| throwusawayus wrote:
| i'm not suggesting that these departures are the cause of
| github's issue. rather, i'm saying i don't understand why
| such a large group from github was hired by planetscale
| if they did not have experience successfully sharding or
| successfully leveraging vitess
|
| this is like if you were building a high-rise condo,
| would you hire the architects or management company from
| the building that collapsed in surfside florida? sure,
| they know what NOT to do next time, but that doesn't mean
| they do know what TO do
| samlambert wrote:
| Hi, former GitHubber here and CEO of PlanetScale. I came
| to PlanetScale after seeing the incredible impact that
| Vitess had at GitHub. The team we have hired here
| successfully sharded part's of GitHub's very large
| platform. We continue to shard very large customers at
| PlanetScale.
|
| Platform migrations take a very long time and it's very
| complicated especially with decade old codesbases. I will
| say the current team at GitHub are nothing but
| outstanding people and engineers with a difficult task of
| managing a very large deployment.
| drewbug01 wrote:
| I understand what you're getting at... but why is it you
| presume that the employees now at planetscale are the
| reason GH couldn't shard out their databases?
|
| Like, there's another angle here: management, yeah?
|
| Another way of reframing it is "maybe the folks hiring at
| planetscale know the inside baseball about GH
| infrastructure". For example:
| https://www.linkedin.com/in/isamlambert
| throwusawayus wrote:
| this is my exact point: they hired the former head of GH
| infrastructure - _literally the person directly
| responsible for all this at github for years_ - and made
| him their ceo
|
| github should have sharded _years ago_ , every other
| large mysql user did so much earlier in their growth
| trajectory
| drewbug01 wrote:
| Ah, that's a more specific point than what you seemed to
| have been making before.
| sillysaurusx wrote:
| > they could probably get 1/2 the team who maintains mariadb
| to come in and work for them if they wanted
|
| _The Mythical Man Month_ has a few things to say about that.
|
| (It's tempting to feel that the information is outdated, but
| in my experience it still seems true.)
| cyberpunk wrote:
| My point was, that they've not done that, is because it
| wouldn't help.
|
| This is an architectural problem, which even if they had
| the massive expensive brains behind something like mysql on
| their team they couldn't fix it.
|
| (at least, I'm guessing, I think this kinda architecture
| doesn't scale even if they could kick the can down the road
| a few times..)
| sillysaurusx wrote:
| Oh, my apologies! Yes, that makes sense.
| prepend wrote:
| My fear is that this seems like a cover excuse for moving off
| MySQL. The bug will be too hard and they'll move off. They will
| choose SQLServer and take a lot of time to convert and then
| have even more outages.
| drewbug01 wrote:
| At GitHub's scale, you don't just "move off" a database. At
| best it would be a gradual project that would take _years_
| for the company to complete, and likely trigger additional
| incidents along the way.
| samlambert wrote:
| Correct.
| nimbius wrote:
| 100% agreed. a lift-and-shift migration to Galera and modern
| MariaDB wouldnt be hard, but knowing MS there are mid
| managers waiting in the wings to swoop in and drive this into
| the ground with azure/sqlserver, the former of which posted 8
| outages in the past 90 days alone.
|
| this is classic Microsoft. spend a ton of money for something
| very valuable -- in this case virtually all developer
| marketshare -- and then casually pedal it into the ground
| while you lie about the KPI's to C levels (IIS marketshare on
| netcraft as a function of parked websites at GoDaddy to
| dominate over Apache) and keep it on life support with other
| revenue streams (XBox) for the next 16 quarters until it
| becomes a repulsive enough carbuncle to shareholders that it
| gets the axe (Microsoft phone.) then in a year, limp into the
| barn with another product nobody else but you could afford to
| buy (minecraft) and slowly turn it into a KPI farm for
| Microsoft account metrics to drive some other failing product
| (Azure) and keep the C level happy while you alienate
| virtually every player with mechanics or requirements they
| hate.
| throwusawayus wrote:
| galera is not a solution for scaling out writes, full stop
|
| galera has lower max writes/sec than a traditional async
| single master because it's a cluster. the other members of
| the cluster need to ack the writes, and all members are
| doing all the writes, so adding machines does not increase
| your max writes
| prepend wrote:
| It's funny you mention Minecraft. My kids recently said "I
| hate Microsoft. All they did is ruin Mojang."
|
| They don't know Microsoft for anything other than ruining
| Minecraft. They didn't know Microsoft made the Xbox or even
| Windows.
|
| They made this statement after Microsoft forced them to
| migrate their account they've had for 5 years to a
| Microsoft account. That broke their computer for a few days
| and reset their games. For no useful reason.
| speedgoose wrote:
| I bet on Azure Cosmos DB.
| protomyth wrote:
| Why would Microsoft not migrate to SQL Server? MySQL is owned
| by Oracle. Microsoft cannot be happy about using a product
| from Oracle. SQL Server is a pretty good product and the
| conversion will give them even more tools and expertise for
| their consulting wing to do it for other companies.
| Ygg2 wrote:
| They do seem to be acting like they are cool and OSS
| friendly.
| wincent wrote:
| Linked in the article is this other one, "Partitioning GitHub's
| relational databases to handle scale"
| (https://github.blog/2021-09-27-partitioning-githubs-
| relation...). That describes how there isn't just one "main
| primary" node; there are multiple clusters, of which `mysql1`
| is just one (the original one -- since then, many others have
| been partitioned off).
| throwusawayus wrote:
| from that article it sounds like they are mostly doing
| "functional partitioning" (moving tables off to other db
| primary/replica clusters) rather than true sharding
| (splitting up tables by ranges of data)
|
| functional partitioning is a band-aid. you do it when your
| main cluster is exploding but you need to buy time. it
| ultimately is a very bad thing, because generally your whole
| site is depenedent on every single functional partition being
| up. it moves you from 1 single point of failure to N single
| points of failure!
| wincent wrote:
| Towards the end:
|
| > In addition to vertical partitioning to move database
| tables, we also use horizontal partitioning (aka sharding).
| This allows us to split database tables across multiple
| clusters, enabling more sustainable growth.
| throwusawayus wrote:
| yes i know, the fact that this is a small blurb at the
| bottom of the article (which is largely about functional
| partitioning) exactly proves my point
| egberts1 wrote:
| The hardest part is drafting a series of questions for the end-
| user to understand and answer before we get that "MAGIC-PRESTO-
| BLAM-WHIZ" configuration files that just works.
|
| I blame the program providers.
|
| Some Debian maintainers are trying to do this simple querying of
| complex configurations (dpkg-reconfigure <`package-name>`). And I
| applaud their limited inroad efforts there because no else one
| has seem to bother.
| paxys wrote:
| Outage checklist
|
| - Was it DNS?
|
| - Was it a bad config update?
|
| - Was it an overloaded single point of failure?
|
| There's rarely a #4
| doublerabbit wrote:
| #5 Is it plugged in?
|
| Build&Configure server, A few hour drive it down to the DC,
| rack the server. Get back home, try to access. No luck. Turned
| out I had forgotten I had forgotten to connect power and turn
| it on.
| HL33tibCe7 wrote:
| A proposed fourth: was it BGP?
|
| Although most of those fall under "bad config update" (although
| likewise that applies to DNS).
| jenny91 wrote:
| When has BGP caused a serious outage at a website?
| karlding wrote:
| The big Facebook outage from last fall [0]?
|
| [0] https://blog.cloudflare.com/october-2021-facebook-
| outage/
| jenny91 wrote:
| The cause wasn't BGP though, it was just caught in the
| middle of a big mess and made it very hard to undo a
| botched config change.
| [deleted]
| karmakaze wrote:
| Or an actual DDoS, not self-inflicted.
| longcommonname wrote:
| Our internal monitoring has seen more outages than they listed
| here. Theres been 4 full days where github actions have been
| mixed between completely broken and degraded status.
|
| It's nice to finally get some comms, but this is incredibly late
| and incomplete.
| calcifer wrote:
| TLDR: They still don't know why this is happenning.
| rvz wrote:
| Oh dear. So other than those double outages, I should expect at
| the very least that GitHub should have at least one outage
| every month, not 3 or 5 a month?
|
| I don't think we are going to see GitHub be up for a full month
| without an incident anytime soon.
|
| I guess with my entire comment chain [0] on GitHub's situation
| has aged for two straight years in a row then, especially
| yesterday's one: [1].
|
| [0] https://news.ycombinator.com/item?id=30779275
|
| [1] https://news.ycombinator.com/item?id=30767821
___________________________________________________________________
(page generated 2022-03-23 23:00 UTC)