[HN Gopher] Improving how we deploy GitHub
___________________________________________________________________
Improving how we deploy GitHub
Author : todsacerdoti
Score : 95 points
Date : 2021-01-25 18:06 UTC (4 hours ago)
(HTM) web link (github.blog)
(TXT) w3m dump (github.blog)
| zoobab wrote:
| Github source code did not leak recently?
| aszen wrote:
| Kind of sad to see GitHub doesn't use GitHub itself to deploy and
| monitor their releases.
| WJW wrote:
| That seems like an extremely good idea actually, since if you
| dogfood your own releasing service then you can't fix it
| anymore if you accidentally bring down the service.
| notwhereyouare wrote:
| I did a short stint at wayfair and about 1-2 months in, there
| was a deploy that somehow got passed the test flow and when
| deployed took down their entire site. So badly that they
| couldn't even deploy the fix
| xxpor wrote:
| That's usually solved with a parallel stack deployment, use
| the other stack if something is broken
| paxys wrote:
| If the "other stack" isn't regularly used then you can
| assume it will be broken when needed
| cpascal wrote:
| You just run the previous version of the production stack
| in your "dogfood/operations" stack. Once you've fully
| rolled out production and have vetted it, you can upgrade
| the other one to match production.
| Xorlev wrote:
| That also means when it does go wrong, it takes much longer
| to fix. Good operational practice is to decrease MTTR, not
| make it worse.
| illnewsthat wrote:
| I was surprised to read that they are using Slack since it is
| such a competitor to Microsoft's Teams (parent company).
| kuschkufan wrote:
| Are you expecting them to use Windows everywhere as well?
| dubcanada wrote:
| No but why would you use a product that is $7 or what ever
| times the number of employees (so let's say 200, so $1400 a
| month) when you can use a free one.
| maccard wrote:
| Speaking from experience, just because you work for a
| company doesn't mean you can use all of their products
| (or that you'll even get favorable pricing on them).
| lostapathy wrote:
| I'd love to hear this story! Seems crazy ... but we live
| in a crazy world.
| scott_w wrote:
| Unrelated to software but the company my dad works for
| (motor repair) has to buy all its parts from its own
| distribution arm, at the marked up price. He then has to
| turn a profit on those parts as well as pricing the
| labour.
|
| If cost price is PS5 and the markup is 20%, he has to pay
| PS6 to get the part, then charge PS7.20 on the invoice to
| the customer. I'll let you guess what that does to tender
| bids ;-)
| names_are_hard wrote:
| At Microsoft if you build a product using Azure (and if
| you want to use the cloud you MUST use Azure, you're not
| going to get approval to write a check to AWS) the costs
| come out of your budget. And it's taken seriously, to the
| point where teams will very much emphasize managing costs
| (what will this new feature cost on our Azure bill? Can
| we build it more efficiently? Oh wow, that refactor saved
| us 100k/month in cloud costs, don't forget that when we
| start talking about promotions...)
| lostapathy wrote:
| That makes sense since the amount you could use is
| variable. I was thinking more like somebody couldn't get
| a free word license at a MS subsidiary or something.
| vulcan01 wrote:
| When I worked at MS Azure, we had to pay for Azure
| servers! (I believe our team had a $5k/month Azure bill.)
| It's part of internal budgeting, so that people within MS
| don't splurge on expensive things (because it does cost
| MS money for each person on Teams).
| names_are_hard wrote:
| Did you drop a k? What can you do with 50 dollars?
| vulcan01 wrote:
| Yes, thank you, it should be $5k. Edited.
| josephg wrote:
| My uncle used to work at Compaq (back before they got
| bought by HP). When their computers broke, his team had
| to pay their support staff to get them fixed. (Via
| internal budgeting). But the support team knew internal
| customers would call them anyway and it was still
| compaq's money, so they charged several times more for
| internal support calls than normal support calls.
|
| My uncle's team was having none of that, so they paid an
| external computer repair service to fix their computers.
| The external repair service subcontracted to compaq's
| internal people anyway, so when their computers broke
| they called up (and paid) external consultants. Who in
| turn called compaq's internal support team, who came
| downstairs and fixed their computers at a competitive
| price.
| theshrike79 wrote:
| On the other hand sometimes it means you MUST use the
| company products.
|
| Consulted for a sub-sub-sub-subsidiary of Toshiba. All
| computer equipment _had_ to be from Toshiba - the closest
| place to get Toshiba laptops was two COUNTRIES over.
|
| They even had to tape over non-Toshiba branding from
| external displays that would be visible.
| paxys wrote:
| $1400 a month is less than a rounding error for a company
| that size. If you can get even the tiniest bit of extra
| developer productivity from the software then it is worth
| it.
|
| And Github will definitely still have to "pay" for Teams,
| whether that is internal accounting or actual money being
| exchanged.
| names_are_hard wrote:
| My understanding of Microsoft policy is that it's easier to
| buy macbooks for your developers than it is to buy Slack.
| Which makes sense, because they're currently doing head to
| head with slack for market share right now, while a few
| macbooks doesn't threaten their credibility when selling
| windows.
|
| My guess is that github was using slack before they were
| bought and inertia is a thing. I'm sure there are people
| within the parent company that would like to see them
| transition, but I'm sure there's a ton of resistance,
| especially "on the ground" at github. Buyouts are a
| delicate thing, they don't want to ruin github by trying to
| force it to change too quickly.
| dubcanada wrote:
| Probably because Teams is the worst.
|
| More then likely it's because that's what they used before
| they got bought and haven't been forced to migrate over yet,
| they also seem to have bots, which are not really a direct
| copy and paste into MS Teams, and likely them converting over
| isn't a high priority.
| jen20 wrote:
| IIRC GitHub used to use Campfire and it took a long time to
| switch to Slack - a switch to Teams would no doubt take a
| long time too!
| paxys wrote:
| Easy to switch a chat application, hard to switch your entire
| chatops ecosystem. This blog post shows the perfect example
| of that.
| jules2689 wrote:
| There is some GitHub used, but as others stated we don't want
| to create a circular dependency on ourselves in case we deploy
| something that is broken.
| hoprocker wrote:
| This is generally a good flow, but something that absolutely
| baffles me is that GitHub changes the commit SHAs when branches
| are rebase-merged from PRs[0]. This totally breaks a fundamental
| notion in Git that the same work, based on the same commits, has
| the same hash. It also makes it incredibly difficult to determine
| which PR branches have been merged into master.
|
| [0] https://docs.github.com/en/github/collaborating-with-
| issues-...
| KinesisMagic wrote:
| Can anyone explain why they might go with a slack based
| deployment system as opposed to something more robust like
| CircleCI or Jenkins? Is it mainly about the simplicity of it?
| jules2689 wrote:
| It's mainly the simplicity of the deployment system as it's
| inline and visible, coupled with habit. It all actuality that
| is just what _can_ trigger the deploy, the actual deploy is
| based on an internal deploy application and deploys can be
| triggered from there as well.
| mrdonbrown wrote:
| My team recently put in automation so that we use CircleCI for
| the staging deployment, have it wait for manual approval, then
| deploy to production. However, we can also give the Slack
| staging deployment message a +1 reaction, which will
| automatically approve the production deployment for CircleCI.
| This way, we get an easy dev UX but all the CI features of
| CircleCI.
| pronoiac wrote:
| There's easy transparency amongst multiple teams, without
| having accounts for the other teams on CircleCI or Jenkins.
| This is while the deploy is in flight, and it can provide
| timestamped logs if there's an incident, and it could be useful
| for tracking history. It's also clear who kicked off the
| deploy.
| zug_zug wrote:
| As a devops person myself, I am super skeptical that there is
| any good reason to do a chatops deploy. My guess is "new toys
| are cool" / "Want this on my resume"
|
| To be clear, it's hopefully just some connector that does slack
| message -> triggers jenkins job.
|
| But from a security, compliance, reliability, debuggability,
| auditability perspective I think it's inferior. Not to mention
| an inferior interface.
| swagonomixxx wrote:
| chatops deploys aren't really new toys, a place I was at was
| doing them around 2013/14.
|
| We liked it because the chat history you see is essentially a
| deploy history, no need to login into some other website to
| check some obscure logs page to see who did what. We did end
| up having to debug the service that processed the chat
| messages maybe once, but never ran into an issue when we had
| to deploy a hotfix.
| alexchamberlain wrote:
| That's pretty awesome to go from nothing to full production in 15
| minutes. I would like to encourage others to bear in mind that
| simply adding more time wouldn't significantly decrease the risk
| of things going wrong.
| cytzol wrote:
| Something I found surprising is that a change to the GitHub
| codebase will be run in canary, get deployed to production, and
| _then_ merged. I would have expected the PR to be merged first
| before it gets served to the public, so even if you have to `git
| revert` and undeploy it, you still have a record of every version
| that was seen by actual users, even momentarily.
|
| Does anyone know the pros and cons of GitHub's approach?
| halukakin wrote:
| I think this method seems to get more popular by day. IMHO,
| previously master was the branch you merge before the deploy
| process. But today this is reversed.
|
| The main benefit is, other developers can rely on the master
| branch even more. They will know there will not be a revert on
| the master branch they just pulled one hour ago and already
| started coding on.
| Kwpolska wrote:
| A `git revert` creates a new commit. To a developer, a revert
| commit appearing on master has the same effect as a pull
| request (or ten) being merged into it. If the revert affects
| code you're working on, you will need to resolve conflicts,
| just like you would need to if a merged PR affected the same
| code.
| bswinnerton wrote:
| This is known as "GitHub Flow"
| (https://guides.github.com/introduction/flow/). I was pretty
| surprised by it when I first joined GitHub but I've grown to
| love it. It makes rolling back changes much faster than having
| to open up a revert branch, get it approved, and deploy it.
| When something goes sideways, just deploy master / main, which
| is meant to always be in a safe state.
| sandGorgon wrote:
| > _GitHub.com is deployed primarily through chatops_
|
| What is the best chatops right now ? I dont see a lot of
| popularity around chatops. Its most usually some version of
| github based triggers.
|
| Its funny that Github themselves uses chatops. I think that's a
| very nice take - especially for early stage startups. Anyone else
| use anything like it ?
| paxys wrote:
| I'm guessing they are using Hubot (https://hubot.github.com/)
| swagonomixxx wrote:
| A place I was at used Hubot as well. It gets the job done, we
| never really ran into a fuss. Easily extensible as well.
| jules2689 wrote:
| This is correct :)
| icey wrote:
| We're just starting beta, but my friend Phil and I both worked
| together at GitHub and are building what we hope to be a better
| Hubot at https://ab.bot right now.
|
| It's missing some of the chatops stuff that is mentioned in the
| blog post but since we support a lot more languages than Hubot
| we're hoping it's a matter of time before someone in our
| community builds a better replacement deployment script (or
| we'll do it while building out sample scripts :))
|
| (Also, hi GitHub friends!)
| Xorlev wrote:
| I was surprised to see their canary stages are just 5 minutes.
| Many problems take longer to manifest. That seems like a fairly
| risky release process.
| jules2689 wrote:
| It's actually longer than 5 minutes. There is the duration of
| the 2% canary deploy where we start to see pick up of traffic,
| a 5 minute wait, then a 20% "deploy", and a 5 minute wait. All
| in all this comes out to around 10-15ish minutes in canary.
| This is a stage where we can almost instantly shut off traffic
| to the canary deploy.
|
| Could we reduce risk by lengthening the process? Maybe, but you
| also make deploys longer which means less stuff can get through
| in a day. This makes devs respond with larger PRs, for example,
| which increases the risk profile.
|
| So we need to balance time and duration. Typically large
| problems will manifest quickly, or take a lot longer to detect
| (and thus are generally more minor problems) when you have our
| scale of a user base in my experience.
| wdb wrote:
| Yeah, wouldn't you need some sort of minimum amount of traffic
| to be able to use canary deployment?
| paxys wrote:
| The problems that don't immediately manifest could very well
| take hours or days or longer. There has to be a limit, and 5
| minutes is as good as any.
| closeparen wrote:
| A lot of alerts use moving averages or sustain times to
| squelch transient noise. You have to wait for the max sustain
| time to pass before you can conclude that lack of alert =
| lack of problem.
|
| That time could very well be 5 minutes but the two need to be
| coordinated.
| bomdo wrote:
| I'd love to learn more about their canary rollouts. Is there any
| more info from either them or similar large sites about this?
|
| For example, what usually has to happen for a dev to trigger a
| rollback? Or how do they handle stateful changes such as database
| schema changes?
| t3rabytes wrote:
| Re db migrations: they've built their own DB management tooling
| (https://github.com/openark/orchestrator) and online migration
| tooling (https://github.com/github/gh-ost)
| jules2689 wrote:
| We monitor Datadog dashboards, exceptions, and other metrics
| mainly, as well as smoke testing the application
___________________________________________________________________
(page generated 2021-01-25 23:01 UTC)