[HN Gopher] Epic Games certificate expiration incident report
___________________________________________________________________
Epic Games certificate expiration incident report
Author : gwtabn
Score : 81 points
Date : 2021-04-16 18:29 UTC (3 hours ago)
(HTM) web link (www.epicgames.com)
(TXT) w3m dump (www.epicgames.com)
| [deleted]
| wolverine876 wrote:
| For internal services, why not use self-signed certs with
| expiration dates in the 22nd century (if the technology allows
| that)? You don't need public trust and arguably your own
| authentication of the cert is more trustworthy a third party's.
|
| I can imagine exceptions, such as when code requires a publicly-
| signed cert, but I suspect I'm missing something obvious here.
| NovemberWhiskey wrote:
| Typical certificate management practices for internal PKI are
| just absolutely set up to cause outages like this. The
| certificates get issued for a year, or two years, or whatever.
| This is infrequent enough that it doesn't feel like it makes
| sense to automate the process, and then it becomes a run-book
| that only ever comes out once a year, it's way too easy to add
| additional services without remembering to monitor which
| certificates you deployed to them etc.
|
| Start from the idea that you're going to issue certificates valid
| for 24 hours, and think how different your environment would need
| to look.
| marcosdumay wrote:
| It's worse than one not feeling it's worth it. Automating rare
| events is basically futile, because the next time your
| automation runs, everything will have changed and it will
| break.
| cwkoss wrote:
| If your automation has good logging and you have good
| alerting on logs, isn't it much better to see the automated
| process fail as a notification it needs to be done manually
| rather than relying on it being remembered?
|
| (Ideally, you'd remember and never set the alert off, but
| still great to have that extra layer.
| cortesoft wrote:
| If that is the benefit, than why not just send an auto-
| reminder notification and skip the automation part?
| marcosdumay wrote:
| It's not much different from a notification telling you the
| activity is due. The difference is mostly a matter of what
| kind of notifications your organization ignores, and well,
| I've seen both cases.
|
| Anyway, the best is to shorten the certificates validity.
| The way Letsencrypt recommends is perfect, run it often and
| require several failures before anything breaks.
| de6u99er wrote:
| Not if you treat automation as a first class citizen.
| qwertox wrote:
| The 3 month limit from Let's Encrypt was a blessing to me, as
| it forced me to monitor and automate all the renewals.
|
| I renew once a month, and if things should break, I have a two
| month window to fix the issues.
|
| Before that I would receive a Comodo SSL Certificate once a
| year via email and until then I always had forgotten what I had
| to do with it. What an unnecessary pain.
| atkbrah wrote:
| I suppose it's the "green lock" that drives people to still
| use certificates issued non-automated way.
| lights0123 wrote:
| What browsers still show a green lock, or anything to
| differentiate EV certs? Only IE?
| athorax wrote:
| Ugh, I feel this in my bones.
| darkwater wrote:
| I know that automation is key here and all the benefits that
| doing thing very often bring BUT in the specific case of
| certificates issue if you have short lived certs now you have
| to ensure your CA system works perfectly, as its uptime now is
| the uptime of your whole platform. Yeah you can outsource it to
| AWS ACM and the like, or use Hashicorp Vault but still, it's
| something that before the change was totally static and now is
| an extra moving part.
|
| I'm not advocating against it, just exposing the whole story.
| NovemberWhiskey wrote:
| It depends on your organization, but in many ways the
| enterprise PKI CA is one of the _easiest_ services to run at
| high availability. There are hardly any shared-data
| dependencies, so it 's easy to scale; it's almost completely
| CPU bound with highly predictable demand, etc.
|
| Pretending it's "totally static" is exactly the problem.
| There are only two kinds of things in the software world -
| things that can stay the same until your next release, and
| things that need automation. "Almost completely static" is
| how your post mortem ends up on the front page of HN.
|
| A consideration of the full story also needs to include the
| risks associated with long-lived certificates. If you lose
| control of the private key associated with one, what do you
| do? Are you actually operating a CRL? Are any of your HTTPS
| clients actually _checking_ the CRL? What would you do if a
| severe compromise were discovered that affected the signature
| algorithm you 're using?
| ArchOversight wrote:
| With a short expiry, let's say 90 days, you should be
| renewing 30 days ahead of time, so at the 60 day mark you
| attempt to renew.
|
| This grants you 30 days to fix any problems and get the
| system back up.
| shuntress wrote:
| Then you have to automate handling the 60-day renewal
| failure warnings which adds _another_ moving part.
| johncolanduoni wrote:
| If you can't reliably automate sending an unignorable
| message to some set of humans when something fails to
| happen, you're going to have a tough time keeping
| anything actively developed online.
| erwald wrote:
| That is some impressively fast mitigation for an unexpected
| problem. 6 minutes to start the incident process, another 6
| minutes to identify the issue, and another 25 minutes to start
| rolling out the solution.
| de6u99er wrote:
| Not sure if internal services necessaruly require valid
| certificates. Most of them don't even require encryption. Both
| encryption, decryption, signing, and validating signatures will
| only cost cpu-cycles and increase total power consumption.
|
| Looking at what Epic is doing, I would encrypt customer data and
| everything that involves money. IMO only communication between
| data centers, with external payment providers, and with users
| must be encrypted and require valid certificates.
| Hamuko wrote:
| I'm running Let's Encrypt certificates on services that are
| only accessible in my home network. And I live alone.
|
| I mean, why not?
|
| (Granted my certs actually failed earlier this week since my
| automation had broken)
| xtracto wrote:
| One problem with Let's Encrypt certs only work for public
| domains.
| the8472 wrote:
| Modern CPUs come with instructions that make symmetric crypto
| very cheap. And if you err in the other direction you end up
| with "SSL added and removed here! :^)"
| aluminussoma wrote:
| Cert Expiration is a problem that needs a better solution when a
| company does not renew it. These were internal certificates.
| Still important but not user-facing.
|
| One possible solution might be having the client introduce an
| artificial delay of 10 seconds or some other time when it
| encounters an expired cert, or adds an additional second of delay
| for every day it is expired. This degrades the connection but
| does not immediately break anything.
| NovemberWhiskey wrote:
| Oh please no; give me a hard fail I can localize and fix rather
| than some kind of awful brownout where various parts of the
| system just go slow and break things just as badly anyway.
|
| Plus you'd need to be way in the guts of the TLS implementation
| to achieve this; if you're already there, start generating
| noise a week ahead of the expiration instead.
|
| Or better, none of the above and automate.
| macintux wrote:
| Concur. From working at Basho, one key takeaway with
| distributed systems is that a hard failure is much easier to
| remediate than a slow machine.
|
| We _wanted_ a database server to fail hard. Running slowly
| just caused cascading failures.
|
| Of course, in this case you're effectively talking about the
| entire cluster crashing hard, but that's still easier to cope
| with than every system responding at a snail's pace.
| aluminussoma wrote:
| I agree that automation is ideal. But let's face it: most
| companies haven't.
|
| The goal of a business is not to have perfect engineering
| practices. It is to fulfill customer requests. When there is
| an outage in the middle of the night, I'd argue that a
| degraded system buys time to address the issue.
|
| Regardless of the mechanism, having a sudden, complete
| breakage is not ideal for a business.
| qwertox wrote:
| No thank you. An artificial delay of 10 seconds is already
| broken. And adding an additional second a day doesn't improve
| anything, nor does it help.
|
| If you plan to implement something like this, then do it right
| and have the service catch the exception and notify an
| administrator.
| magicalhippo wrote:
| We integrated with a government service. It uses a government
| supplied authentication service[1] for machine-to-machine
| communication, based on OpenID IIRC (OAuth2++).
|
| For this, our customers need a EV certificate. Most of our
| customers are small, and don't have their own IT. It's a mess,
| most don't understand what it is, don't understand the difference
| between the two or three certificate files they get, a lot can't
| even figure out how to extract the files (inside a password
| protected pdf of all things), password? What password? ...
|
| And then of course the certificates expires. Just like that.
| Poof. And the person who ordered them last time has moved on to a
| new job, and so we're back to scratch.
|
| We spend so... much... time... on hand holding this for our
| customers. Didn't take us long to figure out we need to remind
| them about certificate expiry, but the rest is just such a PITA.
|
| Technically it's a pretty nice solution, but boy it is not made
| for normal people.
|
| [1]: https://www.digdir.no/digitale-
| felleslosninger/maskinporten/...
| thrower123 wrote:
| It makes me feel better when everybody is fucking up the simple
| things all the time also.
| arbirk wrote:
| Apple did it
| [deleted]
| Snoozle wrote:
| In my experience, certificate issues is a huge tell into
| organization and treatment of IT folks. Every place I've worked
| which has had issues with last minute certificate changes or
| expiring certificates without renewal has had a systemic problem
| with underpaid and understaffed IT department.
|
| This is not a new problem, organizations will always choose
| guaranteed profits over possible loss of business unless the loss
| of business is catastrophic, I just wish that in this case
| instead of trying to make it seem like a big deal by writing an
| entire multipage excuse, a company for once would be honest and
| say 'The risk percentage did not fall in our favor this time, but
| we're not going to do anything about it because it didn't really
| impact our profits.'
| mdoms wrote:
| This seems a bit presumptuous. Epic's Glassdoor reviews[0]
| don't seem to list pay or staffing as systemic issues.
|
| [0] https://www.glassdoor.co.nz/Reviews/Epic-Games-
| Reviews-E2669...
| dijit wrote:
| I really don't know one way or the other, though as mentioned
| in another thread: I'm a devops in games and it pays less but
| not as poorly as it does for programmers.
|
| That said; Glassdoor is a terrible metric and has been widely
| criticised as a source of information due to the fact that
| bad reviews can be removed for payment; though "officially"
| they don't accept payment to delete reviews; it's part of one
| of their packages to clean up a companies image.
|
| It has also been gamed by employers- but that is obviously a
| problem for all review sites of this kind.
|
| https://www.reddit.com/r/sysadmin/comments/8tfhxv/glassdoor_.
| ..
| bartread wrote:
| Yeah, that's why Epic Games have been transparent enough to
| post this incident report: not to provide some explanation to
| their customers, or some information that the rest of us might
| be able to learn something from, but so that people on HN can
| make entirely unfounded accusations about the state of their
| organisation based on (at best) weakly correlated behaviours
| and symptoms.
|
| Be reasonable: you know nothing about how Epic Games treats
| their IT staff or whether or not the team is adequately
| resourced. I wouldn't say certificate expiry is something that
| happens particularly often, but I have seen it happen, and it's
| been simply an oversight rather than an indication of some
| serious systemic issue.
| Groxx wrote:
| The fact that a company can't deal with a scheduled-far-in-
| advance, highly-public-if-failed event does tell you some
| things about their priorities / how well they do things they
| need to do.
| SilasX wrote:
| Reminder: Mozilla failed this way too.
|
| https://news.ycombinator.com/item?id=19823701
| Groxx wrote:
| And in recent years they've been crippling extensions
| more and more, and even completely dropped support for
| them from their primary mobile browser for over a year
| now.
|
| So yes, I think this is one of many signs that they're
| not paying enough attention to extensions, not a totally
| isolated "accidents happen" event. Were I an extension
| author, I'd see that event as reason to be more
| concerned.
| bartread wrote:
| OK, fine, I'll bite: what specifically are those things it
| tells you that you can verifiably claim are true about Epic
| Games, again, _specifically_?
| Groxx wrote:
| That they apparently sometimes fail to do these things.
|
| You can't verify _anything_ internal unless you 're
| internal or it has already failed publicly, so you of
| course have to draw on patterns seen elsewhere. Critical-
| process failures in one area correlate heavily with
| failures in others.
|
| Plus, Epic has not exactly shown themselves to be
| producing consistent quality in anything related to their
| store, or many internet-connected properties. If they
| were, this might be more attributable to "accidents
| happen, it's impossible to prevent them all". It could
| still be an abnormality, but they're edging further
| towards "... maybe not though" territory.
|
| ---
|
| Edit: lets add a concrete "kinda example, kinda counter-
| example". Google is a tech company that is pretty good at
| consistently renewing its _many_ certificates. They
| recently failed to do so for Google Voice:
| https://www.bleepingcomputer.com/news/google/recent-
| google-v...
|
| I think there's a reasonable argument to be made that
| this reinforces claims that Google Voice is low priority
| / at higher risk of future issues due to lack of care,
| i.e. systemic issues, compared to other Google
| properties. I have no proof, but that doesn't mean it's
| automatically unreasonable.
| bartread wrote:
| Sure, but you can't actually use an example from Google
| to deduce what's going on at Epic Games.
|
| Don't get me wrong: I'm not saying there aren't problems
| at Epic Games (most companies have them). What I'm saying
| is, we're just speculating: how is that helpful? Either
| to them or to this discussion?
|
| We're either casting vague and hand-wavy aspersions or
| citing more specific examples where we actually have no
| idea whether they have any relevance to Epic Games.
|
| It's just noise because, as you've pointed out, we're not
| internal.
| matkoniecz wrote:
| > Sure, but you can't actually use an example from Google
| to deduce what's going on at Epic Games.
|
| It was illustration of though process - that seems to
| make sense to me.
|
| > It's just noise because, as you've pointed out, we're
| not internal.
|
| Yes, it is noisier than direct info from inside but you
| may learn _something_.
| machello13 wrote:
| Are you arguing that the internal workings of a company
| can't be visible at all to outsiders? Or that there's no
| correlation between the rate of public, easily
| preventable failures and technical incompetence? Or just
| that it's not "helpful" somehow to point these things
| out?
| kortilla wrote:
| Epic games published this as a PR move. Nothing more, nothing
| less. Customers got mad because Epic fucked up so they had to
| say something to make it seem complex and totally reasonable.
|
| "We made a bad bet on certs not being that important, it
| backfired" doesn't sound good but it's the truth.
|
| The same thing happened when Delta got wiped out by a power
| outage. "We made a bad bad bet on geo redundancy not being
| important, it backfired" wasn't good enough for them either
| so they pontificated just like Epic did here.
|
| It's obvious that Epic doesn't take certificates very
| seriously here. This is cert management 101. No need to read
| into it much further.
| PragmaticPulp wrote:
| > Every place I've worked which has had issues with last minute
| certificate changes or expiring certificates without renewal
| has had a systemic problem with underpaid and understaffed IT
| department.
|
| I've seen the opposite: Organizations who spent so much on the
| department that everyone was getting promoted to manager and
| hiring someone underneath themselves to manager things.
| Responsibilities being shuffled around as the department is
| constantly reorganized, until no one really understands who's
| responsible for what any more, but there are enough low-level
| employees to blame when things go wrong.
|
| I've seen enough variations of organizational dysfunction that
| I no longer pretend to be able to guess what's going on behind
| the scenes.
| psanford wrote:
| > Every place I've worked which has had issues with last minute
| certificate changes or expiring certificates without renewal
| has had a systemic problem with underpaid and understaffed IT
| department.
|
| That's an interesting anecdote, but its quite easy to find
| examples of companies with well respected, well paid
| engineering teams that still have an occasional certificate
| expire. Microsoft[0], Spotify[1], Facebook[2], Apple[3] have
| all had embarrassing outages due to certificates expiring.
|
| [0]: https://www.theverge.com/2020/2/3/21120248/microsoft-
| teams-d...
|
| [1]: https://www.theverge.com/2020/8/19/21375032/spotify-down-
| son...
|
| [2]: https://www.theverge.com/2018/3/7/17092084/oculus-rift-
| heads...
|
| [3]: https://www.theverge.com/2015/11/12/9721108/apple-mac-app-
| st...
| xtracto wrote:
| Right handling certificates is one of those chores that,
| particularly in a startup is easy to oversee. If the average
| turnaround for employees is 2 years, and a certificate can be
| bought for a bit more than 2 years. Normally someone will put
| it in their calendar and leave the company before it expired,
| so the new employee will be welcomed by an expired
| certificate and a not-so-clear list of places where to place
| it.
|
| That's why things like AWS certificate manager + ELB kind of
| things are useful, so that they are mostly auto-renewed.
|
| It is a chore that had bit most of the places where I have
| worked.
| machello13 wrote:
| The Apple issue was not a case of forgetting to renew a
| certificate, certain 3rd party apps just weren't handling the
| upgrade correctly. So maybe not quite as easy to find
| examples after all.
| renewiltord wrote:
| This is so overfit. Unbelievable anyone goes along with it. I
| was paid north of $400k total comp when I made this error last.
| Easy mistake to make.
| spondyl wrote:
| We had this issue a few times at the place I previously worked.
|
| At first, it wasn't clear whose responsibility it was since
| back in the operations day, emails would go to someone's
| specific address or even a mailing group, where most of the
| employees who were on it had left while new employees weren't
| added to the list since they didn't know about it.
|
| After it happened once or twice, metrics were set up to track
| expiring certificates (they were mostly all migrated to AWS
| Cert Manager I believe) while a few key ones couldn't be.
|
| As a bit of background, we also follow the Google-esque model
| of not having a phone number for customer support and requiring
| customers to submit a ticket. We do have outgoing calls but no
| incoming phone number.
|
| I say that because those key certificates would generate an
| email that said something like "Press this button and we'll
| call you to confirm you want to renew" so as you can imagine,
| my first thought was "Well, how the fuck is shit gonna work?"
|
| I think in the end we just ended up calling the certificate
| provider to say we don't have a phone number and then we
| managed to get them migrated to DNS-based validation after some
| time.
|
| This too wasn't a case of being underpaid but rather having a
| lack of knowledge. It's the sort of task that some particular
| person did for a long time but then left so none of us newer
| folks even knew where these things were provisioned from.
| Additionally, you don't feel like you have the authority to ie;
| call up some multi-national provider and be like "Hi, we own
| this thing but umm, I have no idea how to go about renewing
| it". It feels like being a teenager calling up about a first
| job haha.
|
| It's just one of the casualities of "high growth" businesses
| mixed with humans being bad at seeing cause and effect when the
| gap between the two is super wide. Cause being people leaving
| and effect being "I forgot to ask how to do X or Y"
|
| I guess I would clarify that we were following a devops model
| but had transitioned from a classic dev/ops split so it's quite
| literally a generational thing where you conceptually don't
| know how to go about eg; renewing a certificate on the phone
| because you've entered the industry in the time of dns
| validation via lets encrypt (and because there literally are no
| phones anymore in the businessa)
| whalesalad wrote:
| In my experience there are a lot of IT departments full of
| people who know how to click around and hack shit together but
| aren't what you'd call classically trained experts.
|
| Kinda like "I'll get my nephew to make my website"
| nickysielicki wrote:
| Video game developers are underpaid because they have an
| undying love for video games and are willing to work for less
| than they could make elsewhere.
|
| I suspect this becomes a problem in the context hiring devops
| people, because whereas you can make the argument that writing
| game engines and working on game logic is more fun and
| justifies working for less, it's hard to make the argument that
| a devops job at Epic running game servers and websites is any
| more exciting than running servers and websites anywhere else.
|
| This puts epic in the situation of having to pay market rate to
| attract devops people, but below market rate for attracting
| developers, which fucks up their pay scaling completely. What
| ends up happening is they just don't adjust their pay scale at
| all, which means they're hiring cheap devops people.
| joana035 wrote:
| I interviewed with epic games and got all the questions
| answered, though I used generic term to describe each aws
| product and drilled down into specifics/fundamentals of the
| questions, protocols, configuration gotchas, etc. Got
| rejected with "no experience with aws".
|
| Now seeing this I'm sure I dodged a bullet.
| [deleted]
| Impossible wrote:
| Epic doesn't pay below market rate. They can't offer stock
| because they aren't public but offer cash bonuses 2x-4x
| salary.
|
| I do agree with OP that (some) game developers undervalue IT.
| Oculus had a similar and pay rate was equal to FAANG (because
| it is FAANG!), so it came from culture, not pay.
| hsbauauvhabzb wrote:
| Can you elaborate more on the bonuses? Is bonus based
| income common for these sorts of companies?
| vmception wrote:
| I've never worked in a place that gave more than a few
| thousand dollars in bonuses :/
|
| Yeah I know FAANGs and investment banks can be impressive
| on the bonus front too
|
| But the prevalence of this just seems disconnected from
| what is considered normal or bragworthy in the rest of the
| private sector and world
| rootsudo wrote:
| This. I interviewed for a role with a game studio and the pay
| was 40% lower than what I'm making.
|
| I was just curious since they approached me and I had fun
| with the experience and saying I didn't play their games/had
| no idea.
|
| The recruiter had no idea of local wages.
| bcrosby95 wrote:
| > Video game developers are underpaid because they have an
| undying love for video games and are willing to work for less
| than they could make elsewhere.
|
| I'm not sure that love lasts forever though. I'm childhood
| friends with a lot of people that went into games and left by
| their 30s because they couldn't justify the pay difference.
| That said, maybe the games industry doesn't need these
| experienced people.
| dijit wrote:
| I work as a devops in the games industry. It's true that it's
| underpaid, and by quite a bit. But it's not as bad as the
| programming teams, IME devops pays more.
| Thaxll wrote:
| Epic pays very well compare to the rest of the industry and
| way above your average dev compagny, your comment is out of
| reality because it's def not gameplay dev that manage
| certificates, they have central team like Google does.
|
| When people say video game doesn't pay well, it does not
| apply for the like of EA, Activision, Epic, Unity etc ...
___________________________________________________________________
(page generated 2021-04-16 22:01 UTC)