[HN Gopher] Epic Games certificate expiration incident report
       ___________________________________________________________________
        
       Epic Games certificate expiration incident report
        
       Author : gwtabn
       Score  : 81 points
       Date   : 2021-04-16 18:29 UTC (3 hours ago)
        
 (HTM) web link (www.epicgames.com)
 (TXT) w3m dump (www.epicgames.com)
        
       | [deleted]
        
       | wolverine876 wrote:
       | For internal services, why not use self-signed certs with
       | expiration dates in the 22nd century (if the technology allows
       | that)? You don't need public trust and arguably your own
       | authentication of the cert is more trustworthy a third party's.
       | 
       | I can imagine exceptions, such as when code requires a publicly-
       | signed cert, but I suspect I'm missing something obvious here.
        
       | NovemberWhiskey wrote:
       | Typical certificate management practices for internal PKI are
       | just absolutely set up to cause outages like this. The
       | certificates get issued for a year, or two years, or whatever.
       | This is infrequent enough that it doesn't feel like it makes
       | sense to automate the process, and then it becomes a run-book
       | that only ever comes out once a year, it's way too easy to add
       | additional services without remembering to monitor which
       | certificates you deployed to them etc.
       | 
       | Start from the idea that you're going to issue certificates valid
       | for 24 hours, and think how different your environment would need
       | to look.
        
         | marcosdumay wrote:
         | It's worse than one not feeling it's worth it. Automating rare
         | events is basically futile, because the next time your
         | automation runs, everything will have changed and it will
         | break.
        
           | cwkoss wrote:
           | If your automation has good logging and you have good
           | alerting on logs, isn't it much better to see the automated
           | process fail as a notification it needs to be done manually
           | rather than relying on it being remembered?
           | 
           | (Ideally, you'd remember and never set the alert off, but
           | still great to have that extra layer.
        
             | cortesoft wrote:
             | If that is the benefit, than why not just send an auto-
             | reminder notification and skip the automation part?
        
             | marcosdumay wrote:
             | It's not much different from a notification telling you the
             | activity is due. The difference is mostly a matter of what
             | kind of notifications your organization ignores, and well,
             | I've seen both cases.
             | 
             | Anyway, the best is to shorten the certificates validity.
             | The way Letsencrypt recommends is perfect, run it often and
             | require several failures before anything breaks.
        
           | de6u99er wrote:
           | Not if you treat automation as a first class citizen.
        
         | qwertox wrote:
         | The 3 month limit from Let's Encrypt was a blessing to me, as
         | it forced me to monitor and automate all the renewals.
         | 
         | I renew once a month, and if things should break, I have a two
         | month window to fix the issues.
         | 
         | Before that I would receive a Comodo SSL Certificate once a
         | year via email and until then I always had forgotten what I had
         | to do with it. What an unnecessary pain.
        
           | atkbrah wrote:
           | I suppose it's the "green lock" that drives people to still
           | use certificates issued non-automated way.
        
             | lights0123 wrote:
             | What browsers still show a green lock, or anything to
             | differentiate EV certs? Only IE?
        
         | athorax wrote:
         | Ugh, I feel this in my bones.
        
         | darkwater wrote:
         | I know that automation is key here and all the benefits that
         | doing thing very often bring BUT in the specific case of
         | certificates issue if you have short lived certs now you have
         | to ensure your CA system works perfectly, as its uptime now is
         | the uptime of your whole platform. Yeah you can outsource it to
         | AWS ACM and the like, or use Hashicorp Vault but still, it's
         | something that before the change was totally static and now is
         | an extra moving part.
         | 
         | I'm not advocating against it, just exposing the whole story.
        
           | NovemberWhiskey wrote:
           | It depends on your organization, but in many ways the
           | enterprise PKI CA is one of the _easiest_ services to run at
           | high availability. There are hardly any shared-data
           | dependencies, so it 's easy to scale; it's almost completely
           | CPU bound with highly predictable demand, etc.
           | 
           | Pretending it's "totally static" is exactly the problem.
           | There are only two kinds of things in the software world -
           | things that can stay the same until your next release, and
           | things that need automation. "Almost completely static" is
           | how your post mortem ends up on the front page of HN.
           | 
           | A consideration of the full story also needs to include the
           | risks associated with long-lived certificates. If you lose
           | control of the private key associated with one, what do you
           | do? Are you actually operating a CRL? Are any of your HTTPS
           | clients actually _checking_ the CRL? What would you do if a
           | severe compromise were discovered that affected the signature
           | algorithm you 're using?
        
           | ArchOversight wrote:
           | With a short expiry, let's say 90 days, you should be
           | renewing 30 days ahead of time, so at the 60 day mark you
           | attempt to renew.
           | 
           | This grants you 30 days to fix any problems and get the
           | system back up.
        
             | shuntress wrote:
             | Then you have to automate handling the 60-day renewal
             | failure warnings which adds _another_ moving part.
        
               | johncolanduoni wrote:
               | If you can't reliably automate sending an unignorable
               | message to some set of humans when something fails to
               | happen, you're going to have a tough time keeping
               | anything actively developed online.
        
       | erwald wrote:
       | That is some impressively fast mitigation for an unexpected
       | problem. 6 minutes to start the incident process, another 6
       | minutes to identify the issue, and another 25 minutes to start
       | rolling out the solution.
        
       | de6u99er wrote:
       | Not sure if internal services necessaruly require valid
       | certificates. Most of them don't even require encryption. Both
       | encryption, decryption, signing, and validating signatures will
       | only cost cpu-cycles and increase total power consumption.
       | 
       | Looking at what Epic is doing, I would encrypt customer data and
       | everything that involves money. IMO only communication between
       | data centers, with external payment providers, and with users
       | must be encrypted and require valid certificates.
        
         | Hamuko wrote:
         | I'm running Let's Encrypt certificates on services that are
         | only accessible in my home network. And I live alone.
         | 
         | I mean, why not?
         | 
         | (Granted my certs actually failed earlier this week since my
         | automation had broken)
        
           | xtracto wrote:
           | One problem with Let's Encrypt certs only work for public
           | domains.
        
         | the8472 wrote:
         | Modern CPUs come with instructions that make symmetric crypto
         | very cheap. And if you err in the other direction you end up
         | with "SSL added and removed here! :^)"
        
       | aluminussoma wrote:
       | Cert Expiration is a problem that needs a better solution when a
       | company does not renew it. These were internal certificates.
       | Still important but not user-facing.
       | 
       | One possible solution might be having the client introduce an
       | artificial delay of 10 seconds or some other time when it
       | encounters an expired cert, or adds an additional second of delay
       | for every day it is expired. This degrades the connection but
       | does not immediately break anything.
        
         | NovemberWhiskey wrote:
         | Oh please no; give me a hard fail I can localize and fix rather
         | than some kind of awful brownout where various parts of the
         | system just go slow and break things just as badly anyway.
         | 
         | Plus you'd need to be way in the guts of the TLS implementation
         | to achieve this; if you're already there, start generating
         | noise a week ahead of the expiration instead.
         | 
         | Or better, none of the above and automate.
        
           | macintux wrote:
           | Concur. From working at Basho, one key takeaway with
           | distributed systems is that a hard failure is much easier to
           | remediate than a slow machine.
           | 
           | We _wanted_ a database server to fail hard. Running slowly
           | just caused cascading failures.
           | 
           | Of course, in this case you're effectively talking about the
           | entire cluster crashing hard, but that's still easier to cope
           | with than every system responding at a snail's pace.
        
           | aluminussoma wrote:
           | I agree that automation is ideal. But let's face it: most
           | companies haven't.
           | 
           | The goal of a business is not to have perfect engineering
           | practices. It is to fulfill customer requests. When there is
           | an outage in the middle of the night, I'd argue that a
           | degraded system buys time to address the issue.
           | 
           | Regardless of the mechanism, having a sudden, complete
           | breakage is not ideal for a business.
        
         | qwertox wrote:
         | No thank you. An artificial delay of 10 seconds is already
         | broken. And adding an additional second a day doesn't improve
         | anything, nor does it help.
         | 
         | If you plan to implement something like this, then do it right
         | and have the service catch the exception and notify an
         | administrator.
        
       | magicalhippo wrote:
       | We integrated with a government service. It uses a government
       | supplied authentication service[1] for machine-to-machine
       | communication, based on OpenID IIRC (OAuth2++).
       | 
       | For this, our customers need a EV certificate. Most of our
       | customers are small, and don't have their own IT. It's a mess,
       | most don't understand what it is, don't understand the difference
       | between the two or three certificate files they get, a lot can't
       | even figure out how to extract the files (inside a password
       | protected pdf of all things), password? What password? ...
       | 
       | And then of course the certificates expires. Just like that.
       | Poof. And the person who ordered them last time has moved on to a
       | new job, and so we're back to scratch.
       | 
       | We spend so... much... time... on hand holding this for our
       | customers. Didn't take us long to figure out we need to remind
       | them about certificate expiry, but the rest is just such a PITA.
       | 
       | Technically it's a pretty nice solution, but boy it is not made
       | for normal people.
       | 
       | [1]: https://www.digdir.no/digitale-
       | felleslosninger/maskinporten/...
        
       | thrower123 wrote:
       | It makes me feel better when everybody is fucking up the simple
       | things all the time also.
        
       | arbirk wrote:
       | Apple did it
        
       | [deleted]
        
       | Snoozle wrote:
       | In my experience, certificate issues is a huge tell into
       | organization and treatment of IT folks. Every place I've worked
       | which has had issues with last minute certificate changes or
       | expiring certificates without renewal has had a systemic problem
       | with underpaid and understaffed IT department.
       | 
       | This is not a new problem, organizations will always choose
       | guaranteed profits over possible loss of business unless the loss
       | of business is catastrophic, I just wish that in this case
       | instead of trying to make it seem like a big deal by writing an
       | entire multipage excuse, a company for once would be honest and
       | say 'The risk percentage did not fall in our favor this time, but
       | we're not going to do anything about it because it didn't really
       | impact our profits.'
        
         | mdoms wrote:
         | This seems a bit presumptuous. Epic's Glassdoor reviews[0]
         | don't seem to list pay or staffing as systemic issues.
         | 
         | [0] https://www.glassdoor.co.nz/Reviews/Epic-Games-
         | Reviews-E2669...
        
           | dijit wrote:
           | I really don't know one way or the other, though as mentioned
           | in another thread: I'm a devops in games and it pays less but
           | not as poorly as it does for programmers.
           | 
           | That said; Glassdoor is a terrible metric and has been widely
           | criticised as a source of information due to the fact that
           | bad reviews can be removed for payment; though "officially"
           | they don't accept payment to delete reviews; it's part of one
           | of their packages to clean up a companies image.
           | 
           | It has also been gamed by employers- but that is obviously a
           | problem for all review sites of this kind.
           | 
           | https://www.reddit.com/r/sysadmin/comments/8tfhxv/glassdoor_.
           | ..
        
         | bartread wrote:
         | Yeah, that's why Epic Games have been transparent enough to
         | post this incident report: not to provide some explanation to
         | their customers, or some information that the rest of us might
         | be able to learn something from, but so that people on HN can
         | make entirely unfounded accusations about the state of their
         | organisation based on (at best) weakly correlated behaviours
         | and symptoms.
         | 
         | Be reasonable: you know nothing about how Epic Games treats
         | their IT staff or whether or not the team is adequately
         | resourced. I wouldn't say certificate expiry is something that
         | happens particularly often, but I have seen it happen, and it's
         | been simply an oversight rather than an indication of some
         | serious systemic issue.
        
           | Groxx wrote:
           | The fact that a company can't deal with a scheduled-far-in-
           | advance, highly-public-if-failed event does tell you some
           | things about their priorities / how well they do things they
           | need to do.
        
             | SilasX wrote:
             | Reminder: Mozilla failed this way too.
             | 
             | https://news.ycombinator.com/item?id=19823701
        
               | Groxx wrote:
               | And in recent years they've been crippling extensions
               | more and more, and even completely dropped support for
               | them from their primary mobile browser for over a year
               | now.
               | 
               | So yes, I think this is one of many signs that they're
               | not paying enough attention to extensions, not a totally
               | isolated "accidents happen" event. Were I an extension
               | author, I'd see that event as reason to be more
               | concerned.
        
             | bartread wrote:
             | OK, fine, I'll bite: what specifically are those things it
             | tells you that you can verifiably claim are true about Epic
             | Games, again, _specifically_?
        
               | Groxx wrote:
               | That they apparently sometimes fail to do these things.
               | 
               | You can't verify _anything_ internal unless you 're
               | internal or it has already failed publicly, so you of
               | course have to draw on patterns seen elsewhere. Critical-
               | process failures in one area correlate heavily with
               | failures in others.
               | 
               | Plus, Epic has not exactly shown themselves to be
               | producing consistent quality in anything related to their
               | store, or many internet-connected properties. If they
               | were, this might be more attributable to "accidents
               | happen, it's impossible to prevent them all". It could
               | still be an abnormality, but they're edging further
               | towards "... maybe not though" territory.
               | 
               | ---
               | 
               | Edit: lets add a concrete "kinda example, kinda counter-
               | example". Google is a tech company that is pretty good at
               | consistently renewing its _many_ certificates. They
               | recently failed to do so for Google Voice:
               | https://www.bleepingcomputer.com/news/google/recent-
               | google-v...
               | 
               | I think there's a reasonable argument to be made that
               | this reinforces claims that Google Voice is low priority
               | / at higher risk of future issues due to lack of care,
               | i.e. systemic issues, compared to other Google
               | properties. I have no proof, but that doesn't mean it's
               | automatically unreasonable.
        
               | bartread wrote:
               | Sure, but you can't actually use an example from Google
               | to deduce what's going on at Epic Games.
               | 
               | Don't get me wrong: I'm not saying there aren't problems
               | at Epic Games (most companies have them). What I'm saying
               | is, we're just speculating: how is that helpful? Either
               | to them or to this discussion?
               | 
               | We're either casting vague and hand-wavy aspersions or
               | citing more specific examples where we actually have no
               | idea whether they have any relevance to Epic Games.
               | 
               | It's just noise because, as you've pointed out, we're not
               | internal.
        
               | matkoniecz wrote:
               | > Sure, but you can't actually use an example from Google
               | to deduce what's going on at Epic Games.
               | 
               | It was illustration of though process - that seems to
               | make sense to me.
               | 
               | > It's just noise because, as you've pointed out, we're
               | not internal.
               | 
               | Yes, it is noisier than direct info from inside but you
               | may learn _something_.
        
               | machello13 wrote:
               | Are you arguing that the internal workings of a company
               | can't be visible at all to outsiders? Or that there's no
               | correlation between the rate of public, easily
               | preventable failures and technical incompetence? Or just
               | that it's not "helpful" somehow to point these things
               | out?
        
           | kortilla wrote:
           | Epic games published this as a PR move. Nothing more, nothing
           | less. Customers got mad because Epic fucked up so they had to
           | say something to make it seem complex and totally reasonable.
           | 
           | "We made a bad bet on certs not being that important, it
           | backfired" doesn't sound good but it's the truth.
           | 
           | The same thing happened when Delta got wiped out by a power
           | outage. "We made a bad bad bet on geo redundancy not being
           | important, it backfired" wasn't good enough for them either
           | so they pontificated just like Epic did here.
           | 
           | It's obvious that Epic doesn't take certificates very
           | seriously here. This is cert management 101. No need to read
           | into it much further.
        
         | PragmaticPulp wrote:
         | > Every place I've worked which has had issues with last minute
         | certificate changes or expiring certificates without renewal
         | has had a systemic problem with underpaid and understaffed IT
         | department.
         | 
         | I've seen the opposite: Organizations who spent so much on the
         | department that everyone was getting promoted to manager and
         | hiring someone underneath themselves to manager things.
         | Responsibilities being shuffled around as the department is
         | constantly reorganized, until no one really understands who's
         | responsible for what any more, but there are enough low-level
         | employees to blame when things go wrong.
         | 
         | I've seen enough variations of organizational dysfunction that
         | I no longer pretend to be able to guess what's going on behind
         | the scenes.
        
         | psanford wrote:
         | > Every place I've worked which has had issues with last minute
         | certificate changes or expiring certificates without renewal
         | has had a systemic problem with underpaid and understaffed IT
         | department.
         | 
         | That's an interesting anecdote, but its quite easy to find
         | examples of companies with well respected, well paid
         | engineering teams that still have an occasional certificate
         | expire. Microsoft[0], Spotify[1], Facebook[2], Apple[3] have
         | all had embarrassing outages due to certificates expiring.
         | 
         | [0]: https://www.theverge.com/2020/2/3/21120248/microsoft-
         | teams-d...
         | 
         | [1]: https://www.theverge.com/2020/8/19/21375032/spotify-down-
         | son...
         | 
         | [2]: https://www.theverge.com/2018/3/7/17092084/oculus-rift-
         | heads...
         | 
         | [3]: https://www.theverge.com/2015/11/12/9721108/apple-mac-app-
         | st...
        
           | xtracto wrote:
           | Right handling certificates is one of those chores that,
           | particularly in a startup is easy to oversee. If the average
           | turnaround for employees is 2 years, and a certificate can be
           | bought for a bit more than 2 years. Normally someone will put
           | it in their calendar and leave the company before it expired,
           | so the new employee will be welcomed by an expired
           | certificate and a not-so-clear list of places where to place
           | it.
           | 
           | That's why things like AWS certificate manager + ELB kind of
           | things are useful, so that they are mostly auto-renewed.
           | 
           | It is a chore that had bit most of the places where I have
           | worked.
        
           | machello13 wrote:
           | The Apple issue was not a case of forgetting to renew a
           | certificate, certain 3rd party apps just weren't handling the
           | upgrade correctly. So maybe not quite as easy to find
           | examples after all.
        
         | renewiltord wrote:
         | This is so overfit. Unbelievable anyone goes along with it. I
         | was paid north of $400k total comp when I made this error last.
         | Easy mistake to make.
        
         | spondyl wrote:
         | We had this issue a few times at the place I previously worked.
         | 
         | At first, it wasn't clear whose responsibility it was since
         | back in the operations day, emails would go to someone's
         | specific address or even a mailing group, where most of the
         | employees who were on it had left while new employees weren't
         | added to the list since they didn't know about it.
         | 
         | After it happened once or twice, metrics were set up to track
         | expiring certificates (they were mostly all migrated to AWS
         | Cert Manager I believe) while a few key ones couldn't be.
         | 
         | As a bit of background, we also follow the Google-esque model
         | of not having a phone number for customer support and requiring
         | customers to submit a ticket. We do have outgoing calls but no
         | incoming phone number.
         | 
         | I say that because those key certificates would generate an
         | email that said something like "Press this button and we'll
         | call you to confirm you want to renew" so as you can imagine,
         | my first thought was "Well, how the fuck is shit gonna work?"
         | 
         | I think in the end we just ended up calling the certificate
         | provider to say we don't have a phone number and then we
         | managed to get them migrated to DNS-based validation after some
         | time.
         | 
         | This too wasn't a case of being underpaid but rather having a
         | lack of knowledge. It's the sort of task that some particular
         | person did for a long time but then left so none of us newer
         | folks even knew where these things were provisioned from.
         | Additionally, you don't feel like you have the authority to ie;
         | call up some multi-national provider and be like "Hi, we own
         | this thing but umm, I have no idea how to go about renewing
         | it". It feels like being a teenager calling up about a first
         | job haha.
         | 
         | It's just one of the casualities of "high growth" businesses
         | mixed with humans being bad at seeing cause and effect when the
         | gap between the two is super wide. Cause being people leaving
         | and effect being "I forgot to ask how to do X or Y"
         | 
         | I guess I would clarify that we were following a devops model
         | but had transitioned from a classic dev/ops split so it's quite
         | literally a generational thing where you conceptually don't
         | know how to go about eg; renewing a certificate on the phone
         | because you've entered the industry in the time of dns
         | validation via lets encrypt (and because there literally are no
         | phones anymore in the businessa)
        
         | whalesalad wrote:
         | In my experience there are a lot of IT departments full of
         | people who know how to click around and hack shit together but
         | aren't what you'd call classically trained experts.
         | 
         | Kinda like "I'll get my nephew to make my website"
        
         | nickysielicki wrote:
         | Video game developers are underpaid because they have an
         | undying love for video games and are willing to work for less
         | than they could make elsewhere.
         | 
         | I suspect this becomes a problem in the context hiring devops
         | people, because whereas you can make the argument that writing
         | game engines and working on game logic is more fun and
         | justifies working for less, it's hard to make the argument that
         | a devops job at Epic running game servers and websites is any
         | more exciting than running servers and websites anywhere else.
         | 
         | This puts epic in the situation of having to pay market rate to
         | attract devops people, but below market rate for attracting
         | developers, which fucks up their pay scaling completely. What
         | ends up happening is they just don't adjust their pay scale at
         | all, which means they're hiring cheap devops people.
        
           | joana035 wrote:
           | I interviewed with epic games and got all the questions
           | answered, though I used generic term to describe each aws
           | product and drilled down into specifics/fundamentals of the
           | questions, protocols, configuration gotchas, etc. Got
           | rejected with "no experience with aws".
           | 
           | Now seeing this I'm sure I dodged a bullet.
        
           | [deleted]
        
           | Impossible wrote:
           | Epic doesn't pay below market rate. They can't offer stock
           | because they aren't public but offer cash bonuses 2x-4x
           | salary.
           | 
           | I do agree with OP that (some) game developers undervalue IT.
           | Oculus had a similar and pay rate was equal to FAANG (because
           | it is FAANG!), so it came from culture, not pay.
        
             | hsbauauvhabzb wrote:
             | Can you elaborate more on the bonuses? Is bonus based
             | income common for these sorts of companies?
        
             | vmception wrote:
             | I've never worked in a place that gave more than a few
             | thousand dollars in bonuses :/
             | 
             | Yeah I know FAANGs and investment banks can be impressive
             | on the bonus front too
             | 
             | But the prevalence of this just seems disconnected from
             | what is considered normal or bragworthy in the rest of the
             | private sector and world
        
           | rootsudo wrote:
           | This. I interviewed for a role with a game studio and the pay
           | was 40% lower than what I'm making.
           | 
           | I was just curious since they approached me and I had fun
           | with the experience and saying I didn't play their games/had
           | no idea.
           | 
           | The recruiter had no idea of local wages.
        
           | bcrosby95 wrote:
           | > Video game developers are underpaid because they have an
           | undying love for video games and are willing to work for less
           | than they could make elsewhere.
           | 
           | I'm not sure that love lasts forever though. I'm childhood
           | friends with a lot of people that went into games and left by
           | their 30s because they couldn't justify the pay difference.
           | That said, maybe the games industry doesn't need these
           | experienced people.
        
           | dijit wrote:
           | I work as a devops in the games industry. It's true that it's
           | underpaid, and by quite a bit. But it's not as bad as the
           | programming teams, IME devops pays more.
        
           | Thaxll wrote:
           | Epic pays very well compare to the rest of the industry and
           | way above your average dev compagny, your comment is out of
           | reality because it's def not gameplay dev that manage
           | certificates, they have central team like Google does.
           | 
           | When people say video game doesn't pay well, it does not
           | apply for the like of EA, Activision, Epic, Unity etc ...
        
       ___________________________________________________________________
       (page generated 2021-04-16 22:01 UTC)