[HN Gopher] Google Outage in Europe
       ___________________________________________________________________
        
       Google Outage in Europe
        
       Author : vanburen
       Score  : 255 points
       Date   : 2021-11-12 10:12 UTC (12 hours ago)
        
 (HTM) web link (www.google.com)
 (TXT) w3m dump (www.google.com)
        
       | natch wrote:
       | EU to AI:
       | 
       | Here's a EUR2.4 billion fine for you.
       | 
       | AI to EU:
       | 
       | OK, human.
        
       | elp wrote:
       | The Google imap service has also been unstable in South Africa
       | this morning.
        
       | rzr wrote:
       | Let me crosslink to invidious tracker:
       | 
       | https://github.com/iv-org/invidious/issues/2577
        
       | markusgammer wrote:
       | WoW
        
       | ac50hz wrote:
       | I guess the network admin who skipped the classes on Network
       | Fundamentals, has moved on from their recent placement in OVH, to
       | Facebook, then to Google.
       | 
       | Next stop AWS, firewall maintenance division?!
        
       | ulzeraj wrote:
       | Didn't even noticed. Work doesn't use any GCloud components and
       | personally its been a while since I've degoogled myself. No
       | Google docs, mail or search. Not even the quad 8 nameservers.
        
         | rossdavidh wrote:
         | Downvoted from people who are jealous, perhaps?
        
         | yosito wrote:
         | Not sure why you got downvoted. I'm also a degoogled European
         | and didn't even notice.
        
       | dredmorbius wrote:
       | Also reported / dupe here:
       | https://news.ycombinator.com/item?id=29197479
        
       | everydaypanos wrote:
       | reCaptcha and translate are also down. Even search is affected it
       | is super slow.
        
         | jacquesm wrote:
         | And meet. Meetings starting is super slow, sometimes it times
         | out.
        
         | jbverschoor wrote:
         | Yeah had recaptcha issues too
        
       | mschuster91 wrote:
       | Google Forms _definitely_ has issues, too. I just had an issue
       | where I couldn 't fill out a form.
        
       | synchton wrote:
       | Is it me or are big provider outages getting more common these
       | days?
        
         | sva_ wrote:
         | Almost like bad weather. Perhaps we'll have cloud outage
         | forecasts at some point.
        
         | yunohn wrote:
         | There's always this question on every "X is down" post on HN.
         | 
         | Quite boring commentary tbh.
        
         | afc wrote:
         | I think it's just you, probably recency bias. I recall the days
         | of Gmail being globally down for hours, more than once (3
         | times, IIRC), back in 2009.
        
           | The-Bus wrote:
           | I also can't think of the last time I saw Twitter's
           | Failwhale, which almost became their default mascot at one
           | point.
           | 
           | (Yes, I know they discontinued it, but I haven't seen the
           | service "over capacity" in some time. I say this as a casual
           | web user of it, not an active account holder or app user).
        
             | riffic wrote:
             | Twitter had a global outage this year (April 16).
             | 
             | Facebook had an October outage.
             | 
             | This happens quite regularly with distributed and complex
             | systems.
        
           | [deleted]
        
         | that_guy_iain wrote:
         | I think they're reasonable common it's just becoming more of a
         | think to notice.
        
         | scrollaway wrote:
         | As time passes, the more big cloud providers there are, and the
         | more complex they individually get (more products).
         | 
         | Assuming the chance of an outage is actually static. If there
         | is one (highly reliable and trusted) provider with five
         | products one year, and three providers with ten products the
         | next year, the chance of you seeing an outage has gone way up
         | because of surface area.
        
           | deepstack wrote:
           | More reason to promote for self-host, and decentralised
           | systems like smtp, matrix.org, ActivityPub. The idea of
           | having all the data and server concentrated in a few player
           | such as google, amazon, ccp is not reliable for the digital
           | operation of the planet.
        
             | altgoogler wrote:
             | Disclaimer: am Googler.
             | 
             | Self-hosting is a good option, especially when mixed with
             | multi-cloud offerings.
             | 
             | The big rub is that it's really hard to approach the same
             | level of availability the cloud offerings already give you.
             | Depending on work-load, self-hosting is typically more
             | expensive.
             | 
             | This is why the SRE book talks about availability budget.
             | You can't have 100% uptime, so how much do you want to pay
             | to get close to it?
        
             | kertoip_1 wrote:
             | The question is: should we trust small, underfunded,
             | hobbyist servers more than large corporations that have a
             | money-driven reason to maintain high quality of services?
        
               | riffic wrote:
               | for the average end user: heck no.
               | 
               | for those with budgets to hire server administrators (or
               | pay for third-party managed services)? yes.
               | 
               | second category includes almost anyone with a large
               | following like institutional users.
               | 
               | hell the incumbent social media service operators can
               | white-label their existing software and sell this as a
               | service.
        
               | fragmede wrote:
               | Yes. Like some sort of workspace? Google could make one.
               | I wonder what they'd call it?
        
               | scrollaway wrote:
               | I'm sure Google wonders the same thing, given that it's
               | had like five different names over the years.
        
           | querez wrote:
           | Except that Google services don't run within GCP, but apart
           | from it (mostly on Borg I'd guess).
        
             | scrollaway wrote:
             | The above applies to cloud as understood by techies just as
             | b2c cloud as understood by laypeople.
        
             | jsty wrote:
             | Their internal platform is likely analogous in terms of
             | accumulating complexity (and a bit of cruft) over time
             | though
        
             | nefitty wrote:
             | If anyone else is curious, Borg is Google's cluster
             | management software:
             | https://kubernetes.io/blog/2015/04/borg-predecessor-to-
             | kuber...
        
         | riffic wrote:
         | ask anybody who researches complex and distributed systems and
         | see what they say.
        
       | bla3 wrote:
       | Looks like it might be resolved now.
        
       | nudpiedo wrote:
       | Seems that google is not anymore the company it used to be
       | advertised. The myth prevails though.
       | 
       | It's a pity as some of its services are frankly among the best,
       | and I depend on them.
        
         | southerntofu wrote:
         | > Seems that google is not anymore the company it used to be
         | advertised.
         | 
         | Do you mean about the "Don't be evil" slogan it started with?
         | Even back then, it was a pretty obvious move for any movie
         | villain.
         | 
         | Or do you mean you actually bought into the reliability
         | promises of cloud providers thinking downtime would not exist
         | anymore?
        
       | 1cvmask wrote:
       | And many people blame their internet provider as google outages
       | are rare.
        
         | ratww wrote:
         | Now that I think of it, I might have blamed my provider or at
         | least restarted my modem if Hacker News was ever down...
        
           | rablackburn wrote:
           | It's the perfect connection test.
           | 
           | 9 times out of 10 if you can't connect to hn the problem is
           | you, not them. Probably even 99/100
        
             | paganel wrote:
             | I usually check the HN website to see if my internet is
             | down, both on my mobile and on my work computer. I used to
             | just type "test" in the browser address bar hoping for the
             | related Google SERPs but checking for HN seems way faster
             | nowadays.
        
             | garaetjjte wrote:
             | HN is down pretty frequently, but I think it shows custom
             | error page served from Cloudflare.
        
         | veltas wrote:
         | People and software, when Facebook was done my mobile thought
         | the internet was unavailable.
        
         | vanderZwan wrote:
         | I don't think it's that so much as the fact that that half of
         | the internet services that rely on Google along the line
         | somewhere (but not visibly to the users) are down, at least for
         | me. And yet the Google search page can be reached (again,
         | speaking from my European perspective, sample size one). To
         | make it even more confusing: these issues are limited to my
         | WiFi. If I use my phone connection as a hotspot the problems
         | disappear.
         | 
         | So from the perspective of a regular person it is very
         | reasonable to conclude that it is their internet provider that
         | is down.
        
           | kaba0 wrote:
           | And I think no internet provider has such a good name where
           | an outage on their part would be unheard of. Much more so
           | than google.
        
       | blauditore wrote:
       | ITT: "My rasperry has been up for 2 years with almost no
       | downtime, what is up with these suckers?"
        
         | remram wrote:
         | "For a Linux user, you can already build a Google yourself
         | quite trivially with wget -R and grep."
        
         | Nextgrid wrote:
         | It's valid criticism for projects that don't need the scale or
         | features that cloud-based solutions offer.
         | 
         | A single machine won't give you the same level of scaling and
         | management features, but also won't have hundreds of
         | distributed moving parts that could break and take your service
         | down as a side-effect.
        
           | Spivak wrote:
           | I feel like you are selling the cloud short for SMBs. The
           | majority of my career has been spent working for shops that
           | have their own datacenters or colo. It's so freaking cheap to
           | run a huge service it's not even funny. And it's a dream to
           | have so much compute and storage floating around that you can
           | "waste" it without even thinking -- it's already paid for!
           | 
           | However, when your small the economics flip. You want to run
           | as little as possible as cheap as possible. And a load
           | balancer pointing at a two asgs in different availability
           | zones is by far the cheapest way to get something production
           | ready on the internet. You can go from your garage to a 100M
           | business and probably never need to graduate from this
           | architecture.
        
       | sebow wrote:
       | Good Maybe these outages will make people realize how dependent
       | they are on a few monopolistic companies.Not only these will make
       | rise of competitors but also hopefully make people's information
       | more independent
        
       | intunderflow wrote:
       | I felt a great disturbance in the force, as if millions of pagers
       | suddenly cried out in terror
        
       | [deleted]
        
       | iso1210 wrote:
       | I do laugh at people that say "Using the cloud means less
       | downtime than your own server", then things like this come along
       | :D
        
         | HatchedLake721 wrote:
         | It's still less downtime than your own server and you have
         | hundreds of engineers 24/7 ready to diagnose and fix issues
        
           | tester34 wrote:
           | >It's still less downtime than your own server
           | 
           | what makes you think so, actually?
        
             | michaelt wrote:
             | In my experience, it's very easy to achieve high uptime
             | through luck. If I only run a single server, and it has a
             | 10% chance of failing in any given year, I have a 90%
             | chance of achieving 100% uptime in a given year.
             | 
             | In my experience it's _also_ very easy to _think_ you 've
             | got all your bases covered when actually you haven't. I'm
             | protected against mains power failures by a UPS and a
             | generator - but a UPS switchgear fault can cut my power
             | even without a mains power outage. My server has dual power
             | supplies and network cards - but that won't help me if a
             | clumsy worker sent to replace the server above mine unplugs
             | mine by mistake. And so on.
             | 
             | If I _think_ I 'm doing a better job than a billion-dollar
             | corporation with hundreds of thousands of servers, does
             | that mean I am? Or is it more likely I'm fooling myself?
        
               | ihumanable wrote:
               | This so much.
               | 
               | And an excellent corollary to this is that when you have
               | lucky 100% uptime there is no incentive to optimize mean
               | time to recovery.
               | 
               | Sure the raspberry pi in your closet has been running
               | fine for years, 100% uptime, but then a component fails.
               | Do you have a replacement on hand? Are you continuously
               | monitoring it to know it went down? The component failed
               | at 3am, did it page you? Did you hop right out of bed to
               | rush to fix it?
               | 
               | Single systems can have really nice uptime until they
               | don't. Then you are hoping that the people on hand can
               | repair what's going on after months or years of never
               | having to do that. Mean time to recovery might be a week
               | while you wait for new hardware or a few hours while you
               | google some error message you've never encountered.
               | 
               | People can run their own systems if they want to, but
               | they shouldn't confuse good luck with rigorous
               | engineering.
        
               | lostmsu wrote:
               | Do you have some external ping test running continuously?
               | If not, you really have no idea.
        
               | iso1210 wrote:
               | Of course, how else would I know the exact failure times,
               | my service monitoring only polls every few minutes. That
               | doesn't help me with my knowing how my service is
               | performing though, I'm not offering a ping service.
               | 
               | More importantly, I do know I had an error loading
               | twitter at Thu 11 Nov 04:17:01 GMT 2021, however my
               | websites (and google and hackernews) were working. At
               | 17:53:01 GMT Twitter took over 2 seconds to load it's
               | first http page, far beyond the normal 400ms. Google took
               | 23 seconds to load this morning, BBC News just 0.071.
               | 
               | On the other hand I also need to provide services which
               | can't cope with outages measured in milliseconds. Good
               | luck with complaining to a cloud provider that your
               | traffic vanished for 4 seconds. Those services thus have
               | multiple connections on independent hardware and circuits
               | with no single point of failure
        
               | anthony_r wrote:
               | Indeed. People do not realize just how advanced the
               | reliability infrastructure of those services is. Things
               | like diesel power generators have been baked into cloud
               | datacenters for what, a decade now? Probably longer. Show
               | me your alternative power source when the power goes out
               | (and power does disappear, everywhere, eventually).
        
               | iso1210 wrote:
               | Diesel generators have been baked into my on prem
               | equipment room for at least 40 years
               | 
               | For you average person working in an average office if
               | the power goes out you're not going to be working anyway,
               | so it doesn't matter if your server is offline too.
        
               | remram wrote:
               | I also have a personal dedicated server that never
               | crashed or restarted this year. However I am not sure how
               | much it was actually available. I know for a fact that
               | there were multiple network issues at OVH. I also know
               | that had my server been home, it would have been worse
               | (Optimum residential is awful).
               | 
               | The server not failing is not the only outage mode.
        
           | iso1210 wrote:
           | Really, downtime of google so far this year: More than 0
           | minutes. Downtime of my own server: 0 minutes.
           | 
           | Of course I'm not trying to run a massively scalable service
           | coping with millions of customers, because I don't need that.
        
             | mekkkkkk wrote:
             | You think that's how statistics works? Obviously it's
             | possible to keep your own server running with 0 downtime.
             | But you are exposed to a much higher risk of severe
             | downtimes longer than the cloud provider would likely be.
             | Be it hardware failure, grid, ISP, whatever.
        
               | iso1210 wrote:
               | Claim was
               | 
               | "It's still less downtime than your own server"
               | 
               | Claim disproven
        
               | mekkkkkk wrote:
               | That claim was a direct snippet from your original
               | comment. So unless you meant that you laugh at people who
               | are talking specifically about your personal server, I'd
               | say you are being selectively literal.
        
               | iso1210 wrote:
               | Cloud kool-aid drinkers say "The cloud is always better"
               | 
               | I say that in many cases your own server is better. My
               | company runs its own on-prem confluence, it's been taken
               | down for updates on a regular basis at a known
               | maintenence time. That's far better than losing it
               | because a cloud based one was hosted on GCP this morning
               | when we actually need the data
               | 
               | Obviously in many other cases the cloud is better. You
               | wouldn't want to server a million cusotmers across the
               | world from a single server in your own basement. That's
               | not the only model.
        
               | mekkkkkk wrote:
               | I agree completely with that added nuance. Cloud infra is
               | definitely overused and is sometimes seen as the only
               | option these days, even for the simplest of projects.
               | Having your own metal is often times much more convenient
               | and cheaper. Plus it's much more enjoyable to work with.
               | 
               | My only gripe was with your gross over-simplification
               | that read like "hurr durr, my server hasn't gone down
               | this year, so self-hosting has better uptime than
               | Google". It's such an unnecessary and baseless argument.
        
               | iso1210 wrote:
               | Vast vast majority of arguments you hear on HN is that
               | it's impossible to run your own equipment. That's as
               | ridiculous as saying you could run dropbox off a
               | raspberry pi, but tends to get pushed to the top, and the
               | schadenfreude when these events inevitability come along
               | is too good to miss.
               | 
               | Every few months another major outage of another cloud
               | provider hits the headlines, meanwhile millions of small
               | companies have no problem with the uptime of their
               | 'legacy' services.
               | 
               | I was at a farm a few weeks ago, the farmer had a server
               | in a closet. It did break on occasion, when there was a
               | power outage. His desktop and internet broke too, so what
               | would the point in his server working.
               | 
               | If it was hosted on google it wouldn't have been working
               | this morning, despite his computer being fine.
               | 
               | If you build your business processes around accepting
               | failure, it's not a problem. It's far easier to keep at
               | least one out of 3 machines online for 99.999% of the
               | time than to keep a single service running for the same
               | time.
        
               | remram wrote:
               | To be fair, this is exactly how statistics work. The
               | _average_ downtime of self-hosted servers might be higher
               | (arguably) but many people are under Google 's yearly
               | downtime (aka _variance_ is high).
        
           | southerntofu wrote:
           | This is so misinformed by industry propaganda. Modern cloud
           | services are very often unavailable from a region without any
           | apparent reasons. Or the service appears to be available but
           | some specific feature doesn't work. Or it's available but
           | just really slow or dropping packets.
           | 
           | When you have a "simple" (already quite complex)
           | BGP(TCP(HTTP)) tunnel, chances are things just work and it's
           | easy enough to diagnose issues. Between anycast, auto-
           | scaling, and WAFs (among others) you have added so many
           | layers of complexity that the chance of a "random" error
           | somewhere in the stack has dramatically increased and
           | diagnosis is close to impossible.
           | 
           | It used to be simple that web services are either up or down.
           | Now it's not so obvious anymore, and that's definitely not
           | being reflected in the 5 9's SLAs falsely promised by cloud
           | providers. I can say with confidence most selfhosted systems
           | i've been close to have much better uptime than modern cloud
           | services and are much cheaper to service and maintain.
        
             | exikyut wrote:
             | I agree with specific dimensions of this sentiment.
             | 
             | Firstly, I think that the GP comment ever so slightly
             | misapplies the localized awareness available in smaller
             | environments to the cloud. Yes, there are tons more
             | engineers to look at stuff, but those engineers are a)
             | already bogged down keeping up with internal infra,
             | politics and bureaucratic machinery, and b) quite some
             | distance further away from actual errors occurring on the
             | ground because everything's aggregated to the hilt so the
             | stats remain comprehensible at the higher scale.
             | 
             | I also agree that the newer stuff that is less mature has
             | an order of magnitude more intrinsic leaky abstractions
             | than old designs, which I do think were more hygienic and
             | espoused more effective separation of concerns than what is
             | used today.
             | 
             | Also, not to nitpick, but the "classical" canonical
             | definition would probably look closer to
             | BGP(TCP(TLSv1.3(HTTP/1.1))), with the modern equivalent
             | being BGP(UDP(HTTP2)) and the future being BGP(UDP(QUIC)).
             | I do agree that the rapid consolidation of HTTP and TLS,
             | without wide-scale awareness and slow, methodological
             | development of general introspection tooling, does make
             | things net worse in general. I suspect infra will still be
             | using HTTP/1.1 for a long time into the future until this
             | materially changes.
        
         | i5heu wrote:
         | Right?
         | 
         | My private website on my RPI is running now 2 Years without a
         | problem and only minimal downtime due to rebooting for the new
         | kernels.
         | 
         | It is amazing how much uptime you can achieve with a 5$
         | Computer in comparison to a 1730000000000$ (1,73 tera $)
         | Company. Even if you compensate for dynamic content.
        
           | exikyut wrote:
           | (Google is currently processing multiple terabits of traffic
           | per second using several billion dollars' worth of
           | distributed infrastructure.)
        
             | iso1210 wrote:
             | Actually it's not, and that's the problem
        
               | exikyut wrote:
               | FWIW I vaguely recall some kind of website that showed
               | approximate bandwidth use... it was something like a
               | cross between statcounter and alexa except for network
               | stats and the like.
               | 
               | I think it purported Google to be hovering around 15Gbps.
               | That sounded humongously wrong, like I'd expect global
               | traffic <-> Google to at least be a couple terabits,
               | right?
               | 
               | I'd be very curious if there's a way to actually ballpark
               | the number. Or maybe it is in fact possible to implement
               | services that reliably track this sort of thing, and my
               | futile searches earlier just weren't finding them...
        
             | Nextgrid wrote:
             | And yet it doesn't mean his RPi-powered website needs to
             | ever sustain that kind of traffic, so he doesn't have to
             | take the additional risk of running a distributed system.
             | 
             | In contrast, in the cloud you typically take on the risk of
             | the underlying distributed platform (optimized for managing
             | thousands of VMs, etc) even though you only need a single
             | machine for yourself.
        
           | [deleted]
        
           | remram wrote:
           | Maybe Google's servers are also running without a problem and
           | you can't reach it for a different reason. Self-reported
           | uptime is not a good measure of server availability.
           | 
           | Was your home internet available all of the time? How many
           | times did you reboot your modem?
        
       | kroolik wrote:
       | There were outages around the same time last year. Somebody in
       | the HN thread commented back then that the employees evaluation
       | and promotion window ends around december/eoy, thus more releases
       | are made.
       | 
       | https://en.m.wikipedia.org/wiki/Google_services_outages
        
         | inglor wrote:
         | I would be very interested in any research about code quality
         | with relation to promotion packets. Ideally acamdemic.
         | 
         | I am not sure where to look for it
        
           | rajin444 wrote:
           | Is there an objective way to measure code quality?
           | 
           | We barely understand the systems we're writing code for - I
           | don't see how you could objectively judge the code to manage
           | those systems without fully understanding the systems first.
           | 
           | You could have wave some metrics like number of bugs or test
           | coverage, but I can't think of any (aforementioned included)
           | that aren't subject to tons of confounding variables.
        
         | trhway wrote:
         | November 11 is a public holiday in US. And it seems it has in
         | recent years became a company holiday in many companies too.
         | May be that break in established over many years cadence of
         | workday and holidays plays a role here, like say affecting
         | smoothness of support transition between US-EMEA-APJ.
        
           | mikelward wrote:
           | Veterans Day is not a day off for US Googlers.
        
         | delroth wrote:
         | Perf has been over for almost a month now, and the evaluation
         | period was over more than two months ago.
        
           | kroolik wrote:
           | Then it might, indeed, be unrelated. Thanks for
           | clarification!
        
           | samhw wrote:
           | Could it not be knock-on effects of earlier changes? Or
           | indeed post-review changes from scared/precarious engineers?
        
             | mucle6 wrote:
             | I think we are exactly mid way between the end year 2021
             | perf and mid year 2022 perf, so its not clear to me that
             | this could be to blame
        
               | skinkestek wrote:
               | That explains:
               | 
               | The people who did the stuff have either gotten their
               | promotions and moved on or didn't get a promotion and
               | have given up.
               | 
               | :-)
        
           | dspillett wrote:
           | _> and the evaluation period was over more than two months
           | ago_
           | 
           | Excellent time to slack off a bit, make a few mistakes, then
           | come next eval you can point to a marked improvement over the
           | intervening six-to-nine months!
        
             | roenxi wrote:
             | The line of logic this thread has followed so far suggests
             | there will be a reason for a Google outage every month of
             | the year.
        
               | NikxDa wrote:
               | Couldn't have phrased it any better.
        
               | shuckles wrote:
               | I believe the term of art is "just so story."
        
         | [deleted]
        
         | de6u99er wrote:
         | Some people are definitely not gatting a raise.
        
           | jankeymeulen wrote:
           | More likely someone actually will. A blameless postmortem
           | will be written, and the people that will fix the bug or
           | systems issue will have something to work on with high
           | visibility and high impact, which tends to translate in good
           | performance ratings.
           | 
           | (Googler, opinions are my own)
        
         | deanCommie wrote:
         | And this is why nobody serious should build on GCP.
         | 
         | Google's more interested in placating their primadonna
         | engineers than solving customer problems.
        
         | bamboozled wrote:
         | There is also a lot of pressure on engineers around black
         | Friday and Cyber Monday to get things done before any change
         | embargoes come into place. This is coming from someone who just
         | worked 16 hours straight to scale things before a change
         | freeze.
         | 
         | I don't know how or if this would impact Google, but I'm sure
         | someone there has at least thought about those dates.
        
         | kixiQu wrote:
         | Far more likely IMO to coincide with trying to get Exciting New
         | Features out before release freezes for some conference or
         | other.
        
         | oblio wrote:
         | He he. It's kind of hilarious. Maybe they should stagger
         | reviews to ensure high availability? :-D
        
           | kroolik wrote:
           | Nothing wrong with it. People make errors, distributed
           | systems aren't easy. More frequent changes - more likely a
           | bug was introduced.
           | 
           | My post is just a speculation, let's wait for the actual
           | technical details doc from Google.
        
             | oblio wrote:
             | Well, there are various whitepapers that reformulate this
             | exact truth you've highlighted: issues tend to happen more
             | when changes are made.
             | 
             | Regarding the technical details doc, Google will never
             | state that outright in individual postmortems. And they
             | will definitely not draw this to the logical conclusion
             | regarding the spiky yearly activity.
        
               | UncleMeat wrote:
               | It can't be logical because it is based on bad facts.
               | 
               | We wrote evals in August. Anybody racing to get things
               | launched before perf did so months ago. The timing
               | described in the top post is just false.
               | 
               | We write the next one in February. We are almost exactly
               | half way between perf cycles. Your hunch that this lines
               | up with the end of the year is false.
        
               | yunohn wrote:
               | > logical conclusion regarding the spiky yearly activity
               | 
               | Why is it logical? There's tons of changes being deployed
               | at all times, at all large companies.
               | 
               | Across products, verticals, everything - hundreds of
               | changes at any given point. Some of these changes can
               | introduce hard-to-predict bugs in globally distributed
               | systems. Most of the time, external users don't even
               | notice before they're fixed.
               | 
               | Like another commenter said, performance reviews do not
               | coincide with with the year-end at Google and other
               | companies.
        
               | oblio wrote:
               | > Like another commenter said, performance reviews do not
               | coincide with with the year-end at Google and other
               | companies.
               | 
               | When are they? My bet is that they're roughly around the
               | same time period, even if they don't start or close on
               | December 31st.
        
               | michaelt wrote:
               | Hypothetically speaking? 11 months a year there's no
               | incentive to cut corners and only do 2 weeks of testing
               | on something that really needs 3. If one month a year
               | rushing things out with a bit less testing is rewarded, I
               | can believe some people would respond to that.
               | 
               | Of course, I wouldn't go as far as to call this a
               | "logical conclusion" as the evidence I've seen is very
               | slim.
        
               | fragmede wrote:
               | Except that there are two a year at Google., so it's 10
               | months of the year, 5 and 5, and Covid's made everything
               | weird.
        
         | moffkalast wrote:
         | Then they get fired instead, well done ha.
        
           | evercast wrote:
           | Comments like this make me wonder if people really expect
           | engineers to be fired because of an outage? I do not work at
           | Google, but none of my workplaces would fire engineers
           | because of a failure. Mistakes happen. As long as they are
           | not repeated, everything is good.
           | 
           | If your company fires people in situations like this, run
           | away and never look back.
        
             | ddalex wrote:
             | Googler here, not speaking on the behalf of the company, my
             | opinions are my own
             | 
             | People do absolutely NOT get fired over incidents. Making
             | mistakes is human. An incident will prompt a review of the
             | systems and safeguards in place to prevent such an
             | incident, much like an airline incident investigation -
             | 
             | basically "somebody fat-fingered it" is never the answer,
             | postmortems are always blameless
             | 
             | EDIT: now that I think of it, the opposite thing happens
             | after a major incident - a systemic failure should be
             | identified, people are being hired to fix it :)
        
               | samhw wrote:
               | Yeah, this 'blameless' ethos has definitely trickled down
               | from FAANG to decently-sized decently-reputed places I've
               | worked at - and certainly to #EngTwitter.
               | 
               | I think it's a _bit_ over-applied in some cases. Does it
               | not commit you to the theorem that every process can be
               | made so perfect as to be completely invulnerable to one
               | human being making a mistake? (At least, in the form
               | exemplified by the common tweets to the effect that
               | "your processes are to blame for $incident, not your
               | interns/engineers/etc".)
               | 
               | Even if you required two-person auth for every single
               | thing, two people will make a mistake now and then, and
               | in reality - due to our being social animals - the two
               | probabilities are not truly independent.
               | 
               | I just don't see how this is feasible in reality. A more
               | realistic principle feels like: "people _will_
               | infrequently make mistakes, and that 's of course natural
               | and human and forgiveable, but _far fewer_ incidents
               | should be vulnerable to human error than currently are ".
        
               | assbuttbuttass wrote:
               | I of course agree that mistakes are inevitable. That
               | being said, the point of blameless culture is not to make
               | a process invulnerable to mistakes. Instead during a
               | post-mortem, we look at how to prevent _that particular
               | incident_ from happening again.
        
               | xcambar wrote:
               | > Googler here, not speaking on the behalf of the
               | company, my opinions are my own
               | 
               | Why do employees at big tech names (FAANG et al.) are so
               | often so cautious as to include this as a foreword
               | everywhere? Twitter bios are full of that, for instance.
               | 
               | It is crazy to me; who would expect anything else that
               | our opinions being your own and nothing more? Who would
               | expect that your word (with all due respect) is worth
               | anything with regards to the company's PR?
               | 
               | Is there an actual risk in the US? Have there been trials
               | or anything that push people to add such statements?
        
               | fshbbdssbbgdd wrote:
               | They want to mention their employer to gain authority in
               | the discussion, but since mentioning their employer is a
               | legal/PR risk, they need to follow it up with a
               | disclaimer (this only partially mitigates the risk, but
               | it's worth it to get the brag in).
        
               | Rebelgecko wrote:
               | Part of it is because the company asks us to. Part of it
               | is because I think it's reasonable to tell people your
               | biases, and it can avoid the situation where substantive
               | conversation gets derailed by "gotchas". If I make a
               | comment about how I think Google Meet has the best noise
               | cancellation of any video chat software, even though I
               | don't work on Meet or anything adjacent to it, it's still
               | a bad look if someone can dig through my comment history
               | and pull out a previous comment about how I work for
               | Google.
        
               | b0afc375b5 wrote:
               | It's in the spirit of full disclosure, which some,
               | including me, appreciate.
        
               | secondaryacct wrote:
               | It s because they only hire idiots at Google. Im from a
               | big company, I just dont name it and assume humans
               | understand my opinions are my own :D
        
               | LadyCailin wrote:
               | I work at a large tech company, and they do mention in
               | the on boarding materials that we represent the company,
               | so we should be careful in our social media profiles. My
               | solution to this is to not associate my social media
               | profiles with my employer. This is technically not really
               | what we're supposed to do, and I might have to change
               | that approach at some point if I move high enough in the
               | org to start getting attention from people, but this
               | works for me better than disclaimers on all my posts.
        
               | riffic wrote:
               | if they wanted to control this so bad they'd provision
               | you a managed account like how email addresses are
               | managed.
        
               | secondaryacct wrote:
               | Yet all these googlers breached it immediately in the
               | first sentence by naming their company...
        
               | xcambar wrote:
               | Sarcastically, "I'm a googler, opinions my own" reads a
               | lot like "I'm a googler, just so you know".
               | 
               | I didn't want to emphasize on that on the first comment
               | yet to be honest, I find it pedantic because it's
               | pointless, legally speaking.
        
               | ddalex wrote:
               | Can't speak for other companies, but this is covered in
               | basic training at Google - if you're not authorized to
               | speak on behalf of the company, you must make it clear
               | when your writings may be mistaken or constructed to
               | represent the company.
               | 
               | Basically the company has specially trained people that
               | speak on behalf of the company, and that message should
               | not be confounded by personal opinions of other
               | employees. For example, on the recent FB outage, there
               | was an employee posting inside information on reddit -
               | media companies just took it at face value and ran around
               | with it reporting as it was what FB itself was saying
               | about the outage.
               | 
               | I'm not aware of any actual risks in the US, but then
               | again I'm not in the US. For me this seems a minor point,
               | and I actually enjoy separating my public persona from
               | the company for which I work, being it Google or a small
               | startup.
        
               | DnDGrognard wrote:
               | It's the same at all large companies - its CYA boiler
               | plate.
               | 
               | Though I almost got to be the official spokesperson for
               | British Telecom responding on the alt.2600 news group
               | about the the Met police VMB hack - press office was cool
               | but the internal security was not.
        
               | secondaryacct wrote:
               | Every company says that, the obvious solution is to never
               | name it when you speak. Why do these people need to say
               | "Im a googler" and then immediately "but forget it, I
               | speak on my own"... obviously there s value in the fact
               | they re at Google and it will color their discourse which
               | is already probably forbidden.
               | 
               | Dont name your company if you intend to speak for
               | yourself.
        
               | jefftk wrote:
               | _> On the recent FB outage, there was an employee posting
               | inside information on reddit - media companies just took
               | it at face value and ran around with it reporting as it
               | was what FB itself was saying about the outage._
               | 
               | To be fair, I think the media would have done that even
               | with a "speaking only for myself" disclaimer.
        
               | xcambar wrote:
               | Would you say: "I live in [city], opinions are my own",
               | or "I am married to [person], opinions are my own"?
               | 
               | If no, why are you doing this for the company you're
               | trading skills and time against money?
        
               | null_object wrote:
               | > Why do employees at big tech names (FAANG et al.) are
               | so often so cautious as to include this as a foreword
               | everywhere?
               | 
               | This isn't just 'big tech' - I work at a relatively small
               | tech company, but I'd never want anything I say about the
               | company to be mistaken as some sort of 'official
               | statement' especially if it related to an incident that
               | possibly had a financial impact on external parties, and
               | could conceivably be misused in that context in the
               | future.
               | 
               | I go as far as never writing private emails from my work
               | mail for the same sort of reasons - although that is from
               | a possibly over-abundance of caution.
        
               | gsich wrote:
               | Especially when it's not an opinion.
        
               | samhw wrote:
               | In fairness, 'opinion' is a horrendously vague and ill-
               | defined word. It does double duty as (i) 'normative
               | value' and (ii) 'personal understanding of the
               | descriptive facts', which two senses are constantly
               | confused - for example right here.
               | 
               | That's why we constantly get "it's just my opinion" used
               | in reference to type-ii opinions (personal understanding
               | of descriptive fact), when it's only really appropriate
               | to type-i opinions (normative value).
               | 
               | Many conversations would be far clearer if it were
               | abandoned in favour of more precise language, IMPUOTDF.
        
               | not1ofU wrote:
               | "Many conversations would be far clearer if it were
               | abandoned in favour of more precise language,
               | IMPUOTDF"... um whats IMPUOTDF? I did try to google it,
               | but only this post was found.
        
             | fragmede wrote:
             | You're totally right and the SRE book by Google goes over
             | this - the company's culture does not allow firing people
             | for outages. If you're somewhere this still happens, run
             | away (or you'd better be getting paid more than top-end ICs
             | at Google.
        
           | hdjjhhvvhga wrote:
           | It makes no sense at all. After the outage you have not only
           | a review of the causes and appropriate remedies, but also
           | more experienced people who are now more aware of possible
           | consequences of seemingly unrelated actions and will take
           | extra care not to make these things happen in the future.
           | 
           | Also, such cases are rarely the "fault" of a single person.
           | Or, the direct/immediate cause is often not the main one.
        
           | kroolik wrote:
           | Why would you fire an engineer you have just spent millions
           | to train?
        
             | goldcd wrote:
             | I'd guess because Europe-wide outages are costing more than
             | millions
        
               | kreeben wrote:
               | But now you have someone in your team who will never,
               | ever make that same mistake again and should be your new
               | go-to guy for all X related changes (X being DNS or what-
               | have-you). Firing someone with that type of experience
               | does not lead to success.
               | 
               | 100% of all devs make huge mistakes, at least once.
        
               | KronisLV wrote:
               | > But now you have someone in your team who will never,
               | ever make that same mistake again and that should be your
               | new go-to guy for all DNS related changes.
               | 
               | I'm not entirely sure that's _always_ true. For example,
               | i 've seen people introduce N+1 issues into a codebase,
               | spend evenings fixing them and refactoring code to fix
               | production issues... just to later introduce those very
               | same types of issues.
               | 
               | Sure, you can learn from mistakes, have post-mortems and
               | so on (provided that your org even does those and that
               | anyone listens and cares about the conclusions from
               | those), but to me it feels like the most foolproof way is
               | to ensure that no-one can make these mistakes again, be
               | it with a checklist (which tend to be ignored, honestly),
               | or better yet, an automated CI step or a new test suite.
               | 
               | In my eyes, it's basically the same as with unit tests -
               | everyone agrees that you need them, but people rarely
               | write enough of them. So if you introduce something to
               | prevent them from not doing what they should, e.g. a
               | quality gate within a CI step which will disallow a merge
               | once the coverage falls below a set margin, suddenly
               | things are a lot better in the long run.
        
               | kreeben wrote:
               | N+1 issues aren't nearly as devastating as N^2. Commend
               | them for not putting your systems to a complete halt,
               | then teach them how to reason about this properly.
               | 
               | >> a quality gate
               | 
               | Yes, this, also.
        
               | KronisLV wrote:
               | > N+1 issues aren't nearly as devastating as N^2.
               | 
               | Depends on the project, i guess: if you're unlucky enough
               | to be working on a monolith and suddenly a page takes
               | 5'000 SQL queries to load as opposed to 100, because
               | someone thought that initializing data through service/DB
               | calls in a loop is "easier" than writing views in the DB,
               | it might still kill the entire system anyways, depending
               | on the count of users.
               | 
               | And once this data initialization is sufficiently
               | complicated and convoluted for you not to be able to
               | rewrite it and them not wanting to rewrite it, all while
               | "the business" is breathing down on your neck, you might
               | either want to introduce caching (and possibly run into
               | cache invalidation problems down the road), or just
               | freshen up your CV.
               | 
               | I guess i'd also like to expand on the previous
               | suggestion and advise others to consider performance/load
               | testing as well, especially when coupled with APM
               | solutions like Skywalking or even Matomo analytics, both
               | of which can allow you to aggregate the historical page
               | load times, CPM and overall performance of your
               | applications, to figure out what went wrong when.
        
               | kroolik wrote:
               | Still, that engineer (if its engineer's fault) is
               | extremely unlikely to make that issue again. IMHO, the
               | problem is systemic in that why the system allows such
               | errors (if its human error) to happen. Given Google's
               | scale I think a lot of the generally known scenarios are
               | covered and what you see is tens of services interacting
               | in not obvious ways. Those unobvious scenarios manifest
               | in situations like this.
        
               | cranekam wrote:
               | But firing someone doesn't undo an incident. It just
               | introduces other weird incentives. People become afraid
               | to change things for fear of breaking something, or when
               | something does break they try to hide it rather than
               | feeling like they can immediately ask for and receive
               | help to fix it.
               | 
               | The only time someone should be fired for causing an
               | outage is if they're negligent or sloppy or mess things
               | up all the time. This is rare. Almost always outages in
               | large systems are the combination of many factors --
               | latent bugs, design flaws, abnormal load, etc, any one or
               | two of which wouldn't take the site down. But when the
               | combine in a perfect storm that nobody foresaw things
               | fall over.
        
             | [deleted]
        
             | chrisseaton wrote:
             | What tech company spends millions on training anyone?
        
               | 1ibsq wrote:
               | The other comments already explained it, but I'm
               | wondering how you haven't come across this 'saying'
               | before. It's so overused and also cheesy in my opinion.
        
               | chrisseaton wrote:
               | Lol maybe if people actually _did_ spend millions
               | training their people up front we could do better?
        
               | exikyut wrote:
               | 139,995 employees at Google * 1,000,000 =
               | $139,995,000,000
               | 
               | $140 billion dollars. On training.
               | 
               | On the one hand... you know what, I'd _love_ to work in
               | an environment like that. Seriously.
               | 
               | On the other hand... what's the argument you make to the
               | CFO in support of this? Honest question, interested to
               | hear answers.
        
               | chrisseaton wrote:
               | I work part time in the Army. In the Army when you go
               | from their equivalent of junior to mid-level they take
               | you out of your job for _eight months_ of dedicated
               | personal development, before you start your first mid-
               | level job. When you go to their equivalent of senior they
               | take you out for a _year_.
               | 
               | I don't know how much that costs all-in, including the
               | salaries, instructors, facilities, but might be starting
               | to approach a million.
               | 
               | That's valuing training!
        
               | secondaryacct wrote:
               | But the army is a cost center! The workers have some
               | money shaved off their salary to pay for an army that
               | allows delinquents and half-disabled to pew-pew guns in
               | the forest, leaving them in peace. It's not comparable to
               | a productive entreprise that needs to build things for
               | people or perish.
               | 
               | For instance, if Google fails and cant profit, it cant
               | just shoot at their client until they pay. Your
               | organisation can.
        
               | chrisseaton wrote:
               | > It's not comparable to a productive entreprise that
               | needs to build things for people or perish.
               | 
               | Well we had to build an Army to win against fascism in
               | the Second World War or we all would have perished.
               | 
               | And 'perished' means literally dead or subject to
               | fascism, not just going out of business.
        
               | confidantlake wrote:
               | WW2 was 70+ years ago. The USSR fell 30+ years ago.
               | Military budgets are still incredibly high. The army as
               | it is today does not need to be that efficient. And even
               | during WW2 times the US did not face a credible threat of
               | invasion. The last time the US faced an invasion on the
               | mainland was during the war of 1812.
        
               | chrisseaton wrote:
               | > The army as it is today does not need to be that
               | efficient.
               | 
               | You want the Army to be... less efficient? Spend more for
               | less capability?
        
               | confidantlake wrote:
               | No I mean that for the US army today the downside to the
               | army being inefficient is that money is wasted. Not
               | great, but not a disaster. For a different country that
               | could mean the country gets invaded or the government
               | collapses (like Afghanistan).
        
               | handrous wrote:
               | Middle and upper management get there via connections and
               | picking things up on their own. It's unsurprising that
               | they don't want to be subjected to competition with
               | lesser people who can "merely" be trained to do their job
               | as well, or better.
        
               | dolmen wrote:
               | Some jobs can be learned on the job.
               | 
               | But I'm glad that learning to kill people (military) is
               | not taught that way.
        
               | chrisseaton wrote:
               | If your job is something like a staff officer in a
               | Brigade, you _could_ learn that on the job, the Google
               | way, because exercising is also  'the job', but they
               | don't get you to do that - instead they take the time to
               | fully pull you out of all work commitments for dedicated
               | personal development. These periods of personal
               | development are about personal skills rather than combat
               | training, which you've already done by this point.
        
               | stackbutterflow wrote:
               | People are born every day. Every day tens of thousands of
               | people will hear about hacker news, the pyramids, Darth
               | Vader being someone's father, for the first time.
        
               | pfarrell wrote:
               | The GP is a reference to the anecdote about IBM's Thomas
               | Watson not firing a executive who had made an error
               | costing the company a substantial amount of money.
        
               | l33t2328 wrote:
               | The implication is that millions were spent training the
               | person who made the mistake when they cost the company
               | millions by making that mistake.
        
               | Ueland wrote:
               | An outage can be pretty expensive but it is training for
               | those whom triggered it and/or those that fixes it.
        
               | pestaa wrote:
               | It's reframing lost revenue, not talking about literal
               | training cost.
        
       | yosito wrote:
       | Degoogled European here. Genuinely didn't even notice. Google can
       | fuck off.
        
         | [deleted]
        
       | peteri wrote:
       | Docs was a bit flaky earlier on.
        
       | tobyhinloopen wrote:
       | We just released a rewrite of an (web)app today so lot's of
       | support calls and mails. Really unfortunate timing, hah
        
         | rossdavidh wrote:
         | So what you're saying is, you brought down Google with your
         | rewrite. :)
        
       | tomudding wrote:
       | Probably related to the Google Cloud "service disruption" that
       | started earlier this morning:
       | https://status.cloud.google.com/incidents/1xkAB1KmLrh5g3v9ZE...
       | 
       | It was very frustrating to see DNS working perfectly fine (since
       | DNS is always the problem) and the connections just timing out.
        
       | demosito666 wrote:
       | I was also getting 500 error in GCP console
        
       | ddek wrote:
       | Yep, we're hit by that right now. It's not a total outage, we're
       | losing about 10-15% of our calls to bigquery from within
       | cloudfunctions. What we have using a VPC connector is ok.
       | Fortunately areas affected are ancillary (mainly monitoring,
       | ironically), our service is still running.
        
       | bob229 wrote:
       | Good then cancer machine is down for a while
        
       | intsunny wrote:
       | Outage reports in UTC, amazing!
       | 
       | ( _glares at AWS_ )
        
         | NaturalPhallacy wrote:
         | I want to smack whomever invented time zones.
         | 
         |  _Everyone_ is bad at them. Trading desks, international trade
         | offices, space related offices all have a set of clocks on the
         | wall because even really smart people just suck at figuring out
         | time zones. I even use https://everytimezone.com/ in lieu of a
         | set of clocks.
         | 
         | What a needless complication. I wish we could all just switch
         | to UTC and stop daylight savings time.
        
           | judge2020 wrote:
           | I can't imagine anyone taking kindly to a change as big as
           | eradicating time zones. Get rid of DST, yes, but only if the
           | U.S. goes up an hour (the extra sunlight at 6pm during DST is
           | nice).
        
             | NaturalPhallacy wrote:
             | I can imagine hordes of programmers everywhere taking
             | kindly to it. It's a simpler system.
        
         | tex0 wrote:
         | Hahaha. This one made me laugh hard.
        
           | andrelaszlo wrote:
           | No it will make you laugh in about 30 minutes. Oh wait, DST.
           | Never mind.
        
       | gabkins wrote:
       | I do wonder how many companies and businesses got affected by the
       | outage.
        
       | hulitu wrote:
       | Maybe they are just upset because they got fined. It's common by
       | big companies.
        
         | southerntofu wrote:
         | Or maybe this is unrelated and it's just one of the many cloud
         | outages contradicting their uptime promises. Welcome to the
         | reality of cloud computing, where everything's foggy and the
         | hallucinogenic fumes made us believe downtime was a thing of
         | the past.
        
       ___________________________________________________________________
       (page generated 2021-11-12 23:02 UTC)