[HN Gopher] Google Outage in Europe
___________________________________________________________________
Google Outage in Europe
Author : vanburen
Score : 255 points
Date : 2021-11-12 10:12 UTC (12 hours ago)
(HTM) web link (www.google.com)
(TXT) w3m dump (www.google.com)
| natch wrote:
| EU to AI:
|
| Here's a EUR2.4 billion fine for you.
|
| AI to EU:
|
| OK, human.
| elp wrote:
| The Google imap service has also been unstable in South Africa
| this morning.
| rzr wrote:
| Let me crosslink to invidious tracker:
|
| https://github.com/iv-org/invidious/issues/2577
| markusgammer wrote:
| WoW
| ac50hz wrote:
| I guess the network admin who skipped the classes on Network
| Fundamentals, has moved on from their recent placement in OVH, to
| Facebook, then to Google.
|
| Next stop AWS, firewall maintenance division?!
| ulzeraj wrote:
| Didn't even noticed. Work doesn't use any GCloud components and
| personally its been a while since I've degoogled myself. No
| Google docs, mail or search. Not even the quad 8 nameservers.
| rossdavidh wrote:
| Downvoted from people who are jealous, perhaps?
| yosito wrote:
| Not sure why you got downvoted. I'm also a degoogled European
| and didn't even notice.
| dredmorbius wrote:
| Also reported / dupe here:
| https://news.ycombinator.com/item?id=29197479
| everydaypanos wrote:
| reCaptcha and translate are also down. Even search is affected it
| is super slow.
| jacquesm wrote:
| And meet. Meetings starting is super slow, sometimes it times
| out.
| jbverschoor wrote:
| Yeah had recaptcha issues too
| mschuster91 wrote:
| Google Forms _definitely_ has issues, too. I just had an issue
| where I couldn 't fill out a form.
| synchton wrote:
| Is it me or are big provider outages getting more common these
| days?
| sva_ wrote:
| Almost like bad weather. Perhaps we'll have cloud outage
| forecasts at some point.
| yunohn wrote:
| There's always this question on every "X is down" post on HN.
|
| Quite boring commentary tbh.
| afc wrote:
| I think it's just you, probably recency bias. I recall the days
| of Gmail being globally down for hours, more than once (3
| times, IIRC), back in 2009.
| The-Bus wrote:
| I also can't think of the last time I saw Twitter's
| Failwhale, which almost became their default mascot at one
| point.
|
| (Yes, I know they discontinued it, but I haven't seen the
| service "over capacity" in some time. I say this as a casual
| web user of it, not an active account holder or app user).
| riffic wrote:
| Twitter had a global outage this year (April 16).
|
| Facebook had an October outage.
|
| This happens quite regularly with distributed and complex
| systems.
| [deleted]
| that_guy_iain wrote:
| I think they're reasonable common it's just becoming more of a
| think to notice.
| scrollaway wrote:
| As time passes, the more big cloud providers there are, and the
| more complex they individually get (more products).
|
| Assuming the chance of an outage is actually static. If there
| is one (highly reliable and trusted) provider with five
| products one year, and three providers with ten products the
| next year, the chance of you seeing an outage has gone way up
| because of surface area.
| deepstack wrote:
| More reason to promote for self-host, and decentralised
| systems like smtp, matrix.org, ActivityPub. The idea of
| having all the data and server concentrated in a few player
| such as google, amazon, ccp is not reliable for the digital
| operation of the planet.
| altgoogler wrote:
| Disclaimer: am Googler.
|
| Self-hosting is a good option, especially when mixed with
| multi-cloud offerings.
|
| The big rub is that it's really hard to approach the same
| level of availability the cloud offerings already give you.
| Depending on work-load, self-hosting is typically more
| expensive.
|
| This is why the SRE book talks about availability budget.
| You can't have 100% uptime, so how much do you want to pay
| to get close to it?
| kertoip_1 wrote:
| The question is: should we trust small, underfunded,
| hobbyist servers more than large corporations that have a
| money-driven reason to maintain high quality of services?
| riffic wrote:
| for the average end user: heck no.
|
| for those with budgets to hire server administrators (or
| pay for third-party managed services)? yes.
|
| second category includes almost anyone with a large
| following like institutional users.
|
| hell the incumbent social media service operators can
| white-label their existing software and sell this as a
| service.
| fragmede wrote:
| Yes. Like some sort of workspace? Google could make one.
| I wonder what they'd call it?
| scrollaway wrote:
| I'm sure Google wonders the same thing, given that it's
| had like five different names over the years.
| querez wrote:
| Except that Google services don't run within GCP, but apart
| from it (mostly on Borg I'd guess).
| scrollaway wrote:
| The above applies to cloud as understood by techies just as
| b2c cloud as understood by laypeople.
| jsty wrote:
| Their internal platform is likely analogous in terms of
| accumulating complexity (and a bit of cruft) over time
| though
| nefitty wrote:
| If anyone else is curious, Borg is Google's cluster
| management software:
| https://kubernetes.io/blog/2015/04/borg-predecessor-to-
| kuber...
| riffic wrote:
| ask anybody who researches complex and distributed systems and
| see what they say.
| bla3 wrote:
| Looks like it might be resolved now.
| nudpiedo wrote:
| Seems that google is not anymore the company it used to be
| advertised. The myth prevails though.
|
| It's a pity as some of its services are frankly among the best,
| and I depend on them.
| southerntofu wrote:
| > Seems that google is not anymore the company it used to be
| advertised.
|
| Do you mean about the "Don't be evil" slogan it started with?
| Even back then, it was a pretty obvious move for any movie
| villain.
|
| Or do you mean you actually bought into the reliability
| promises of cloud providers thinking downtime would not exist
| anymore?
| 1cvmask wrote:
| And many people blame their internet provider as google outages
| are rare.
| ratww wrote:
| Now that I think of it, I might have blamed my provider or at
| least restarted my modem if Hacker News was ever down...
| rablackburn wrote:
| It's the perfect connection test.
|
| 9 times out of 10 if you can't connect to hn the problem is
| you, not them. Probably even 99/100
| paganel wrote:
| I usually check the HN website to see if my internet is
| down, both on my mobile and on my work computer. I used to
| just type "test" in the browser address bar hoping for the
| related Google SERPs but checking for HN seems way faster
| nowadays.
| garaetjjte wrote:
| HN is down pretty frequently, but I think it shows custom
| error page served from Cloudflare.
| veltas wrote:
| People and software, when Facebook was done my mobile thought
| the internet was unavailable.
| vanderZwan wrote:
| I don't think it's that so much as the fact that that half of
| the internet services that rely on Google along the line
| somewhere (but not visibly to the users) are down, at least for
| me. And yet the Google search page can be reached (again,
| speaking from my European perspective, sample size one). To
| make it even more confusing: these issues are limited to my
| WiFi. If I use my phone connection as a hotspot the problems
| disappear.
|
| So from the perspective of a regular person it is very
| reasonable to conclude that it is their internet provider that
| is down.
| kaba0 wrote:
| And I think no internet provider has such a good name where
| an outage on their part would be unheard of. Much more so
| than google.
| blauditore wrote:
| ITT: "My rasperry has been up for 2 years with almost no
| downtime, what is up with these suckers?"
| remram wrote:
| "For a Linux user, you can already build a Google yourself
| quite trivially with wget -R and grep."
| Nextgrid wrote:
| It's valid criticism for projects that don't need the scale or
| features that cloud-based solutions offer.
|
| A single machine won't give you the same level of scaling and
| management features, but also won't have hundreds of
| distributed moving parts that could break and take your service
| down as a side-effect.
| Spivak wrote:
| I feel like you are selling the cloud short for SMBs. The
| majority of my career has been spent working for shops that
| have their own datacenters or colo. It's so freaking cheap to
| run a huge service it's not even funny. And it's a dream to
| have so much compute and storage floating around that you can
| "waste" it without even thinking -- it's already paid for!
|
| However, when your small the economics flip. You want to run
| as little as possible as cheap as possible. And a load
| balancer pointing at a two asgs in different availability
| zones is by far the cheapest way to get something production
| ready on the internet. You can go from your garage to a 100M
| business and probably never need to graduate from this
| architecture.
| sebow wrote:
| Good Maybe these outages will make people realize how dependent
| they are on a few monopolistic companies.Not only these will make
| rise of competitors but also hopefully make people's information
| more independent
| intunderflow wrote:
| I felt a great disturbance in the force, as if millions of pagers
| suddenly cried out in terror
| [deleted]
| iso1210 wrote:
| I do laugh at people that say "Using the cloud means less
| downtime than your own server", then things like this come along
| :D
| HatchedLake721 wrote:
| It's still less downtime than your own server and you have
| hundreds of engineers 24/7 ready to diagnose and fix issues
| tester34 wrote:
| >It's still less downtime than your own server
|
| what makes you think so, actually?
| michaelt wrote:
| In my experience, it's very easy to achieve high uptime
| through luck. If I only run a single server, and it has a
| 10% chance of failing in any given year, I have a 90%
| chance of achieving 100% uptime in a given year.
|
| In my experience it's _also_ very easy to _think_ you 've
| got all your bases covered when actually you haven't. I'm
| protected against mains power failures by a UPS and a
| generator - but a UPS switchgear fault can cut my power
| even without a mains power outage. My server has dual power
| supplies and network cards - but that won't help me if a
| clumsy worker sent to replace the server above mine unplugs
| mine by mistake. And so on.
|
| If I _think_ I 'm doing a better job than a billion-dollar
| corporation with hundreds of thousands of servers, does
| that mean I am? Or is it more likely I'm fooling myself?
| ihumanable wrote:
| This so much.
|
| And an excellent corollary to this is that when you have
| lucky 100% uptime there is no incentive to optimize mean
| time to recovery.
|
| Sure the raspberry pi in your closet has been running
| fine for years, 100% uptime, but then a component fails.
| Do you have a replacement on hand? Are you continuously
| monitoring it to know it went down? The component failed
| at 3am, did it page you? Did you hop right out of bed to
| rush to fix it?
|
| Single systems can have really nice uptime until they
| don't. Then you are hoping that the people on hand can
| repair what's going on after months or years of never
| having to do that. Mean time to recovery might be a week
| while you wait for new hardware or a few hours while you
| google some error message you've never encountered.
|
| People can run their own systems if they want to, but
| they shouldn't confuse good luck with rigorous
| engineering.
| lostmsu wrote:
| Do you have some external ping test running continuously?
| If not, you really have no idea.
| iso1210 wrote:
| Of course, how else would I know the exact failure times,
| my service monitoring only polls every few minutes. That
| doesn't help me with my knowing how my service is
| performing though, I'm not offering a ping service.
|
| More importantly, I do know I had an error loading
| twitter at Thu 11 Nov 04:17:01 GMT 2021, however my
| websites (and google and hackernews) were working. At
| 17:53:01 GMT Twitter took over 2 seconds to load it's
| first http page, far beyond the normal 400ms. Google took
| 23 seconds to load this morning, BBC News just 0.071.
|
| On the other hand I also need to provide services which
| can't cope with outages measured in milliseconds. Good
| luck with complaining to a cloud provider that your
| traffic vanished for 4 seconds. Those services thus have
| multiple connections on independent hardware and circuits
| with no single point of failure
| anthony_r wrote:
| Indeed. People do not realize just how advanced the
| reliability infrastructure of those services is. Things
| like diesel power generators have been baked into cloud
| datacenters for what, a decade now? Probably longer. Show
| me your alternative power source when the power goes out
| (and power does disappear, everywhere, eventually).
| iso1210 wrote:
| Diesel generators have been baked into my on prem
| equipment room for at least 40 years
|
| For you average person working in an average office if
| the power goes out you're not going to be working anyway,
| so it doesn't matter if your server is offline too.
| remram wrote:
| I also have a personal dedicated server that never
| crashed or restarted this year. However I am not sure how
| much it was actually available. I know for a fact that
| there were multiple network issues at OVH. I also know
| that had my server been home, it would have been worse
| (Optimum residential is awful).
|
| The server not failing is not the only outage mode.
| iso1210 wrote:
| Really, downtime of google so far this year: More than 0
| minutes. Downtime of my own server: 0 minutes.
|
| Of course I'm not trying to run a massively scalable service
| coping with millions of customers, because I don't need that.
| mekkkkkk wrote:
| You think that's how statistics works? Obviously it's
| possible to keep your own server running with 0 downtime.
| But you are exposed to a much higher risk of severe
| downtimes longer than the cloud provider would likely be.
| Be it hardware failure, grid, ISP, whatever.
| iso1210 wrote:
| Claim was
|
| "It's still less downtime than your own server"
|
| Claim disproven
| mekkkkkk wrote:
| That claim was a direct snippet from your original
| comment. So unless you meant that you laugh at people who
| are talking specifically about your personal server, I'd
| say you are being selectively literal.
| iso1210 wrote:
| Cloud kool-aid drinkers say "The cloud is always better"
|
| I say that in many cases your own server is better. My
| company runs its own on-prem confluence, it's been taken
| down for updates on a regular basis at a known
| maintenence time. That's far better than losing it
| because a cloud based one was hosted on GCP this morning
| when we actually need the data
|
| Obviously in many other cases the cloud is better. You
| wouldn't want to server a million cusotmers across the
| world from a single server in your own basement. That's
| not the only model.
| mekkkkkk wrote:
| I agree completely with that added nuance. Cloud infra is
| definitely overused and is sometimes seen as the only
| option these days, even for the simplest of projects.
| Having your own metal is often times much more convenient
| and cheaper. Plus it's much more enjoyable to work with.
|
| My only gripe was with your gross over-simplification
| that read like "hurr durr, my server hasn't gone down
| this year, so self-hosting has better uptime than
| Google". It's such an unnecessary and baseless argument.
| iso1210 wrote:
| Vast vast majority of arguments you hear on HN is that
| it's impossible to run your own equipment. That's as
| ridiculous as saying you could run dropbox off a
| raspberry pi, but tends to get pushed to the top, and the
| schadenfreude when these events inevitability come along
| is too good to miss.
|
| Every few months another major outage of another cloud
| provider hits the headlines, meanwhile millions of small
| companies have no problem with the uptime of their
| 'legacy' services.
|
| I was at a farm a few weeks ago, the farmer had a server
| in a closet. It did break on occasion, when there was a
| power outage. His desktop and internet broke too, so what
| would the point in his server working.
|
| If it was hosted on google it wouldn't have been working
| this morning, despite his computer being fine.
|
| If you build your business processes around accepting
| failure, it's not a problem. It's far easier to keep at
| least one out of 3 machines online for 99.999% of the
| time than to keep a single service running for the same
| time.
| remram wrote:
| To be fair, this is exactly how statistics work. The
| _average_ downtime of self-hosted servers might be higher
| (arguably) but many people are under Google 's yearly
| downtime (aka _variance_ is high).
| southerntofu wrote:
| This is so misinformed by industry propaganda. Modern cloud
| services are very often unavailable from a region without any
| apparent reasons. Or the service appears to be available but
| some specific feature doesn't work. Or it's available but
| just really slow or dropping packets.
|
| When you have a "simple" (already quite complex)
| BGP(TCP(HTTP)) tunnel, chances are things just work and it's
| easy enough to diagnose issues. Between anycast, auto-
| scaling, and WAFs (among others) you have added so many
| layers of complexity that the chance of a "random" error
| somewhere in the stack has dramatically increased and
| diagnosis is close to impossible.
|
| It used to be simple that web services are either up or down.
| Now it's not so obvious anymore, and that's definitely not
| being reflected in the 5 9's SLAs falsely promised by cloud
| providers. I can say with confidence most selfhosted systems
| i've been close to have much better uptime than modern cloud
| services and are much cheaper to service and maintain.
| exikyut wrote:
| I agree with specific dimensions of this sentiment.
|
| Firstly, I think that the GP comment ever so slightly
| misapplies the localized awareness available in smaller
| environments to the cloud. Yes, there are tons more
| engineers to look at stuff, but those engineers are a)
| already bogged down keeping up with internal infra,
| politics and bureaucratic machinery, and b) quite some
| distance further away from actual errors occurring on the
| ground because everything's aggregated to the hilt so the
| stats remain comprehensible at the higher scale.
|
| I also agree that the newer stuff that is less mature has
| an order of magnitude more intrinsic leaky abstractions
| than old designs, which I do think were more hygienic and
| espoused more effective separation of concerns than what is
| used today.
|
| Also, not to nitpick, but the "classical" canonical
| definition would probably look closer to
| BGP(TCP(TLSv1.3(HTTP/1.1))), with the modern equivalent
| being BGP(UDP(HTTP2)) and the future being BGP(UDP(QUIC)).
| I do agree that the rapid consolidation of HTTP and TLS,
| without wide-scale awareness and slow, methodological
| development of general introspection tooling, does make
| things net worse in general. I suspect infra will still be
| using HTTP/1.1 for a long time into the future until this
| materially changes.
| i5heu wrote:
| Right?
|
| My private website on my RPI is running now 2 Years without a
| problem and only minimal downtime due to rebooting for the new
| kernels.
|
| It is amazing how much uptime you can achieve with a 5$
| Computer in comparison to a 1730000000000$ (1,73 tera $)
| Company. Even if you compensate for dynamic content.
| exikyut wrote:
| (Google is currently processing multiple terabits of traffic
| per second using several billion dollars' worth of
| distributed infrastructure.)
| iso1210 wrote:
| Actually it's not, and that's the problem
| exikyut wrote:
| FWIW I vaguely recall some kind of website that showed
| approximate bandwidth use... it was something like a
| cross between statcounter and alexa except for network
| stats and the like.
|
| I think it purported Google to be hovering around 15Gbps.
| That sounded humongously wrong, like I'd expect global
| traffic <-> Google to at least be a couple terabits,
| right?
|
| I'd be very curious if there's a way to actually ballpark
| the number. Or maybe it is in fact possible to implement
| services that reliably track this sort of thing, and my
| futile searches earlier just weren't finding them...
| Nextgrid wrote:
| And yet it doesn't mean his RPi-powered website needs to
| ever sustain that kind of traffic, so he doesn't have to
| take the additional risk of running a distributed system.
|
| In contrast, in the cloud you typically take on the risk of
| the underlying distributed platform (optimized for managing
| thousands of VMs, etc) even though you only need a single
| machine for yourself.
| [deleted]
| remram wrote:
| Maybe Google's servers are also running without a problem and
| you can't reach it for a different reason. Self-reported
| uptime is not a good measure of server availability.
|
| Was your home internet available all of the time? How many
| times did you reboot your modem?
| kroolik wrote:
| There were outages around the same time last year. Somebody in
| the HN thread commented back then that the employees evaluation
| and promotion window ends around december/eoy, thus more releases
| are made.
|
| https://en.m.wikipedia.org/wiki/Google_services_outages
| inglor wrote:
| I would be very interested in any research about code quality
| with relation to promotion packets. Ideally acamdemic.
|
| I am not sure where to look for it
| rajin444 wrote:
| Is there an objective way to measure code quality?
|
| We barely understand the systems we're writing code for - I
| don't see how you could objectively judge the code to manage
| those systems without fully understanding the systems first.
|
| You could have wave some metrics like number of bugs or test
| coverage, but I can't think of any (aforementioned included)
| that aren't subject to tons of confounding variables.
| trhway wrote:
| November 11 is a public holiday in US. And it seems it has in
| recent years became a company holiday in many companies too.
| May be that break in established over many years cadence of
| workday and holidays plays a role here, like say affecting
| smoothness of support transition between US-EMEA-APJ.
| mikelward wrote:
| Veterans Day is not a day off for US Googlers.
| delroth wrote:
| Perf has been over for almost a month now, and the evaluation
| period was over more than two months ago.
| kroolik wrote:
| Then it might, indeed, be unrelated. Thanks for
| clarification!
| samhw wrote:
| Could it not be knock-on effects of earlier changes? Or
| indeed post-review changes from scared/precarious engineers?
| mucle6 wrote:
| I think we are exactly mid way between the end year 2021
| perf and mid year 2022 perf, so its not clear to me that
| this could be to blame
| skinkestek wrote:
| That explains:
|
| The people who did the stuff have either gotten their
| promotions and moved on or didn't get a promotion and
| have given up.
|
| :-)
| dspillett wrote:
| _> and the evaluation period was over more than two months
| ago_
|
| Excellent time to slack off a bit, make a few mistakes, then
| come next eval you can point to a marked improvement over the
| intervening six-to-nine months!
| roenxi wrote:
| The line of logic this thread has followed so far suggests
| there will be a reason for a Google outage every month of
| the year.
| NikxDa wrote:
| Couldn't have phrased it any better.
| shuckles wrote:
| I believe the term of art is "just so story."
| [deleted]
| de6u99er wrote:
| Some people are definitely not gatting a raise.
| jankeymeulen wrote:
| More likely someone actually will. A blameless postmortem
| will be written, and the people that will fix the bug or
| systems issue will have something to work on with high
| visibility and high impact, which tends to translate in good
| performance ratings.
|
| (Googler, opinions are my own)
| deanCommie wrote:
| And this is why nobody serious should build on GCP.
|
| Google's more interested in placating their primadonna
| engineers than solving customer problems.
| bamboozled wrote:
| There is also a lot of pressure on engineers around black
| Friday and Cyber Monday to get things done before any change
| embargoes come into place. This is coming from someone who just
| worked 16 hours straight to scale things before a change
| freeze.
|
| I don't know how or if this would impact Google, but I'm sure
| someone there has at least thought about those dates.
| kixiQu wrote:
| Far more likely IMO to coincide with trying to get Exciting New
| Features out before release freezes for some conference or
| other.
| oblio wrote:
| He he. It's kind of hilarious. Maybe they should stagger
| reviews to ensure high availability? :-D
| kroolik wrote:
| Nothing wrong with it. People make errors, distributed
| systems aren't easy. More frequent changes - more likely a
| bug was introduced.
|
| My post is just a speculation, let's wait for the actual
| technical details doc from Google.
| oblio wrote:
| Well, there are various whitepapers that reformulate this
| exact truth you've highlighted: issues tend to happen more
| when changes are made.
|
| Regarding the technical details doc, Google will never
| state that outright in individual postmortems. And they
| will definitely not draw this to the logical conclusion
| regarding the spiky yearly activity.
| UncleMeat wrote:
| It can't be logical because it is based on bad facts.
|
| We wrote evals in August. Anybody racing to get things
| launched before perf did so months ago. The timing
| described in the top post is just false.
|
| We write the next one in February. We are almost exactly
| half way between perf cycles. Your hunch that this lines
| up with the end of the year is false.
| yunohn wrote:
| > logical conclusion regarding the spiky yearly activity
|
| Why is it logical? There's tons of changes being deployed
| at all times, at all large companies.
|
| Across products, verticals, everything - hundreds of
| changes at any given point. Some of these changes can
| introduce hard-to-predict bugs in globally distributed
| systems. Most of the time, external users don't even
| notice before they're fixed.
|
| Like another commenter said, performance reviews do not
| coincide with with the year-end at Google and other
| companies.
| oblio wrote:
| > Like another commenter said, performance reviews do not
| coincide with with the year-end at Google and other
| companies.
|
| When are they? My bet is that they're roughly around the
| same time period, even if they don't start or close on
| December 31st.
| michaelt wrote:
| Hypothetically speaking? 11 months a year there's no
| incentive to cut corners and only do 2 weeks of testing
| on something that really needs 3. If one month a year
| rushing things out with a bit less testing is rewarded, I
| can believe some people would respond to that.
|
| Of course, I wouldn't go as far as to call this a
| "logical conclusion" as the evidence I've seen is very
| slim.
| fragmede wrote:
| Except that there are two a year at Google., so it's 10
| months of the year, 5 and 5, and Covid's made everything
| weird.
| moffkalast wrote:
| Then they get fired instead, well done ha.
| evercast wrote:
| Comments like this make me wonder if people really expect
| engineers to be fired because of an outage? I do not work at
| Google, but none of my workplaces would fire engineers
| because of a failure. Mistakes happen. As long as they are
| not repeated, everything is good.
|
| If your company fires people in situations like this, run
| away and never look back.
| ddalex wrote:
| Googler here, not speaking on the behalf of the company, my
| opinions are my own
|
| People do absolutely NOT get fired over incidents. Making
| mistakes is human. An incident will prompt a review of the
| systems and safeguards in place to prevent such an
| incident, much like an airline incident investigation -
|
| basically "somebody fat-fingered it" is never the answer,
| postmortems are always blameless
|
| EDIT: now that I think of it, the opposite thing happens
| after a major incident - a systemic failure should be
| identified, people are being hired to fix it :)
| samhw wrote:
| Yeah, this 'blameless' ethos has definitely trickled down
| from FAANG to decently-sized decently-reputed places I've
| worked at - and certainly to #EngTwitter.
|
| I think it's a _bit_ over-applied in some cases. Does it
| not commit you to the theorem that every process can be
| made so perfect as to be completely invulnerable to one
| human being making a mistake? (At least, in the form
| exemplified by the common tweets to the effect that
| "your processes are to blame for $incident, not your
| interns/engineers/etc".)
|
| Even if you required two-person auth for every single
| thing, two people will make a mistake now and then, and
| in reality - due to our being social animals - the two
| probabilities are not truly independent.
|
| I just don't see how this is feasible in reality. A more
| realistic principle feels like: "people _will_
| infrequently make mistakes, and that 's of course natural
| and human and forgiveable, but _far fewer_ incidents
| should be vulnerable to human error than currently are ".
| assbuttbuttass wrote:
| I of course agree that mistakes are inevitable. That
| being said, the point of blameless culture is not to make
| a process invulnerable to mistakes. Instead during a
| post-mortem, we look at how to prevent _that particular
| incident_ from happening again.
| xcambar wrote:
| > Googler here, not speaking on the behalf of the
| company, my opinions are my own
|
| Why do employees at big tech names (FAANG et al.) are so
| often so cautious as to include this as a foreword
| everywhere? Twitter bios are full of that, for instance.
|
| It is crazy to me; who would expect anything else that
| our opinions being your own and nothing more? Who would
| expect that your word (with all due respect) is worth
| anything with regards to the company's PR?
|
| Is there an actual risk in the US? Have there been trials
| or anything that push people to add such statements?
| fshbbdssbbgdd wrote:
| They want to mention their employer to gain authority in
| the discussion, but since mentioning their employer is a
| legal/PR risk, they need to follow it up with a
| disclaimer (this only partially mitigates the risk, but
| it's worth it to get the brag in).
| Rebelgecko wrote:
| Part of it is because the company asks us to. Part of it
| is because I think it's reasonable to tell people your
| biases, and it can avoid the situation where substantive
| conversation gets derailed by "gotchas". If I make a
| comment about how I think Google Meet has the best noise
| cancellation of any video chat software, even though I
| don't work on Meet or anything adjacent to it, it's still
| a bad look if someone can dig through my comment history
| and pull out a previous comment about how I work for
| Google.
| b0afc375b5 wrote:
| It's in the spirit of full disclosure, which some,
| including me, appreciate.
| secondaryacct wrote:
| It s because they only hire idiots at Google. Im from a
| big company, I just dont name it and assume humans
| understand my opinions are my own :D
| LadyCailin wrote:
| I work at a large tech company, and they do mention in
| the on boarding materials that we represent the company,
| so we should be careful in our social media profiles. My
| solution to this is to not associate my social media
| profiles with my employer. This is technically not really
| what we're supposed to do, and I might have to change
| that approach at some point if I move high enough in the
| org to start getting attention from people, but this
| works for me better than disclaimers on all my posts.
| riffic wrote:
| if they wanted to control this so bad they'd provision
| you a managed account like how email addresses are
| managed.
| secondaryacct wrote:
| Yet all these googlers breached it immediately in the
| first sentence by naming their company...
| xcambar wrote:
| Sarcastically, "I'm a googler, opinions my own" reads a
| lot like "I'm a googler, just so you know".
|
| I didn't want to emphasize on that on the first comment
| yet to be honest, I find it pedantic because it's
| pointless, legally speaking.
| ddalex wrote:
| Can't speak for other companies, but this is covered in
| basic training at Google - if you're not authorized to
| speak on behalf of the company, you must make it clear
| when your writings may be mistaken or constructed to
| represent the company.
|
| Basically the company has specially trained people that
| speak on behalf of the company, and that message should
| not be confounded by personal opinions of other
| employees. For example, on the recent FB outage, there
| was an employee posting inside information on reddit -
| media companies just took it at face value and ran around
| with it reporting as it was what FB itself was saying
| about the outage.
|
| I'm not aware of any actual risks in the US, but then
| again I'm not in the US. For me this seems a minor point,
| and I actually enjoy separating my public persona from
| the company for which I work, being it Google or a small
| startup.
| DnDGrognard wrote:
| It's the same at all large companies - its CYA boiler
| plate.
|
| Though I almost got to be the official spokesperson for
| British Telecom responding on the alt.2600 news group
| about the the Met police VMB hack - press office was cool
| but the internal security was not.
| secondaryacct wrote:
| Every company says that, the obvious solution is to never
| name it when you speak. Why do these people need to say
| "Im a googler" and then immediately "but forget it, I
| speak on my own"... obviously there s value in the fact
| they re at Google and it will color their discourse which
| is already probably forbidden.
|
| Dont name your company if you intend to speak for
| yourself.
| jefftk wrote:
| _> On the recent FB outage, there was an employee posting
| inside information on reddit - media companies just took
| it at face value and ran around with it reporting as it
| was what FB itself was saying about the outage._
|
| To be fair, I think the media would have done that even
| with a "speaking only for myself" disclaimer.
| xcambar wrote:
| Would you say: "I live in [city], opinions are my own",
| or "I am married to [person], opinions are my own"?
|
| If no, why are you doing this for the company you're
| trading skills and time against money?
| null_object wrote:
| > Why do employees at big tech names (FAANG et al.) are
| so often so cautious as to include this as a foreword
| everywhere?
|
| This isn't just 'big tech' - I work at a relatively small
| tech company, but I'd never want anything I say about the
| company to be mistaken as some sort of 'official
| statement' especially if it related to an incident that
| possibly had a financial impact on external parties, and
| could conceivably be misused in that context in the
| future.
|
| I go as far as never writing private emails from my work
| mail for the same sort of reasons - although that is from
| a possibly over-abundance of caution.
| gsich wrote:
| Especially when it's not an opinion.
| samhw wrote:
| In fairness, 'opinion' is a horrendously vague and ill-
| defined word. It does double duty as (i) 'normative
| value' and (ii) 'personal understanding of the
| descriptive facts', which two senses are constantly
| confused - for example right here.
|
| That's why we constantly get "it's just my opinion" used
| in reference to type-ii opinions (personal understanding
| of descriptive fact), when it's only really appropriate
| to type-i opinions (normative value).
|
| Many conversations would be far clearer if it were
| abandoned in favour of more precise language, IMPUOTDF.
| not1ofU wrote:
| "Many conversations would be far clearer if it were
| abandoned in favour of more precise language,
| IMPUOTDF"... um whats IMPUOTDF? I did try to google it,
| but only this post was found.
| fragmede wrote:
| You're totally right and the SRE book by Google goes over
| this - the company's culture does not allow firing people
| for outages. If you're somewhere this still happens, run
| away (or you'd better be getting paid more than top-end ICs
| at Google.
| hdjjhhvvhga wrote:
| It makes no sense at all. After the outage you have not only
| a review of the causes and appropriate remedies, but also
| more experienced people who are now more aware of possible
| consequences of seemingly unrelated actions and will take
| extra care not to make these things happen in the future.
|
| Also, such cases are rarely the "fault" of a single person.
| Or, the direct/immediate cause is often not the main one.
| kroolik wrote:
| Why would you fire an engineer you have just spent millions
| to train?
| goldcd wrote:
| I'd guess because Europe-wide outages are costing more than
| millions
| kreeben wrote:
| But now you have someone in your team who will never,
| ever make that same mistake again and should be your new
| go-to guy for all X related changes (X being DNS or what-
| have-you). Firing someone with that type of experience
| does not lead to success.
|
| 100% of all devs make huge mistakes, at least once.
| KronisLV wrote:
| > But now you have someone in your team who will never,
| ever make that same mistake again and that should be your
| new go-to guy for all DNS related changes.
|
| I'm not entirely sure that's _always_ true. For example,
| i 've seen people introduce N+1 issues into a codebase,
| spend evenings fixing them and refactoring code to fix
| production issues... just to later introduce those very
| same types of issues.
|
| Sure, you can learn from mistakes, have post-mortems and
| so on (provided that your org even does those and that
| anyone listens and cares about the conclusions from
| those), but to me it feels like the most foolproof way is
| to ensure that no-one can make these mistakes again, be
| it with a checklist (which tend to be ignored, honestly),
| or better yet, an automated CI step or a new test suite.
|
| In my eyes, it's basically the same as with unit tests -
| everyone agrees that you need them, but people rarely
| write enough of them. So if you introduce something to
| prevent them from not doing what they should, e.g. a
| quality gate within a CI step which will disallow a merge
| once the coverage falls below a set margin, suddenly
| things are a lot better in the long run.
| kreeben wrote:
| N+1 issues aren't nearly as devastating as N^2. Commend
| them for not putting your systems to a complete halt,
| then teach them how to reason about this properly.
|
| >> a quality gate
|
| Yes, this, also.
| KronisLV wrote:
| > N+1 issues aren't nearly as devastating as N^2.
|
| Depends on the project, i guess: if you're unlucky enough
| to be working on a monolith and suddenly a page takes
| 5'000 SQL queries to load as opposed to 100, because
| someone thought that initializing data through service/DB
| calls in a loop is "easier" than writing views in the DB,
| it might still kill the entire system anyways, depending
| on the count of users.
|
| And once this data initialization is sufficiently
| complicated and convoluted for you not to be able to
| rewrite it and them not wanting to rewrite it, all while
| "the business" is breathing down on your neck, you might
| either want to introduce caching (and possibly run into
| cache invalidation problems down the road), or just
| freshen up your CV.
|
| I guess i'd also like to expand on the previous
| suggestion and advise others to consider performance/load
| testing as well, especially when coupled with APM
| solutions like Skywalking or even Matomo analytics, both
| of which can allow you to aggregate the historical page
| load times, CPM and overall performance of your
| applications, to figure out what went wrong when.
| kroolik wrote:
| Still, that engineer (if its engineer's fault) is
| extremely unlikely to make that issue again. IMHO, the
| problem is systemic in that why the system allows such
| errors (if its human error) to happen. Given Google's
| scale I think a lot of the generally known scenarios are
| covered and what you see is tens of services interacting
| in not obvious ways. Those unobvious scenarios manifest
| in situations like this.
| cranekam wrote:
| But firing someone doesn't undo an incident. It just
| introduces other weird incentives. People become afraid
| to change things for fear of breaking something, or when
| something does break they try to hide it rather than
| feeling like they can immediately ask for and receive
| help to fix it.
|
| The only time someone should be fired for causing an
| outage is if they're negligent or sloppy or mess things
| up all the time. This is rare. Almost always outages in
| large systems are the combination of many factors --
| latent bugs, design flaws, abnormal load, etc, any one or
| two of which wouldn't take the site down. But when the
| combine in a perfect storm that nobody foresaw things
| fall over.
| [deleted]
| chrisseaton wrote:
| What tech company spends millions on training anyone?
| 1ibsq wrote:
| The other comments already explained it, but I'm
| wondering how you haven't come across this 'saying'
| before. It's so overused and also cheesy in my opinion.
| chrisseaton wrote:
| Lol maybe if people actually _did_ spend millions
| training their people up front we could do better?
| exikyut wrote:
| 139,995 employees at Google * 1,000,000 =
| $139,995,000,000
|
| $140 billion dollars. On training.
|
| On the one hand... you know what, I'd _love_ to work in
| an environment like that. Seriously.
|
| On the other hand... what's the argument you make to the
| CFO in support of this? Honest question, interested to
| hear answers.
| chrisseaton wrote:
| I work part time in the Army. In the Army when you go
| from their equivalent of junior to mid-level they take
| you out of your job for _eight months_ of dedicated
| personal development, before you start your first mid-
| level job. When you go to their equivalent of senior they
| take you out for a _year_.
|
| I don't know how much that costs all-in, including the
| salaries, instructors, facilities, but might be starting
| to approach a million.
|
| That's valuing training!
| secondaryacct wrote:
| But the army is a cost center! The workers have some
| money shaved off their salary to pay for an army that
| allows delinquents and half-disabled to pew-pew guns in
| the forest, leaving them in peace. It's not comparable to
| a productive entreprise that needs to build things for
| people or perish.
|
| For instance, if Google fails and cant profit, it cant
| just shoot at their client until they pay. Your
| organisation can.
| chrisseaton wrote:
| > It's not comparable to a productive entreprise that
| needs to build things for people or perish.
|
| Well we had to build an Army to win against fascism in
| the Second World War or we all would have perished.
|
| And 'perished' means literally dead or subject to
| fascism, not just going out of business.
| confidantlake wrote:
| WW2 was 70+ years ago. The USSR fell 30+ years ago.
| Military budgets are still incredibly high. The army as
| it is today does not need to be that efficient. And even
| during WW2 times the US did not face a credible threat of
| invasion. The last time the US faced an invasion on the
| mainland was during the war of 1812.
| chrisseaton wrote:
| > The army as it is today does not need to be that
| efficient.
|
| You want the Army to be... less efficient? Spend more for
| less capability?
| confidantlake wrote:
| No I mean that for the US army today the downside to the
| army being inefficient is that money is wasted. Not
| great, but not a disaster. For a different country that
| could mean the country gets invaded or the government
| collapses (like Afghanistan).
| handrous wrote:
| Middle and upper management get there via connections and
| picking things up on their own. It's unsurprising that
| they don't want to be subjected to competition with
| lesser people who can "merely" be trained to do their job
| as well, or better.
| dolmen wrote:
| Some jobs can be learned on the job.
|
| But I'm glad that learning to kill people (military) is
| not taught that way.
| chrisseaton wrote:
| If your job is something like a staff officer in a
| Brigade, you _could_ learn that on the job, the Google
| way, because exercising is also 'the job', but they
| don't get you to do that - instead they take the time to
| fully pull you out of all work commitments for dedicated
| personal development. These periods of personal
| development are about personal skills rather than combat
| training, which you've already done by this point.
| stackbutterflow wrote:
| People are born every day. Every day tens of thousands of
| people will hear about hacker news, the pyramids, Darth
| Vader being someone's father, for the first time.
| pfarrell wrote:
| The GP is a reference to the anecdote about IBM's Thomas
| Watson not firing a executive who had made an error
| costing the company a substantial amount of money.
| l33t2328 wrote:
| The implication is that millions were spent training the
| person who made the mistake when they cost the company
| millions by making that mistake.
| Ueland wrote:
| An outage can be pretty expensive but it is training for
| those whom triggered it and/or those that fixes it.
| pestaa wrote:
| It's reframing lost revenue, not talking about literal
| training cost.
| yosito wrote:
| Degoogled European here. Genuinely didn't even notice. Google can
| fuck off.
| [deleted]
| peteri wrote:
| Docs was a bit flaky earlier on.
| tobyhinloopen wrote:
| We just released a rewrite of an (web)app today so lot's of
| support calls and mails. Really unfortunate timing, hah
| rossdavidh wrote:
| So what you're saying is, you brought down Google with your
| rewrite. :)
| tomudding wrote:
| Probably related to the Google Cloud "service disruption" that
| started earlier this morning:
| https://status.cloud.google.com/incidents/1xkAB1KmLrh5g3v9ZE...
|
| It was very frustrating to see DNS working perfectly fine (since
| DNS is always the problem) and the connections just timing out.
| demosito666 wrote:
| I was also getting 500 error in GCP console
| ddek wrote:
| Yep, we're hit by that right now. It's not a total outage, we're
| losing about 10-15% of our calls to bigquery from within
| cloudfunctions. What we have using a VPC connector is ok.
| Fortunately areas affected are ancillary (mainly monitoring,
| ironically), our service is still running.
| bob229 wrote:
| Good then cancer machine is down for a while
| intsunny wrote:
| Outage reports in UTC, amazing!
|
| ( _glares at AWS_ )
| NaturalPhallacy wrote:
| I want to smack whomever invented time zones.
|
| _Everyone_ is bad at them. Trading desks, international trade
| offices, space related offices all have a set of clocks on the
| wall because even really smart people just suck at figuring out
| time zones. I even use https://everytimezone.com/ in lieu of a
| set of clocks.
|
| What a needless complication. I wish we could all just switch
| to UTC and stop daylight savings time.
| judge2020 wrote:
| I can't imagine anyone taking kindly to a change as big as
| eradicating time zones. Get rid of DST, yes, but only if the
| U.S. goes up an hour (the extra sunlight at 6pm during DST is
| nice).
| NaturalPhallacy wrote:
| I can imagine hordes of programmers everywhere taking
| kindly to it. It's a simpler system.
| tex0 wrote:
| Hahaha. This one made me laugh hard.
| andrelaszlo wrote:
| No it will make you laugh in about 30 minutes. Oh wait, DST.
| Never mind.
| gabkins wrote:
| I do wonder how many companies and businesses got affected by the
| outage.
| hulitu wrote:
| Maybe they are just upset because they got fined. It's common by
| big companies.
| southerntofu wrote:
| Or maybe this is unrelated and it's just one of the many cloud
| outages contradicting their uptime promises. Welcome to the
| reality of cloud computing, where everything's foggy and the
| hallucinogenic fumes made us believe downtime was a thing of
| the past.
___________________________________________________________________
(page generated 2021-11-12 23:02 UTC)