[HN Gopher] Google's infamous internal 2010 "I just want to serv...
       ___________________________________________________________________
        
       Google's infamous internal 2010 "I just want to serve 5TB" video
       now public
        
       Author : raldi
       Score  : 708 points
       Date   : 2021-11-02 14:44 UTC (8 hours ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | sergiotapia wrote:
       | Much more infamous.
       | 
       | "Mongo DB Is Web Scale" -
       | https://www.youtube.com/watch?v=b2F-DItXtZs
        
         | keymone wrote:
         | s/in//
        
           | shirleyquirk wrote:
           | famous ternal?
        
             | tusharsadhwani wrote:
             | that would be s/in//g :)
        
       | jboggan wrote:
       | Eagerly awaiting the "GoogFellas" video leak.
        
       | loxias wrote:
       | And now, in 2021, Google has inflicted their "clarity" on the
       | rest of the world. I miss jobs from the 2000s, the jobs where you
       | were paid to write software for a living.
       | 
       | You know, engineering! Given a task, or set of requirements,
       | develop software on your computer, software which eventually runs
       | on the customer's computer, where it's used to solve the
       | customer's problem.
       | 
       | My most recent full time employment a year ago was at a _great_
       | company. Healthy culture, some of the most talented coworkers I
       | 've ever had the pleasure to work along side.
       | 
       | Over the year I lasted there, I used for the first time: Docker,
       | Golang, Kubernetes, Terraform, Gitlab, Saltstack, Prometheus,
       | (and probably other middleware that my brain has GCd to free
       | space). I was barely able to get _anything_ done. At least, it
       | always felt that way.
       | 
       | Maybe I'm just an idiot, I don't know. I'd accept it if true!
       | What I do know is that I used to be able to _build_ things for
       | people, be compensated well for it, and get _satisfaction_ from a
       | customer liking what I 've built. It was simple.
       | 
       | In this brave new world, with containers, pods and this and that
       | and the other thing, where it can take months before one even
       | understands enough _primitives_ to do a  "hello world".... how
       | can anything ever get done?? How can anything inventive,
       | creative, or experimental emerge from our industry when the
       | develop/test/improve cycle has gone from minutes to weeks or
       | months?
       | 
       | I don't know what the future looks like, but the present strikes
       | me as unsustainable in the long run.
       | 
       | (edit: Wow! I expected this to be downvoted to oblivion, not my
       | highest rated comment on the site...)
       | 
       | <tiny>(Forgive this shameless self promotion: if, dear reader-
       | who-is-a-hiring-manager, you have a paid role for a lowly but
       | experienced systems engineer who doesn't know anything about
       | "web" or "apps" or "social" but is quite adroit with C/C++ (and a
       | few others), most "sciencey/mathy" type problems, signal
       | processing, firmware, network protocols, automation/scripting,
       | and more, ... email is welcome!)</>
        
         | mwcampbell wrote:
         | > software which eventually runs on the customer's computer
         | 
         | I gather then that you're not a fan of SaaS. True, one can
         | cynically explain the rise of SaaS as rent seeking. But there's
         | undeniably value in selling whatever functionality your
         | software provides without burdening the customer with having to
         | run it on their own computer(s). And when we do that, it's our
         | responsibility to make the service reliable, which is what a
         | lot of these tools are trying to do.
        
           | loxias wrote:
           | > I gather then that you're not a fan of SaaS.
           | 
           | I'm neutral, I think? I don't quite see the point of it would
           | be more accurate. I don't think I've used any SaaS in my
           | personal life (other than streaming services. Which I'd
           | prefer as a local app anyway, and I still do, for music, but
           | not video)
           | 
           | I'm sure it's a matter of opinion, not something with an
           | objective answer, but "burden of running software on their
           | own computers" genuinely confused me as I read your comment,
           | I thought "burden? what burden?".
           | 
           | As a user:
           | 
           | If software is designed properly (and most isn't...) you
           | download it once, and it runs. Is the burden the time it
           | takes to do the download? Compared to the noticeable burden
           | of using a webapp, with problems like crappy and frustrating
           | responsiveness, an inability to work without an internet
           | connection, and frequent inability to handle tasks of real
           | complexity, I'd choose a local program any day.
           | 
           | As an employee:
           | 
           | Heck yes SaaS! $/month >>> $$$/customer :D Of _course_ it 's
           | rent seeking, and I take (and give) no shame in that.
        
         | [deleted]
        
         | thrashh wrote:
         | Perhaps we need more specialization but I remember the time
         | before these kind of tools and I hated it.
         | 
         | I'm a lazy person and I absolute love tools. Tools like Docker
         | helped me never have to solve other people's complex
         | environment problems again. I love metric reporting tools like
         | Prometheus because it helps me front problems before they
         | become weekend emergencies. I use a paid Git GUI so I can fix
         | complex Git problems without ever making a mistake.
        
           | loxias wrote:
           | I'm also a lazy person! Which is why tools like this are a
           | PITA to me.
           | 
           | The one exception is Docker. It's not a regular part of my
           | workflow, because of how it makes things both harder to get
           | started (making a working Dockerfile takes a bit of time),
           | harder to debug, and slower to build (I just changed one
           | line! Now I have to rebuild the whole image to see if it
           | fixed the problem... &c).
           | 
           | However, for _deployment_ of the final product? I agree
           | Docker 's GREAT. But, consider, in that respect it offers
           | nothing I didn't have at the start of my career 20 years ago.
           | Static linking for interpreted languages. :)
        
           | pm90 wrote:
           | Same. I do not have any nostalgia for when you had to say
           | into machines and run scripts. Please no.
        
         | mathteddybear wrote:
         | Reminder that jobs from the 2000s that you were paid to write
         | software for a living include also J2EE and CORBA projects.
        
         | VHRanger wrote:
         | In my team, we often deploy internal "services" as cronjobs on
         | an EC2 service. This hasn't run into any issues in 24 months.
         | 
         | One of these we decided to move to a more serious
         | infrastructure (a set of AWS lambdas). It's failed three times
         | in 6 months since, and we're moving it back to be a good old
         | cronjob on a server.
         | 
         | Simple is good.
        
           | cheeze wrote:
           | What does the cronjob do? Start some service that listens for
           | inbound connections? Or are you talking more about daemons
           | that do some set of work every interval?
        
           | SamuelAdams wrote:
           | Just curious what your cost differences are between a
           | dedicated EC2 instance and lambdas. For our organization an
           | EC2 instance was at best 8-10 times more expensive than
           | lambdas.
        
             | azmodeus wrote:
             | The question is also what's the cost of troubleshooting the
             | lambda service going down 3 times. 10x more expensive and
             | reliable can be a good trade.
        
             | x0x0 wrote:
             | likely dwarfed by eng time, both in dollars and opportunity
             | cost
        
             | callmeal wrote:
             | AWS Lightsail instances are pretty affordable ($10/mo and
             | up)
        
               | SanchoPanda wrote:
               | $3.5 USD and up
        
         | rvnx wrote:
         | I tend to agree with the green dude :|
         | 
         | It's normal to have a production service replicated on 2
         | availability regions.
         | 
         | The green guy is annoying, because reality is annoying, and
         | reliability is not about luck, but is about a properly
         | calibrated and tested process.
         | 
         | Yes, you need to write monitoring, you cannot run only with
         | "hope".
         | 
         | Yes it sucks that a DC can go down. Your particular service is
         | not important if it's down, but having a copy of the production
         | data is essential in case of a catastrophe.
         | 
         | Except for the tests that are probably unnecessary, everything
         | else seems to make sense.
         | 
         | The peer bonuses are an issue though.
        
           | ridaj wrote:
           | I think the challenge is you'd expect a company like Google
           | to have more of the setup be automated. If replication and
           | monitoring are such universally good ideas, then why don't
           | they come out of the box?
        
           | midasuni wrote:
           | Depends what the problem you're trying to solve is. In my
           | experience the vast majority of business problems do not need
           | that kind of reliability, and if they do they don't need it
           | deployed in such a Byzantian way.
        
             | zaphar wrote:
             | You say that but then the system goes down and the CTO is
             | walking up to your desk asking why it's down and exactly
             | when can they expect it to be up and don't you know we are
             | bleeding money right now?
             | 
             | What you call byzantine an SRE calls necessary complexity
             | to meet the needs of your business.
        
           | novok wrote:
           | Green guy should be making a all of that a one click process
           | to start up a service shell that does all of that for you
           | although. Then as you write it up an automatic linting &
           | rules engine will highlight what is missing before you make a
           | final pull request to get the necessary human approvals,
           | ONCE.
        
           | loxias wrote:
           | All of this is true, but I'd wager there's 10, at MOST,
           | entities on the planet that are large enough to warrant this
           | level of ... "architectural overkill". The other 99.99% of us
           | don't need it.
           | 
           | I CERTAINLY don't debate that Google, or Amazon, or Facebook,
           | or Netflix, or the phone system, or anything else that
           | touches a noticeable percentage of the human race needs
           | architecture like this to provide "5 9s".
           | 
           | But, just like when "big data" became a buzzword, and many
           | people thought their problem needed "big data" approaches to
           | solve, the thought that all but a small minority of entities
           | need this is Simply Wrong.
           | 
           | I am reminded of a client doing something with genomics about
           | 9 years ago. They had some over-complicated "new tool"
           | infested approach to solve their "big data" problem, but the
           | run times were taking too long. I was brought in as a
           | consultant to improve it. After I was done, a data processing
           | run that took hours (causing employees to run them overnight)
           | before took minutes, or seconds. What did I do? I got rid of
           | all the complexity. I replaced their expensive cluster with
           | one studly provisioned machine. I replaced their collection
           | of networked Java microservices with 1 non-networked
           | multithreaded C program. I replaced their XML based format
           | for data at rest with something I whipped up, tuned to what
           | they actually needed.
           | 
           | Once their "we need big data!" >10TB data set could fit in a
           | single machine's memory, the rest was easy. What used to
           | "require" a cluster of machines and overnight processing
           | could be done interactively, and quickly enough for the
           | scientists to get into a much more productive "flow", doing
           | dozens of runs per day.
           | 
           | tl;dr: unless you're google (or google scale) you don't need
           | all this crap. :)
        
           | gopher_space wrote:
           | A lot of it feels like premature optimization. Like I'm
           | laying down a heavy infrastructure to support change but it's
           | already locking me into certain ways of looking at problems.
        
             | Jensson wrote:
             | It isn't premature though, a service is as robust as its
             | weakest link, so if you let people write crappy services
             | that easily goes down and are hard to get back when they do
             | then you will get a huge amount of outages in major
             | services since they depend on so many small ones.
        
             | loxias wrote:
             | Great observation. Perhaps the term "premature
             | infrastructure architecture optimization".
        
         | ethbr0 wrote:
         | The current landscape (optimizing for hyperscale, at the cost
         | of complexity) seems like a natural extension of relatively few
         | giant corporations funding the majority of programmers. To such
         | an organization, efficiency & time to market are more important
         | than simplicity.
        
           | [deleted]
        
           | iamstupidsimple wrote:
           | But at least in the FAANG example, time to market is much
           | slower because of said complexity.
        
             | thiagocsf wrote:
             | I believe is now called MANGA.
        
               | tester756 wrote:
               | not MAGMA?
        
           | [deleted]
        
       | dpryden wrote:
       | Non-Googler: What do all those words mean?
       | 
       | Noogler: Haha, this video is so funny!
       | 
       | L4 SWE: (Crying because the video is so true)
       | 
       | L5 SWE: Haha, this video is so funny! I should show it to my
       | interns, this will be a good training for them.
       | 
       | L6+ SWE: Why do people think this is funny? This Broccoli Man guy
       | makes some really good points...
        
         | [deleted]
        
         | throw1234651234 wrote:
         | Non-Googler: What do all those words mean?
         | 
         | Exactly. This wasn't too relatable, even though I have the GCP
         | Certified Architect cert.
        
           | dpryden wrote:
           | I can't tell if this comment is implying that my comment is
           | unclear, or if you're agreeing with the first line of my
           | comment.
           | 
           | In either case, though, it's an inside joke precisely because
           | it's more relatable to those who are (or were) inside. In
           | particular, I think it would be most funny to someone who was
           | at Google about a decade ago; when I left Google in 2017
           | things had already changed enough that this didn't ring quite
           | as true for new hires.
           | 
           | That said, GCP is not very representative of what the
           | internal platform looked like circa 2010. (Or even of what
           | the internal platform looks like now, as far as I know.)
        
             | [deleted]
        
           | NikolaeVarius wrote:
           | Why would internal tooling mean anything to you? And why
           | would GCP knowledge be useful in any way?
           | 
           | Its fairly simple to extract the gist of what these systems
           | from the script.
        
           | jjtheblunt wrote:
           | As an ex Apple person, i'd say it means there's way too much
           | hierarchy at Google? not sure i'm reading it right though
        
             | flatiron wrote:
             | we still had our processes though. Radar was my least
             | favorite, but they replaced the ant eater app with one that
             | was at least partially usable right before i left.
        
               | jjtheblunt wrote:
               | we'd say, about spoken mad scientist style requests, if
               | it's not in radar, it never existed. :)
        
             | q3k wrote:
             | IMO/IME it's the clash between tooling, systems and and
             | processes designed for running long-term highly scalable
             | and reliable services maintained by teams in multiple
             | geographical locations and used by billions of people; and
             | greenfield projects that just want to get things done at an
             | early stage.
             | 
             | Requiring multi-cluster/region, the quota/resource economy
             | system, handling PCRs, code review, readability approval
             | for complex configuration languages (and the existence of
             | such complex languages in the first place) ... all of that
             | makes sense in a vacuum and all were built to handle real
             | problems and are likely written in the blood of a near-miss
             | outage. But it also all comes crashing down on you when
             | you're doing things from scratch for a relatively simple
             | usecase that no-one really designed for.
        
         | vanderZwan wrote:
         | Not sure if "SWE" stands for software engineer, or "Sweden" as
         | in Stockholm Syndrome
        
           | [deleted]
        
           | frakkingcylons wrote:
           | Oh definitely Sweden.
        
           | keville wrote:
           | Why not both? :sob:
        
           | praptak wrote:
           | Random synapse activation:
           | 
           | A few years ago there was a Swedish tourist at a hotel where
           | I was on vacation. He had a blue-yellow hat with "SWE"
           | written on it in Courier font. I felt an urge to steal his
           | hat because it looked better than most of the Google-branded
           | swag I got as a Google SWE :)
        
         | belter wrote:
         | L7+ SWE
         | 
         | My life is a waste but the money is too good...
        
           | kubb wrote:
           | This applies to every level, particularly the lower levels.
        
             | ikiris wrote:
             | it ain't much, but it's honest crying into piles of money
        
           | tandr wrote:
           | How good?
        
             | riknos314 wrote:
             | https://www.levels.fyi/company/Google/salaries/Software-
             | Engi...
        
         | cperciva wrote:
         | Where does "these are really good points, but why don't we have
         | tooling which sets everything up automatically?" fit on the
         | scale?
        
           | nuerow wrote:
           | > _Where does "these are really good points, but why don't we
           | have tooling which sets everything up automatically?" fit on
           | the scale?_
           | 
           | My guess it fits nowhere because the L5s don't have the
           | ability to automate it, and the L6s think it's trivial and as
           | it's done sparingly then it doesn't justify the work to do
           | things differently.
           | 
           | And this is why we can't have nice things.
        
             | azornathogron wrote:
             | And yet it's been a decade since this video and practically
             | everything it mentions is a non-problem now.
             | 
             | No one is spinning up new borgmon instances. Spanner is
             | replicated by default. Only very low level services need to
             | care about PCRs. If you use one of the approved frameworks
             | it will set up practically all the production configuration
             | for you. Basic alerting for your service is automated, just
             | turn it on, picking cells to run in is automated, scaling
             | your service is automated, etc.
             | 
             | Actually getting quota remains a problem... :-p
             | 
             | Anyway I would argue we can and do have nice things, and
             | that has happened precisely through the efforts of a huge
             | number of people at all levels.
             | 
             | Edit to add: of course, there are always new problems to
             | complain about! It's the march of progress after all.
        
               | compiler-guy wrote:
               | Yes. If someone were to make this video today, it
               | wouldn't be about production jobs and PCRs, it would be
               | about privacy reviews and branding approvals.
               | 
               | But the quota issues haven't changed a bit.
        
             | ikiris wrote:
             | More like you aren't going to get promoted for automating
             | someone else's toil. Also, now who's going to support it,
             | better deprecate it since the library changed / got
             | deprecated / it's tuesday.
        
               | Jensson wrote:
               | > More like you aren't going to get promoted for
               | automating someone else's toil.
               | 
               | Lots of people were promoted for automating these things.
               | They built easy to use services, got extra headcount
               | since they became important and climbed the ranks. So not
               | sure why you'd think that.
               | 
               | It may be different at other companies, but at Google
               | building stuff that many other engineers depends on is a
               | major way to get promoted. Of course if you automate
               | something and nobody uses your automation tooling then
               | you wont get promoted, but if your work gets used by
               | basically every new engineer you'll climb the ranks
               | quickly.
        
           | SilasX wrote:
           | Yeah, that was my reaction. I get the need for all this
           | reliability/failover, but it's _horrible_ failure of
           | abstraction /separation of concerns.
           | 
           | There's no reason the serving team should have to learn how
           | to do all of those things on the checklist, since it can be
           | done by anyone who's already learned the infra. You're
           | expecting them to learn all kinds of stuff outside of their
           | specialty, when they should be able to kick the app over the
           | wall and let infra ensure that the app is deployed in two
           | separate PCR zones with the failover plan etc, which should
           | itself be mostly automated.
        
             | q3k wrote:
             | > when they should be able to kick the app over the wall
             | and let infra ensure that the app is deployed in two
             | separate PCR zones with the failover plan etc, which should
             | itself be mostly automated
             | 
             | Not entirely - the developers should actively participate
             | in designing the actual failover scenario and making sure
             | the application can handle that (anything from being okay
             | with some downtime due to the failover happening to
             | designing an actual multi-region multi-master application).
             | Making assumptions like 'infra will handle it' is a great
             | way to not only get unexpected outages (because the
             | developers assumed there would be no downtime because
             | failover is magic, or that writes will never be lost) but
             | to also introduce tensions between teams (because you now
             | have an outside team having to wrangle an application into
             | reliability when the original authors don't give a crap
             | about it).
             | 
             | I get and agree with your point, the tooling and processes
             | should definitely be simplified/automated when possible,
             | and developers deserve a working platform that just works.
             | The whole point of a platform team is to abstract away the
             | mundane to let people do their job. But reliability is
             | everyone's job, not just the infra's team, and developers
             | must understand the tradeoffs and technology involved in
             | order to not design broken systems.
        
               | SilasX wrote:
               | If that's the point:
               | 
               | A) It's doing a horrible job conveying it. A dev _does_
               | need to be concerned on how to handle failover, but only
               | at a certain abstraction level. They should be required
               | to specify something in the form  "given server A fails
               | and has to pass to B, what do you do?" That does _not_
               | require you to know the terminology about PCRs and how to
               | make decisions about which cells (or whatever) to pick on
               | deployment, or avoiding the  "gotcha" about making sure
               | the two servers are in different PCR zones.
               | 
               | At that point, it's just following a checklist that needs
               | no knowledge of the specifics of the app, and, to the
               | extent that it's accurately representing how Google was,
               | _is_ indicative of bad processes.
               | 
               | B) Many things _should_ be infra 's job, as they're
               | cleanly orthogonal to what dev's are doing. For example,
               | how to apply a security patch to a DB. That's unrelated
               | to the operation of the app.
               | 
               | I do get your point though, and I wouldn't say something
               | like this about e.g. testing (which was the short,
               | "reasonable" part of the video!) -- the devs have
               | intimate knowledge of what counts as passing and failing
               | and should be writing tests, and not 100% passing it over
               | to QA. But that's precisely _because_ such concerns are
               | deeply tied in to the thing they _are_ concerned with.
               | "SQL 3.4.1 vs 3.4.2" is not.
        
               | q3k wrote:
               | Yeah, it seems like we agree :).
        
             | lumost wrote:
             | Mega-Caps suffer from the following problem:
             | 
             | 1. There are more engineers making more divergent
             | architectural solutions such that there is never a single
             | place where you can make changes across the group.
             | 
             | 2. Failures keep happening, so process is instituted with
             | many checkboxes for engineers to work through.
             | 
             | 3. Engineers on the small scale stuff get stack ranked
             | against the engineers on the big scale stuff. Everyone
             | needs to show that they can do the work and are "fungible".
             | This leads to small internal systems having the same
             | operational standard as large public facing systems.
        
               | SilasX wrote:
               | I don't see what that's replying to. Nothing in that list
               | would justify demanding that the app's team have
               | knowledge or preference about which PCR zones to pick and
               | which will just have to be corrected when they inevitably
               | pick the wrong one.
        
               | lumost wrote:
               | The point is that every team gets to set their own
               | failure modes. I know of multiple tier-1 services which
               | diverge from at least one best practice.
               | 
               | Think of the scenario where a cloud provider needs to
               | evacuate an az. There is no API which would allow the
               | compute team to force migrate tens of thousands of apps
               | and guarantee that they both are not effected and
               | maintain their redundancy guarantees.
               | 
               | Internal services at google are in the same boat. However
               | google knows about the hard edges and forces everyone to
               | deal with all of that complexity - there is no api which
               | the serving team could plug into which will avoid this
               | overhead.
        
             | dustingetz wrote:
             | Because you have to get it working before you can make it
             | better. Abstraction is quite secondary
        
               | SilasX wrote:
               | Yes but the video is in the context of a mega-scale mega-
               | corp that _should_ have been able to set up clean
               | abstraction boundaries at this point by now.
        
               | [deleted]
        
               | Jensson wrote:
               | They already have done that, this video is 11 years old,
               | at that point Google was half the age it is now and a
               | fraction the size.
        
           | omreaderhn wrote:
           | That would be 'Xoogler' because Google's engineering and
           | broader corporate culture does not reward work like that and
           | so when you realize that, you leave.
           | 
           | In general, Googlers have very little idea how far behind the
           | rest of the industry they are when it comes to tooling.
           | 
           | I am a Xoogler.
        
             | mwcampbell wrote:
             | I got the impression, based on a blog post by Eric Lawrence
             | [1], that Google's developer tooling was top-notch (except
             | for devs working on open-source projects like Chromium).
             | Did it get worse since 2017, or are you talking about a
             | different kind of tooling?
             | 
             | [1]; https://textslashplain.com/2017/02/01/google-chrome-
             | one-year...
        
               | throwawayfgg wrote:
               | Google's developer tooling is top-notch and amazing and
               | constantly improving.
        
           | raldi wrote:
           | Or: These are really good points for a visibly-user-facing
           | post-alpha service, but isn't it a bit overengineered for an
           | experimental internal service whose clients can tolerate the
           | risk of occasional downtime?
        
             | nostrademons wrote:
             | L5 Xoogler who left for a startup.
        
           | nostrademons wrote:
           | L9+
        
       | nojvek wrote:
       | The borgman readability approvers makes me chuckle.
       | 
       | At Stripe, there were language approvers. Only those blessed
       | could approve PRs. Even XML had a set of approvers. I had fun
       | time getting hold of an XML approver.
        
       | sunyc wrote:
       | I actually have Borgmon readability! Peer bonus pls.
        
       | dang wrote:
       | Recent and related:
       | 
       |  _I don't know how to count that low_ -
       | https://news.ycombinator.com/item?id=28988281 - Oct 2021 (259
       | comments)
       | 
       | especially this comment:
       | https://news.ycombinator.com/item?id=29032656
        
       | mlindner wrote:
       | Wow I saw this somewhere a long time ago. But I don't remember
       | where and in what context.
        
       | Zababa wrote:
       | There's almost some kind of irony on uploading that to youtube, a
       | feeling of "why can't I deploy my service as easily as people can
       | upload videos to youtube?".
        
       | silentsea90 wrote:
       | Actively working towards being a Xoogler so I don't have to live
       | in this dystopia.
        
       | devnull3 wrote:
       | At 2:05 the green dude asks if you think your users are scum and
       | do you hate them.
       | 
       | The funny thing is Google as an org ends up hating their users
       | "accidently" anyway because of their history of pulling the rug
       | under the services/APIs etc.
        
         | nunez wrote:
         | even more ironic given that google+ came out four years later
        
         | munk-a wrote:
         | If the users had properly set up a PCR notification about the
         | change and registered it to a bigdata instance then they would
         | never be surprised about service discontinuations. The moral of
         | the story is that you can't fix stupid users. /s
        
       | bluefox wrote:
       | BitTorrent existed since 2001. Get on with the times.
        
       | martini333 wrote:
       | Google: The Sunk Cost Fallacy
        
       | GauntletWizard wrote:
       | Holy crap! I've been asking for this for forever[1]. Thank you to
       | the leaker!
       | 
       | [1] https://news.ycombinator.com/item?id=21786729
        
         | benley wrote:
         | You're welcome <3
         | 
         | Here's hoping Google doesn't get mad about it - though after
         | 12(ish) years there's really nothing secret in that video.
        
       | Spivak wrote:
       | I'm so confused, isn't this just like basic highly available
       | infrastructure mixed with a toxic SRE culture?
       | 
       | I want to serve 5TB!
       | 
       | Okay grab two instances in different patching zones, create a
       | bucket in our replicated RADOS storage that can hold your data or
       | create a table/db in our Postgres cluster, write your app with
       | tests, add an entry in to the load balancer, add an entry in our
       | big ole distributed job scheduler if you need cron, and submit a
       | PR against the infra repo to add Prometheus metrics and alerts.
       | 
       | And when your done with that set up CI/CD because you shouldn't
       | assume that instances are reliable and if you don't give us the
       | code to do a deploy we can't recreate your app when the VM goes
       | belly up and we'll have to page you.
       | 
       | Are people not used to what it really takes to "just run some
       | code?"
        
         | svachalek wrote:
         | It totally makes sense for Gmail, but at Google "serve 5TB"
         | means something like sort your manager's inbox, something that
         | someone somewhere has an interest in doing, or trying, but of
         | no real consequence for failure.
        
         | gliese1337 wrote:
         | I am used to it, but
         | 
         | 1. It is rare for the details of how to actually accomplish
         | each of those steps to be both documented and the documentation
         | made accessible.
         | 
         | 2. If you can describe it that succinctly, it really ought to
         | be automated. If it can't be automated... then you left
         | something out of your instructions, which goes back to point
         | (1).
        
           | Spivak wrote:
           | Like the steps to do all of this are automated, but we can't
           | read your mind. All of this is basically boils down to submit
           | a PR against some repo that says "there shall be two
           | instances in these regions, there shall be a database in this
           | cluster, there shall be a bucket with this name, etc etc"
           | that the SRE team reviews and merges, which triggers an infra
           | deploy.
        
         | q3k wrote:
         | People with HA production experience can easily vibe with
         | points made by Broccoli Man. Yes, these things make a lot of
         | sense if you actually want to get code running reliably,
         | especially at scale (organizational and userbase).
         | 
         | But we must not forget how this can look from the point of view
         | of someone who hasn't had to deal with a page due to an entire
         | datacenter going offline, who's not aware of all the hundreds
         | of small things that can go wrong by doing the 'obvious' thing.
         | I think the video is more of a way to poke fun at the optics of
         | this (and some of the overly arcane stuff involved), rather
         | than at the idea of high availability being useless. At least
         | that's how I've always felt about it, a way to remind SREs to
         | respect their internal users (simplify! automate! document!)
         | and that what makes sense to them might look ridiculous to
         | others.
        
       | cromwellian wrote:
       | As a Googler, it's often easier for me to setup a GCP consumer
       | account, AWS, or Heroku account to demo something, compared to
       | using anything internal. I remember the most annoying situation
       | was like 10 years ago when me and other engineers ported Quake 2
       | to run in Chrome, we were in a time crunch to demo it
       | multiplayer, and I ended up setting up an AWS account to serve
       | it. But then I left it running and forgot and ended up getting a
       | few hundred dollars billed to me because the Quake2 server was
       | chewing CPU.
        
         | bamboozled wrote:
         | I could imagine you're violating some pretty strict policies
         | doing this?
         | 
         | You're taking proprietary code and running it on a competitors
         | platform?
         | 
         | I like to think I'm pretty open minded about stuff like this,
         | and I've actually done something similar, but I'd be surprised
         | if you didn't get your ass handed to you for that type of
         | thing?
        
           | cromwellian wrote:
           | No, no proprietary code was used, the port was done from the
           | Open source Java clone Jake2:
           | https://en.wikipedia.org/wiki/Jake2
           | 
           | We ported it by using Google Web Toolkit Java->JS compiler,
           | and replaced OpenGL with WebGL, and all of the other bits
           | with Web APIs (websocket, pointer-events, fullscreen-api,
           | filesystem api, etc)
           | 
           | The assets (proprietary artwork, levels, etc) were not hosted
           | on AWS, it simply downloaded the EXE file from ID servers and
           | extracted it in the browser.
        
             | bamboozled wrote:
             | You said this:
             | 
             | > As a _Googler_ , it's often easier for me to setup a GCP
             | consumer account, AWS, or Heroku account to demo something,
             | compared to using anything internal.
             | 
             | I get you're trying to make a point of saying you can do
             | something easier elsewhere, but then why even through in
             | the "as a Googler bit" without clarifying that you're not
             | really working on anything of consequence where you'd be
             | actually asked to host things internally.
             | 
             | You're basically hosting open source projects on AWS.
        
               | breakfastduck wrote:
               | Looks like someone is desperate to get into an argument
        
           | cromwellian wrote:
           | For those interested: The original project
           | https://code.google.com/archive/p/quake2-gwt-port/
           | 
           | GitHub (Stefan Haustein is my genius teammate who did all of
           | the heavy lifting on the OpenGL -> WebGL piece)
           | https://github.com/stefanhaustein/quake2-playn-port
           | 
           | You can still play it here, on AppEngine
           | http://quake2playn.appspot.com/
        
           | mabbo wrote:
           | Why? Do you think there's a risk of Amazon stealing code from
           | a customer?
           | 
           | No matter what code they took, the cost would never be worth
           | it for them.
        
             | bamboozled wrote:
             | If I took my companies code and hosted it anywhere execept
             | where I was authorized to do so, I'd expect flack for it.
             | I'm less worried about Amazing stealing it, but it seems
             | like a silly place to put it nevertheless.
        
           | BoorishBears wrote:
           | This is what separates companies that get things done from
           | companies that pay a lot of people to hopefully maybe get
           | things done.
           | 
           | A server for a multiplayer Quake port...
           | 
           | Who is Google paying to hand their ass to them over that?
           | 
           | Who has both the authority to hand their ass to them, and a
           | lack of discretion to not let it end at "well in general we
           | don't do that, but I see why you did it and there's little to
           | no risk"
           | 
           | -
           | 
           | At some companies yes, someone is paid to go "I caught
           | someone putting our proprietary code up on a competitors
           | platform!!!!" and no one will actually think critically about
           | what exactly was proprietary, so someone putting up a quake
           | demo might as well have put up the Coke secret formula
           | 
           | And now OP who actually generated some value at little to no
           | risk gets their ass handed to them and someone who simply
           | lacked the skills to realize the risk profile gets a notch on
           | their "I add value and earn my paycheck" badge.
        
           | potatoman22 wrote:
           | Can't let the Quake 2 source code escape Google
        
       | didip wrote:
       | What is "Borgmon readability" and why was it important. I think
       | that's one of the punch line of the video.
        
         | yeputons wrote:
         | If you change source files in language X, someone proficient in
         | language X (aka "has X readability") should approve that it
         | corresponds to Google Code Style in language X.
         | 
         | You start without it and may obtain once you've written a bunch
         | of code in language X.
         | 
         | I'm not sure if there is really a Borgmon readability. But if
         | there is, it seems like Borgmon configuration files are both
         | common (so that there is a readability requirement) and
         | uncommon (so that there are very few people with readability).
        
       | nunez wrote:
       | still hilarious
        
       | birken wrote:
       | Hey... those of us that worked on Google's internal Bigtable
       | service worked very hard so you _didn 't_ have a file to a ticket
       | to set up replication between your Bigtable cells.
       | 
       | The rest does seem about accurate though.
        
       | gcampos wrote:
       | What exactly are these "peer bonuses"? Is it real? Is it what I
       | think it is? Do people actually use them as bargain chips?
        
         | B-Con wrote:
         | You can send a small, semi-official "thanks for a job well
         | done" to someone else and it comes with a few bucks attached.
         | People joke about using them nefariously (as people tend to
         | joke), but I've only seen them used appropriately.
        
         | advisedwang wrote:
         | Yes they are real, it's in the low hundreds of bucks range.
         | They must be approved by the recipients manager. Its also
         | limited how many a employee can send (but the limit is fairly
         | high). There is also "kudos" which comes with no money, but has
         | no limits or approvals required.
         | 
         | They are intended to be used for going above and beyond, not
         | for stuff that falls within the scope of ones job. Using them
         | as bargaining chip is explicitly against policy.
        
         | nunez wrote:
         | you get some money ($150/bonus, IIRC) for helping someone out,
         | assuming manager approval
         | 
         | akin to the usual corporate "thank you" gift card, but more
         | money and generally easier to distribute
        
         | q3k wrote:
         | > What exactly are these "peer bonuses"? Is it real? Is it what
         | I think it is?
         | 
         | Each month, you can nominate another employee for a small
         | bonus. This is designed to be given to coworkers who have gone
         | above and beyond what was expected from them.
         | 
         | > Do people actually use them as bargain chips?
         | 
         | From my experience it's so over-the-top absurd that it would be
         | difficult to have someone interpret such an offer as anything
         | other than a joke or a meta-joke.
         | 
         | https://blog.bonus.ly/a-look-at-googles-peer-to-peer-bonus-s...
        
           | tazjin wrote:
           | It's not each month. You can send a lot of them. There's a
           | theoretical limit and a bunch of restrictions but in practice
           | they're unenforced.
        
             | guyzero wrote:
             | Each one has to be manually approved by the recipient's
             | manager, so this can't happen. It's a joke.
        
         | compiler-guy wrote:
         | People don't use them as bargaining chips most of the time--it
         | is explicitly against policy. I'm sure it happens some times.
         | 
         | What they do do is send one when someone else does something
         | nice (like fix a bug from a project they have left or whatever
         | else). If you ever need something similar again, the person you
         | peer bonused has warm fuzzies about the experience and a hint
         | that they might get it again.
         | 
         | People also use peer bonuses during perf time to demonstrate
         | that the work they are doing impacts other people enough for a
         | somewhat uncommmon thank you.
        
       | dekhn wrote:
       | Thank you, whomever did this! I asked for it in a comment
       | recently.
       | 
       | This video basically is making fun of a common situation of
       | Google at the time, where a person wants to serve up some data
       | for analytics, but the sysadmins expect the person to follow a
       | process intended for much more complex and high availability
       | services run by teams of skilled engineers.
       | 
       | It parodies SRE as a BOFH sysadmin, even though in general SRE
       | are quite easygoing and helpful.
       | 
       | It helped poked fun at a number of overly stuffy processes and
       | also helped push people to make hosting modest datasets (like
       | this 5TB one) easier.
        
         | smartician wrote:
         | It's not much different today. Nowadays you'll also need
         | privacy review, accessibility review, security review, and
         | diversity & inclusion review.
        
           | ntaylor wrote:
           | _diversity & inclusion review_
        
             | cynicalkane wrote:
             | Assuming this is sarcasm, you realize Google has a massive
             | userbase all over the globe from all walks of life, right?
             | Does it make business sense to accidentally exclude certain
             | people? Or ethical sense?
        
               | killerstorm wrote:
               | Businesses exclude people all the time. E.g. many videos
               | are geoblocked, and there's no way to view or purchase
               | them in some countries.
               | 
               | Here are some other examples: I can use free version of
               | Google Colab from Ukraine, but I can't pay for Pro
               | version. (I can pay for Google Cloud, though.)
               | 
               | OpenAI blocks API dashboard access to IP addresses from
               | Ukraine. (But it is OK if I use VPN LOL.)
               | 
               | So it seems blocking ppl is the norm. I guess "diversity
               | and inclusion" is mostly about social topics within US,
               | not about not excluding people.
        
               | londons_explore wrote:
               | In general it's about not _accidentally_ excluding
               | people. All the cases you propose are deliberate blocks
               | for various (mostly legal) reasons. The deliberate blocks
               | are considered in the review, and as long as there is a
               | sound business case for launching with the exclusion, it
               | goes ahead.
        
               | dustintrex wrote:
               | You're running into US sanctions issues (Crimea), not
               | woke Google policy.
        
               | bbarnett wrote:
               | Nothing is all inclusive. Nothing.
        
               | bufferoverflow wrote:
               | Death and taxes.
        
               | oriki wrote:
               | Is your argument here supposed to be "Nothing is all
               | inclusive, therefore we shouldn't even bother trying"? If
               | so, I'd argue that's a lot more ridiculous than a review
               | process designed to help catch major inclusivity issues
               | before they become problems.
        
               | mikepurvis wrote:
               | Sure, but that's not a reason to not even ask the
               | question. Maybe not every DI initiative turns out to be
               | helpful or productive, but as someone who's privileged on
               | pretty much every axis there is, I'd be grateful for the
               | kind of internal support system that could give me an
               | early warning sign for "hey, this design decision that
               | made sense to you and your team has the potential to
               | alienate user base X and there's a real possibility that
               | if we launch in this state it's going to explode into a
               | minor Twitter scandal."
        
               | brailsafe wrote:
               | Isn't this just called user testing? Also this is in the
               | context of a fucking dataset. If data needs to go through
               | DI in case something blows up on Twitter, I guess it's
               | sad state we're in.
        
               | davidcbc wrote:
               | If, for example, the dataset only contains white faces
               | and is intended to train facial recognition then yes, it
               | needs to go through some kind of DI review.
        
               | brailsafe wrote:
               | Wouldn't this review be done on the data collection and
               | planning side, rather than at point of publishing though?
               | Surely you can publish datasets of just white faces or
               | just black faces if during planning that's what you
               | intended to do for some reason?
        
               | davidcbc wrote:
               | I mean, maybe, but you still might need it to be
               | reviewed. You don't have to wait until you're about to
               | launch to start these kinds of reviews and if you know
               | that some kind of DI review is necessary for your project
               | you should start talking to the reviewers as early as
               | possible, especially if you are making a potentially
               | controversial design choice.
        
               | Volundr wrote:
               | Does it? Seems to me data is a prime place for exclusion
               | to occur. Example: a dataset of tagged photos for
               | training a neural net to analyze facial expressions. All
               | the photos are of white faces.
        
               | pilsetnieks wrote:
               | Perfect is the enemy of good.
        
               | teawrecks wrote:
               | Science is always wrong. Always.
        
               | xmprt wrote:
               | I agree with you but it sometimes seems like Google
               | doesn't care at all about it when they have the kind of
               | customer support processes that they have.
        
               | kevingadd wrote:
               | Customer support is after the fact, reviews are before
               | the fact. It's very cheap to do these reviews before
               | launch and then you can point at those to say "we're
               | trying!" while not providing any customer support.
        
               | protomyth wrote:
               | Google can talk when they stop using a license by a
               | domain squatting org who revised their history and has a
               | pretty offensive line on their front page. _COMMUNITY-LED
               | DEVELOPMENT "THE APACHE WAY_ indeed. Worse, most of the
               | links on Google search point to the org and not the
               | actual tribes.
        
               | brailsafe wrote:
               | Does it make sense to serve a dataset without approval
               | that it's inclusive enough? Yes, because that's typically
               | how things in the world work.
        
               | fyd6gexygsydy wrote:
               | I don't understand this line of reasoning since it
               | assumes inclusion training actually promotes inclusion.
               | My experience has been that it usually means
               | racial/gender intersectionalism training that everyone
               | gets to swallow regardless of culture or belief because
               | it's what white people in the us tech industry are
               | passionate about right now.
        
               | tester756 wrote:
               | >accidentally exclude certain people?
               | 
               | e.g how? could you provide some examples e.g two?
               | 
               | there's a lot of talk about this stuff when it comes to
               | MAGMA, yet docs still use some auto-generated
               | translations which suck.
        
               | davidcbc wrote:
               | https://sitn.hms.harvard.edu/flash/2020/racial-
               | discriminatio...
               | 
               | https://futurism.com/delphi-ai-ethics-racist
               | 
               | https://www.nytimes.com/2019/04/25/lens/sarah-lewis-
               | racial-b...
        
               | tester756 wrote:
               | It seems like this kind of problems occur mostly within
               | some specific areas, meanwhile OP seems to suggest that
               | this kind of review should be applied for everything.
        
               | kukx wrote:
               | By the same logic we can justify any [social issue]
               | division. The sad thing is that the rules are arbitrary
               | and do not help in solving the issue. Actually it is in
               | the interest of the division to create or exaggerate
               | problems to justify its existence.
        
             | pangolinplayer wrote:
             | Based
        
               | sayhar wrote:
               | Hello, I wasn't aware we were on /r/politicalcompassmemes
        
             | ranger_danger wrote:
             | have to make sure there's no trans jokes in there.
        
           | throw10920 wrote:
           | > diversity & inclusion review
           | 
           | Is this tongue-in-cheek, or are you serious? Poe's law and
           | all that.
        
             | smartician wrote:
             | Partly tongue-in-cheek. These review processes exist, but
             | whether they're required or not depends on the product area
             | and type of project.
        
             | kevingadd wrote:
             | If you're publishing a dataset in the terabytes it does
             | actually make sense to at least do a pass over it and make
             | sure the data you're using isn't skewed in any undesirable
             | way that would cause problems down the road. For example,
             | if you're releasing 5tb of face photos for training facial
             | recognition nets, it would certainly be a problem if all
             | the faces are white women or asian men - the result would
             | probably be over-fit and not perform as well for people in
             | other categories. It would be correct to call that a
             | diversity/inclusion issue.
             | 
             | Privacy and accessibility reviews serve similar purposes
             | there, you're reducing risk by checking for these various
             | problems and ideally they also spot ways to improve the
             | quality of your outcomes.
        
               | murph-almighty wrote:
               | It's common in fintech for data/ML models to go through
               | similar overview. If you happen to disenfranchise a set
               | of people because your model said not to lend to them,
               | you risk legal jeopardy.
               | 
               | To clarify, I think it's good that this is a practice.
        
               | londons_explore wrote:
               | A review doesn't necessarily mean you need to resolve all
               | diversity/inclusion issues. It can merely require that
               | you _identify_ the issues and understand the risks of not
               | resolving them.
        
               | dekhn wrote:
               | the 5tb was performance data collected from servers
        
               | kevingadd wrote:
               | Sounds like the reviewer would glance at it for 5 seconds
               | and say 'ok'
        
             | rodgerd wrote:
             | Perhaps Google don't want to be in the news for identifying
             | dark-skinned people as monkeys again?
        
               | jjeaff wrote:
               | I can't remember which company it was that launched a
               | camera with face identification features, but that didn't
               | recognize any face that wasn't lilly white like every
               | single engineer that worked at that company. They could
               | have probably benefited from a diversity and inclusion
               | review. Heck, employing a single brown engineer or even
               | QA engineer probably would have been enough to notice
               | that before launch.
        
           | dekhn wrote:
           | having launched some product at Google in my day, I know
           | quite well how to skate through that process (although D&I
           | was not part of it when I filled out my forms). Sadly for my
           | friends in privacy and security, it's not hard for product
           | teams to exploit Google's propensity to launch and override
           | privacy and security concerns.
        
       | Wonnk13 wrote:
       | One of the few things I miss about my time there...
       | 
       | Never did get Java readability :(
        
       | jazzyjackson wrote:
       | what am I looking at here
       | 
       | EDIT: it has been explained to me:
       | https://rachelbythebay.com/w/2021/10/30/5tb/
        
       | opinion-is-bad wrote:
       | The multiple repetitions of "This is Google" hit home for me. I
       | never worked as a software engineer so much of the rest is out of
       | scope to my experience, but the constant idolization of Google,
       | and by proxy each other for working at such a place, eventually
       | changed from feeling coy to cultish.
        
       | xiphias2 wrote:
       | I wish I would have had this video before 2010. I got paged at
       | night every time there was a PCR failover, and I didn't know what
       | to do with it. This video is better than all the extensive
       | documentation that we had.
        
         | [deleted]
        
       | cletus wrote:
       | Ah, this takes me back (disclaimer: Xooger, 2010-2017). It's
       | painful and funny because it's true. Or was true.
       | 
       | Rumour had it that the Borgmon readability requirement was
       | removed when Sergey saw this video. I don't know if this is true
       | but that's what I heard.
        
         | DaiPlusPlus wrote:
         | Pray tell, what is/was Borgmon?
        
           | twinge wrote:
           | A system for alerting based on time-series data, with its own
           | rule language. The language (along with many others) required
           | authors have demonstrated they can adhere to the style guide
           | by going through a process to obtain readability.
           | 
           | https://sre.google/sre-book/practical-alerting/
        
           | [deleted]
        
           | sleepydog wrote:
           | It's a language and supporting infrastructure for collecting
           | and querying time series data for monitoring.
           | 
           | It was replaced a long time ago by a new system called
           | monarch, but a few holdouts will probably continue using
           | borgmon until the heat death of the universe.
        
           | dekhn wrote:
           | prometheus 0.1
        
             | jensensbutton wrote:
             | This is the correct answer.
        
         | leg wrote:
         | It is true that Borgmon readability went away due to this
         | video. It wasn't Sergey, it was an eng director.
        
           | ikiris wrote:
           | "no one has borgmon readability". years later and i still die
           | laughing.
        
       | sbpayne wrote:
       | I'm so glad I can see this again. I forgot how much I missed
       | this.
        
       | metanonsense wrote:
       | Move fast and break things! And while you are at it, please,
       | don't break anything.
        
       | mseepgood wrote:
       | This monotonous speech synthesis is annoying to listen to. The
       | delivery of the jokes is awful. Who can sit through a 3 min video
       | like that?
        
         | drannex wrote:
         | That's what makes this even more funny.
        
         | pas wrote:
         | https://m.youtube.com/watch?v=b2F-DItXtZs
        
         | zaphar wrote:
         | I think it mostly works best when you've lived it. Which I did.
         | And the resurfacing of that video brought back a lot of
         | memories.
        
         | zucked wrote:
         | This was an output of a free (now defunct) service that used to
         | accept transcripts and pump out these videos with TTS audio. It
         | led to some hilarious results, usually within niche
         | communities. Around ~2010 these things were everywhere.
        
         | kgin wrote:
         | Somehow it makes it funnier to me
        
       | nunez wrote:
       | this is making me miss memegen
       | 
       | google had its downs, but wasting hours on memegen was not one of
       | them
        
       | slac wrote:
       | I have the t-shirt!
        
       | jamestimmins wrote:
       | As an external user who has found Google's services to be
       | incomprehensible, it's nice to know it is (was) equally as
       | painful internally.
        
       | frakkingcylons wrote:
       | For anyone else who'd rather read than watch this video, here's
       | the transcript (from YT's auto-generated captions):
       | https://pastebin.com/8UrFftM6
        
       | throwaway20371 wrote:
       | These kind of organizational problems happen everywhere, that
       | doesn't bug me. What bugs me is when leadership knows about it
       | and doesn't care. After low-level engineers stick their
       | professional neck out to complain in internal town halls and
       | through feedback forms, and leadership gives some bullshit answer
       | that doesn't address or even acknowledge the problem. It would be
       | less infuriating if they just said "I don't give a shit." It's
       | the weasel words and pretending the problem doesn't exist that
       | infuriates me. A lot of the time it doesn't even take much work
       | at all to begin addressing the issue, like a working group for
       | continuous improvement of highly-painful high-value processes.
       | You don't even have to solve it. Just _attempt_ to address it.
        
         | TideAd wrote:
         | My team has issues deploying builds to test machines. It's like
         | 15 steps and takes an hour. The tooling is atrocious and
         | recently got even worse.
         | 
         | We eventually found the team responsible for this (the org
         | structure is hard to penetrate because no one answers emails).
         | They said they had no idea anyone was dissatisfied. Then they
         | said that it was a low priority so they didn't care and nothing
         | would be done.
         | 
         | In my experience, you can usually convince an engineer that
         | their stuff has a problem and they need to fix it. But it's
         | often impossible to convince management if they aren't on the
         | hook for user satisfaction.
        
         | ts4z wrote:
         | To be fair, they did, and many things have improved. And this
         | video was used as an uncomfortable reminder to make some of
         | those changes.
        
         | calmlynarczyk wrote:
         | I work at a global corporation with 50,000 employees. Even
         | though I've never been at Google I felt every pain point this
         | video was getting at because our company is trying to implement
         | all of this stuff right now.
         | 
         | "Oh you want to go to production? Here's a list from A-XX
         | stating what you need to accomplish that." Thing is I thought
         | they actually handled this gracefully when I started because
         | lots of requirements were tiered with various criteria you had
         | to meet to move up (mostly for brownie points).
         | 
         | But then one day the Tech Execs lose their minds and decide
         | "everything needs to meet all criteria for every single
         | process." You want to create an S3 bucket to store data? That
         | will be a week of submitting paperwork and another month of
         | meetings and approvals from various teams you've never heard
         | of. Plus you have to register your schema, implement data
         | quality checks, unit tests, regression tests, get a PR and CO
         | approved for your central config change, remediate any CVEs in
         | the tooling that you used, and build all of this using our in-
         | house CI/CD platform we created because we're just soooo
         | special. Now you're allowed to launch. Oh wait, NO because
         | we've put the entire corporation on hold from launching new
         | systems for the last calendar year because we're still trying
         | to agree on the final process everyone needs to follow to go to
         | production.
         | 
         | It's surreal how universally so many orgs makes the same
         | mistake of trying to throw more and more process at problems.
        
           | unethical_ban wrote:
           | In my previous role, the secdevops groups (matrixed teams)
           | were building custom terraform modules for our devs to use in
           | order to easily deploy compliant AWS infrastructure - and
           | devs could _only_ deploy via terraform /CI-CD. While TF
           | specifically states that custom modules are not meant to be
           | used as wrappers, I thought it was a clever way to try
           | getting security "out of the way" while still enforcing best
           | practices.
        
             | darkwater wrote:
             | > While TF specifically states that custom modules are not
             | meant to be used as wrappers
             | 
             | What do you mean with this?
        
           | acdha wrote:
           | > It's surreal how universally so many orgs makes the same
           | mistake of trying to throw more and more process at problems.
           | 
           | Followed by the inevitable ranting about "shadow IT", AKA the
           | requirements gathering they really should have done.
        
         | m0zg wrote:
         | At Google back then "leadership" might as well not even show
         | up. It was super bottom-up, and _you_, not "leadership" were
         | supposed to identify and fix issues. No "leadership" would stop
         | you, either, at least in most cases. I don't believe that in
         | all my years there anyone ever told me what to do. It was very
         | easy to start projects, shut down projects, get headcount, get
         | resources (if your business case is sufficiently persuasive to
         | others). Not a complete free for all, but certainly _a lot_
         | more freedom than you'd normally see in companies of that size.
         | And (IMO) people used that freedom and autonomy pretty well.
         | 
         | That kinda deteriorated over time, culminating with Sundar
         | "McKinsey" Pichai, and then went rapidly downhill from there,
         | and now I flat out reject their recruiters, based on the
         | feedback from friends still employed there.
        
       | Imnimo wrote:
       | What I don't get is why they wouldn't just use MongoDB. MongoDB
       | is web-scale.
        
         | hinkley wrote:
         | /dev/null is also web-scale
        
           | DeepYogurt wrote:
           | Is /dev/null fast? I will use /dev/null if it is fast.
        
             | flatiron wrote:
             | does it support sharding?
        
               | closeparen wrote:
               | It supports sharting: https://github.com/dcramer/mangodb
        
           | sondr3 wrote:
           | And available as a SaaS: https://devnull-as-a-service.com/
        
         | nostrademons wrote:
         | That was a major impetus for this video, IIRC. The "MongoDB is
         | web-scale" video went around Google about a month before
         | Broccoli Man and some enterprising Googler figured they could
         | use the same software to make a satire of Google's internal
         | tools.
        
           | hedgehog wrote:
           | Link for the Mongo video:
           | https://www.youtube.com/watch?v=b2F-DItXtZs
           | 
           | And bonus lean startup video:
           | https://www.youtube.com/watch?v=3J9KhpgYVB0
        
           | fragmede wrote:
           | MongoDB is web-scale:
           | https://www.youtube.com/watch?v=b2F-DItXtZs
           | 
           | NSFWish; it gets a bit personal around 3:11
        
             | alexjplant wrote:
             | I had a similar conversation with a heavily-intoxicated
             | MongoDB sales guy in a diner at 1AM after the second day of
             | KubeCon 2019. My concerns were primarily around data
             | consistency issues during denormalization and lack of
             | schema. H pitch was essentially "Who cares?! I'm getting
             | [three-letter agency] to move _everything_ to Mongo because
             | it's so cheap and easy! It's all just JSON! Why does it
             | need a schema?!"
             | 
             | He probably made more than I did that year so maybe he has
             | a point -\\_(tsu)_/-
        
             | mlindner wrote:
             | I miss 2010.
        
               | vinay_ys wrote:
               | Ah, 2010 - when web scale and its secret sauce - sharding
               | was all the rage.
        
         | vorticalbox wrote:
         | Maybe because mongodb had been out less than a year in 2010?
        
           | gnabgib wrote:
           | I think you missed the /s from GP.
        
             | vorticalbox wrote:
             | quite likily.
        
         | 323 wrote:
         | But is it planet-scale?
        
           | swalsh wrote:
           | That's out of date, we're now in the days of IPFS.
        
       | anshumankmr wrote:
       | What is this exactly?
        
         | ts4z wrote:
         | Xtranormal was a video service that would animate scripts with
         | some stock characters.
         | 
         | Someone made a bit of internal-only snark, and "I just want to
         | serve 5TB" became an in-joke for turning easy problems into
         | exercises in frustration.
         | 
         | Some of these things have, actually, been addressed.
        
       | SilasX wrote:
       | Wow, kind of funny that Xtranormal now lives on in the few viral
       | videos that were made with it.
       | 
       | Here's where the company is now (the original domain is used for
       | something else now):
       | 
       | https://en.wikipedia.org/wiki/Nawmal
        
       | raldi wrote:
       | Background: https://rachelbythebay.com/w/2021/10/30/5tb/
       | 
       | This video was hugely influential on changing the way Google does
       | internal tools and operations.
        
         | quelltext wrote:
         | How did things change?
        
           | vechagup wrote:
           | There's been a big investment in server platforms that strive
           | to enable SWEs to build a new service that follows Best
           | Practices with as little knowledge and handholding as
           | possible. These consist of conformance tests that yell at you
           | while you're coding if you are trying something generally
           | thought to be bad, and semi-automated workflows that help you
           | bring your code to production. When everything works as
           | intended, the production workflows set up a decent set of
           | alerts, acquire resources, configure CI/CD pipelines, and
           | launch your jobs with just a few button presses on your part.
           | (In practice, one of the steps will probably require
           | debugging, but eh, it seems way better than the broccoli man
           | video.)
        
           | scottlamb wrote:
           | I think you can read about some of these changes in Google's
           | SRE and SWE books (even if they don't mention this video in
           | particular), at least the ones most likely to be interesting
           | to someone outside Google.
           | 
           | But dropping Borgmon readability was the most immediate and
           | obvious. It was basically true that no one had Borgmon
           | readability. The policy was a catch-22: you couldn't get
           | readability for the simple/formulaic Borgmon macro
           | invocations that were encouraged and often sufficient. You
           | could only get it for doing something "clever". I got it by
           | writing fancy borgmon rules to paper over a problem that (in
           | hindsight) I should have solved elsewhere.
           | 
           | Another was easing quota management. IMHO the most
           | unbelievable thing in the video was that after Broccoli Man
           | told Panda Woman to get quota in two cells, she just said
           | "done". Besides the hassle in transcribing what you needed
           | into the request system [1], various types of quota were
           | chronically unavailable where you needed them, even in tiny
           | amounts. In 2010, I kept a critical infrastructure service
           | running by regularly IMing major clients' on-calls asking
           | them to donate 0.1 cpu(!) of their quota in some cell or
           | another when I didn't have quite enough to grow. There was a
           | "gray market" mailing list where people would trade resources
           | they couldn't get through the primary system. But eventually,
           | they built a system that for small services would make the
           | quota just happen for you.
           | 
           | Overall, it was a kick in the pants for the most basic
           | infrastructure teams that made them see how unnecessarily
           | hard this is for their internal customers, prompting them to
           | make small things just happen while keeping large things
           | possible. In any large organization, it's healthy to get this
           | kind of feedback regularly. The actual specific changes and
           | technologies are pretty specific to Google in 2010...
           | 
           | [1] Many people managed this very very tediously with
           | spreadsheets. I eventually wrote a tool to generate the
           | requests based on comparing your intended production config
           | with your current quota.
        
             | jeffbee wrote:
             | Production priority quota horse trading in the days before
             | it was easy was a real skill. But non-production quota was
             | free and virtually infinite, even in those days.
        
           | compiler-guy wrote:
           | The most obvious change that came from this video is that
           | Google abandoned the Borgmon readability requirement. At
           | Google, every change needs approval from someone who has
           | passed a detailed style-guide review process in the given
           | language.
           | 
           | Now over multiple changes, it used to require one fairly big
           | one. It's still a pain in the languages that require it--
           | which is all the main ones, but very few of the niche ones.
           | 
           | Many other things changed as well. Much of what the video
           | complains about got automated and better documented. But the
           | company has grown so much, and the product lines have
           | diversified so dramatically, that there are still plenty of
           | places to complain about the overhead.
        
             | m0zg wrote:
             | I've proudly managed to avoid Borgmon in favor of Monarch.
             | Which was new at the time, but worked all right even back
             | then. I have a lot fewer gray hairs because of that. They
             | should have kept and rigidly enforced the Borgmon
             | readability requirement to force people to migrate off that
             | convoluted, idiosyncratic piece of shit.
        
             | StillBored wrote:
             | I've never understood places with rigid style guides
             | policed by people. Its idiotic, because we have computers
             | and in places like google presumably a fair number of them
             | know basic parsing/lex sufficiently that if they can't make
             | a tool like clang-format that automatically reformats on
             | save/commit/whatever then they can use a tool like clang-
             | tidy to toss warnings during a development/CI/whatever
             | phase.
             | 
             | Putting people in charge of formatting/style is just an
             | excuse for wasting time bikeshedding, either the code is
             | wrong and a tool can tell you, or its not wrong.
        
               | btilly wrote:
               | The hypothetical discussion about readability is
               | pointless.
               | 
               | Let's make it specific. Read
               | https://google.github.io/styleguide/cppguide.html for
               | readability for a language, namely C++. All the things
               | that can be automated, automatic tools have been written
               | for. But, for example, you can't automate "Prefer to use
               | a struct instead of a pair or a tuple whenever the
               | elements can have meaningful names." Because what does it
               | mean for a name to be meaningful?
        
               | StillBored wrote:
               | "Because what does it mean for a name to be meaningful? "
               | 
               | Are you optimizing for someone who already knows all the
               | project lingo, or someone who doesn't know any of it?
               | 
               | Are your engineers native English speakers?
               | 
               | There are a whole bunch of things which make the perfect
               | variable name frequently less than perfect, and putting
               | project insiders in charge likely yields the opposite
               | result.
               | 
               | Take: https://elixir.bootlin.com/linux/latest/source/mm/k
               | hugepaged...
               | 
               | If you don't know what a vma, pte, pfn, compound_page,
               | young pte, huge page, lru, etc your going to be unable to
               | even begin to understand what that code is doing, despite
               | those all being pretty reasonable variable names and
               | actually fairly industry standard concepts. It gets worse
               | as you move to more esoteric topics. Expanding pte to
               | PageTableEntry might help some subset of users, but at
               | the expense of those that work on the code daily. So who
               | do you optimize for? Is it readable if the only people
               | that can read it already know what it does?
        
               | gravypod wrote:
               | Formatting and readability are two separate concepts (as
               | other replies have pointed out). I'd like to specifically
               | point to a fantastic example of what we mean when we say
               | "readability": https://www.youtube.com/watch?v=wf-
               | BqAjZb8M
               | 
               | Someone with readability in a language, who keeps up with
               | the style recommendations, will generally produce code
               | that is easier to read by other engineers.
        
               | kccqzy wrote:
               | That's not what readability is. There are plenty of
               | automated tools that will give you results from running
               | lint, ClangTidy and other tools. Readability is mostly
               | about structuring your code well to be easily read. It's
               | about architecting your code within a single file. It's
               | about telling a junior SWE who reinvented the wheel use a
               | library function he/she didn't know about instead.
        
               | StillBored wrote:
               | So the rules can be codified sufficiently to test people
               | on, but they can't be codified for a computer?
               | 
               | The only one that sounds more difficult to codify is
               | telling people of the existence of duplicate functions.
               | But as someone who contributes to the linux kernel, I can
               | tell you right now that the only way that works reliably
               | is to have a very large pool of reviewers. Very
               | experienced engineers frequently miss what people are
               | doing in other parts of the source base, the name might
               | not be what they expect, etc, etc, etc. In the case of
               | linux there are a fair number of duplicates, or similar
               | functions, and people write coccinelle patches to replace
               | them on a fairly regular basis after they have been in
               | the kernel for years.
               | 
               | So, I doubt giving someone a formal gatekeeper flag,
               | really helps vs just having wider change review.
        
               | compiler-guy wrote:
               | I know of no automated tools available today that can
               | determine if an identifier is accurately and usefully
               | named. They can all tell if you are using the proper
               | case, but that doesn't really tell you anything.
               | 
               | No tool like that tells you if returning a bool instead
               | of an enum is appropriate here, or that a reference vs a
               | pointer makes more sense given the rest of the code.
               | 
               | I'm sure a clever machine learning algorithm could figure
               | that out with a corpus as large as Google's. Maybe. But
               | no tool like that works today.
               | 
               | And not strangely at all, Google does accept "what clang-
               | tidy does" as the canonical way of formatting text. But
               | readability at Google is far more than just formatting.
               | 
               | Readability is frustrating and annoying, but more than
               | just lint.
        
               | gravypod wrote:
               | > So the rules can be codified sufficiently to test
               | people on, but they can't be codified for a computer?
               | 
               | Small note: readability isn't a test or quiz you take
               | (asterisk). It's obtained by merging code in the language
               | you want readability for. If you merge code for a
               | language often and the reviewers have very few style-
               | based questions for the code then you will get
               | readability fairly quickly.
               | 
               | > The only one that sounds more difficult to codify is
               | telling people of the existence of duplicate functions.
               | But as someone who contributes to the linux kernel, I can
               | tell you right now that the only way that works reliably
               | is to have a very large pool of reviewers. Very
               | experienced engineers frequently miss what people are
               | doing in other parts of the source base, the name might
               | not be what they expect, etc, etc, etc.
               | 
               | A better example would be knowing when you should use
               | `const std::string&`, `std::string_view` or `char*`.
               | Example: https://abseil.io/tips/1
               | 
               | The best readability advice I have recieved has been:
               | 
               | 1. Direct "I was confused by X" or "The recommended way
               | to do A is using B", etc
               | 
               | 2. Reasoned: "std::string_view is more efficient and
               | clearer in intention than char ptr, it also improves type
               | safety as it is read only and clear about ownership"
               | 
               | 3. Linked to source material where examples are given
               | totw or other examples in the code.
        
               | kccqzy wrote:
               | > Very experienced engineers frequently miss what people
               | are doing in other parts of the source base, the name
               | might not be what they expect, etc, etc
               | 
               | Very true. Readability can't help with that, nor is it
               | designed to. It's mostly there to help novices and new
               | hires. Experienced engineers already have readability
               | themselves so they don't need this extra review.
        
               | iamstupidsimple wrote:
               | Readability is not about formatting, that's an orthogonal
               | issue. It's possible to have terrible code that's
               | perfectly formatted.
               | 
               | It's more about good usage of idiomatic language
               | constructs, which still requires good human judgement to
               | evaluate.
        
               | StillBored wrote:
               | And I take it google has done wide ranging scientific
               | studies about the variations in coding styles and
               | language constructs that it is a secret advantage that
               | they know how to write "readable" code? Implying they
               | tried a bunch of diffrent ways until settling on the one
               | true way that allows a diverse set of people with diverse
               | experiences to read it?
               | 
               | Ever heard of COBOL?
               | 
               | Because readability has always been in the eye of the
               | beholder, and codifying it makes it even worse.
        
               | iamstupidsimple wrote:
               | What counts for readability is not set in stone by some
               | language czar as the One True Way. Everyone knows the
               | style guides can't be perfect which is why they're
               | relatively mutable.
               | 
               | In any case, readability will comment on stuff that
               | cannot easily be quantified, such as when to use a
               | certain object hierarchy or dependency injection, etc...
        
               | compiler-guy wrote:
               | I'm not a fan of readability exactly the way Google does
               | it, but I'm pretty happy that Google insists on various
               | aspects of it, like good identifier names.
               | 
               | I don't know of any research off hand, but I'm pretty
               | sure the industry consensus is that good identifier names
               | improve the quality of the code (Go style
               | notwithstanding.) Readability is one way to training
               | engineers to do it.
        
               | joshuamorton wrote:
               | > Because readability has always been in the eye of the
               | beholder, and codifying it makes it even worse.
               | 
               | This is empirically false. Consistency, even if it is
               | unfavorable to your preferences, is superior to
               | inconsistency. So a codified set of best practices is
               | better than none at all.
               | 
               | There are part of Google's style guides that I would
               | change if I could, but I also prefer having a style guide
               | (and one that goes beyond things that are lintable) than
               | none at all, because consistency across the codebase
               | means that I can usually understand code at a glance, or
               | if not, know at a glance that something unusual is
               | happening. (this is in fact precisely the argument in
               | favor of autoformatters like gofmt/black/prettier, but
               | extended to softer concepts that can't always be
               | formatted: consistent style, even if it isn't your
               | favorite, is superior to inconsistent style).
        
               | StillBored wrote:
               | Consistency is what you get when you have a defined rule
               | set programmatically enforced. If your looking for
               | "readability" via human judgment, then you get a very
               | different result.
        
               | joshuamorton wrote:
               | A programmatically enforced set of rules is certainly one
               | way to get consistency, but it isn't the only way. You
               | can achieve consistency through culture and training too,
               | and sometimes that's the only way.
               | 
               | Edit: you can look at Google's C-style guide for some
               | examples, https://google.github.io/styleguide/cppguide.ht
               | ml#Structs_vs...
               | 
               | It isn't possible to statically analyze if a class/struct
               | is a POD or if the methods enforce invariants. But it's
               | often very easy to do so with a human eye. And there's
               | value in the distinction!
               | 
               | Similarly, forcing someone to justify using a power-
               | feature (operator overloading, templates, metaclasses,
               | whatever) can only be done by a human. There may be cases
               | where the power feature is warranted and the benefits
               | outweigh the cost, but a linter can't know that. (and
               | ultimately all of this comes back to: things look
               | consistent, and when things are inconsistent, that's a
               | strong signal that something unusual is happening and you
               | should pay close attention)
        
             | dekhn wrote:
             | Context on borgmon: https://sre.google/sre-book/practical-
             | alerting/
             | 
             | borgmon was a truly weird system.
        
               | mikelward wrote:
               | It still is, but it used to be, too.
        
           | mathteddybear wrote:
           | Broadly speaking, there are tools to automate this or that,
           | some technologies are getting deprecated and replaced by new
           | ones
           | 
           | Also probably the privacy review could be a bigger bottleneck
           | these days ;-)
        
         | justicezyx wrote:
         | I worked at TI, Planet and later Borg, I did not feel much
         | influence of this video other a chuckle. Or I might be too low
         | level to perceive.
        
           | jeffbee wrote:
           | I think it was a very common perception among application-
           | level SWEs and SREs that TI, platforms, and Borg did not
           | themselves use the stack enough to perceive its flaws.
        
         | sicromoft wrote:
         | See also the recent discussion here of "I Don't Know How To
         | Count That Low": https://news.ycombinator.com/item?id=28988281
        
         | jrockway wrote:
         | I feel like people forgot after about five years. I remember
         | wasting a week filling out various pieces of paperwork and
         | submitting byzantine configuration CLs so that some contractor
         | would have permission to view a certain webpage through the
         | corporate proxy. (I think what made me most mad is that regular
         | employees could view the website with no additional
         | configuration. I can understand if I was filling out tickets to
         | get approval, or a security review, but the actual
         | configuration of the proxy had to change to allow this, in
         | addition to getting all of those approvals!) My team didn't
         | make the website, and the contractor didn't work on my team, so
         | I'm honestly not sure why I was involved. I just remember being
         | annoyed about it. I'm sure there are some memes about it in the
         | archive.
        
           | ikiris wrote:
           | Contractor access is its own hellscape.
        
             | q3k wrote:
             | ... socially, too. That part sucked.
        
               | abustamam wrote:
               | Separation of the classes and all
        
             | davidw wrote:
             | I did a stint contracting for Goldman Sachs a while back. I
             | can relate. Don't think I can say anything more without a
             | team descending on my house from a black helicopter,
             | though.
        
               | servytor wrote:
               | At night all helicopters are black.
        
               | twinkletwinkle_ wrote:
               | I once worked at a tiny startup where we were trying to
               | sell a dataset to GS. Before we could even send a sample,
               | they sent over some boilerplate forms for us to sign. I
               | remember two distinct stipulations - anything we sent
               | them was immediately and forever their property, AND they
               | had the right to drug test any of our employees. We ended
               | up not signing so there was no deal. My boss said it was
               | their way of getting rid of us.
        
           | hnov wrote:
           | This is paradoxically because historically everything was
           | wide open to anyone so access-control and such isn't super
           | fleshed out for most apps behind the proxy. Random internal
           | app X could have been conceived and built with very little
           | oversight and opening it up to a rotating cast of temporary
           | workers is seen as an unnecessary risk. Broadly used apps
           | (e.g. the bug ticket system) tend to have app-level security-
           | controls and are not blocked by the proxy for contractors.
        
         | inoffensivename wrote:
         | It was hugely influential in identifying the frustration of
         | getting things done at Google. In my experience it's even more
         | true now than it was back then, the number of things you have
         | to deal with has just grown. I've been at Google since 2006 and
         | I feel like I'm losing my mind with all the complexity.
        
           | jez wrote:
           | Out of genuine curiosity, what keeps you at Google for 15
           | years despite perceived increase in complexity to getting
           | things done? I'm wondering whether the answer is like
           | "there's a lot of complexity, but I like the work I do more"
           | or "I like the people more" or some other reason.
        
             | Dangeranger wrote:
             | I think they call them "golden handcuffs".
        
               | oblio wrote:
               | Heh, that's <<if>> they want to leave.
               | 
               | If they don't, they call it "comfy job with awesome
               | paycheck and not a lot of pressure" :-p
        
               | dTal wrote:
               | (this comment now obsolete)
        
               | oblio wrote:
               | There, I fixed it!
               | 
               | https://cheezburger.com/5821507840/his-and-hers-shower-
               | head
        
               | dTal wrote:
               | You know, I've made similar remarks on HN before, but you
               | are the first person to actually edit their comment.
               | Amazing. Now I feel compelled to edit mine...
        
               | marcyb5st wrote:
               | Googler here.
               | 
               | I think the technical term is Golden cage :)
        
       | treebog wrote:
       | What does "serve 5TB" refer to? They expect 5TB of network
       | bandwidth over some time period (a month?)? Or their database
       | takes up 5TB on disk?
        
         | raldi wrote:
         | It's a joke that's sort of open to interpretation.
         | 
         | The most straightforward is, "I just want do this incredibly
         | simple thing; why is it so hard?"
         | 
         | But there's also the level of, "Googlers are so engineeringly
         | pampered that they think serving 5 terabytes is the equivalent
         | of Hello World."
         | 
         | And then there's another level of, "Well, isn't it? After all,
         | this is Google and this is $YEAR."
        
         | rachelbythebay wrote:
         | Imagine it as "I want to have a http://foo/~me/ type path where
         | I can park 5 TB of stuff and other people can fetch from it
         | when they feel like it".
         | 
         | 5 TB of data made available, not 5 TB of
         | transfer/bandwidth/etc.
        
         | metalliqaz wrote:
         | i think it just means to put 5TB of data online
        
         | drjasonharrison wrote:
         | If you watch the video, it doesn't matter. It's just something
         | they want to serve.
        
       | shoeshoeshoey wrote:
       | Facebook had its own meme: "Pusher I need a hotfix"
        
       | w0mbat wrote:
       | When I first started at Google I got things done a lot faster
       | because I didn't know all those rules existed and nobody stopped
       | me. My service was still plenty fast & reliable. Eventually it
       | all got rewritten by other people to do things properly like the
       | video says.
        
         | dekhn wrote:
         | I managed to deploy a whole system at google that had the
         | ability take down all of google globally by DoS'ing the
         | network, and ran it casually (IE, starting and stopping it when
         | I felt like it, at the capacity I felt like, with the binary
         | versions I wanted) for 3 years.
         | 
         | In retrospect, this was absolutely crazy! The actual visible
         | outcomes were: 1 cluster drained due to heat rising so fast the
         | alerting thought there was a fire, 1 page to an engineer in the
         | middle of the night (sorry discovery-service) and a whole bunch
         | of complaints about CPU stealing that weren't my fault.
         | 
         | Those were the good old days.
        
         | robocat wrote:
         | Rachel said something similar "My own 'solution' to it after
         | far too much thrashing was just to say 'we cannot get all N
         | types of quota in the same place so we are at the mercy of
         | whatever happens to be available, and if that dries up, we stop
         | running'. Granted, this was for some internal stuff that was
         | seven or eight levels removed from anything that anyone on the
         | outside might ever see, but still, it was stupid and made me
         | feel so dirty. I'm sure my non-solution probably bit someone
         | later. Sorry, whoever." --
         | https://rachelbythebay.com/w/2021/10/30/5tb/
        
       | kgin wrote:
       | Only if you think your users are scum. Do you think your users
       | are scum? Do you? Why do you hate your users?
        
       | jbverschoor wrote:
       | omg so toxic
        
       | jamestimmins wrote:
       | As a friend of mine explained why she left Google a year or so
       | ago, "I got tired of emailing 30 people to try to figure out who
       | owned a single variable."
        
         | nostrademons wrote:
         | A TL I worked with once had a simple but effective strategy for
         | that:
         | 
         | "Remove it and see who complains."
         | 
         | I did that (with the impenetrably named "PrefetchExperiment",
         | last touched by a branch that lost previous file history in
         | 2007). Turned out it was the source data for Google's DNS to
         | figure out how to route queries to the lowest-latency
         | datacenter, based on their geographic location. In about a
         | month, it would've taken down all Google services. Oops.
         | 
         | It was a very effective way of figuring out who owned the
         | variable and writing a big long comment explaining what it's
         | there for and which team to contact before changing it, though.
        
           | joshuamorton wrote:
           | Scream tests are always fun. ("break it and see who screams")
        
           | AlexanderTheGr8 wrote:
           | LMAO, isn't this very similar to FaceBook's recent DNS
           | problem?
           | 
           | Also I love the idea of removing it and seeing who complains.
        
             | _3u10 wrote:
             | It also works great for a product / bug backlog. Just
             | delete the entire thing. If it's a real bug / feature it
             | will get recreated.
        
             | zamadatix wrote:
             | Facebook's recent "DNS problem" was a process for checking
             | routing failover capacity on the backbone for maintenance
             | ended up taking down the backbone links. As a result of the
             | servers being disconnected from the backbone they pulled
             | their BGP advertisements since they considered their
             | location to be unhealthy (no connection to the backbone).
             | 
             | FB's problem was the lack of routing reachability on its
             | backbone triggering the lack of routing reachability
             | information being sent to the larger internet, this in turn
             | caused problems for DNS not the other way around.
        
           | ikiris wrote:
           | The hilarious thing is I know exactly what file you're
           | talking about here.
        
         | raldi wrote:
         | The best and worst parts of being a Google engineer: Impossible
         | things are merely very hard, but on the other hand, easy things
         | are also very hard.
        
       | taldo wrote:
       | Ah, the laughs (Xoogler since 2020). It was a lot easier, at
       | least last year: you'd use "flex" quota from your PA pool
       | (product area) for Spanner and Borg, write some code for your
       | server, a few configs here and there, and you'd be ready and
       | serving.
        
         | bhickey wrote:
         | About six years ago I had a resource manager deny me a database
         | instance the very same day it became available for flex in
         | another product area. I tried to "Hey Mister" resources from
         | someone in that group to no avail. Eventually I wrote a high-
         | durability key-value store on top of our source control system
         | and told them they could give me my database or I'd be
         | deploying to prod.
        
         | dilyevsky wrote:
         | That video came out a few years before flex appeared _I think_
         | at a time they were having a sort of "resource crunch" on the
         | heels of growth spur following the GFC.
        
           | willidiots wrote:
           | Flex was available for certain things (Colossus IIRC, gave
           | you a ton of flex quota) but for others it wasn't. Because
           | This is Google.
        
             | the-rc wrote:
             | It was easier to mint and carve out Colossus quota than
             | e.g. Bigtable. I seem to remember that flex for Borg
             | existed, but only in a few locations with enough capacity
             | to back it. You couldn't just retrofit it in clusters where
             | existing, large customers were already granted and using
             | most of the quota.
        
           | the-rc wrote:
           | It wasn't that there was a crunch -- that had always existed.
           | There just wasn't all the tooling to implement anything like
           | flex. At least this video was made after "buying Borg quota"
           | was a normal thing. Before it, you had to "buy" regular
           | machines and donate/assimilate them into Borg. Then after X
           | days you'd receive your quota, minus a Borg "tax" of 10% to
           | cover borglet and system daemons' overhead.
        
         | dekhn wrote:
         | you left out monitoring for reliability which is a major part
         | of this video
        
       | mikewarot wrote:
       | I didn't realize this was made by Google in the first place when
       | I saw it a few days ago. I hope things are simpler now, but I
       | doubt it.
        
       | jedberg wrote:
       | The same conversation at Netflix 10 years ago:
       | 
       | I want to serve 5TB of data.
       | 
       | Ok, spin up an instance in AWS and put it there.
       | 
       | I want it production ready.
       | 
       | Ok, replicate it to a second instance. If it breaks we'll page
       | you to fix it.
       | 
       | The funny thing is, for important stuff, we ended up doing
       | similar things to what you see in this video, but for unimportant
       | things, we didn't. I think it was a better system, and it was
       | amusing when we hired people from Google who were confused by the
       | lack of process and approvals.
        
         | ignoramous wrote:
         | > I want to serve 5TB of data. Ok, spin up an instance in AWS
         | and put it there... it was amusing when we hired people from
         | Google who were confused by the lack of process and approvals.
         | 
         | Quoting from _Velocity in Software Engineering_
         | https://queue.acm.org/detail.cfm?id=3352692:
         | 
         |  _In 2003, at a time in Amazon 's history when we were
         | particularly frustrated by our speed of software engineering,
         | we turned to Matt Round, an engineering leader who was a most
         | interesting squeaky wheel in that his team appeared to get more
         | done than any other, yet he remained deeply impatient and
         | complained loudly and with great clarity about how hard it was
         | to get anything done. He wrote a six-pager that had a great
         | hook in the first paragraph: "To many of us Amazon feels more
         | like a tectonic plate than an F-16."_
         | 
         |  _Matt 's paper had many recommendations... including the
         | maximization of autonomy for teams and for the services
         | operated by those teams by the adoption of REST-style
         | interfaces, platform standardization, removal of roadblocks or
         | gatekeepers (high-friction bureaucracy), and continuous
         | deployment of isolated components. He also called... for an
         | enduring performance indicator based on the percentage of their
         | time that software engineers spent building rather than doing
         | other tasks. Builders want to build, and Matt's timely
         | recommendations influenced the forging of Amazon's technology
         | brand as "the best place where builders can build."_
         | 
         | ...leading up to the creation of AWS.
        
           | jll29 wrote:
           | > we turned to Matt Round, an engineering leader who was a
           | most interesting squeaky wheel in that his team appeared to
           | get more done than any other
           | 
           | Matt went on to study theology, and he's started a church
           | community in Scotland: https://www.linkedin.com/in/mattround/
           | 
           | "Leader Company Name: Hope City Church Edinburgh Dates
           | Employed: Sep 2017 - Present Driving a new church start-up."
        
           | ryandrake wrote:
           | The "approval paralysis" thing happens at a lot of companies,
           | large and small, not just GiantTech. It creeps up on you
           | slowly: 1. A big problem happens that gains the attention of
           | leadership. 2. The problem is root-caused to some risky thing
           | an employee did trying to accomplish XYZ. 3. To correct this,
           | a _process_ is put in place that must be followed when one
           | wants to do XYZ, and (critically) gatekeepers are anointed
           | who must approve the activity. 4. These gatekeepers are
           | inevitably senior already-busy people who become bottlenecks.
           | Now we can 't do this critical thing without hounding
           | approvers. 5. Some other big problem happens and the above
           | cycle starts all over again.
           | 
           | Before you know it, every even slightly risky task you need
           | to do through the course of your job requires the blessing of
           | approvers who are well-intentioned, but all so overloaded
           | they don't even answer their E-mail or chats. They sometimes
           | need to be physically grabbed in the hallway in order to
           | unblock your project. Progress grinds to a halt and it still
           | has not stopped production problems--just those particular
           | classes of problems that the approval processes caught.
           | 
           | EDIT: Not sure what the right solution is, but it must be one
           | that doesn't rely on a particular overloaded human doing
           | something. Maybe an automated approval system that produces a
           | paper trail (to help with postmortem and corrective action
           | later) and ensuring all changes can be rolled back
           | effortlessly. Easier said than done, obviously.
        
             | david422 wrote:
             | What is the solution?
             | 
             | I've worked at big companies that are mired in process
             | because they would rather spend more time crossing i's and
             | dotting t's than risk breaking something. I can see why.
             | 
             | And I've worked at smaller companies where the clients are
             | small and it's easy to fix things that break. Move fast and
             | break things at a small scale maybe.
             | 
             | But how do you grow to be a big company and still operate
             | like a small company? I can't seem to see an answer.
        
               | native_samples wrote:
               | There are many, but the problems are more subtle than
               | this video really gives credit for.
               | 
               | I worked at Google at the time this video was made, and
               | empathized (in fact I had been an SRE for years by that
               | point). Nonetheless, there are flip sides that the video
               | maker obviously didn't consider.
               | 
               | Firstly, why did everything at Google have to be
               | replicated up the wazoo? Why was so much time spent
               | talking about PCRs? The reason is, Google had consciously
               | established a culture up front in which individual
               | clusters were considered "unreliable" and everyone had to
               | engineer around that. This was a move specifically
               | intended to increase the velocity of the datacenter
               | engineering groups, by ensuring they did _not_ have to
               | get a billion approvals to do changes. Consider how slow
               | it 'd be to get approval from every user of a Google
               | cluster, let alone an entire datacenter, to take things
               | offline for maintenance. These things had tens of
               | thousands of machines _per cluster_ and that was over a
               | decade ago. They 'd easily be running hundreds of
               | thousands of processes, managed by dozens of different
               | groups. Getting them all to synchronize and approve
               | things would be impossible. So Google said - no approvals
               | are necessary. If the SRE/NetOps/HWOPS teams want to take
               | a cluster or even entire datacenter offline then they
               | simply announce they're going to do it in advance, and,
               | everyone else has to just handle it.
               | 
               | This was fantastic for Google's datacenter tech velocity.
               | They had incredibly advanced facilities years ahead of
               | anyone else, partly due to the frenetic pace of upgrades
               | this system allowed them to achieve. The downside:
               | software engineers have to run their services in >1
               | cluster, unless they're willing to tolerate downtime.
               | 
               | Secondly, why couldn't cat woman just run a single
               | replica and accept some downtime? Mostly because Google
               | had a brand to maintain. When she "just" wanted to serve
               | 5TB, that wasn't really true. She "just" wanted to do it
               | under the Google brand, advertised as a Google service,
               | with all the benefits that brought her. One of the
               | aspects of that brand that we take for granted is
               | Google's insane levels of reliability. Nobody, and I mean
               | nobody, spends serious time planning for "what if Google
               | is down", even though massive companies routinely
               | outsource all their corporate email and other critical
               | infrastructure to them.
               | 
               | Now imagine how hard it'd be to maintain that brand if
               | random services kept going offline for long periods
               | without Google employees even noticing? They could say,
               | sure, this particular service just wasn't important
               | enough for us to replicate or monitor and the DC is under
               | maintenance, we think it'll be back in 3 days, sorry. But
               | customers and users would freak out, and rightly so. How
               | on earth could they guess what Google would or would not
               | find worthy of proper production quality? That would be
               | opaque to them, yet Google has thousands of services.
               | It'd destroy the brand to have some parts that are
               | reliable and others not according to basically random
               | factors nobody outside the firm can understand. The only
               | solution is to ensure every externally visible service is
               | reliable to the same very high degree.
               | 
               | Indeed, "don't trust that service because Google might
               | kill it" is one of the worst problems the brand has, and
               | that's partly due to efforts to avoid corporate slowdown
               | and launch bureaucracy. Google signed off on a lot of dud
               | uncompetitive services that had serious problems,
               | specifically because they hated the idea of becoming a
               | slow moving behemoth that couldn't innovate. Yet it
               | trashed their brand in the end.
               | 
               | A lot of corporate process engineering is like this. It
               | often boils down to tradeoffs consciously made by
               | executives that the individual employee may not care
               | about or value or even know about, but which is good for
               | the group as a whole. Was Google wrong to take an
               | unreliable-DC-but-reliable-services approach? I don't
               | know but I really doubt it. Most of the stuff that SWEs
               | were super impatient to launch and got bitchy about
               | bureaucracy wasn't actually world changing stuff, and a
               | lot ended up not achieving any kind of escape velocity.
        
               | edude03 wrote:
               | This is a great explanation, thank you.
               | 
               | (I've never worked at google, and maybe this isn't a
               | problem anymore however) It seems like the "solution"
               | here would be to do for Infra what Go did for Concurrency
               | - build an abstraction with sane defaults, and rubber
               | stamp anything that doesn't stray from those defaults.
               | Anything that does - requires further scrutiny.
               | 
               | For example, at the companies where I've been response
               | for infrastructure (admittedly much smaller than google)
               | I've done exactly that (with Kubernetes specific things
               | like PodDisruptionBudgets and defaulting to 2 replicas),
               | and if users use the default helm chart values, they can
               | ship their service by themselves.
        
               | Ao7bei3s wrote:
               | Self-service approvals.
               | 
               | Instead of appointing a senior eng to be approver, task
               | the same senior eng with writing down his decision
               | criteria (as text or where it makes sense even as code).
               | 
               | This has advantages for everyone:
               | 
               | 1. It lets the engineers who need approval move at their
               | own speed, and plan time for it as a predictable work
               | item like any other, instead of depending on an approver
               | for whom the approvals will usually be at a lower
               | priority and mid-sprint.
               | 
               | 2. For the approval policy writer, it turns this into a
               | one time effort with a defined scope that can be planned
               | and prioritized in his/her own backlog, instead of open
               | ended toil that can come at any time, take any time, and
               | not clearly relate to their own current priorities.
               | 
               | 3. For the company, writing down the policy brings
               | consistent decision making.
               | 
               | Obviously this requires trust that employees can and will
               | say "no, can't do" when they're tasked with something
               | that is not approvable, which can be culturally difficult
               | (business and otherwise). Checklists (literally a list of
               | checkboxes to click on, "I confirm that...") can help
               | with this.
               | 
               | (As an example of writing down the policy as code: that's
               | any CI/CD pipeline. But it's not limited to engineering
               | decision making - for example, we're using a well-known
               | open source license management tool that promises auto-
               | approval for open source library use depending on
               | policies configured by legal. This works moderately not
               | so well because this particular tool is not great; the
               | idea is sound. We still made it work: now legal wrote
               | down their policies, trained a large number of engineers
               | on them and those are now empowered to make decisions.)
        
               | ignoramous wrote:
               | Autonomy.
               | 
               | Solution to such org woes, in part, is discussed by
               | Clayton Christensen in his work, _The Innovator 's
               | Solution_
               | http://web.mit.edu/6.933/www/Fall2000/teradyne/clay.html:
               | _Even after correctly identifying potentially disruptive
               | technologies, firms still must circumvent its hierarchy
               | and bureaucracy that can stifle the free pursuit of
               | creative ideas. Christensen suggests that firms need to
               | provide experimental groups within the company a freer
               | rein. "With a few exceptions, the only instances in which
               | mainstream firms have successfully established a timely
               | position in a disruptive technology were those in which
               | the firms' managers set up an autonomous organization
               | charged with building a new and independent business
               | around the disruptive technology." This autonomous
               | organization will then be able to choose the customers it
               | answers to, choose how much profit it needs to make, and
               | how to run its business._
               | 
               | ---
               | 
               | Amazon and Cloudflare are good examples of big-orgs
               | trying their best to implement late Prof. Christensen's
               | ideas.
               | 
               | Andy Jassy on Amazon's approach to innovation:
               | https://www.hbs.edu/forum-for-growth-and-
               | innovation/podcasts...: _And then if we like the answers
               | to those first four elements, then we ask, can we put a
               | group of single threaded focused people on this
               | initiative, even if it seems like they 're overwhelming
               | it with strong senior people, if you try to add really
               | busy people do the existing business and the big new
               | idea, they will always favor the existing business
               | because it's surer bet. So we want to peel people away
               | from the existing business and put them just on the new
               | initiative._
               | 
               | Pace of innovation at Cloudflare
               | https://blog.cloudflare.com/the-secret-to-cloudflare-
               | pace-of...: _...it is not unusual for an initial product
               | idea to start with a team small enough to split a pack of
               | Twinkies and for the initial proof of concept to go from
               | whiteboard to rolled out in days. We intentionally staff
               | and structure our teams and our backlogs so that we have
               | flexibility to pivot and innovate. Our Emerging
               | Technology and Incubation team is a small group of
               | product managers and engineers solely dedicated to
               | exploring new products for new markets. Our Research team
               | is dedicated to thinking deeply and partnering with
               | organizations across the globe to define new standards
               | and new ways to tackle some of the hardest challenges._
               | 
               | ---
               | 
               | Also read: Clayton Christensen and Stephen Kaufman on
               | "Resources, Process, and Priorities": https://personal.ut
               | dallas.edu/~chasteen/Christensen%20-%202n...
        
               | [deleted]
        
             | bostonsre wrote:
             | Automate as much as possible. Approval gates are there to
             | prevent obvious issues from continuing down the pipeline.
             | If you can automate checks for known issues that you want
             | to prevent from happening, then you should be able to add
             | it as a test step. Then in the catch, log why it failed and
             | point the dev at documentation.
             | 
             | Manual processes suck for everyone involved.
        
             | jeffbee wrote:
             | You cannot have both an organization that fastidiously
             | protects the privacy and security of user data, and one
             | that requires no process to build and launch software. It's
             | just not possible.
             | 
             | Anyway the video is just a joke. I've never worked anywhere
             | where it was as easy to just serve 5TB of static data as at
             | Google. Googlers who want to just host junk under their own
             | authority do not need to shop for quota, set up borgmon,
             | etc.
        
               | joshuamorton wrote:
               | Right like looking back, they're setting up a production,
               | user facing service. If I want to just store a 5tb blob
               | somewhere, I think that fits in freebie CNS, so I don't
               | even have to provision resources, I just cat the file or
               | whatever (granted, 5tb was a bit bigger 10 years ago).
               | 
               | Having a rule that "your user-facing service needs to be
               | replicated" is a good rule. Replication being difficult
               | was the problem.
        
             | Zababa wrote:
             | I've read on HN that "processes are organizational scar
             | tissue", I think it applies here.
        
               | riknos314 wrote:
               | Yep. A wise engineer once told me "Runbooks [written
               | SOPs] are just solving bugs with people instead of code"
        
               | rShergold wrote:
               | That's an excellent phrase. It reminds me of the navy
               | saying "regulations are written in blood"
        
               | MauranKilom wrote:
               | It's actually super related, given that (at least in the
               | medical software sector) you won't get anything approved
               | by the FDA before spelling out the entire software
               | development operation in processes.
        
             | strictfp wrote:
             | In change management they argue that companies tend to
             | purposely slow down change over time to become more
             | predictable and lock in on the "successful route". That
             | certainly mirrors my experience. The only thing I don't
             | understand is why you hire so many people when you let a
             | few handful people gate everything. You might just as well
             | fire 80% of the workforce.
        
       ___________________________________________________________________
       (page generated 2021-11-02 23:00 UTC)