[HN Gopher] Google's infamous internal 2010 "I just want to serv...
___________________________________________________________________
Google's infamous internal 2010 "I just want to serve 5TB" video
now public
Author : raldi
Score : 708 points
Date : 2021-11-02 14:44 UTC (8 hours ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| sergiotapia wrote:
| Much more infamous.
|
| "Mongo DB Is Web Scale" -
| https://www.youtube.com/watch?v=b2F-DItXtZs
| keymone wrote:
| s/in//
| shirleyquirk wrote:
| famous ternal?
| tusharsadhwani wrote:
| that would be s/in//g :)
| jboggan wrote:
| Eagerly awaiting the "GoogFellas" video leak.
| loxias wrote:
| And now, in 2021, Google has inflicted their "clarity" on the
| rest of the world. I miss jobs from the 2000s, the jobs where you
| were paid to write software for a living.
|
| You know, engineering! Given a task, or set of requirements,
| develop software on your computer, software which eventually runs
| on the customer's computer, where it's used to solve the
| customer's problem.
|
| My most recent full time employment a year ago was at a _great_
| company. Healthy culture, some of the most talented coworkers I
| 've ever had the pleasure to work along side.
|
| Over the year I lasted there, I used for the first time: Docker,
| Golang, Kubernetes, Terraform, Gitlab, Saltstack, Prometheus,
| (and probably other middleware that my brain has GCd to free
| space). I was barely able to get _anything_ done. At least, it
| always felt that way.
|
| Maybe I'm just an idiot, I don't know. I'd accept it if true!
| What I do know is that I used to be able to _build_ things for
| people, be compensated well for it, and get _satisfaction_ from a
| customer liking what I 've built. It was simple.
|
| In this brave new world, with containers, pods and this and that
| and the other thing, where it can take months before one even
| understands enough _primitives_ to do a "hello world".... how
| can anything ever get done?? How can anything inventive,
| creative, or experimental emerge from our industry when the
| develop/test/improve cycle has gone from minutes to weeks or
| months?
|
| I don't know what the future looks like, but the present strikes
| me as unsustainable in the long run.
|
| (edit: Wow! I expected this to be downvoted to oblivion, not my
| highest rated comment on the site...)
|
| <tiny>(Forgive this shameless self promotion: if, dear reader-
| who-is-a-hiring-manager, you have a paid role for a lowly but
| experienced systems engineer who doesn't know anything about
| "web" or "apps" or "social" but is quite adroit with C/C++ (and a
| few others), most "sciencey/mathy" type problems, signal
| processing, firmware, network protocols, automation/scripting,
| and more, ... email is welcome!)</>
| mwcampbell wrote:
| > software which eventually runs on the customer's computer
|
| I gather then that you're not a fan of SaaS. True, one can
| cynically explain the rise of SaaS as rent seeking. But there's
| undeniably value in selling whatever functionality your
| software provides without burdening the customer with having to
| run it on their own computer(s). And when we do that, it's our
| responsibility to make the service reliable, which is what a
| lot of these tools are trying to do.
| loxias wrote:
| > I gather then that you're not a fan of SaaS.
|
| I'm neutral, I think? I don't quite see the point of it would
| be more accurate. I don't think I've used any SaaS in my
| personal life (other than streaming services. Which I'd
| prefer as a local app anyway, and I still do, for music, but
| not video)
|
| I'm sure it's a matter of opinion, not something with an
| objective answer, but "burden of running software on their
| own computers" genuinely confused me as I read your comment,
| I thought "burden? what burden?".
|
| As a user:
|
| If software is designed properly (and most isn't...) you
| download it once, and it runs. Is the burden the time it
| takes to do the download? Compared to the noticeable burden
| of using a webapp, with problems like crappy and frustrating
| responsiveness, an inability to work without an internet
| connection, and frequent inability to handle tasks of real
| complexity, I'd choose a local program any day.
|
| As an employee:
|
| Heck yes SaaS! $/month >>> $$$/customer :D Of _course_ it 's
| rent seeking, and I take (and give) no shame in that.
| [deleted]
| thrashh wrote:
| Perhaps we need more specialization but I remember the time
| before these kind of tools and I hated it.
|
| I'm a lazy person and I absolute love tools. Tools like Docker
| helped me never have to solve other people's complex
| environment problems again. I love metric reporting tools like
| Prometheus because it helps me front problems before they
| become weekend emergencies. I use a paid Git GUI so I can fix
| complex Git problems without ever making a mistake.
| loxias wrote:
| I'm also a lazy person! Which is why tools like this are a
| PITA to me.
|
| The one exception is Docker. It's not a regular part of my
| workflow, because of how it makes things both harder to get
| started (making a working Dockerfile takes a bit of time),
| harder to debug, and slower to build (I just changed one
| line! Now I have to rebuild the whole image to see if it
| fixed the problem... &c).
|
| However, for _deployment_ of the final product? I agree
| Docker 's GREAT. But, consider, in that respect it offers
| nothing I didn't have at the start of my career 20 years ago.
| Static linking for interpreted languages. :)
| pm90 wrote:
| Same. I do not have any nostalgia for when you had to say
| into machines and run scripts. Please no.
| mathteddybear wrote:
| Reminder that jobs from the 2000s that you were paid to write
| software for a living include also J2EE and CORBA projects.
| VHRanger wrote:
| In my team, we often deploy internal "services" as cronjobs on
| an EC2 service. This hasn't run into any issues in 24 months.
|
| One of these we decided to move to a more serious
| infrastructure (a set of AWS lambdas). It's failed three times
| in 6 months since, and we're moving it back to be a good old
| cronjob on a server.
|
| Simple is good.
| cheeze wrote:
| What does the cronjob do? Start some service that listens for
| inbound connections? Or are you talking more about daemons
| that do some set of work every interval?
| SamuelAdams wrote:
| Just curious what your cost differences are between a
| dedicated EC2 instance and lambdas. For our organization an
| EC2 instance was at best 8-10 times more expensive than
| lambdas.
| azmodeus wrote:
| The question is also what's the cost of troubleshooting the
| lambda service going down 3 times. 10x more expensive and
| reliable can be a good trade.
| x0x0 wrote:
| likely dwarfed by eng time, both in dollars and opportunity
| cost
| callmeal wrote:
| AWS Lightsail instances are pretty affordable ($10/mo and
| up)
| SanchoPanda wrote:
| $3.5 USD and up
| rvnx wrote:
| I tend to agree with the green dude :|
|
| It's normal to have a production service replicated on 2
| availability regions.
|
| The green guy is annoying, because reality is annoying, and
| reliability is not about luck, but is about a properly
| calibrated and tested process.
|
| Yes, you need to write monitoring, you cannot run only with
| "hope".
|
| Yes it sucks that a DC can go down. Your particular service is
| not important if it's down, but having a copy of the production
| data is essential in case of a catastrophe.
|
| Except for the tests that are probably unnecessary, everything
| else seems to make sense.
|
| The peer bonuses are an issue though.
| ridaj wrote:
| I think the challenge is you'd expect a company like Google
| to have more of the setup be automated. If replication and
| monitoring are such universally good ideas, then why don't
| they come out of the box?
| midasuni wrote:
| Depends what the problem you're trying to solve is. In my
| experience the vast majority of business problems do not need
| that kind of reliability, and if they do they don't need it
| deployed in such a Byzantian way.
| zaphar wrote:
| You say that but then the system goes down and the CTO is
| walking up to your desk asking why it's down and exactly
| when can they expect it to be up and don't you know we are
| bleeding money right now?
|
| What you call byzantine an SRE calls necessary complexity
| to meet the needs of your business.
| novok wrote:
| Green guy should be making a all of that a one click process
| to start up a service shell that does all of that for you
| although. Then as you write it up an automatic linting &
| rules engine will highlight what is missing before you make a
| final pull request to get the necessary human approvals,
| ONCE.
| loxias wrote:
| All of this is true, but I'd wager there's 10, at MOST,
| entities on the planet that are large enough to warrant this
| level of ... "architectural overkill". The other 99.99% of us
| don't need it.
|
| I CERTAINLY don't debate that Google, or Amazon, or Facebook,
| or Netflix, or the phone system, or anything else that
| touches a noticeable percentage of the human race needs
| architecture like this to provide "5 9s".
|
| But, just like when "big data" became a buzzword, and many
| people thought their problem needed "big data" approaches to
| solve, the thought that all but a small minority of entities
| need this is Simply Wrong.
|
| I am reminded of a client doing something with genomics about
| 9 years ago. They had some over-complicated "new tool"
| infested approach to solve their "big data" problem, but the
| run times were taking too long. I was brought in as a
| consultant to improve it. After I was done, a data processing
| run that took hours (causing employees to run them overnight)
| before took minutes, or seconds. What did I do? I got rid of
| all the complexity. I replaced their expensive cluster with
| one studly provisioned machine. I replaced their collection
| of networked Java microservices with 1 non-networked
| multithreaded C program. I replaced their XML based format
| for data at rest with something I whipped up, tuned to what
| they actually needed.
|
| Once their "we need big data!" >10TB data set could fit in a
| single machine's memory, the rest was easy. What used to
| "require" a cluster of machines and overnight processing
| could be done interactively, and quickly enough for the
| scientists to get into a much more productive "flow", doing
| dozens of runs per day.
|
| tl;dr: unless you're google (or google scale) you don't need
| all this crap. :)
| gopher_space wrote:
| A lot of it feels like premature optimization. Like I'm
| laying down a heavy infrastructure to support change but it's
| already locking me into certain ways of looking at problems.
| Jensson wrote:
| It isn't premature though, a service is as robust as its
| weakest link, so if you let people write crappy services
| that easily goes down and are hard to get back when they do
| then you will get a huge amount of outages in major
| services since they depend on so many small ones.
| loxias wrote:
| Great observation. Perhaps the term "premature
| infrastructure architecture optimization".
| ethbr0 wrote:
| The current landscape (optimizing for hyperscale, at the cost
| of complexity) seems like a natural extension of relatively few
| giant corporations funding the majority of programmers. To such
| an organization, efficiency & time to market are more important
| than simplicity.
| [deleted]
| iamstupidsimple wrote:
| But at least in the FAANG example, time to market is much
| slower because of said complexity.
| thiagocsf wrote:
| I believe is now called MANGA.
| tester756 wrote:
| not MAGMA?
| [deleted]
| dpryden wrote:
| Non-Googler: What do all those words mean?
|
| Noogler: Haha, this video is so funny!
|
| L4 SWE: (Crying because the video is so true)
|
| L5 SWE: Haha, this video is so funny! I should show it to my
| interns, this will be a good training for them.
|
| L6+ SWE: Why do people think this is funny? This Broccoli Man guy
| makes some really good points...
| [deleted]
| throw1234651234 wrote:
| Non-Googler: What do all those words mean?
|
| Exactly. This wasn't too relatable, even though I have the GCP
| Certified Architect cert.
| dpryden wrote:
| I can't tell if this comment is implying that my comment is
| unclear, or if you're agreeing with the first line of my
| comment.
|
| In either case, though, it's an inside joke precisely because
| it's more relatable to those who are (or were) inside. In
| particular, I think it would be most funny to someone who was
| at Google about a decade ago; when I left Google in 2017
| things had already changed enough that this didn't ring quite
| as true for new hires.
|
| That said, GCP is not very representative of what the
| internal platform looked like circa 2010. (Or even of what
| the internal platform looks like now, as far as I know.)
| [deleted]
| NikolaeVarius wrote:
| Why would internal tooling mean anything to you? And why
| would GCP knowledge be useful in any way?
|
| Its fairly simple to extract the gist of what these systems
| from the script.
| jjtheblunt wrote:
| As an ex Apple person, i'd say it means there's way too much
| hierarchy at Google? not sure i'm reading it right though
| flatiron wrote:
| we still had our processes though. Radar was my least
| favorite, but they replaced the ant eater app with one that
| was at least partially usable right before i left.
| jjtheblunt wrote:
| we'd say, about spoken mad scientist style requests, if
| it's not in radar, it never existed. :)
| q3k wrote:
| IMO/IME it's the clash between tooling, systems and and
| processes designed for running long-term highly scalable
| and reliable services maintained by teams in multiple
| geographical locations and used by billions of people; and
| greenfield projects that just want to get things done at an
| early stage.
|
| Requiring multi-cluster/region, the quota/resource economy
| system, handling PCRs, code review, readability approval
| for complex configuration languages (and the existence of
| such complex languages in the first place) ... all of that
| makes sense in a vacuum and all were built to handle real
| problems and are likely written in the blood of a near-miss
| outage. But it also all comes crashing down on you when
| you're doing things from scratch for a relatively simple
| usecase that no-one really designed for.
| vanderZwan wrote:
| Not sure if "SWE" stands for software engineer, or "Sweden" as
| in Stockholm Syndrome
| [deleted]
| frakkingcylons wrote:
| Oh definitely Sweden.
| keville wrote:
| Why not both? :sob:
| praptak wrote:
| Random synapse activation:
|
| A few years ago there was a Swedish tourist at a hotel where
| I was on vacation. He had a blue-yellow hat with "SWE"
| written on it in Courier font. I felt an urge to steal his
| hat because it looked better than most of the Google-branded
| swag I got as a Google SWE :)
| belter wrote:
| L7+ SWE
|
| My life is a waste but the money is too good...
| kubb wrote:
| This applies to every level, particularly the lower levels.
| ikiris wrote:
| it ain't much, but it's honest crying into piles of money
| tandr wrote:
| How good?
| riknos314 wrote:
| https://www.levels.fyi/company/Google/salaries/Software-
| Engi...
| cperciva wrote:
| Where does "these are really good points, but why don't we have
| tooling which sets everything up automatically?" fit on the
| scale?
| nuerow wrote:
| > _Where does "these are really good points, but why don't we
| have tooling which sets everything up automatically?" fit on
| the scale?_
|
| My guess it fits nowhere because the L5s don't have the
| ability to automate it, and the L6s think it's trivial and as
| it's done sparingly then it doesn't justify the work to do
| things differently.
|
| And this is why we can't have nice things.
| azornathogron wrote:
| And yet it's been a decade since this video and practically
| everything it mentions is a non-problem now.
|
| No one is spinning up new borgmon instances. Spanner is
| replicated by default. Only very low level services need to
| care about PCRs. If you use one of the approved frameworks
| it will set up practically all the production configuration
| for you. Basic alerting for your service is automated, just
| turn it on, picking cells to run in is automated, scaling
| your service is automated, etc.
|
| Actually getting quota remains a problem... :-p
|
| Anyway I would argue we can and do have nice things, and
| that has happened precisely through the efforts of a huge
| number of people at all levels.
|
| Edit to add: of course, there are always new problems to
| complain about! It's the march of progress after all.
| compiler-guy wrote:
| Yes. If someone were to make this video today, it
| wouldn't be about production jobs and PCRs, it would be
| about privacy reviews and branding approvals.
|
| But the quota issues haven't changed a bit.
| ikiris wrote:
| More like you aren't going to get promoted for automating
| someone else's toil. Also, now who's going to support it,
| better deprecate it since the library changed / got
| deprecated / it's tuesday.
| Jensson wrote:
| > More like you aren't going to get promoted for
| automating someone else's toil.
|
| Lots of people were promoted for automating these things.
| They built easy to use services, got extra headcount
| since they became important and climbed the ranks. So not
| sure why you'd think that.
|
| It may be different at other companies, but at Google
| building stuff that many other engineers depends on is a
| major way to get promoted. Of course if you automate
| something and nobody uses your automation tooling then
| you wont get promoted, but if your work gets used by
| basically every new engineer you'll climb the ranks
| quickly.
| SilasX wrote:
| Yeah, that was my reaction. I get the need for all this
| reliability/failover, but it's _horrible_ failure of
| abstraction /separation of concerns.
|
| There's no reason the serving team should have to learn how
| to do all of those things on the checklist, since it can be
| done by anyone who's already learned the infra. You're
| expecting them to learn all kinds of stuff outside of their
| specialty, when they should be able to kick the app over the
| wall and let infra ensure that the app is deployed in two
| separate PCR zones with the failover plan etc, which should
| itself be mostly automated.
| q3k wrote:
| > when they should be able to kick the app over the wall
| and let infra ensure that the app is deployed in two
| separate PCR zones with the failover plan etc, which should
| itself be mostly automated
|
| Not entirely - the developers should actively participate
| in designing the actual failover scenario and making sure
| the application can handle that (anything from being okay
| with some downtime due to the failover happening to
| designing an actual multi-region multi-master application).
| Making assumptions like 'infra will handle it' is a great
| way to not only get unexpected outages (because the
| developers assumed there would be no downtime because
| failover is magic, or that writes will never be lost) but
| to also introduce tensions between teams (because you now
| have an outside team having to wrangle an application into
| reliability when the original authors don't give a crap
| about it).
|
| I get and agree with your point, the tooling and processes
| should definitely be simplified/automated when possible,
| and developers deserve a working platform that just works.
| The whole point of a platform team is to abstract away the
| mundane to let people do their job. But reliability is
| everyone's job, not just the infra's team, and developers
| must understand the tradeoffs and technology involved in
| order to not design broken systems.
| SilasX wrote:
| If that's the point:
|
| A) It's doing a horrible job conveying it. A dev _does_
| need to be concerned on how to handle failover, but only
| at a certain abstraction level. They should be required
| to specify something in the form "given server A fails
| and has to pass to B, what do you do?" That does _not_
| require you to know the terminology about PCRs and how to
| make decisions about which cells (or whatever) to pick on
| deployment, or avoiding the "gotcha" about making sure
| the two servers are in different PCR zones.
|
| At that point, it's just following a checklist that needs
| no knowledge of the specifics of the app, and, to the
| extent that it's accurately representing how Google was,
| _is_ indicative of bad processes.
|
| B) Many things _should_ be infra 's job, as they're
| cleanly orthogonal to what dev's are doing. For example,
| how to apply a security patch to a DB. That's unrelated
| to the operation of the app.
|
| I do get your point though, and I wouldn't say something
| like this about e.g. testing (which was the short,
| "reasonable" part of the video!) -- the devs have
| intimate knowledge of what counts as passing and failing
| and should be writing tests, and not 100% passing it over
| to QA. But that's precisely _because_ such concerns are
| deeply tied in to the thing they _are_ concerned with.
| "SQL 3.4.1 vs 3.4.2" is not.
| q3k wrote:
| Yeah, it seems like we agree :).
| lumost wrote:
| Mega-Caps suffer from the following problem:
|
| 1. There are more engineers making more divergent
| architectural solutions such that there is never a single
| place where you can make changes across the group.
|
| 2. Failures keep happening, so process is instituted with
| many checkboxes for engineers to work through.
|
| 3. Engineers on the small scale stuff get stack ranked
| against the engineers on the big scale stuff. Everyone
| needs to show that they can do the work and are "fungible".
| This leads to small internal systems having the same
| operational standard as large public facing systems.
| SilasX wrote:
| I don't see what that's replying to. Nothing in that list
| would justify demanding that the app's team have
| knowledge or preference about which PCR zones to pick and
| which will just have to be corrected when they inevitably
| pick the wrong one.
| lumost wrote:
| The point is that every team gets to set their own
| failure modes. I know of multiple tier-1 services which
| diverge from at least one best practice.
|
| Think of the scenario where a cloud provider needs to
| evacuate an az. There is no API which would allow the
| compute team to force migrate tens of thousands of apps
| and guarantee that they both are not effected and
| maintain their redundancy guarantees.
|
| Internal services at google are in the same boat. However
| google knows about the hard edges and forces everyone to
| deal with all of that complexity - there is no api which
| the serving team could plug into which will avoid this
| overhead.
| dustingetz wrote:
| Because you have to get it working before you can make it
| better. Abstraction is quite secondary
| SilasX wrote:
| Yes but the video is in the context of a mega-scale mega-
| corp that _should_ have been able to set up clean
| abstraction boundaries at this point by now.
| [deleted]
| Jensson wrote:
| They already have done that, this video is 11 years old,
| at that point Google was half the age it is now and a
| fraction the size.
| omreaderhn wrote:
| That would be 'Xoogler' because Google's engineering and
| broader corporate culture does not reward work like that and
| so when you realize that, you leave.
|
| In general, Googlers have very little idea how far behind the
| rest of the industry they are when it comes to tooling.
|
| I am a Xoogler.
| mwcampbell wrote:
| I got the impression, based on a blog post by Eric Lawrence
| [1], that Google's developer tooling was top-notch (except
| for devs working on open-source projects like Chromium).
| Did it get worse since 2017, or are you talking about a
| different kind of tooling?
|
| [1]; https://textslashplain.com/2017/02/01/google-chrome-
| one-year...
| throwawayfgg wrote:
| Google's developer tooling is top-notch and amazing and
| constantly improving.
| raldi wrote:
| Or: These are really good points for a visibly-user-facing
| post-alpha service, but isn't it a bit overengineered for an
| experimental internal service whose clients can tolerate the
| risk of occasional downtime?
| nostrademons wrote:
| L5 Xoogler who left for a startup.
| nostrademons wrote:
| L9+
| nojvek wrote:
| The borgman readability approvers makes me chuckle.
|
| At Stripe, there were language approvers. Only those blessed
| could approve PRs. Even XML had a set of approvers. I had fun
| time getting hold of an XML approver.
| sunyc wrote:
| I actually have Borgmon readability! Peer bonus pls.
| dang wrote:
| Recent and related:
|
| _I don't know how to count that low_ -
| https://news.ycombinator.com/item?id=28988281 - Oct 2021 (259
| comments)
|
| especially this comment:
| https://news.ycombinator.com/item?id=29032656
| mlindner wrote:
| Wow I saw this somewhere a long time ago. But I don't remember
| where and in what context.
| Zababa wrote:
| There's almost some kind of irony on uploading that to youtube, a
| feeling of "why can't I deploy my service as easily as people can
| upload videos to youtube?".
| silentsea90 wrote:
| Actively working towards being a Xoogler so I don't have to live
| in this dystopia.
| devnull3 wrote:
| At 2:05 the green dude asks if you think your users are scum and
| do you hate them.
|
| The funny thing is Google as an org ends up hating their users
| "accidently" anyway because of their history of pulling the rug
| under the services/APIs etc.
| nunez wrote:
| even more ironic given that google+ came out four years later
| munk-a wrote:
| If the users had properly set up a PCR notification about the
| change and registered it to a bigdata instance then they would
| never be surprised about service discontinuations. The moral of
| the story is that you can't fix stupid users. /s
| bluefox wrote:
| BitTorrent existed since 2001. Get on with the times.
| martini333 wrote:
| Google: The Sunk Cost Fallacy
| GauntletWizard wrote:
| Holy crap! I've been asking for this for forever[1]. Thank you to
| the leaker!
|
| [1] https://news.ycombinator.com/item?id=21786729
| benley wrote:
| You're welcome <3
|
| Here's hoping Google doesn't get mad about it - though after
| 12(ish) years there's really nothing secret in that video.
| Spivak wrote:
| I'm so confused, isn't this just like basic highly available
| infrastructure mixed with a toxic SRE culture?
|
| I want to serve 5TB!
|
| Okay grab two instances in different patching zones, create a
| bucket in our replicated RADOS storage that can hold your data or
| create a table/db in our Postgres cluster, write your app with
| tests, add an entry in to the load balancer, add an entry in our
| big ole distributed job scheduler if you need cron, and submit a
| PR against the infra repo to add Prometheus metrics and alerts.
|
| And when your done with that set up CI/CD because you shouldn't
| assume that instances are reliable and if you don't give us the
| code to do a deploy we can't recreate your app when the VM goes
| belly up and we'll have to page you.
|
| Are people not used to what it really takes to "just run some
| code?"
| svachalek wrote:
| It totally makes sense for Gmail, but at Google "serve 5TB"
| means something like sort your manager's inbox, something that
| someone somewhere has an interest in doing, or trying, but of
| no real consequence for failure.
| gliese1337 wrote:
| I am used to it, but
|
| 1. It is rare for the details of how to actually accomplish
| each of those steps to be both documented and the documentation
| made accessible.
|
| 2. If you can describe it that succinctly, it really ought to
| be automated. If it can't be automated... then you left
| something out of your instructions, which goes back to point
| (1).
| Spivak wrote:
| Like the steps to do all of this are automated, but we can't
| read your mind. All of this is basically boils down to submit
| a PR against some repo that says "there shall be two
| instances in these regions, there shall be a database in this
| cluster, there shall be a bucket with this name, etc etc"
| that the SRE team reviews and merges, which triggers an infra
| deploy.
| q3k wrote:
| People with HA production experience can easily vibe with
| points made by Broccoli Man. Yes, these things make a lot of
| sense if you actually want to get code running reliably,
| especially at scale (organizational and userbase).
|
| But we must not forget how this can look from the point of view
| of someone who hasn't had to deal with a page due to an entire
| datacenter going offline, who's not aware of all the hundreds
| of small things that can go wrong by doing the 'obvious' thing.
| I think the video is more of a way to poke fun at the optics of
| this (and some of the overly arcane stuff involved), rather
| than at the idea of high availability being useless. At least
| that's how I've always felt about it, a way to remind SREs to
| respect their internal users (simplify! automate! document!)
| and that what makes sense to them might look ridiculous to
| others.
| cromwellian wrote:
| As a Googler, it's often easier for me to setup a GCP consumer
| account, AWS, or Heroku account to demo something, compared to
| using anything internal. I remember the most annoying situation
| was like 10 years ago when me and other engineers ported Quake 2
| to run in Chrome, we were in a time crunch to demo it
| multiplayer, and I ended up setting up an AWS account to serve
| it. But then I left it running and forgot and ended up getting a
| few hundred dollars billed to me because the Quake2 server was
| chewing CPU.
| bamboozled wrote:
| I could imagine you're violating some pretty strict policies
| doing this?
|
| You're taking proprietary code and running it on a competitors
| platform?
|
| I like to think I'm pretty open minded about stuff like this,
| and I've actually done something similar, but I'd be surprised
| if you didn't get your ass handed to you for that type of
| thing?
| cromwellian wrote:
| No, no proprietary code was used, the port was done from the
| Open source Java clone Jake2:
| https://en.wikipedia.org/wiki/Jake2
|
| We ported it by using Google Web Toolkit Java->JS compiler,
| and replaced OpenGL with WebGL, and all of the other bits
| with Web APIs (websocket, pointer-events, fullscreen-api,
| filesystem api, etc)
|
| The assets (proprietary artwork, levels, etc) were not hosted
| on AWS, it simply downloaded the EXE file from ID servers and
| extracted it in the browser.
| bamboozled wrote:
| You said this:
|
| > As a _Googler_ , it's often easier for me to setup a GCP
| consumer account, AWS, or Heroku account to demo something,
| compared to using anything internal.
|
| I get you're trying to make a point of saying you can do
| something easier elsewhere, but then why even through in
| the "as a Googler bit" without clarifying that you're not
| really working on anything of consequence where you'd be
| actually asked to host things internally.
|
| You're basically hosting open source projects on AWS.
| breakfastduck wrote:
| Looks like someone is desperate to get into an argument
| cromwellian wrote:
| For those interested: The original project
| https://code.google.com/archive/p/quake2-gwt-port/
|
| GitHub (Stefan Haustein is my genius teammate who did all of
| the heavy lifting on the OpenGL -> WebGL piece)
| https://github.com/stefanhaustein/quake2-playn-port
|
| You can still play it here, on AppEngine
| http://quake2playn.appspot.com/
| mabbo wrote:
| Why? Do you think there's a risk of Amazon stealing code from
| a customer?
|
| No matter what code they took, the cost would never be worth
| it for them.
| bamboozled wrote:
| If I took my companies code and hosted it anywhere execept
| where I was authorized to do so, I'd expect flack for it.
| I'm less worried about Amazing stealing it, but it seems
| like a silly place to put it nevertheless.
| BoorishBears wrote:
| This is what separates companies that get things done from
| companies that pay a lot of people to hopefully maybe get
| things done.
|
| A server for a multiplayer Quake port...
|
| Who is Google paying to hand their ass to them over that?
|
| Who has both the authority to hand their ass to them, and a
| lack of discretion to not let it end at "well in general we
| don't do that, but I see why you did it and there's little to
| no risk"
|
| -
|
| At some companies yes, someone is paid to go "I caught
| someone putting our proprietary code up on a competitors
| platform!!!!" and no one will actually think critically about
| what exactly was proprietary, so someone putting up a quake
| demo might as well have put up the Coke secret formula
|
| And now OP who actually generated some value at little to no
| risk gets their ass handed to them and someone who simply
| lacked the skills to realize the risk profile gets a notch on
| their "I add value and earn my paycheck" badge.
| potatoman22 wrote:
| Can't let the Quake 2 source code escape Google
| didip wrote:
| What is "Borgmon readability" and why was it important. I think
| that's one of the punch line of the video.
| yeputons wrote:
| If you change source files in language X, someone proficient in
| language X (aka "has X readability") should approve that it
| corresponds to Google Code Style in language X.
|
| You start without it and may obtain once you've written a bunch
| of code in language X.
|
| I'm not sure if there is really a Borgmon readability. But if
| there is, it seems like Borgmon configuration files are both
| common (so that there is a readability requirement) and
| uncommon (so that there are very few people with readability).
| nunez wrote:
| still hilarious
| birken wrote:
| Hey... those of us that worked on Google's internal Bigtable
| service worked very hard so you _didn 't_ have a file to a ticket
| to set up replication between your Bigtable cells.
|
| The rest does seem about accurate though.
| gcampos wrote:
| What exactly are these "peer bonuses"? Is it real? Is it what I
| think it is? Do people actually use them as bargain chips?
| B-Con wrote:
| You can send a small, semi-official "thanks for a job well
| done" to someone else and it comes with a few bucks attached.
| People joke about using them nefariously (as people tend to
| joke), but I've only seen them used appropriately.
| advisedwang wrote:
| Yes they are real, it's in the low hundreds of bucks range.
| They must be approved by the recipients manager. Its also
| limited how many a employee can send (but the limit is fairly
| high). There is also "kudos" which comes with no money, but has
| no limits or approvals required.
|
| They are intended to be used for going above and beyond, not
| for stuff that falls within the scope of ones job. Using them
| as bargaining chip is explicitly against policy.
| nunez wrote:
| you get some money ($150/bonus, IIRC) for helping someone out,
| assuming manager approval
|
| akin to the usual corporate "thank you" gift card, but more
| money and generally easier to distribute
| q3k wrote:
| > What exactly are these "peer bonuses"? Is it real? Is it what
| I think it is?
|
| Each month, you can nominate another employee for a small
| bonus. This is designed to be given to coworkers who have gone
| above and beyond what was expected from them.
|
| > Do people actually use them as bargain chips?
|
| From my experience it's so over-the-top absurd that it would be
| difficult to have someone interpret such an offer as anything
| other than a joke or a meta-joke.
|
| https://blog.bonus.ly/a-look-at-googles-peer-to-peer-bonus-s...
| tazjin wrote:
| It's not each month. You can send a lot of them. There's a
| theoretical limit and a bunch of restrictions but in practice
| they're unenforced.
| guyzero wrote:
| Each one has to be manually approved by the recipient's
| manager, so this can't happen. It's a joke.
| compiler-guy wrote:
| People don't use them as bargaining chips most of the time--it
| is explicitly against policy. I'm sure it happens some times.
|
| What they do do is send one when someone else does something
| nice (like fix a bug from a project they have left or whatever
| else). If you ever need something similar again, the person you
| peer bonused has warm fuzzies about the experience and a hint
| that they might get it again.
|
| People also use peer bonuses during perf time to demonstrate
| that the work they are doing impacts other people enough for a
| somewhat uncommmon thank you.
| dekhn wrote:
| Thank you, whomever did this! I asked for it in a comment
| recently.
|
| This video basically is making fun of a common situation of
| Google at the time, where a person wants to serve up some data
| for analytics, but the sysadmins expect the person to follow a
| process intended for much more complex and high availability
| services run by teams of skilled engineers.
|
| It parodies SRE as a BOFH sysadmin, even though in general SRE
| are quite easygoing and helpful.
|
| It helped poked fun at a number of overly stuffy processes and
| also helped push people to make hosting modest datasets (like
| this 5TB one) easier.
| smartician wrote:
| It's not much different today. Nowadays you'll also need
| privacy review, accessibility review, security review, and
| diversity & inclusion review.
| ntaylor wrote:
| _diversity & inclusion review_
| cynicalkane wrote:
| Assuming this is sarcasm, you realize Google has a massive
| userbase all over the globe from all walks of life, right?
| Does it make business sense to accidentally exclude certain
| people? Or ethical sense?
| killerstorm wrote:
| Businesses exclude people all the time. E.g. many videos
| are geoblocked, and there's no way to view or purchase
| them in some countries.
|
| Here are some other examples: I can use free version of
| Google Colab from Ukraine, but I can't pay for Pro
| version. (I can pay for Google Cloud, though.)
|
| OpenAI blocks API dashboard access to IP addresses from
| Ukraine. (But it is OK if I use VPN LOL.)
|
| So it seems blocking ppl is the norm. I guess "diversity
| and inclusion" is mostly about social topics within US,
| not about not excluding people.
| londons_explore wrote:
| In general it's about not _accidentally_ excluding
| people. All the cases you propose are deliberate blocks
| for various (mostly legal) reasons. The deliberate blocks
| are considered in the review, and as long as there is a
| sound business case for launching with the exclusion, it
| goes ahead.
| dustintrex wrote:
| You're running into US sanctions issues (Crimea), not
| woke Google policy.
| bbarnett wrote:
| Nothing is all inclusive. Nothing.
| bufferoverflow wrote:
| Death and taxes.
| oriki wrote:
| Is your argument here supposed to be "Nothing is all
| inclusive, therefore we shouldn't even bother trying"? If
| so, I'd argue that's a lot more ridiculous than a review
| process designed to help catch major inclusivity issues
| before they become problems.
| mikepurvis wrote:
| Sure, but that's not a reason to not even ask the
| question. Maybe not every DI initiative turns out to be
| helpful or productive, but as someone who's privileged on
| pretty much every axis there is, I'd be grateful for the
| kind of internal support system that could give me an
| early warning sign for "hey, this design decision that
| made sense to you and your team has the potential to
| alienate user base X and there's a real possibility that
| if we launch in this state it's going to explode into a
| minor Twitter scandal."
| brailsafe wrote:
| Isn't this just called user testing? Also this is in the
| context of a fucking dataset. If data needs to go through
| DI in case something blows up on Twitter, I guess it's
| sad state we're in.
| davidcbc wrote:
| If, for example, the dataset only contains white faces
| and is intended to train facial recognition then yes, it
| needs to go through some kind of DI review.
| brailsafe wrote:
| Wouldn't this review be done on the data collection and
| planning side, rather than at point of publishing though?
| Surely you can publish datasets of just white faces or
| just black faces if during planning that's what you
| intended to do for some reason?
| davidcbc wrote:
| I mean, maybe, but you still might need it to be
| reviewed. You don't have to wait until you're about to
| launch to start these kinds of reviews and if you know
| that some kind of DI review is necessary for your project
| you should start talking to the reviewers as early as
| possible, especially if you are making a potentially
| controversial design choice.
| Volundr wrote:
| Does it? Seems to me data is a prime place for exclusion
| to occur. Example: a dataset of tagged photos for
| training a neural net to analyze facial expressions. All
| the photos are of white faces.
| pilsetnieks wrote:
| Perfect is the enemy of good.
| teawrecks wrote:
| Science is always wrong. Always.
| xmprt wrote:
| I agree with you but it sometimes seems like Google
| doesn't care at all about it when they have the kind of
| customer support processes that they have.
| kevingadd wrote:
| Customer support is after the fact, reviews are before
| the fact. It's very cheap to do these reviews before
| launch and then you can point at those to say "we're
| trying!" while not providing any customer support.
| protomyth wrote:
| Google can talk when they stop using a license by a
| domain squatting org who revised their history and has a
| pretty offensive line on their front page. _COMMUNITY-LED
| DEVELOPMENT "THE APACHE WAY_ indeed. Worse, most of the
| links on Google search point to the org and not the
| actual tribes.
| brailsafe wrote:
| Does it make sense to serve a dataset without approval
| that it's inclusive enough? Yes, because that's typically
| how things in the world work.
| fyd6gexygsydy wrote:
| I don't understand this line of reasoning since it
| assumes inclusion training actually promotes inclusion.
| My experience has been that it usually means
| racial/gender intersectionalism training that everyone
| gets to swallow regardless of culture or belief because
| it's what white people in the us tech industry are
| passionate about right now.
| tester756 wrote:
| >accidentally exclude certain people?
|
| e.g how? could you provide some examples e.g two?
|
| there's a lot of talk about this stuff when it comes to
| MAGMA, yet docs still use some auto-generated
| translations which suck.
| davidcbc wrote:
| https://sitn.hms.harvard.edu/flash/2020/racial-
| discriminatio...
|
| https://futurism.com/delphi-ai-ethics-racist
|
| https://www.nytimes.com/2019/04/25/lens/sarah-lewis-
| racial-b...
| tester756 wrote:
| It seems like this kind of problems occur mostly within
| some specific areas, meanwhile OP seems to suggest that
| this kind of review should be applied for everything.
| kukx wrote:
| By the same logic we can justify any [social issue]
| division. The sad thing is that the rules are arbitrary
| and do not help in solving the issue. Actually it is in
| the interest of the division to create or exaggerate
| problems to justify its existence.
| pangolinplayer wrote:
| Based
| sayhar wrote:
| Hello, I wasn't aware we were on /r/politicalcompassmemes
| ranger_danger wrote:
| have to make sure there's no trans jokes in there.
| throw10920 wrote:
| > diversity & inclusion review
|
| Is this tongue-in-cheek, or are you serious? Poe's law and
| all that.
| smartician wrote:
| Partly tongue-in-cheek. These review processes exist, but
| whether they're required or not depends on the product area
| and type of project.
| kevingadd wrote:
| If you're publishing a dataset in the terabytes it does
| actually make sense to at least do a pass over it and make
| sure the data you're using isn't skewed in any undesirable
| way that would cause problems down the road. For example,
| if you're releasing 5tb of face photos for training facial
| recognition nets, it would certainly be a problem if all
| the faces are white women or asian men - the result would
| probably be over-fit and not perform as well for people in
| other categories. It would be correct to call that a
| diversity/inclusion issue.
|
| Privacy and accessibility reviews serve similar purposes
| there, you're reducing risk by checking for these various
| problems and ideally they also spot ways to improve the
| quality of your outcomes.
| murph-almighty wrote:
| It's common in fintech for data/ML models to go through
| similar overview. If you happen to disenfranchise a set
| of people because your model said not to lend to them,
| you risk legal jeopardy.
|
| To clarify, I think it's good that this is a practice.
| londons_explore wrote:
| A review doesn't necessarily mean you need to resolve all
| diversity/inclusion issues. It can merely require that
| you _identify_ the issues and understand the risks of not
| resolving them.
| dekhn wrote:
| the 5tb was performance data collected from servers
| kevingadd wrote:
| Sounds like the reviewer would glance at it for 5 seconds
| and say 'ok'
| rodgerd wrote:
| Perhaps Google don't want to be in the news for identifying
| dark-skinned people as monkeys again?
| jjeaff wrote:
| I can't remember which company it was that launched a
| camera with face identification features, but that didn't
| recognize any face that wasn't lilly white like every
| single engineer that worked at that company. They could
| have probably benefited from a diversity and inclusion
| review. Heck, employing a single brown engineer or even
| QA engineer probably would have been enough to notice
| that before launch.
| dekhn wrote:
| having launched some product at Google in my day, I know
| quite well how to skate through that process (although D&I
| was not part of it when I filled out my forms). Sadly for my
| friends in privacy and security, it's not hard for product
| teams to exploit Google's propensity to launch and override
| privacy and security concerns.
| Wonnk13 wrote:
| One of the few things I miss about my time there...
|
| Never did get Java readability :(
| jazzyjackson wrote:
| what am I looking at here
|
| EDIT: it has been explained to me:
| https://rachelbythebay.com/w/2021/10/30/5tb/
| opinion-is-bad wrote:
| The multiple repetitions of "This is Google" hit home for me. I
| never worked as a software engineer so much of the rest is out of
| scope to my experience, but the constant idolization of Google,
| and by proxy each other for working at such a place, eventually
| changed from feeling coy to cultish.
| xiphias2 wrote:
| I wish I would have had this video before 2010. I got paged at
| night every time there was a PCR failover, and I didn't know what
| to do with it. This video is better than all the extensive
| documentation that we had.
| [deleted]
| cletus wrote:
| Ah, this takes me back (disclaimer: Xooger, 2010-2017). It's
| painful and funny because it's true. Or was true.
|
| Rumour had it that the Borgmon readability requirement was
| removed when Sergey saw this video. I don't know if this is true
| but that's what I heard.
| DaiPlusPlus wrote:
| Pray tell, what is/was Borgmon?
| twinge wrote:
| A system for alerting based on time-series data, with its own
| rule language. The language (along with many others) required
| authors have demonstrated they can adhere to the style guide
| by going through a process to obtain readability.
|
| https://sre.google/sre-book/practical-alerting/
| [deleted]
| sleepydog wrote:
| It's a language and supporting infrastructure for collecting
| and querying time series data for monitoring.
|
| It was replaced a long time ago by a new system called
| monarch, but a few holdouts will probably continue using
| borgmon until the heat death of the universe.
| dekhn wrote:
| prometheus 0.1
| jensensbutton wrote:
| This is the correct answer.
| leg wrote:
| It is true that Borgmon readability went away due to this
| video. It wasn't Sergey, it was an eng director.
| ikiris wrote:
| "no one has borgmon readability". years later and i still die
| laughing.
| sbpayne wrote:
| I'm so glad I can see this again. I forgot how much I missed
| this.
| metanonsense wrote:
| Move fast and break things! And while you are at it, please,
| don't break anything.
| mseepgood wrote:
| This monotonous speech synthesis is annoying to listen to. The
| delivery of the jokes is awful. Who can sit through a 3 min video
| like that?
| drannex wrote:
| That's what makes this even more funny.
| pas wrote:
| https://m.youtube.com/watch?v=b2F-DItXtZs
| zaphar wrote:
| I think it mostly works best when you've lived it. Which I did.
| And the resurfacing of that video brought back a lot of
| memories.
| zucked wrote:
| This was an output of a free (now defunct) service that used to
| accept transcripts and pump out these videos with TTS audio. It
| led to some hilarious results, usually within niche
| communities. Around ~2010 these things were everywhere.
| kgin wrote:
| Somehow it makes it funnier to me
| nunez wrote:
| this is making me miss memegen
|
| google had its downs, but wasting hours on memegen was not one of
| them
| slac wrote:
| I have the t-shirt!
| jamestimmins wrote:
| As an external user who has found Google's services to be
| incomprehensible, it's nice to know it is (was) equally as
| painful internally.
| frakkingcylons wrote:
| For anyone else who'd rather read than watch this video, here's
| the transcript (from YT's auto-generated captions):
| https://pastebin.com/8UrFftM6
| throwaway20371 wrote:
| These kind of organizational problems happen everywhere, that
| doesn't bug me. What bugs me is when leadership knows about it
| and doesn't care. After low-level engineers stick their
| professional neck out to complain in internal town halls and
| through feedback forms, and leadership gives some bullshit answer
| that doesn't address or even acknowledge the problem. It would be
| less infuriating if they just said "I don't give a shit." It's
| the weasel words and pretending the problem doesn't exist that
| infuriates me. A lot of the time it doesn't even take much work
| at all to begin addressing the issue, like a working group for
| continuous improvement of highly-painful high-value processes.
| You don't even have to solve it. Just _attempt_ to address it.
| TideAd wrote:
| My team has issues deploying builds to test machines. It's like
| 15 steps and takes an hour. The tooling is atrocious and
| recently got even worse.
|
| We eventually found the team responsible for this (the org
| structure is hard to penetrate because no one answers emails).
| They said they had no idea anyone was dissatisfied. Then they
| said that it was a low priority so they didn't care and nothing
| would be done.
|
| In my experience, you can usually convince an engineer that
| their stuff has a problem and they need to fix it. But it's
| often impossible to convince management if they aren't on the
| hook for user satisfaction.
| ts4z wrote:
| To be fair, they did, and many things have improved. And this
| video was used as an uncomfortable reminder to make some of
| those changes.
| calmlynarczyk wrote:
| I work at a global corporation with 50,000 employees. Even
| though I've never been at Google I felt every pain point this
| video was getting at because our company is trying to implement
| all of this stuff right now.
|
| "Oh you want to go to production? Here's a list from A-XX
| stating what you need to accomplish that." Thing is I thought
| they actually handled this gracefully when I started because
| lots of requirements were tiered with various criteria you had
| to meet to move up (mostly for brownie points).
|
| But then one day the Tech Execs lose their minds and decide
| "everything needs to meet all criteria for every single
| process." You want to create an S3 bucket to store data? That
| will be a week of submitting paperwork and another month of
| meetings and approvals from various teams you've never heard
| of. Plus you have to register your schema, implement data
| quality checks, unit tests, regression tests, get a PR and CO
| approved for your central config change, remediate any CVEs in
| the tooling that you used, and build all of this using our in-
| house CI/CD platform we created because we're just soooo
| special. Now you're allowed to launch. Oh wait, NO because
| we've put the entire corporation on hold from launching new
| systems for the last calendar year because we're still trying
| to agree on the final process everyone needs to follow to go to
| production.
|
| It's surreal how universally so many orgs makes the same
| mistake of trying to throw more and more process at problems.
| unethical_ban wrote:
| In my previous role, the secdevops groups (matrixed teams)
| were building custom terraform modules for our devs to use in
| order to easily deploy compliant AWS infrastructure - and
| devs could _only_ deploy via terraform /CI-CD. While TF
| specifically states that custom modules are not meant to be
| used as wrappers, I thought it was a clever way to try
| getting security "out of the way" while still enforcing best
| practices.
| darkwater wrote:
| > While TF specifically states that custom modules are not
| meant to be used as wrappers
|
| What do you mean with this?
| acdha wrote:
| > It's surreal how universally so many orgs makes the same
| mistake of trying to throw more and more process at problems.
|
| Followed by the inevitable ranting about "shadow IT", AKA the
| requirements gathering they really should have done.
| m0zg wrote:
| At Google back then "leadership" might as well not even show
| up. It was super bottom-up, and _you_, not "leadership" were
| supposed to identify and fix issues. No "leadership" would stop
| you, either, at least in most cases. I don't believe that in
| all my years there anyone ever told me what to do. It was very
| easy to start projects, shut down projects, get headcount, get
| resources (if your business case is sufficiently persuasive to
| others). Not a complete free for all, but certainly _a lot_
| more freedom than you'd normally see in companies of that size.
| And (IMO) people used that freedom and autonomy pretty well.
|
| That kinda deteriorated over time, culminating with Sundar
| "McKinsey" Pichai, and then went rapidly downhill from there,
| and now I flat out reject their recruiters, based on the
| feedback from friends still employed there.
| Imnimo wrote:
| What I don't get is why they wouldn't just use MongoDB. MongoDB
| is web-scale.
| hinkley wrote:
| /dev/null is also web-scale
| DeepYogurt wrote:
| Is /dev/null fast? I will use /dev/null if it is fast.
| flatiron wrote:
| does it support sharding?
| closeparen wrote:
| It supports sharting: https://github.com/dcramer/mangodb
| sondr3 wrote:
| And available as a SaaS: https://devnull-as-a-service.com/
| nostrademons wrote:
| That was a major impetus for this video, IIRC. The "MongoDB is
| web-scale" video went around Google about a month before
| Broccoli Man and some enterprising Googler figured they could
| use the same software to make a satire of Google's internal
| tools.
| hedgehog wrote:
| Link for the Mongo video:
| https://www.youtube.com/watch?v=b2F-DItXtZs
|
| And bonus lean startup video:
| https://www.youtube.com/watch?v=3J9KhpgYVB0
| fragmede wrote:
| MongoDB is web-scale:
| https://www.youtube.com/watch?v=b2F-DItXtZs
|
| NSFWish; it gets a bit personal around 3:11
| alexjplant wrote:
| I had a similar conversation with a heavily-intoxicated
| MongoDB sales guy in a diner at 1AM after the second day of
| KubeCon 2019. My concerns were primarily around data
| consistency issues during denormalization and lack of
| schema. H pitch was essentially "Who cares?! I'm getting
| [three-letter agency] to move _everything_ to Mongo because
| it's so cheap and easy! It's all just JSON! Why does it
| need a schema?!"
|
| He probably made more than I did that year so maybe he has
| a point -\\_(tsu)_/-
| mlindner wrote:
| I miss 2010.
| vinay_ys wrote:
| Ah, 2010 - when web scale and its secret sauce - sharding
| was all the rage.
| vorticalbox wrote:
| Maybe because mongodb had been out less than a year in 2010?
| gnabgib wrote:
| I think you missed the /s from GP.
| vorticalbox wrote:
| quite likily.
| 323 wrote:
| But is it planet-scale?
| swalsh wrote:
| That's out of date, we're now in the days of IPFS.
| anshumankmr wrote:
| What is this exactly?
| ts4z wrote:
| Xtranormal was a video service that would animate scripts with
| some stock characters.
|
| Someone made a bit of internal-only snark, and "I just want to
| serve 5TB" became an in-joke for turning easy problems into
| exercises in frustration.
|
| Some of these things have, actually, been addressed.
| SilasX wrote:
| Wow, kind of funny that Xtranormal now lives on in the few viral
| videos that were made with it.
|
| Here's where the company is now (the original domain is used for
| something else now):
|
| https://en.wikipedia.org/wiki/Nawmal
| raldi wrote:
| Background: https://rachelbythebay.com/w/2021/10/30/5tb/
|
| This video was hugely influential on changing the way Google does
| internal tools and operations.
| quelltext wrote:
| How did things change?
| vechagup wrote:
| There's been a big investment in server platforms that strive
| to enable SWEs to build a new service that follows Best
| Practices with as little knowledge and handholding as
| possible. These consist of conformance tests that yell at you
| while you're coding if you are trying something generally
| thought to be bad, and semi-automated workflows that help you
| bring your code to production. When everything works as
| intended, the production workflows set up a decent set of
| alerts, acquire resources, configure CI/CD pipelines, and
| launch your jobs with just a few button presses on your part.
| (In practice, one of the steps will probably require
| debugging, but eh, it seems way better than the broccoli man
| video.)
| scottlamb wrote:
| I think you can read about some of these changes in Google's
| SRE and SWE books (even if they don't mention this video in
| particular), at least the ones most likely to be interesting
| to someone outside Google.
|
| But dropping Borgmon readability was the most immediate and
| obvious. It was basically true that no one had Borgmon
| readability. The policy was a catch-22: you couldn't get
| readability for the simple/formulaic Borgmon macro
| invocations that were encouraged and often sufficient. You
| could only get it for doing something "clever". I got it by
| writing fancy borgmon rules to paper over a problem that (in
| hindsight) I should have solved elsewhere.
|
| Another was easing quota management. IMHO the most
| unbelievable thing in the video was that after Broccoli Man
| told Panda Woman to get quota in two cells, she just said
| "done". Besides the hassle in transcribing what you needed
| into the request system [1], various types of quota were
| chronically unavailable where you needed them, even in tiny
| amounts. In 2010, I kept a critical infrastructure service
| running by regularly IMing major clients' on-calls asking
| them to donate 0.1 cpu(!) of their quota in some cell or
| another when I didn't have quite enough to grow. There was a
| "gray market" mailing list where people would trade resources
| they couldn't get through the primary system. But eventually,
| they built a system that for small services would make the
| quota just happen for you.
|
| Overall, it was a kick in the pants for the most basic
| infrastructure teams that made them see how unnecessarily
| hard this is for their internal customers, prompting them to
| make small things just happen while keeping large things
| possible. In any large organization, it's healthy to get this
| kind of feedback regularly. The actual specific changes and
| technologies are pretty specific to Google in 2010...
|
| [1] Many people managed this very very tediously with
| spreadsheets. I eventually wrote a tool to generate the
| requests based on comparing your intended production config
| with your current quota.
| jeffbee wrote:
| Production priority quota horse trading in the days before
| it was easy was a real skill. But non-production quota was
| free and virtually infinite, even in those days.
| compiler-guy wrote:
| The most obvious change that came from this video is that
| Google abandoned the Borgmon readability requirement. At
| Google, every change needs approval from someone who has
| passed a detailed style-guide review process in the given
| language.
|
| Now over multiple changes, it used to require one fairly big
| one. It's still a pain in the languages that require it--
| which is all the main ones, but very few of the niche ones.
|
| Many other things changed as well. Much of what the video
| complains about got automated and better documented. But the
| company has grown so much, and the product lines have
| diversified so dramatically, that there are still plenty of
| places to complain about the overhead.
| m0zg wrote:
| I've proudly managed to avoid Borgmon in favor of Monarch.
| Which was new at the time, but worked all right even back
| then. I have a lot fewer gray hairs because of that. They
| should have kept and rigidly enforced the Borgmon
| readability requirement to force people to migrate off that
| convoluted, idiosyncratic piece of shit.
| StillBored wrote:
| I've never understood places with rigid style guides
| policed by people. Its idiotic, because we have computers
| and in places like google presumably a fair number of them
| know basic parsing/lex sufficiently that if they can't make
| a tool like clang-format that automatically reformats on
| save/commit/whatever then they can use a tool like clang-
| tidy to toss warnings during a development/CI/whatever
| phase.
|
| Putting people in charge of formatting/style is just an
| excuse for wasting time bikeshedding, either the code is
| wrong and a tool can tell you, or its not wrong.
| btilly wrote:
| The hypothetical discussion about readability is
| pointless.
|
| Let's make it specific. Read
| https://google.github.io/styleguide/cppguide.html for
| readability for a language, namely C++. All the things
| that can be automated, automatic tools have been written
| for. But, for example, you can't automate "Prefer to use
| a struct instead of a pair or a tuple whenever the
| elements can have meaningful names." Because what does it
| mean for a name to be meaningful?
| StillBored wrote:
| "Because what does it mean for a name to be meaningful? "
|
| Are you optimizing for someone who already knows all the
| project lingo, or someone who doesn't know any of it?
|
| Are your engineers native English speakers?
|
| There are a whole bunch of things which make the perfect
| variable name frequently less than perfect, and putting
| project insiders in charge likely yields the opposite
| result.
|
| Take: https://elixir.bootlin.com/linux/latest/source/mm/k
| hugepaged...
|
| If you don't know what a vma, pte, pfn, compound_page,
| young pte, huge page, lru, etc your going to be unable to
| even begin to understand what that code is doing, despite
| those all being pretty reasonable variable names and
| actually fairly industry standard concepts. It gets worse
| as you move to more esoteric topics. Expanding pte to
| PageTableEntry might help some subset of users, but at
| the expense of those that work on the code daily. So who
| do you optimize for? Is it readable if the only people
| that can read it already know what it does?
| gravypod wrote:
| Formatting and readability are two separate concepts (as
| other replies have pointed out). I'd like to specifically
| point to a fantastic example of what we mean when we say
| "readability": https://www.youtube.com/watch?v=wf-
| BqAjZb8M
|
| Someone with readability in a language, who keeps up with
| the style recommendations, will generally produce code
| that is easier to read by other engineers.
| kccqzy wrote:
| That's not what readability is. There are plenty of
| automated tools that will give you results from running
| lint, ClangTidy and other tools. Readability is mostly
| about structuring your code well to be easily read. It's
| about architecting your code within a single file. It's
| about telling a junior SWE who reinvented the wheel use a
| library function he/she didn't know about instead.
| StillBored wrote:
| So the rules can be codified sufficiently to test people
| on, but they can't be codified for a computer?
|
| The only one that sounds more difficult to codify is
| telling people of the existence of duplicate functions.
| But as someone who contributes to the linux kernel, I can
| tell you right now that the only way that works reliably
| is to have a very large pool of reviewers. Very
| experienced engineers frequently miss what people are
| doing in other parts of the source base, the name might
| not be what they expect, etc, etc, etc. In the case of
| linux there are a fair number of duplicates, or similar
| functions, and people write coccinelle patches to replace
| them on a fairly regular basis after they have been in
| the kernel for years.
|
| So, I doubt giving someone a formal gatekeeper flag,
| really helps vs just having wider change review.
| compiler-guy wrote:
| I know of no automated tools available today that can
| determine if an identifier is accurately and usefully
| named. They can all tell if you are using the proper
| case, but that doesn't really tell you anything.
|
| No tool like that tells you if returning a bool instead
| of an enum is appropriate here, or that a reference vs a
| pointer makes more sense given the rest of the code.
|
| I'm sure a clever machine learning algorithm could figure
| that out with a corpus as large as Google's. Maybe. But
| no tool like that works today.
|
| And not strangely at all, Google does accept "what clang-
| tidy does" as the canonical way of formatting text. But
| readability at Google is far more than just formatting.
|
| Readability is frustrating and annoying, but more than
| just lint.
| gravypod wrote:
| > So the rules can be codified sufficiently to test
| people on, but they can't be codified for a computer?
|
| Small note: readability isn't a test or quiz you take
| (asterisk). It's obtained by merging code in the language
| you want readability for. If you merge code for a
| language often and the reviewers have very few style-
| based questions for the code then you will get
| readability fairly quickly.
|
| > The only one that sounds more difficult to codify is
| telling people of the existence of duplicate functions.
| But as someone who contributes to the linux kernel, I can
| tell you right now that the only way that works reliably
| is to have a very large pool of reviewers. Very
| experienced engineers frequently miss what people are
| doing in other parts of the source base, the name might
| not be what they expect, etc, etc, etc.
|
| A better example would be knowing when you should use
| `const std::string&`, `std::string_view` or `char*`.
| Example: https://abseil.io/tips/1
|
| The best readability advice I have recieved has been:
|
| 1. Direct "I was confused by X" or "The recommended way
| to do A is using B", etc
|
| 2. Reasoned: "std::string_view is more efficient and
| clearer in intention than char ptr, it also improves type
| safety as it is read only and clear about ownership"
|
| 3. Linked to source material where examples are given
| totw or other examples in the code.
| kccqzy wrote:
| > Very experienced engineers frequently miss what people
| are doing in other parts of the source base, the name
| might not be what they expect, etc, etc
|
| Very true. Readability can't help with that, nor is it
| designed to. It's mostly there to help novices and new
| hires. Experienced engineers already have readability
| themselves so they don't need this extra review.
| iamstupidsimple wrote:
| Readability is not about formatting, that's an orthogonal
| issue. It's possible to have terrible code that's
| perfectly formatted.
|
| It's more about good usage of idiomatic language
| constructs, which still requires good human judgement to
| evaluate.
| StillBored wrote:
| And I take it google has done wide ranging scientific
| studies about the variations in coding styles and
| language constructs that it is a secret advantage that
| they know how to write "readable" code? Implying they
| tried a bunch of diffrent ways until settling on the one
| true way that allows a diverse set of people with diverse
| experiences to read it?
|
| Ever heard of COBOL?
|
| Because readability has always been in the eye of the
| beholder, and codifying it makes it even worse.
| iamstupidsimple wrote:
| What counts for readability is not set in stone by some
| language czar as the One True Way. Everyone knows the
| style guides can't be perfect which is why they're
| relatively mutable.
|
| In any case, readability will comment on stuff that
| cannot easily be quantified, such as when to use a
| certain object hierarchy or dependency injection, etc...
| compiler-guy wrote:
| I'm not a fan of readability exactly the way Google does
| it, but I'm pretty happy that Google insists on various
| aspects of it, like good identifier names.
|
| I don't know of any research off hand, but I'm pretty
| sure the industry consensus is that good identifier names
| improve the quality of the code (Go style
| notwithstanding.) Readability is one way to training
| engineers to do it.
| joshuamorton wrote:
| > Because readability has always been in the eye of the
| beholder, and codifying it makes it even worse.
|
| This is empirically false. Consistency, even if it is
| unfavorable to your preferences, is superior to
| inconsistency. So a codified set of best practices is
| better than none at all.
|
| There are part of Google's style guides that I would
| change if I could, but I also prefer having a style guide
| (and one that goes beyond things that are lintable) than
| none at all, because consistency across the codebase
| means that I can usually understand code at a glance, or
| if not, know at a glance that something unusual is
| happening. (this is in fact precisely the argument in
| favor of autoformatters like gofmt/black/prettier, but
| extended to softer concepts that can't always be
| formatted: consistent style, even if it isn't your
| favorite, is superior to inconsistent style).
| StillBored wrote:
| Consistency is what you get when you have a defined rule
| set programmatically enforced. If your looking for
| "readability" via human judgment, then you get a very
| different result.
| joshuamorton wrote:
| A programmatically enforced set of rules is certainly one
| way to get consistency, but it isn't the only way. You
| can achieve consistency through culture and training too,
| and sometimes that's the only way.
|
| Edit: you can look at Google's C-style guide for some
| examples, https://google.github.io/styleguide/cppguide.ht
| ml#Structs_vs...
|
| It isn't possible to statically analyze if a class/struct
| is a POD or if the methods enforce invariants. But it's
| often very easy to do so with a human eye. And there's
| value in the distinction!
|
| Similarly, forcing someone to justify using a power-
| feature (operator overloading, templates, metaclasses,
| whatever) can only be done by a human. There may be cases
| where the power feature is warranted and the benefits
| outweigh the cost, but a linter can't know that. (and
| ultimately all of this comes back to: things look
| consistent, and when things are inconsistent, that's a
| strong signal that something unusual is happening and you
| should pay close attention)
| dekhn wrote:
| Context on borgmon: https://sre.google/sre-book/practical-
| alerting/
|
| borgmon was a truly weird system.
| mikelward wrote:
| It still is, but it used to be, too.
| mathteddybear wrote:
| Broadly speaking, there are tools to automate this or that,
| some technologies are getting deprecated and replaced by new
| ones
|
| Also probably the privacy review could be a bigger bottleneck
| these days ;-)
| justicezyx wrote:
| I worked at TI, Planet and later Borg, I did not feel much
| influence of this video other a chuckle. Or I might be too low
| level to perceive.
| jeffbee wrote:
| I think it was a very common perception among application-
| level SWEs and SREs that TI, platforms, and Borg did not
| themselves use the stack enough to perceive its flaws.
| sicromoft wrote:
| See also the recent discussion here of "I Don't Know How To
| Count That Low": https://news.ycombinator.com/item?id=28988281
| jrockway wrote:
| I feel like people forgot after about five years. I remember
| wasting a week filling out various pieces of paperwork and
| submitting byzantine configuration CLs so that some contractor
| would have permission to view a certain webpage through the
| corporate proxy. (I think what made me most mad is that regular
| employees could view the website with no additional
| configuration. I can understand if I was filling out tickets to
| get approval, or a security review, but the actual
| configuration of the proxy had to change to allow this, in
| addition to getting all of those approvals!) My team didn't
| make the website, and the contractor didn't work on my team, so
| I'm honestly not sure why I was involved. I just remember being
| annoyed about it. I'm sure there are some memes about it in the
| archive.
| ikiris wrote:
| Contractor access is its own hellscape.
| q3k wrote:
| ... socially, too. That part sucked.
| abustamam wrote:
| Separation of the classes and all
| davidw wrote:
| I did a stint contracting for Goldman Sachs a while back. I
| can relate. Don't think I can say anything more without a
| team descending on my house from a black helicopter,
| though.
| servytor wrote:
| At night all helicopters are black.
| twinkletwinkle_ wrote:
| I once worked at a tiny startup where we were trying to
| sell a dataset to GS. Before we could even send a sample,
| they sent over some boilerplate forms for us to sign. I
| remember two distinct stipulations - anything we sent
| them was immediately and forever their property, AND they
| had the right to drug test any of our employees. We ended
| up not signing so there was no deal. My boss said it was
| their way of getting rid of us.
| hnov wrote:
| This is paradoxically because historically everything was
| wide open to anyone so access-control and such isn't super
| fleshed out for most apps behind the proxy. Random internal
| app X could have been conceived and built with very little
| oversight and opening it up to a rotating cast of temporary
| workers is seen as an unnecessary risk. Broadly used apps
| (e.g. the bug ticket system) tend to have app-level security-
| controls and are not blocked by the proxy for contractors.
| inoffensivename wrote:
| It was hugely influential in identifying the frustration of
| getting things done at Google. In my experience it's even more
| true now than it was back then, the number of things you have
| to deal with has just grown. I've been at Google since 2006 and
| I feel like I'm losing my mind with all the complexity.
| jez wrote:
| Out of genuine curiosity, what keeps you at Google for 15
| years despite perceived increase in complexity to getting
| things done? I'm wondering whether the answer is like
| "there's a lot of complexity, but I like the work I do more"
| or "I like the people more" or some other reason.
| Dangeranger wrote:
| I think they call them "golden handcuffs".
| oblio wrote:
| Heh, that's <<if>> they want to leave.
|
| If they don't, they call it "comfy job with awesome
| paycheck and not a lot of pressure" :-p
| dTal wrote:
| (this comment now obsolete)
| oblio wrote:
| There, I fixed it!
|
| https://cheezburger.com/5821507840/his-and-hers-shower-
| head
| dTal wrote:
| You know, I've made similar remarks on HN before, but you
| are the first person to actually edit their comment.
| Amazing. Now I feel compelled to edit mine...
| marcyb5st wrote:
| Googler here.
|
| I think the technical term is Golden cage :)
| treebog wrote:
| What does "serve 5TB" refer to? They expect 5TB of network
| bandwidth over some time period (a month?)? Or their database
| takes up 5TB on disk?
| raldi wrote:
| It's a joke that's sort of open to interpretation.
|
| The most straightforward is, "I just want do this incredibly
| simple thing; why is it so hard?"
|
| But there's also the level of, "Googlers are so engineeringly
| pampered that they think serving 5 terabytes is the equivalent
| of Hello World."
|
| And then there's another level of, "Well, isn't it? After all,
| this is Google and this is $YEAR."
| rachelbythebay wrote:
| Imagine it as "I want to have a http://foo/~me/ type path where
| I can park 5 TB of stuff and other people can fetch from it
| when they feel like it".
|
| 5 TB of data made available, not 5 TB of
| transfer/bandwidth/etc.
| metalliqaz wrote:
| i think it just means to put 5TB of data online
| drjasonharrison wrote:
| If you watch the video, it doesn't matter. It's just something
| they want to serve.
| shoeshoeshoey wrote:
| Facebook had its own meme: "Pusher I need a hotfix"
| w0mbat wrote:
| When I first started at Google I got things done a lot faster
| because I didn't know all those rules existed and nobody stopped
| me. My service was still plenty fast & reliable. Eventually it
| all got rewritten by other people to do things properly like the
| video says.
| dekhn wrote:
| I managed to deploy a whole system at google that had the
| ability take down all of google globally by DoS'ing the
| network, and ran it casually (IE, starting and stopping it when
| I felt like it, at the capacity I felt like, with the binary
| versions I wanted) for 3 years.
|
| In retrospect, this was absolutely crazy! The actual visible
| outcomes were: 1 cluster drained due to heat rising so fast the
| alerting thought there was a fire, 1 page to an engineer in the
| middle of the night (sorry discovery-service) and a whole bunch
| of complaints about CPU stealing that weren't my fault.
|
| Those were the good old days.
| robocat wrote:
| Rachel said something similar "My own 'solution' to it after
| far too much thrashing was just to say 'we cannot get all N
| types of quota in the same place so we are at the mercy of
| whatever happens to be available, and if that dries up, we stop
| running'. Granted, this was for some internal stuff that was
| seven or eight levels removed from anything that anyone on the
| outside might ever see, but still, it was stupid and made me
| feel so dirty. I'm sure my non-solution probably bit someone
| later. Sorry, whoever." --
| https://rachelbythebay.com/w/2021/10/30/5tb/
| kgin wrote:
| Only if you think your users are scum. Do you think your users
| are scum? Do you? Why do you hate your users?
| jbverschoor wrote:
| omg so toxic
| jamestimmins wrote:
| As a friend of mine explained why she left Google a year or so
| ago, "I got tired of emailing 30 people to try to figure out who
| owned a single variable."
| nostrademons wrote:
| A TL I worked with once had a simple but effective strategy for
| that:
|
| "Remove it and see who complains."
|
| I did that (with the impenetrably named "PrefetchExperiment",
| last touched by a branch that lost previous file history in
| 2007). Turned out it was the source data for Google's DNS to
| figure out how to route queries to the lowest-latency
| datacenter, based on their geographic location. In about a
| month, it would've taken down all Google services. Oops.
|
| It was a very effective way of figuring out who owned the
| variable and writing a big long comment explaining what it's
| there for and which team to contact before changing it, though.
| joshuamorton wrote:
| Scream tests are always fun. ("break it and see who screams")
| AlexanderTheGr8 wrote:
| LMAO, isn't this very similar to FaceBook's recent DNS
| problem?
|
| Also I love the idea of removing it and seeing who complains.
| _3u10 wrote:
| It also works great for a product / bug backlog. Just
| delete the entire thing. If it's a real bug / feature it
| will get recreated.
| zamadatix wrote:
| Facebook's recent "DNS problem" was a process for checking
| routing failover capacity on the backbone for maintenance
| ended up taking down the backbone links. As a result of the
| servers being disconnected from the backbone they pulled
| their BGP advertisements since they considered their
| location to be unhealthy (no connection to the backbone).
|
| FB's problem was the lack of routing reachability on its
| backbone triggering the lack of routing reachability
| information being sent to the larger internet, this in turn
| caused problems for DNS not the other way around.
| ikiris wrote:
| The hilarious thing is I know exactly what file you're
| talking about here.
| raldi wrote:
| The best and worst parts of being a Google engineer: Impossible
| things are merely very hard, but on the other hand, easy things
| are also very hard.
| taldo wrote:
| Ah, the laughs (Xoogler since 2020). It was a lot easier, at
| least last year: you'd use "flex" quota from your PA pool
| (product area) for Spanner and Borg, write some code for your
| server, a few configs here and there, and you'd be ready and
| serving.
| bhickey wrote:
| About six years ago I had a resource manager deny me a database
| instance the very same day it became available for flex in
| another product area. I tried to "Hey Mister" resources from
| someone in that group to no avail. Eventually I wrote a high-
| durability key-value store on top of our source control system
| and told them they could give me my database or I'd be
| deploying to prod.
| dilyevsky wrote:
| That video came out a few years before flex appeared _I think_
| at a time they were having a sort of "resource crunch" on the
| heels of growth spur following the GFC.
| willidiots wrote:
| Flex was available for certain things (Colossus IIRC, gave
| you a ton of flex quota) but for others it wasn't. Because
| This is Google.
| the-rc wrote:
| It was easier to mint and carve out Colossus quota than
| e.g. Bigtable. I seem to remember that flex for Borg
| existed, but only in a few locations with enough capacity
| to back it. You couldn't just retrofit it in clusters where
| existing, large customers were already granted and using
| most of the quota.
| the-rc wrote:
| It wasn't that there was a crunch -- that had always existed.
| There just wasn't all the tooling to implement anything like
| flex. At least this video was made after "buying Borg quota"
| was a normal thing. Before it, you had to "buy" regular
| machines and donate/assimilate them into Borg. Then after X
| days you'd receive your quota, minus a Borg "tax" of 10% to
| cover borglet and system daemons' overhead.
| dekhn wrote:
| you left out monitoring for reliability which is a major part
| of this video
| mikewarot wrote:
| I didn't realize this was made by Google in the first place when
| I saw it a few days ago. I hope things are simpler now, but I
| doubt it.
| jedberg wrote:
| The same conversation at Netflix 10 years ago:
|
| I want to serve 5TB of data.
|
| Ok, spin up an instance in AWS and put it there.
|
| I want it production ready.
|
| Ok, replicate it to a second instance. If it breaks we'll page
| you to fix it.
|
| The funny thing is, for important stuff, we ended up doing
| similar things to what you see in this video, but for unimportant
| things, we didn't. I think it was a better system, and it was
| amusing when we hired people from Google who were confused by the
| lack of process and approvals.
| ignoramous wrote:
| > I want to serve 5TB of data. Ok, spin up an instance in AWS
| and put it there... it was amusing when we hired people from
| Google who were confused by the lack of process and approvals.
|
| Quoting from _Velocity in Software Engineering_
| https://queue.acm.org/detail.cfm?id=3352692:
|
| _In 2003, at a time in Amazon 's history when we were
| particularly frustrated by our speed of software engineering,
| we turned to Matt Round, an engineering leader who was a most
| interesting squeaky wheel in that his team appeared to get more
| done than any other, yet he remained deeply impatient and
| complained loudly and with great clarity about how hard it was
| to get anything done. He wrote a six-pager that had a great
| hook in the first paragraph: "To many of us Amazon feels more
| like a tectonic plate than an F-16."_
|
| _Matt 's paper had many recommendations... including the
| maximization of autonomy for teams and for the services
| operated by those teams by the adoption of REST-style
| interfaces, platform standardization, removal of roadblocks or
| gatekeepers (high-friction bureaucracy), and continuous
| deployment of isolated components. He also called... for an
| enduring performance indicator based on the percentage of their
| time that software engineers spent building rather than doing
| other tasks. Builders want to build, and Matt's timely
| recommendations influenced the forging of Amazon's technology
| brand as "the best place where builders can build."_
|
| ...leading up to the creation of AWS.
| jll29 wrote:
| > we turned to Matt Round, an engineering leader who was a
| most interesting squeaky wheel in that his team appeared to
| get more done than any other
|
| Matt went on to study theology, and he's started a church
| community in Scotland: https://www.linkedin.com/in/mattround/
|
| "Leader Company Name: Hope City Church Edinburgh Dates
| Employed: Sep 2017 - Present Driving a new church start-up."
| ryandrake wrote:
| The "approval paralysis" thing happens at a lot of companies,
| large and small, not just GiantTech. It creeps up on you
| slowly: 1. A big problem happens that gains the attention of
| leadership. 2. The problem is root-caused to some risky thing
| an employee did trying to accomplish XYZ. 3. To correct this,
| a _process_ is put in place that must be followed when one
| wants to do XYZ, and (critically) gatekeepers are anointed
| who must approve the activity. 4. These gatekeepers are
| inevitably senior already-busy people who become bottlenecks.
| Now we can 't do this critical thing without hounding
| approvers. 5. Some other big problem happens and the above
| cycle starts all over again.
|
| Before you know it, every even slightly risky task you need
| to do through the course of your job requires the blessing of
| approvers who are well-intentioned, but all so overloaded
| they don't even answer their E-mail or chats. They sometimes
| need to be physically grabbed in the hallway in order to
| unblock your project. Progress grinds to a halt and it still
| has not stopped production problems--just those particular
| classes of problems that the approval processes caught.
|
| EDIT: Not sure what the right solution is, but it must be one
| that doesn't rely on a particular overloaded human doing
| something. Maybe an automated approval system that produces a
| paper trail (to help with postmortem and corrective action
| later) and ensuring all changes can be rolled back
| effortlessly. Easier said than done, obviously.
| david422 wrote:
| What is the solution?
|
| I've worked at big companies that are mired in process
| because they would rather spend more time crossing i's and
| dotting t's than risk breaking something. I can see why.
|
| And I've worked at smaller companies where the clients are
| small and it's easy to fix things that break. Move fast and
| break things at a small scale maybe.
|
| But how do you grow to be a big company and still operate
| like a small company? I can't seem to see an answer.
| native_samples wrote:
| There are many, but the problems are more subtle than
| this video really gives credit for.
|
| I worked at Google at the time this video was made, and
| empathized (in fact I had been an SRE for years by that
| point). Nonetheless, there are flip sides that the video
| maker obviously didn't consider.
|
| Firstly, why did everything at Google have to be
| replicated up the wazoo? Why was so much time spent
| talking about PCRs? The reason is, Google had consciously
| established a culture up front in which individual
| clusters were considered "unreliable" and everyone had to
| engineer around that. This was a move specifically
| intended to increase the velocity of the datacenter
| engineering groups, by ensuring they did _not_ have to
| get a billion approvals to do changes. Consider how slow
| it 'd be to get approval from every user of a Google
| cluster, let alone an entire datacenter, to take things
| offline for maintenance. These things had tens of
| thousands of machines _per cluster_ and that was over a
| decade ago. They 'd easily be running hundreds of
| thousands of processes, managed by dozens of different
| groups. Getting them all to synchronize and approve
| things would be impossible. So Google said - no approvals
| are necessary. If the SRE/NetOps/HWOPS teams want to take
| a cluster or even entire datacenter offline then they
| simply announce they're going to do it in advance, and,
| everyone else has to just handle it.
|
| This was fantastic for Google's datacenter tech velocity.
| They had incredibly advanced facilities years ahead of
| anyone else, partly due to the frenetic pace of upgrades
| this system allowed them to achieve. The downside:
| software engineers have to run their services in >1
| cluster, unless they're willing to tolerate downtime.
|
| Secondly, why couldn't cat woman just run a single
| replica and accept some downtime? Mostly because Google
| had a brand to maintain. When she "just" wanted to serve
| 5TB, that wasn't really true. She "just" wanted to do it
| under the Google brand, advertised as a Google service,
| with all the benefits that brought her. One of the
| aspects of that brand that we take for granted is
| Google's insane levels of reliability. Nobody, and I mean
| nobody, spends serious time planning for "what if Google
| is down", even though massive companies routinely
| outsource all their corporate email and other critical
| infrastructure to them.
|
| Now imagine how hard it'd be to maintain that brand if
| random services kept going offline for long periods
| without Google employees even noticing? They could say,
| sure, this particular service just wasn't important
| enough for us to replicate or monitor and the DC is under
| maintenance, we think it'll be back in 3 days, sorry. But
| customers and users would freak out, and rightly so. How
| on earth could they guess what Google would or would not
| find worthy of proper production quality? That would be
| opaque to them, yet Google has thousands of services.
| It'd destroy the brand to have some parts that are
| reliable and others not according to basically random
| factors nobody outside the firm can understand. The only
| solution is to ensure every externally visible service is
| reliable to the same very high degree.
|
| Indeed, "don't trust that service because Google might
| kill it" is one of the worst problems the brand has, and
| that's partly due to efforts to avoid corporate slowdown
| and launch bureaucracy. Google signed off on a lot of dud
| uncompetitive services that had serious problems,
| specifically because they hated the idea of becoming a
| slow moving behemoth that couldn't innovate. Yet it
| trashed their brand in the end.
|
| A lot of corporate process engineering is like this. It
| often boils down to tradeoffs consciously made by
| executives that the individual employee may not care
| about or value or even know about, but which is good for
| the group as a whole. Was Google wrong to take an
| unreliable-DC-but-reliable-services approach? I don't
| know but I really doubt it. Most of the stuff that SWEs
| were super impatient to launch and got bitchy about
| bureaucracy wasn't actually world changing stuff, and a
| lot ended up not achieving any kind of escape velocity.
| edude03 wrote:
| This is a great explanation, thank you.
|
| (I've never worked at google, and maybe this isn't a
| problem anymore however) It seems like the "solution"
| here would be to do for Infra what Go did for Concurrency
| - build an abstraction with sane defaults, and rubber
| stamp anything that doesn't stray from those defaults.
| Anything that does - requires further scrutiny.
|
| For example, at the companies where I've been response
| for infrastructure (admittedly much smaller than google)
| I've done exactly that (with Kubernetes specific things
| like PodDisruptionBudgets and defaulting to 2 replicas),
| and if users use the default helm chart values, they can
| ship their service by themselves.
| Ao7bei3s wrote:
| Self-service approvals.
|
| Instead of appointing a senior eng to be approver, task
| the same senior eng with writing down his decision
| criteria (as text or where it makes sense even as code).
|
| This has advantages for everyone:
|
| 1. It lets the engineers who need approval move at their
| own speed, and plan time for it as a predictable work
| item like any other, instead of depending on an approver
| for whom the approvals will usually be at a lower
| priority and mid-sprint.
|
| 2. For the approval policy writer, it turns this into a
| one time effort with a defined scope that can be planned
| and prioritized in his/her own backlog, instead of open
| ended toil that can come at any time, take any time, and
| not clearly relate to their own current priorities.
|
| 3. For the company, writing down the policy brings
| consistent decision making.
|
| Obviously this requires trust that employees can and will
| say "no, can't do" when they're tasked with something
| that is not approvable, which can be culturally difficult
| (business and otherwise). Checklists (literally a list of
| checkboxes to click on, "I confirm that...") can help
| with this.
|
| (As an example of writing down the policy as code: that's
| any CI/CD pipeline. But it's not limited to engineering
| decision making - for example, we're using a well-known
| open source license management tool that promises auto-
| approval for open source library use depending on
| policies configured by legal. This works moderately not
| so well because this particular tool is not great; the
| idea is sound. We still made it work: now legal wrote
| down their policies, trained a large number of engineers
| on them and those are now empowered to make decisions.)
| ignoramous wrote:
| Autonomy.
|
| Solution to such org woes, in part, is discussed by
| Clayton Christensen in his work, _The Innovator 's
| Solution_
| http://web.mit.edu/6.933/www/Fall2000/teradyne/clay.html:
| _Even after correctly identifying potentially disruptive
| technologies, firms still must circumvent its hierarchy
| and bureaucracy that can stifle the free pursuit of
| creative ideas. Christensen suggests that firms need to
| provide experimental groups within the company a freer
| rein. "With a few exceptions, the only instances in which
| mainstream firms have successfully established a timely
| position in a disruptive technology were those in which
| the firms' managers set up an autonomous organization
| charged with building a new and independent business
| around the disruptive technology." This autonomous
| organization will then be able to choose the customers it
| answers to, choose how much profit it needs to make, and
| how to run its business._
|
| ---
|
| Amazon and Cloudflare are good examples of big-orgs
| trying their best to implement late Prof. Christensen's
| ideas.
|
| Andy Jassy on Amazon's approach to innovation:
| https://www.hbs.edu/forum-for-growth-and-
| innovation/podcasts...: _And then if we like the answers
| to those first four elements, then we ask, can we put a
| group of single threaded focused people on this
| initiative, even if it seems like they 're overwhelming
| it with strong senior people, if you try to add really
| busy people do the existing business and the big new
| idea, they will always favor the existing business
| because it's surer bet. So we want to peel people away
| from the existing business and put them just on the new
| initiative._
|
| Pace of innovation at Cloudflare
| https://blog.cloudflare.com/the-secret-to-cloudflare-
| pace-of...: _...it is not unusual for an initial product
| idea to start with a team small enough to split a pack of
| Twinkies and for the initial proof of concept to go from
| whiteboard to rolled out in days. We intentionally staff
| and structure our teams and our backlogs so that we have
| flexibility to pivot and innovate. Our Emerging
| Technology and Incubation team is a small group of
| product managers and engineers solely dedicated to
| exploring new products for new markets. Our Research team
| is dedicated to thinking deeply and partnering with
| organizations across the globe to define new standards
| and new ways to tackle some of the hardest challenges._
|
| ---
|
| Also read: Clayton Christensen and Stephen Kaufman on
| "Resources, Process, and Priorities": https://personal.ut
| dallas.edu/~chasteen/Christensen%20-%202n...
| [deleted]
| bostonsre wrote:
| Automate as much as possible. Approval gates are there to
| prevent obvious issues from continuing down the pipeline.
| If you can automate checks for known issues that you want
| to prevent from happening, then you should be able to add
| it as a test step. Then in the catch, log why it failed and
| point the dev at documentation.
|
| Manual processes suck for everyone involved.
| jeffbee wrote:
| You cannot have both an organization that fastidiously
| protects the privacy and security of user data, and one
| that requires no process to build and launch software. It's
| just not possible.
|
| Anyway the video is just a joke. I've never worked anywhere
| where it was as easy to just serve 5TB of static data as at
| Google. Googlers who want to just host junk under their own
| authority do not need to shop for quota, set up borgmon,
| etc.
| joshuamorton wrote:
| Right like looking back, they're setting up a production,
| user facing service. If I want to just store a 5tb blob
| somewhere, I think that fits in freebie CNS, so I don't
| even have to provision resources, I just cat the file or
| whatever (granted, 5tb was a bit bigger 10 years ago).
|
| Having a rule that "your user-facing service needs to be
| replicated" is a good rule. Replication being difficult
| was the problem.
| Zababa wrote:
| I've read on HN that "processes are organizational scar
| tissue", I think it applies here.
| riknos314 wrote:
| Yep. A wise engineer once told me "Runbooks [written
| SOPs] are just solving bugs with people instead of code"
| rShergold wrote:
| That's an excellent phrase. It reminds me of the navy
| saying "regulations are written in blood"
| MauranKilom wrote:
| It's actually super related, given that (at least in the
| medical software sector) you won't get anything approved
| by the FDA before spelling out the entire software
| development operation in processes.
| strictfp wrote:
| In change management they argue that companies tend to
| purposely slow down change over time to become more
| predictable and lock in on the "successful route". That
| certainly mirrors my experience. The only thing I don't
| understand is why you hire so many people when you let a
| few handful people gate everything. You might just as well
| fire 80% of the workforce.
___________________________________________________________________
(page generated 2021-11-02 23:00 UTC)