[HN Gopher] The Value of In-House Expertise
___________________________________________________________________
The Value of In-House Expertise
Author : ingve
Score : 168 points
Date : 2021-09-29 09:08 UTC (13 hours ago)
(HTM) web link (danluu.com)
(TXT) w3m dump (danluu.com)
| awinter-py wrote:
| > Another reason to have in-house expertise in various areas is
| that they easily pay for themselves ... If, in the lifetime of
| the specialist team like the kernel team, a single person found
| something that persistently reduced TCO by 0.5%, that would pay
| for the team in perpetuity ... people will also find
| configuration issues, etc., that have that kind of impact.
|
| _KEY_ observation that we forget when we wear many hats at small
| companies. This is the satisfying core reason 'why we debug'.
| Our deep dives matter. (Though not as much as they would at
| BigCos)
| qaq wrote:
| Hmm reducing TCO by 0.5% at Twitter would pay for the team in
| perpetuity. Reducing TCO by 0.5% at small Co. will pay for a
| beer once a month.
| awinter-py wrote:
| yes -- at smaller places, the goal is usually fixing
| something urgently broken and the economics are harder to
| justify
|
| which is really frustrating and leads old hands at small
| places to become obsessed with conservative build/buy
| decisions and low TCO tech
|
| nice to be reminded that it's not like that everywhere, there
| are places where debugging labor pays consistent dividends
| feoren wrote:
| > which is really frustrating and leads old hands at small
| places to become obsessed with conservative build/buy
| decisions and low TCO tech
|
| Can you clarify what you mean here? Are you saying that
| when people cut their teeth in big companies and then move
| to small startups, they tend to be overly conservative and
| favor low TCO tech, when instead they should be spending
| more money for, what, more growth? Should they be building
| more in-house, or buying more off the shelf? I'm just not
| clear which direction you're going here.
| awinter-py wrote:
| ah no, I'm saying that small co experience leads to
| visceral fear of tech that is non-mature / difficult to
| debug / unpredictable
|
| (whether or not you're ex-big-5)
|
| sorry, by 'low TCO' I wasn't just thinking about $/month
| -- I was thinking of technology that is non-experimental,
| easy to hire for and manage, that doesn't take one of
| your senior people a week per quarter to keep alive. TCO
| is the wrong word for that.
| handrous wrote:
| Why hello my old friend, _economy of scale_. We meet again.
| lordnacho wrote:
| > a single person found something that persistently reduced TCO
| by 0.5%, that would pay for the team in perpetuity,
|
| There are several problems with this line of thinking, although
| as I will mention, it's not actually crazy, just problematic.
|
| Attribution is one, how will you know if the team actually did do
| something that increased profits? There are many teams are
| involved in a business. It's not at all simple to say he-did-
| this-and-she-did-that. In fact much of office politics is exactly
| that, pie-cutting.
|
| You also don't know whether a non-specialist might have figured
| out the problem. There's a lot of smart people at Twitter, right?
| Surely some of them work in adjacent areas and have occasional
| time to look at other things? If a non-specialist might have
| solved it, what else might he have solved? Couldn't he also
| collect 0.5% slices of profit for the company?
|
| How do big businesses ever lose money? They must have a load of
| specialists, right? And some of those will during any period be
| doing that thing that makes them pay for themselves in
| perpetuity? The big danger is you justify every expenditure this
| way, "they just need to find one thing". Security says they
| stopped a cyberattack that would have shut down the company for
| two days, kernel says they reduced runtime by 0.5%, sales claims
| to have raised prices by 0.5%. In the end there's a fair chance
| that all the claimed gains don't add up to your bottom line.
|
| I remember as a quant trader we could buy any book we wanted.
| Programming, Linear Algebra, finance, whatever. "If you find just
| one good idea, that will pay for all the books." Hard to argue
| with considering the sums involved, but it's also hard to know
| exactly what ideas we got out of the books.
|
| Finally, if someone claims to be making you money, they will also
| claim the money. Especially if it's clearly agreed (Yep AWS cost
| is lower by 0.5% exclusively because of a kernel team action). So
| them saving 0.5% won't necessarily mean the company gets that
| extra profit. They may feel they deserve a raise, or new
| headcount to spread the work. Or you will decide not to pay them
| and they will leave.
| bob1029 wrote:
| A corollary for this could be:
|
| The value of minimizing external complexities [0]
|
| For instance, if you design an application as HTML+PWA instead of
| native mobile apps, you just need a web developer who understands
| responsive CSS techniques and maybe someone with time to test a
| bunch of different devices all day. With native, you usually need
| 1 fairly-specialized developer per native target unless you have
| a lot of time to go to market (or a very simple app).
|
| Another example could be designing your product to run on a
| single, bare-ass VM so you don't need to hire legions of level 30
| kubernetes wizards to sort our your go-to-market strategy or
| accountants to manage the byzantine nightmare that is
| AWS/Azure/Et. al. billing.
|
| The fewer things you have to worry about, the less expertise you
| need to maintain. [0] What I mean by "external
| complexities" - Anything that is external to the problem domain
| for which the solution is originally being built. If you have a
| banking product, an internal complexity would be state management
| around account or customer activities. An external complexity
| would be a 3rd party vendor, reporting system, database, file,
| network, hardware, operating system, or any other non-domain
| types residing within the software product itself.
| km3r wrote:
| One fairly specialized developer is a lot harder to replace
| then a generic level 30 kubernetes wizard. You lose that
| developer as a company and the ramp up time for a replacement
| could be >year. In addition, more standardized approaches have
| better defined practices, tooling, and security.
| awinter-py wrote:
| hmm I mean yes but
|
| author mentions doing science because 'scala is slower than java'
| -- they're talking about in prod, but build times are also
| slower. why not just use better tools?
|
| heard a one liner once about react I think which was like 'FB is
| hiring rocket scientists to get to par with 90s web performance'
|
| twitter is a cool site but it isn't curing cancer, it isn't
| feeding people, it's solving rails bugs caused by a celeb selfie
| during the grammies.
|
| is the subtext here 'hiring rocket scientists in organizational
| sea caves because you can't hire rocket scientists to run the
| company and impose good practices top down'
| pantulis wrote:
| > a single person found something that persistently reduced TCO
| by 0.5%, that would pay for the team in perpetuity,
|
| This means that when you are operating at hyperscale you need a
| world class team. But the tricky question is calculating when
| that happens!
| mcot2 wrote:
| The section about Apple is quite wrong. Their products were held
| back by semiconductor technology as far back as I can remember.
| Some examples were never getting a laptop running with a G5 and
| for sure with the early phone and tablet prototypes they wanted
| to focus on effciency in a small package. Buying PA-Semi was
| integral to their product roadmap.
|
| The stuff about twitter I am not sure. You certainly do not need
| kernel expertise or to use the JVM at all to build a similar
| product these days. It seems they could be just held back by
| legacy technology choices that don't really benefit the product
| and which competitors probably won't need the cost of supporting.
|
| Companies should always be evaluating how critical their in house
| technology is to thier business and/or future product roadmap.
| winkeltripel wrote:
| > Companies should always be evaluating how critical their in
| house technology is to thier business and/or future product
| roadmap.
|
| For these sort of deep-expertise support teams, they would have
| to be considered over very long time horizons, since a kernel
| or JVM team might be needed to deal with a serious bug at any
| time, but there may not have been such a bug this year.
| MrBuddyCasino wrote:
| > Despite a lot of claims otherwise, Scala uses more memory and
| is significantly slower than Java
|
| Yup. Sometimes the advanced abstractions are worth it (Spark,
| maybe Akka), but those are niche cases.
| metropolisdelaq wrote:
| akka is never the answer
| chubot wrote:
| Nice article, but I wish these would have a date on them!
|
| Joel addressed the build vs. buy argument here (20 years ago).
|
| _If it's a core business function -- do it yourself, no matter
| what._
|
| https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...
|
| I guess what's not obvious to many people is that maintaining and
| optimizing the kernel and JVM is a core business function for
| Twitter (but probably not writing a kernel!). Likewise CPU design
| is now a core business function for Apple. Anything "down" your
| dependency stack can be.
|
| On the other hand, software for employees to file expense reports
| likely isn't, etc.
| sbierwagen wrote:
| The article list has dates: https://danluu.com/
| hyperpape wrote:
| I think Dan addresses this in the post. "Core business" just
| isn't well enough defined to be useful. Why is optimizing the
| kernel Twitter's core business, but not writing it? Because the
| ROI on the former is high for Twitter, but the ROI on the
| latter is low.
|
| If you're going to stretch things to call the kernel
| maintenance and optimization Twitter's core business, then you
| have the consequence that you don't know what your core
| business is until you spend a lot of time exploring which
| things are going to be effective uses of your money. Imo,
| that's too much of a stretch.
| chubot wrote:
| Yeah now I see it even has the phrase "core business", which
| he says is vague. I don't think it's that vague if you just
| add the qualification about "down the stack".
|
| It's hard for me to think anything that should be brought in
| house that isn't down the stack, i.e. on the critical path to
| serving Twitter. It's just not obvious to outsiders what is
| down the stack for a given company (the "I could write
| Twitter in a weekend" people), but it should be more obvious
| to people working there.
|
| e.g. the example of expense reports is obviously not down the
| stack -- if it goes down or is generally terrible, you can
| still serve your customers. As another example, in the old
| days, the big tech companies used to actually hire chefs and
| kitchen staff themselves. These days I believe it's all done
| through contractors.
|
| Also I'd say the post assumes that saving money is always
| important... in the earlier stages of a company, they are
| frequently really wasteful and prefer to grow the business.
| An example is Dropbox starting on AWS and then building their
| own data centers to become cost efficient.
| tomerv wrote:
| Lots of things are "down the stack". The servers, the
| building they sit in, the electricity that runs the
| building, and the coal/water/solar/wind that produces that
| electricity. Should Twitter run a power plant? Maybe they
| should optimize the design of current power plants? Also,
| what about the toilets that their programmers use - they
| could probably be optimized as well!
| claudiulodro wrote:
| There may be an ROI for it, but I don't think it's part of
| their core business. If we extrapolate it out and Twitter
| becomes the best kernel optimizer on the planet, they still
| can't really sell kernel optimization as a product.
| TeMPOraL wrote:
| But one of Dan's point is that it _could become_ a part of
| their core business - they could launch e.g. a kernel
| optimization consulting service, much like Apple suddenly
| expanded into making their own CPUs.
| g051051 wrote:
| I'm not surprised that Twitter has those teams, I'm always
| surprised that more places don't. About 22 years ago I was the
| only person at the company I worked for at the time who could
| analyze Solaris core dumps, and understood enough about the JVM
| to diagnose deep problems. In the 5 years until the rest of the
| engineering staff caught up to where they stopped needing me for
| every incident, I probably saved enough money to pay my salary
| 100x over.
|
| Never saw a dime of that. The one time someone offered me a spot
| bonus to come solve a problem, they reneged on it: A manager in
| charge of a project came to me for help. We're crashing all the
| time, he says. IBM and the team can't figure it out, he says. If
| you can solve this, I'll give you a $5000 spot bonus, he says.
|
| I would have done it anyway, because it's my, you know, job? But
| whatever, I won't turn down free money.
|
| So I wander over to the team that's been looking at this and get
| the lowdown. They keep getting out of memory errors.
|
| Me: So what does the heapanalyzer output look like? Team: Huh?
|
| Me: You...you've been having out of memory errors and haven't
| looked at the heap? Team: Buh?
|
| So I get the heapdump and look at it. Immediately it's clear that
| the system is overflowing with http session objects.
|
| Me: Anything in the log files related to sessions? Team: Just
| these messages about null pointer exceptions during session
| cleanup...do you think they're related somehow? Me: <Bangs head
| on desk>
|
| A little more research reveals that there were two issues at
| play. The first is that we had a custom HttpSessionListener that
| was doing some cleanup when sessions were unbound. It would
| sometimes throw an exception. We were using IBM WAS, and it
| turned out that when a sessionDestroyed method threw an
| exception, WAS would abort all session cleanup. So we'd wind up
| in a cycle: the session cleanup thread would start, process a few
| sessions, hit one that threw an exception on cleanup, and which
| would abort cleaning up any other sessions.
|
| We did a quick fix of wrapping all the code in the
| sessionDestroyed method with a blanket try/catch and logging the
| exception for later fixing, and IBM later released a patch for
| WAS that fixed the session cleanup code to continue even if
| sessionDestroyed threw an exception.
|
| So, I very quickly solved this problem and waited for my $5000
| spot bonus. And waited. And waited...
|
| I went back to the manager and asked him about it. Over the next
| few weeks, he proceeded to tell me the following series of
| stories:
|
| * It was in the works, and I'd have it soon.
|
| * He had to get approval from his superiors.
|
| * Because so many people had worked on the problem, it was
| decided that it should be split among the group, and that I'd
| have to share it with the people that couldn't fix it.
|
| * No bonus.
|
| So even though it was his idea to try to bribe me to fix a
| problem, they still failed to follow through on it.
|
| Another story: We had an issue once where they finally brought me
| in after a year of problems. One of our Java systems was failing
| intermittently, and the development team had given up and
| couldn't figure out what was wrong. The boss told me it was now
| my problem, that I was to dedicate myself 100% of the time to
| solving the problem, and I could rewrite as much as much of the
| system as needed, basically total freedom (and responsibility).
| About halfway through the spiel where they were talking about the
| architecture and implementation, someone mentioned that the
| system was dumping core. I immediately stopped them right there.
|
| Me: You realize that if it's a coredump, it's not our fault,
| right? Boss: Huh?
|
| Me: If a Java program coredumps, it's either a bug in a 3rd party
| JNI library, a bug in the JVM, or a bug in the OS. What did the
| coredump show? Boss: Wha?
|
| Me: You guys have had this problem for a year and haven't looked
| at the coredumps? Boss: Blurgh?
|
| So I fire up dbx and take a look at the last few coredumps.
| Pretty much instantly I can see the problem is in a JDBC type 2
| (JNI native code) driver for DB2. We contact IBM, and after a
| bunch of hemming and hawing they admit there's a problem that's
| fixed in the latest driver patch. We upgrade the driver and poof!
| the problem is gone.
|
| We had a year of failures, causing problems for customers, as
| well as all the wasted man hours trying to fix something in our
| code that simply could not have been fixed that way, all because
| the main dev team for this product had no idea how to debug it. I
| had an answer within 30 minutes of being brought in to the
| problem, and the solution was deployed within days.
| wccrawford wrote:
| While it wasn't $5k, I was in a situation where I was told that
| I'd get something specific if I fixed a particular bug, and
| then didn't get it.
|
| OTOH, I have gotten bonuses after the fact that weren't talked
| about at all beforehand. IMO, that works out a lot better for
| everyone.
|
| I've decided that promised bonuses mean _nothing_ and are a
| sign of deceit (why bribe me to do my job?) and to absolutely
| ignore them.
|
| I do very much appreciate the expected bonuses, though. I've
| had them both in cash and time off and I'm not even sure which
| I prefer.
| wccrawford wrote:
| Ugh. I meant _unexpected_ bonuses there at the end. That one
| mistake changes that whole sentence.
| toast0 wrote:
| At a smaller scale, you don't necessarily need a dedicated
| team, you can just have a couple people who know to look for
| core dumps and know how to look at them (although gdb exe
| exe.dump ... bt gets you 90% of the way there 90% of the time),
| and whatever their real job is, it can be deprioritized to deal
| with urgent issues elsewhere without much fuss.
|
| If you get to the point where the fixers are always mostly busy
| fixing things, you can give their real jobs away.
| Darkphibre wrote:
| Please tell me you found better pastures! A place where you
| were compensated and appreciated. :)
|
| I've been at the current job for 10 years, receiving the
| highest level of performance review for all but the last year
| (the global pandemic was one of four life upheavals... I'm glad
| I pulled off a 'great' review vs. 'exceptional'), doing similar
| style work of solving problems people can't seem to
| comprehend... yet I just can't get seem to break into the next
| job title tier because "We don't need a principal."
|
| ...
|
| I just found out my lead was just promoted to principal. Once
| the divorce is off my plate, I plan on taking more risk and
| jumping ship.
| ghoward wrote:
| > yet I just can't get seem to break into the next job title
| tier because "We don't need a principal."
|
| > I just found out my lead was just promoted to principal.
|
| Brutal. Good luck.
|
| Got a plan?
| gknoy wrote:
| I wonder if that's a situation where the bosses above your
| boss think that there should be a proper "shape" to a team's
| rankings (e.g., can't havea team with too many principals, or
| all SWE 3s, etc). Even when one's manager is willing to go to
| bat for you, sometimes they get told they can only promote
| one person.
| g051051 wrote:
| I didn't mind so much, as I was well compensated in general.
| It was only in the last few years that Agile cultists took
| over and managed to ruin the place completely. After a long,
| painful decline, my entire group was laid off, from senior
| managers down to new hires. Fortunately, I immediately found
| a better job.
|
| Oh, and I had been there for 22 years.
| geodel wrote:
| What amazing stories you have shared!. Glad you changed the
| job. Totally agree on Agile cultists. I am seeing similar
| decline here at work.
|
| Obsession with hour tracking takes priority over any other
| thing. Things that one good engineer would do in 4-6 weeks
| on a single ticket is now broken into 100 little JIRA
| shitlets for 6 person months divided in to 5 developers. At
| the end of it no one really has clue what was really been
| achieved. But since 100 JIRA stories have been completed,
| it must be a great milestone.
| g051051 wrote:
| The obsession with Agile, Scrum, sprints, and points
| ground our entire organization to a halt.
| xorcist wrote:
| I hear you. Tracing and core dumps is just part of being a
| programmer. At least it should be.
|
| Try to have a conversation about this with anyone where modern
| architecture means not only there are nothing to analyze, but
| that every software everywhere should just be restarted on
| every failure and all state thrown away because "stateless" and
| "cloud". The best environment is when no one can log in
| anywhere.
|
| It's not a problem that no one can analyze anything because
| nothing is ever analyzed anyway. Software simply _shouldn 't_
| be fixed, it should be built upon.
|
| It's seems to be pretty much everyone's state of mind these
| days, and I feel completely powerless about it.
| oaw-bct-ar-bamf wrote:
| Sometimes the Root cause for problems is buried one or two
| abstraction layers deeper than the "responsible team" is
| comfortable to work at. This is where the "lower level" expert
| comes into play.
|
| At my place the abstractions start at high level user facing
| code, reach down into the hardware interface (driver) code and
| down to the actual electronic circuit design.
|
| EVERYWHERE something can go wrong. In case the circuit
| deteriorates too fast you need to get the material analysts to
| "debug"
|
| Rule of thumb is: there is always one layer of abstraction
| below you where stuff can break which feels like magic to you
| but is fixed with a glimpse from the lower level guy
| g051051 wrote:
| I've just always considered understanding and being able to
| debug your environment to be a standard part of the job. As a
| professional software developer working in a Solaris
| environment, knowing how Solaris works, how to use the shell,
| tools, and other stuff is just basic.
|
| Back in the 90's I worked at a company doing development on
| HP Apollo workstations. They were X based and used CDE for
| the desktop environment. When I started there, I invested
| some time to learn how it all hung together, how to leverage
| ToolTalk, and customized my system to do cool stuff.
|
| There was another developer who had started there a month
| before me, and had the same workstation. They had a problem
| one day that they wanted my help with, so I went to their
| desk for the first time. I found their screen had the default
| X stipple pattern, and a single terminal window 80 characters
| wide, maybe 2/3rds of the screen tall, and placed somewhat
| off center in the screen.
|
| I thought it slightly odd, but was distracted by the problem
| at hand. So at one point when they had some code up in the
| single window in vi, I asked them to open another terminal so
| we could do some other stuff. They just sort of sat there, so
| I asked again. They got agitated and snapped that they didn't
| know how to do that.
|
| This was a professionally employed C programmer, with several
| years of experience prior to this, who had been working with
| this equipment for several months, and _didn't know how to
| open a second terminal window_. They didn't have CDE running,
| because when you log in you can select your desktop, and they
| kept picking just a plain, raw X session with the default
| xterm. They were completely and utterly uninterested in the
| capabilities of the hardware, OS, and desktop environment.
|
| The same goes for the issue with the crashing JNI driver in
| my other comment. Maybe you don't know how to write your own
| JNI stuff, but I just expected that any developer who'd been
| using Java for any length of time would at least _know_ about
| how the JVM works in general, and what a coredump means in
| the context of a Java program. Specifically, that it's not
| your software that's causing it, and that trying to fix it by
| rewriting and debugging your Java code is a waste of time.
| toast0 wrote:
| > I've just always considered understanding and being able
| to debug your environment to be a standard part of the job.
| As a professional software developer working in a Solaris
| environment, knowing how Solaris works, how to use the
| shell, tools, and other stuff is just basic.
|
| If this was a standard part of the job, it wouldn't be my
| superpower. Working in a team where everyone (or almost
| everyone) can debug is amazing!
| djoldman wrote:
| > I'm not going to do a team-by-team breakdown of teams that pay
| for themselves many times over because there are so many of them,
| even if I limit the scope to "teams that people are surprised
| that Twitter has".
|
| I assume Dan means that "teams that pay for themselves" are teams
| where the total cost of employing the team is <= the decrease in
| company expenditure that can reasonably be attributed to the
| team.
|
| If that is the case, two things come to mind:
|
| 1. What is the likelihood that that company expenditure would
| have decreased without the team? (over time bugs/improvements are
| fixed/implemented by other employees or outside parties (open
| source))
|
| 2. If instead of spending money on the team that decreased
| expenditure, the company had spent money on a team that increased
| revenue, what would the relative difference be for profits?
|
| This may be complex because the cost to process a byte of
| information using the same process almost always gets cheaper
| over time (am I wrong?).
| feoren wrote:
| > 2. If instead of spending money on the team that decreased
| expenditure, the company had spent money on a team that
| increased revenue, what would the relative difference be for
| profits?
|
| Why not both?
| droopyEyelids wrote:
| The trick is to realize this is a gray area that can't be
| measured precisely, and >as a leader, your organization will
| find evidence to support whichever argument you favor<
|
| It takes a special type of MBA brain to imagine you can projet
| the future enough to forsee the outcome of outsourcing a "cost
| center" and putting that budget into a team that will define
| their metrics in increased revenue.
| bluGill wrote:
| > What is the likelihood that that company expenditure would
| have decreased without the team? (over time bugs/improvements
| are fixed/implemented by other employees or outside parties
| (open source))
|
| Don't forget the time part of this. If the issue is fixed, but
| not for 2 years, that needs to count for having the team.
| Nobody is giving real numbers, but reading between the lines
| you can guess that some of these changes are saving
| $100,000/month to the company, so 2 years is $2.4 million
| dollars, and you are assuming you get the change at all.
|
| Most of the above savings are on the electric bill: computers
| use less power, and therefore need less AC to cool them. Some
| of it is also that the company buys less computers.
| RandomLensman wrote:
| It's a shame that this gets forgotten so often. Most of the very
| high value extracting places pay a huge premium to recruit and
| retain the best people for anything that is remotely considered
| to be value-add.
|
| Similarly overlooked: vertical integration is rarely (only) about
| costs but precisely about removing the conflicts and challenges
| inherent in outsourcing.
| zwieback wrote:
| For larger companies that have the luxury of even contemplating
| developing in-house expertise the next big question is: buy or
| hire external expertise or develop in-house. These aren't easy
| questions since the investments are large and take a long time to
| pay off. It's easy to praise the good decisions some companies
| have made or laugh at the failures but someone has to make these
| big calls at some point and it's usually not the engineers and
| developers.
| bluGill wrote:
| As engineers we should be thinking about this and telling
| management.
|
| A few years back we decided to change the format all our
| graphics were stored in - which in turn meant calling new APIs
| to draw them. After a few rounds of meetings to figure out how
| many graphics there were and how much this would cost I
| realized this wasn't something we should do. I continued to
| estimate, but I sent a strong email to my boss "As a tech lead
| I forbid all in house engineers from doing this work, there is
| no long term value in learn how to do it, and we can hire third
| party contractors who know the new API better than us. Also we
| need to combine the contract with other division needing to do
| this same thing as it isn't worth scripting things for just us
| but a large contract will write a script for some of the work
| saving time and money". Immediately the whole tone of
| conversations changed around the company - managers (and I
| assume their technical people) realized I was right and all got
| together to get one contract to get the job done.
___________________________________________________________________
(page generated 2021-09-29 23:02 UTC)