[HN Gopher] Lucky Lotto, chaos engineering but for teams
___________________________________________________________________
Lucky Lotto, chaos engineering but for teams
Author : delebe
Score : 128 points
Date : 2021-07-01 06:58 UTC (16 hours ago)
(HTM) web link (danlebrero.com)
(TXT) w3m dump (danlebrero.com)
| physicsgraph wrote:
| I have a different approach for create a similar outcome:
| everyone rotates projects on a periodic basis. The period of
| change is longer than a sprint duration so as to give people time
| to acclimate to the new-to-them project.
|
| [0] https://graphthinking.blogspot.com/2021/06/periodic-
| rotation...
| GuB-42 wrote:
| Your approach is lawful, the one in the article is chaotic.
|
| Your approach is planned, with well defined onboarding and
| offboarding and with a report in the end. Personally, I hate it
| just for the idea of having to write a report.
|
| The article approach in unpredictable. At the last moment, you
| know you are not part of the team anymore, and all
| communications are cut. People have to take over even if you
| are in the middle of something. It also involves managers. I
| prefer this, if anything, just because there is no report to
| write except if things go wrong (i.e. you break the "no
| communication" rule).
|
| The chaotic version of your solution would involve periodically
| drawing two team members, including managers and having them
| switch teams immediately. For each individual, the period of
| change would follow an average but be random. And no reports :)
| cableshaft wrote:
| This is extremely clever, and something I've never even heard of
| before, bravo. The closest I've ever seen to this sort of
| practice in the wild is periodically shifting people around so
| that more than one person knows how a system works. But more
| often I've seen no attempts at this, and a mad scramble with KT
| meetings when someone gives their two weeks notice.
|
| I'm expecting a similar mad scramble when I put in my two weeks
| at my current company (hopefully within the next two months,
| interviewing with various companies now).
| dchoi315 wrote:
| This is a great idea! It sounds similar to controlled chaos
| and/or engineering serendipity, where you intentionally create
| chaos in order to uncover some sort of new skill or increase
| productivity
| jgerrish wrote:
| Could you imagine being part of this. Told the company was
| running DiRT testing or whatever, and then getting fired or
| punished for listening and following instructions?
|
| Could you imagine afterwards being lectured to use common sense.
| Also, we have secret corporate policies you need to know about
| and follow. Resolve that catch-22. Repeat for a decade.
|
| You'd be justified in never trusting them with your life or
| another loved ones again.
| detaro wrote:
| ... did you intend to reply here, or on some other post?
| jgerrish wrote:
| Oh, also, really clever! I love studying resiliency in complex
| systems!
| mingusrude wrote:
| I have used similar discussion points as in this article to argue
| that parental leaves, long vacations and other similar employee
| benefits build resilient teams and products. If you have to plan
| for people going missing (without the possibility for immediate
| replacement) you are forced to spread knowledge in the
| organization and you can't be too dependent on individuals.
| insomniacity wrote:
| I like this - we've had "reduce our bus factor" on the to-do list
| for a while, but not found a way to do it yet.
| Cthulhu_ wrote:
| The challenge there is that you need to hire and train more
| people than you have right now, and hiring is difficult. I mean
| probably not so much if you have a lot of money.
|
| I'm currently in a high bus factor job, I'm the only developer
| on the UI - our CTO can do some small jobs here and there, but
| nobody's touched or even looked at the new UI I'm building (Go
| + React).
|
| I want to be able to leave, but the way things are going - and
| the way recruiters are spinning up again - I'm afraid I'll have
| to bring them the bad news that I got an offer I can't refuse
| (and that they can't match; we're talking up to 150% pay rise /
| benefits).
|
| What my company needs is a big bag of money so we can hire
| contractors to fast forward this project. Which will be a short
| term solution, but still. But this company isn't eager, it's
| run by mid-late career veterans who are happy with things
| bumbling along and a decent 15%/year growth. I can kinda
| respect that, but for my project that's not good enough. And
| it's an unsexy company, so they struggle to hire anyone.
| ckdarby wrote:
| They'll survive. It is common to overvalue yourself when
| leaving the company but the reality is the company will adapt
| just like it always has and if it doesn't it would have died
| even with you there from choices that didn't allow it to
| adapt.
|
| Just accept the offer and move on. It is the best thing you
| can do for a company like this.
| eru wrote:
| > I'm currently in a high bus factor job, [...]
|
| That's actually a low bus factor, isn't it?
|
| https://en.wikipedia.org/wiki/Bus_factor
| retzkek wrote:
| From the article you linked: "There is a rare alternative
| definition for the bus factor, namely: the number of people
| who are indispensable for the project. In other words, it
| is the minimum number of people who are a single point of
| failure. If using this definition, then a high bus factor
| is considered a bad thing (since the loss of any person
| included destroys the project), and zero is considered the
| ideal bus factor."
|
| Perhaps "bus risk" would be a better term for this usage?
| layoutIfNeeded wrote:
| Ummm, _reducing_ the bus factor is probably the opposite of
| what you would want.
| jweather wrote:
| Reducing the impact of the bus factor, then. Kind of like
| turning down the air conditioning... which way does the
| thermostat go?
| metters wrote:
| I guess he/she is using the alteration of the definition of
| the bus factor:
|
| > There is a rare alternative definition for the bus factor,
| namely: the number of people who are indispensable for the
| project. In other words, it is the minimum number of people
| who are a single point of failure. If using this definition,
| then a high bus factor is considered a bad thing (since the
| loss of any person included destroys the project), and zero
| is considered the ideal bus factor.
|
| Source: https://en.wikipedia.org/wiki/Bus_factor
| kriro wrote:
| The most practical solution: Hire new people. Let the low-bus
| factor people mentor the new people to become them 2.0 (long
| term).
|
| Downside: Takes time (and removes mentor from tasks deemed more
| important short term), costs money, people might need to get
| used to becoming mentors (I found most people enjoy this even
| if they cannot imagine doing it the first time around)
|
| Upsides: Long term thinking and investing in employees is
| rarely wrong (imo)
| newswasboring wrote:
| I don't think this is a mechanism to reduce the bus factor, to
| me it seems like a mechanism to make people realize what the
| bus factor is for the team. The actual solutions were not
| described in the article but I guess that is very team
| specific.
| GauntletWizard wrote:
| Aligning the pain with the people who can solve it is the
| first rule of getting things done. Make the developers who
| will have to take on the bus-factor workload realize what the
| bus factor is, and they'll figure out how to reduce their
| dependencies real quickly.
|
| Another good take on this is the "Wheel of Misfortune"; Take
| a real incident (or synthesize one in testing) that paged
| someone inconveniently. They've already solved it - They're
| running the exercise. Now have everyone else on the team,
| individually or as a group, figure out how to solve it. Not
| from the debugging steps or postmortem of the incident, but
| before that's been shared - Have them all suggest what they'd
| look for, and how to fix it. Their most important resource is
| their instructor/adversary for this, so give them all the
| data, but no map to it. Further, have the instructor figure
| out how to respond to all the unknowns - What else was broken
| because of the incident? What would have happend if they
| tried different debugging steps? Builds team knowledge real
| fast.
| insomniacity wrote:
| Yes, good point, I guess the first step is realising you have
| a problem and proving it to management!
| tharkun__ wrote:
| Maybe you didn't read the whole article then. It totally
| describes how to do it. The bottom has a nice summary but
| I'll quote the solution parts of it here:
| The winner will work on some side project. Still work.
|
| Not a solution to the bus-factor of people but a good
| solution to the "we never get to work on tech debt / platform
| work because features" problem. This alone is awesome about
| this. Everybody, including product
| managers, gets one ticket every week, even if you don't want
| it.
|
| This, if done right, will result in Product Managers
| providing a vision and consistent answers to similar type
| questions. Thus the team can learn to anticipate their
| answers for minor things (which helps even in week where the
| PM isn't the winner) and in weeks in which he is on vacation
| (or wins again) the team isn't stuck waiting for a week for
| an important question that blocks development.
| Team should avoid delaying the work for a week. Try
| to bring one of your colleagues to do the task with you or
| under your supervision.
|
| This is what to do, when you need to break "rule 3" (which
| states you have to be completely unavailable). It's a soft
| rule and I think the point is actually to break it a lot in
| the beginning. The "under supervision" part means, you are
| teaching another team member part of what made you the one
| with the bad bus factor. As time goes on, having to break
| this rule will become less frequent (which is why they wanted
| everyone to write down when they have to break rule 3 if you
| ask me)
| newswasboring wrote:
| Keeping aside the insulting nature of suggesting someone
| did not read the article, here is my response.
|
| I like to differentiate between mechanism to make people
| realize a problem and actual solutions to the problem. Its
| very easy to say mentor someone to do your job but if that
| was doable and easy they would be already doing it. The
| problem is precisely that there is no real good way to do
| KTs. Forcing someone to solve the problem is one way to KT,
| but is that the most efficient way? I would rather gather
| data about what breaks down and come up with a more
| efficient KT mechanism.
| tharkun__ wrote:
| FWIW I operate under the assumption that HN is just as
| bad as Slashdot with regards to commenting without
| reading the article :) and the solutions were pretty
| clear to me, even without the nice summary at the end.
| Sorry if it wasn't worded softly enough, no insult meant,
| more an observation.
|
| The parts I mentioned are not the 'realization' part.
| They are the solution parts. They aren't the
| 'implementation details' of the solution, I would agree,
| but they are the solution.
|
| If you ask me the problem is not usually that there is no
| good way to do KT or that it just isn't doable at all.
| The problem is that in most businesses due to their
| 'culture' (for lack of a better word - not to start
| "that" culture discussion) you do not actually get to do
| it. A good way to mentor and transfer knowledge for
| example is to do pair programming. Not many places allow
| for that and will look at you funny for even suggesting
| it (and I'm not even personally on the extreme end of
| that like some companies, where everyone literally pair
| programs 100% of the time - I like a sort of hybrid
| model, where people pair for as long as it makes sense to
| them, which could be sitting there "designing" for a
| couple of hours together, maybe dividing things up after
| a basic structure is in place and then working on their
| own for the rest of the day w/ some quick 5 minute sync
| ups and questions going back and forth from time to
| time). Another very doable way to do knowledge transfer
| is to specifically not give let the one guy that wrote
| that part of the code and knows it inside and out work on
| the next ticket that will need changes to that part. But
| many Product Managers/businesses/team leads will not
| allow that because it would mean that the task will be
| delivered slower.
|
| The beauty of this approach is that you don't need to
| actively gather data, make a decision etc. Gathering this
| data is usually very error prone in that you can fill out
| forms and skills matrices and such all you want (been
| there done that), you always forget about something or it
| doesn't really tell you the whole story (skills matrices
| are particularly bad)
|
| With this, it just happens! It's the self organizing way
| of dealing with the problem. I think you might be putting
| too much emphasis on the "completely unavailable" part,
| whereas I see the "it's a soft rule" part bigger. Instead
| of waiting for someone to go on vacation and then you
| find out that he was the only one that can do X (there's
| your data point that you missed during collection and
| analysis) and now you're in deep trouble, because he's
| touring the Amazon rain forest (i.e. definitely no cell
| reception there), you get to figure it out because he won
| the lottery and he's actually at work and can tutor
| someone through it all.
| jedberg wrote:
| I heard a good presentation about this. They did it for a team at
| Google. But it was daily, and you weren't allowed to tell anyone
| else. You just didn't respond to any requests for that day.
|
| But even worse, you could also be assigned the "liar" task. In
| which case you were supposed to reply to emails but give
| intentionally wrong information some of the time. But in that
| case you would tell people that you were the liar and that your
| answers aren't to be trusted.
|
| It seemed like a good way to make sure at least two people could
| do everything and that documentation was solid enough that you
| could recognize when someone was wrong.
| geoduck14 wrote:
| > But it was daily, and you weren't allowed to tell anyone
| else. You just didn't respond to any requests for that day.
|
| I'm 100% certain I've worked with people that did this. I never
| realized they were DR visionaries
| eru wrote:
| When at Google, we did the simpler version (of not replying)
| for our DiRT week exercises. DiRT stands for disaster recovery
| testing.
| edenhyacinth wrote:
| I wonder if this encouraged people to be more open with their
| ideas also.
|
| If I could say things knowing that if I made a mistake, someone
| might friendly correct me with "Ah, you're the liar today, S3
| doesn't support that format", I'd be happier to make them.
| treeman79 wrote:
| How are people afterwards.
|
| I refuse to use email at work anymore because of the constant
| phishing tests. Sick of the gotcha mentality.
| welcome_dragon wrote:
| I agree. We have a security score at work and mine had been
| zero for a long time. I realized that you need to report the
| emails as spam or phishing and not just ignore/delete like I
| usually do.
| newswasboring wrote:
| I am genuinely fascinated how they managed to piss you off so
| much with phishing tests. (For me email is the backbone of
| all permanent office communication). Was the frequency too
| high (reducing the signal to noise ratio too much), or just
| the fact that they doubt your ability to fall for such a
| thing?
| treeman79 wrote:
| 2-3 a month. Often spoofing other co workers I need to be
| cautious on all emails.
|
| I use it now only to see if notifications came in from
| calendar or such. Even then it's just looking at subject
| lines.
|
| I then go straight to the app and check for actual message
| / event.
| noneeeed wrote:
| I was wondering that too. I'm guessing it's frequency. We
| get them, but they are once every few months. The only
| annoying thing about them is I have to open outlook to
| actually report them as I normally use Airmail.
| madhadron wrote:
| Banks require a planned version of this: you have to take so many
| contiguous days off at least once a year. But it's an anti-fraud
| measure, as most of the frauds you can run internally require you
| to be there to juggle things, and you make the number of
| contiguous days long enough where such a scheme will come
| crashing down.
| thedougd wrote:
| Some stopped in the last few years, or at least for certain job
| categories. It was a wonderful thing when it was around because
| you could take a vacation with absolutely no remote access.
| toss1 wrote:
| Great idea, similar to what I understand is done in accounting.
| The idea is to force everyone in the actg dept to take regular
| vacations of several weeks, requiring others to cover their
| tasks. This is so that others can uncover financial sleight-of-
| hand to prevent embezzling (or at least structure it such that
| successful embezzling requires a larger conspiracy, increasing
| the likelihood of getting caught).
___________________________________________________________________
(page generated 2021-07-01 23:02 UTC)