[HN Gopher] Ask HN: Do you find working on large distributed sys...
___________________________________________________________________
Ask HN: Do you find working on large distributed systems
exhausting?
Ive been working on large distributed system for the last 4-5 years
with teams owning few services or have different responsibilities
to keep the system up and running. We run into very interesting
problems due to scale (billions of requests per month for our main
public apis) and the large amount of data we deal with. I think it
has progressed my career and expanded my skills but I feel it's
pretty damn exhausting to manage all this even when following a lot
of the best-practices and working with other highly skilled
engineers. I've been wondering recently if others feel this kind
of burnout (for lack of better word). Is the expectation is that
your average engineer should now be able to handle all this?
Author : wreath
Score : 262 points
Date : 2022-02-19 11:58 UTC (11 hours ago)
| nixgeek wrote:
| Quite the opposite, interestingly, I'm usually in "Platform"-ish
| roles which touch or influence all aspects of the business, inc.
| building and operating services which do a couple orders of
| magnitude more than OP's referenced scale (in the $job[current]
| case, O(100B - 1T) requests per day) and while I agree with the
| "Upside" (career progression, intellectual interest, caliber of
| people you work with), I haven't experienced the burnout and in
| 2022 am actually the most energized I've been in a few years.
|
| I expect you can hit burnout building services and systems for
| any scale and that's more reflective on the local environment --
| the job and the day to day, people you work with, formalized
| progression and career development conversations, the attitude to
| taking time off and decompressing, attitudes to oncall,
| compensation, other facets.
|
| That said, mental health and well-being is real and IMO needs to
| be taken very seriously, if you're feeling burnout, figuring out
| why and fixing that is critical. There have been too many
| tragedies both during COVID and before :-(
| jedberg wrote:
| I find it exhilarating, but you have to have a well architected
| distributed system. Some key points:
|
| - Your micro service should be able to run independently. No
| shared data storage, no direct access into other microservices'
| storage.
|
| - Your service should protect itself from other services,
| rejecting requests before it becomes overloaded.
|
| - Your service should be lenient on the data it accepts from
| other services, but strict about what it sends.
|
| - Your service should be a good citizen, employing good backoffs
| when other services it is calling appear overloaded.
|
| - The API should be the contract and fully describe your
| service's relationship to the other services. You should
| absolutely collaborate with the engineers who make other
| services, but at the end of the day anything you agree on should
| be built into the API.
|
| Generally if you follow these best practices, you shouldn't have
| to maintain a huge working knowledge of the system, only detailed
| knowledge of your part, which should be small enough to fit into
| your mental model.
|
| There will be a small team of people responsible for the entire
| system and how it fits together, but ideally if everyone is
| following these practices, they won't need to know details of any
| system, only how to read the APIs and the call graph and how the
| pieces fit together.
| bofaGuy wrote:
| Worked on a team at BofA, our application would handle 800
| million events per day. The logic we had for retry and failure
| was solid. We also had redundancy across multiple DCs. I think we
| processed like 99.9999999% of all events successfully. (Basically
| all of them, last year we lost about 2,000 events total) I didn't
| find it very stressful at all. We build in JMX Utica for our
| production support teams be able to handle practically anything
| they would need to.
| [deleted]
| bofaGuy wrote:
| Utils*
| tristor wrote:
| I think I understand what you mean, but it's hard for me to
| contextualize, because I'm still working through some of my own
| past to identify where some of my burn out began.
|
| For my part, I love working at global scale on highly distributed
| systems, and find deep enjoyment in diving into the complexity
| that brings with it. What I didn't enjoy was dealing with
| unrealistic expectations from management, mostly management
| outside my chain, for what the operations team I led should be
| responsible for. This culminated in an incident I won't detail,
| but suffice to say I hadn't left the office in more than 72 hours
| continuous, and the aftermath was I stopped giving a shit about
| what anyone other than my direct supervisor and my team thought
| about my work.
|
| It's not limited to operations or large systems, but every /job/
| dissatisfaction I've had has been in retrospect caused by a
| disconnect between what I'm being held accountable for vs what I
| have control over. As long as I have control over what I'm
| responsible for, the complexity of the technology is a cakewalk
| in comparison to dealing with the people in the organization.
|
| Now I've since switched careers to PM and I've literally taken on
| the role of doing things and being held responsible for things I
| have no control over and getting them done through influencing
| people rather than via direct effort. Pretty much the exact thing
| that made my life hell as an engineer is now my primary job.
|
| Making that change made me realize a few things that helped
| actually ease my burn out and excite me again. Firstly, the
| system mostly reflects the organization rather than the
| organization reflecting the system. Secondly, the entire cultural
| balance in an organization is different for engineers vs
| managers, which has far-reaching consequences for WLB, QoL, and
| generally the quality of work. Finally, I realized that if you
| express yourself well you can set boundaries in any healthy
| organization which allows you to exert a sliding scale of control
| vs responsibility which is reasonable.
|
| My #1 recommendation for you OP is to take all of your PTO
| yearly, and if you find work intruding into your time off realize
| you're not part of a healthy organization and leave for greener
| pastures. Along the way, start taking therapy because it's
| important to talk through this stuff and it's really hard to find
| people who can understand your emotional context who aren't mired
| in the same situation. Most engineers working on large scale
| systems I know are borderline alcoholics (myself too back then),
| and that's not a healthy or sustainable coping strategy. Therapy
| can be massively helpful, including in empowering you to quit
| your job and go elsewhere.
| lr4444lr wrote:
| Yes. But remember, with tools and automation getting better, this
| is a major source of value add that you bring as a software
| engineer which is likely to have long term career viability.
| z3t4 wrote:
| Often when I hear stories of billions of requests per second it's
| self inflicted because of over complicated architecture where all
| those requests are generated only by a few thousand customers...
| So it's usually a question of how the company operate, do you
| constantly fight fires ? or do you spend your time implementing
| stuff that have high value for the company and it's customers ?
| Fighting fires can get your burned out (no pun intended) while
| feeling that you deliver a lot of value will make you feel great.
| primeletter wrote:
| > billions of requests per second
|
| Op said "billions of requests per month".
|
| That's ~thousands of qps.
| Kiro wrote:
| That's nothing.
| bob1029 wrote:
| > We run into very interesting problems due to scale (billions of
| requests per month for our main public apis) and the large amount
| of data we deal with.
|
| So, if you are handling 10 billion requests per month, that would
| average out to about 4k per second.
|
| Are these API calls data/compute intensive, or is this more
| pedestrian data like logging or telemetry?
|
| Any time I see someone having a rough time with a distributed
| system, I ask myself if that system had to be distributed in the
| first place. There is usually a valuable lesson to be learned by
| probing this question.
| petters wrote:
| Yes! A single machine can handle tons of traffic in many cases.
| kortex wrote:
| It definitely can be. I'm constantly trying to push our stack
| away from anti-patterns and towards patterns that work well, are
| robust, and reduce cognitive load.
|
| It starts by watching Simple Made Easy by Rich Hickey. And then
| making every member of your team watch it. Seriously, it is _the_
| most important talk in software engineering.
|
| https://www.infoq.com/presentations/Simple-Made-Easy/
|
| Exhausting patterns:
|
| - Mutable shared state
|
| - distributed state
|
| - distributed, mutable, shared state ;)
|
| - opaque state
|
| - nebulosity, soft boundaries
|
| - dynamicism
|
| - deep inheritance, big objects, wide interfaces
|
| - objects/functions which mix IO/state with complex logic
|
| - code than needs creds/secrets/config/state/AWS just to run
| tests
|
| - CI/CD deploy systems that don't actually tell you if they
| successfully deployed or not. I've had AWS task deploys that time
| out but actually worked, and ones that seemingly take, but
| destabilize the system.
|
| ---
|
| Things that help me stay sane(r):
|
| - pure functions
|
| - declarative APIs/datatypes
|
| - "hexagonal architecture" - stateful shell, functional core
|
| - type systems, linting, autoformatting, autocomplete, a good IDE
|
| - code does primarily either IO, state management, or logic, but
| minimal of the other ops
|
| - push for unit tests over integration/system tests wherever
| possible
|
| - dependency injection
|
| - ability to run as much of the stack locally (in docker-compose)
| as possible
|
| - infrastructure-as-code (terraform as much as possible)
|
| - observability, telemetry, tracing, metrics, structured logs
|
| - immutable event streams and reducers (vs mutable tables)
|
| - make sure your team takes time periodically to refactor, design
| deliberately, and pay down tech debt.
| islandert wrote:
| I agree with most of you points, but the one that stands out is
| "push for unit tests over integration/system tests wherever
| possible".
|
| By integration/system tests, do you mean tests that you cannot
| run locally?
| solididiot wrote:
| Only read the transcript but I'm not getting most of it. I mean
| it starts with a bunch of aphorisms we all agree with but when
| it should be getting more concrete it goes on with statements
| that are kind of vague.
|
| E.g. what exactly does it mean to: >> Don't use an object to
| handle information. That's not what objects were meant for. We
| need to create generic constructs that manipulate information.
| You build them once and reuse them. Objects raise complexity in
| that area.
|
| What kind of generic constructs?
| LoveGracePeace wrote:
| Most of that I agree with, I'm curious why you'd recommend unit
| tests over integration tests? It seems at odds with the
| direction of overall software engineering best practices.
| arielweisberg wrote:
| Not so all? Stuff is usually fixable.
|
| Org and people are not.
| rthomas6 wrote:
| https://youtu.be/y8OnoxKotPQ
| a_code wrote:
| You are right, I work for a FAANG on one such system and it's
| hard.
| phuff wrote:
| I think there are a lot of strategies for dealing with the kinds
| of issues you're working with, but a lot of them involve building
| a good engineering culture and building a disciplined engineering
| practice that can adapt and find best scalability practices at
| that level.
|
| We do billions of requests a day on one of the teams that I
| manage at work, and that team alone has sole operational and
| development responsibility for a large number of subsystems to be
| able to manage the complexity that a sustained QPS of that level
| requires. But those subsystems are in turn dependent on a whole
| suite of other subsystems which other teams own and maintain.
|
| It requires a lot of coordination with a spirit of good-will and
| trust among the parties in order to be able to develop the
| organizational discipline and rigor needed to be able to handle
| those kinds of loads without things falling over terrible all the
| time and everybody pointing fingers at each other.
|
| But! There are lots of great people out there who have spent a
| lot of time figuring out how to do these things properly and that
| have come up with general principals that can be applied in your
| specific circumstances (whatever they may be). And when executed
| properly I would argue that these principals can be used to
| mitigate the burnout you're talking about. It's possible to make
| it through those rough spots in an organization (that frequently,
| though not always, come from quick business scaling -- i.e. we
| grew from 1000 customers to 10,000 last year) etc.
|
| If you're feeling this kind of feeling and the organization isn't
| taking steps to work on it, then there are things you can do as
| an IC to help, too. But this is all a much longer conversation :)
| timka wrote:
| I think it's more likely Zeitgeist. You see, someone else finds
| working in Data Science frustrating, another person nearing his
| 40 says he's anxious about his career, another guy says he's
| worried about it's too late to do something about the big tech
| messing up the field etc.
|
| I've had similar issues recently working at a demanding position
| I didn't really like even though my achievements may look
| impressive in my resume. I tried working in a shop somewhere in
| between aerospace and academia but just didn't fit at all. I
| ended up joining a small team that I enjoy working with so far
| and feel much better now.
|
| At a higher level, we're hitting the limits of current paradigm
| in many ways including monetary system (debt), environment
| (pollution) and natural resources, ideology (creativity and
| innovation), technology (complexity).
|
| The good news is that this year current monetary system will
| cease to exist. This will eventually restructure the economy to a
| more healthy balance. Unfortunately, this will have severe social
| consequences as standard of living will change dramatically
| (somewhere at the 60's level). This will basically destroy the
| middle class and thus change the structure of consumption.
| Obviously, this will mostly affect services and other non-
| essential stuff we got used to. On the other hand, this will blow
| down all bloat like insane market cap of the big tech etc. That
| is working in IT may become fun again, like 20 years back :)
| late2part wrote:
| The end is near?
| yodsanklai wrote:
| This post resonates with me. I recently join a big organisation
| and a team owning such a system. The oncalls are very stressful
| to me. Our systems aren't that robust and we don't have control
| on all the dependencies. So things fail all the time. At the same
| time, management is consistently pushing for new features. As a
| consequence, work life balance is bad, turnover is high.
|
| My hope is that I'll learn to manage the stress and gain more
| expertise.
| jmyeet wrote:
| It's hard to answer this because you don't specify what exactly
| you find exhausting. Is it oncall? Deployment? Performance
| issues? Dealing with different teams? Failures and recovery? The
| right hand not knowing what the left hand is doing? Too many
| services? Something else?
|
| It's not even clear how big your service is. You mention billions
| of requests per month. Every 1B requests/month translates to ~400
| QPS, which isn't even that large. Like, that's single server
| territory. Obviously spikiness matters. I'd also be curious what
| you mean by "large amount of data".
| wreath wrote:
| > Every 1B requests/month translates to ~400 QPS, which isn't
| even that large
|
| I said billions not one billion.
|
| I guess what I find exhausting is the long feedback cycle. For
| example, Writing a simple script that makes two calls to
| different APIs requires tons of wiring for telemetry,
| monitoring, logging, error handling, integrating w/ two APIs,
| setting up the proper kubernetes manifests, setting up the
| required permissions to run this thing and have them available
| to k8s. I find all this to be exhausting. We're not even
| talking about operating this thing yet (on call, running in
| issues with the APIs owned by other teams etc)
| bfung wrote:
| Automate that process that you find tedious; if you find it
| tedious, ask your coworkers if they do as well. Make the
| right time/automation trade offs. https://xkcd.com/1205/
|
| Yes, work is tedious.
| sangnoir wrote:
| This sounds like your team/organization needs to invest in
| tooling. Processes that take long should ideally be automated
| and done async, notification of the result is generated some
| time later, freeing up some of your time.
| [deleted]
| mistaPockets wrote:
| Arrezz wrote:
| I think our field is so broad that it is somewhat nebolous to
| talk about the average engineer. But from my experience taking
| car of such a large system with a large amount of requests and
| complexity is outside of what is expected of an average engineer.
| I think that there is an eventual limit for how much complexity a
| single engineer can handle for several years.
| sillysaurusx wrote:
| Jobs aren't exhausting. Teams are. If you find yourself feeling
| this way, consider that the higher ups may be mismanaging.
|
| There's often not a lot of organizational pressure to change
| anything. So the status quo stays static. But the services change
| over time, so the status quo needs to change with them.
| softwarebeware wrote:
| Agree with this. Conway's Law will always hold. If a company
| does not organize it's teams into units that actually hold full
| responsibility and full control/agency over that
| responsibility, those teams will burn out.
|
| When getting anything done requires constant meetings, placing
| tickets, playing politics, and doing anything and everything to
| get other teams to accept that they need to work with you and
| prioritize your tasks so that you can get them done, you will
| burn out.
| faangiq wrote:
| Yes the complexity and scale of these systems is far beyond what
| companies understand. The salaries of engineers on these systems
| need to double asap or they risk collapse.
| benlivengood wrote:
| Google's SRE books cover a lot of the things that large teams
| managing large distributed systems encounter and how to tackle it
| in a way that doesn't burn out engineers. Depending on
| organization size/spread, follow-the-sun oncall schedules
| drastically reduce burnout and apprehension about outages.
| Incident management procedures give confidence when outages do
| happen. Blameless postmortems provide a pathway to understanding
| and fixing the root causes of troublesome outages. Automation
| reduces manual toil. Google SRE has been keeping a lot of things
| running for a decade or more and has learned a lot of lessons. I
| did that from 2014 to 2018 and it seemed like a pretty mature
| organizational approach, and the books document essentially that
| era.
| m_herrlich wrote:
| I love building and developing software, and despite the fun and
| interesting challenges presented at my last job I quit because of
| the operations component. We adopted DevOps and it felt like
| "building" got replaced with "configuring" and managing complex
| configurations does not tickle my brain at all. Week-long on-call
| shifts are like being under house arrest 24/7.
|
| I understand the value that developers bring to operational
| roles, and to some extent making developers feel the pain of
| their screwups is appropriate. But when DevOps is 80% Ops, you
| need a fundamentally different kind of developer.
| throwhauser wrote:
| After-hours on-call is a thing that needs to be destroyed. A
| company that is sufficiently large that the CEO doesn't get
| woken up for emergencies needs to have shifts in other
| timezones to handle them. I don't know why people put up with
| it.
| bckr wrote:
| Part of it is a culture that discourages complaining about
| after hours work.
|
| There's an expectation that everyone is a night owl and that
| night time emergency work is fun, and that these fires are to
| be expected.
|
| Finally, engineers seem to get this feeling of being
| important because they wake up and work at night. It's really
| a form of insanity.
| daneel_w wrote:
| I find it very draining and vexing to work on systems that have
| all of its _components_ distributed left and right without clear
| boundaries, instead of being more coalesced. Distribution in the
| typical sense - identical spares working in parallel for the sake
| of redundancy - doesn 't faze me very much.
| gorgoiler wrote:
| My number one requirement for a distributed system is that the
| code all be one place.
|
| There are good reasons for wanting multiple services talking
| through APIs. Perhaps you have a Linux scheduler that is
| marshalling test suites running on Android, Windows, macOS and
| iOS?
|
| If all these systems originate from a single repository,
| preferably with the top level written in a dynamic language that
| runs from its own source code, then life can be much easier.
| Being able to _change multiple parts of the infrastructure in a
| single commit_ is a powerful proposition.
|
| You also stand a chance of being able to model your distributed
| system locally, maybe even in a single Python process, which can
| help when you want to test new infrastructure ideas without
| needing the whole distributed environment.
|
| Your development velocity will be faster and less painless.
| Changes being slow and painful are what burn people out and grind
| progress to a halt.
| wreath wrote:
| > My number one requirement for a distributed system is that
| the code all be one place.
|
| This is a major source of frustration. Having to touch multiple
| repositories and syncing and waiting for their
| deployment/release (if it's a library) just to add a small
| feature easily wastes a few hours of the day and most
| importantly drains cognitive ability by context switching.
| dudul wrote:
| Let's set aside the "distributed" aspect. To effectively scale a
| team and a code base you need some concept of "modularization"
| and "ownership". It is unrealistic to expect engineers to know
| everything about the entire system.
|
| The problem is that this division of the code base is really
| hard. It is really hard to find the time and the energy to
| properly section your code base in proper domains and APIs.
| Especially with the constantly moving target of what needs to be
| delivered next. Even in a monorepo it is exhausting.
|
| Now, put on top of that the added burden brought by a distributed
| system (deployment, protocol, network issues, etc) and you have
| something that becomes even more taxing on your energy.
| throwaway984393 wrote:
| It's exhausting when the business does not give you the support
| you need and leans on you to do too much work. Find another place
| to work where they do things without stress (ask them in the
| interview about their stress levels and workload). Make sure
| leadership are actively prioritizing work that shores up
| fundamental reliability and continuously improves response to
| failure.
|
| When things aren't a tire fire, people will still ask you to do
| too much work. The only way to deal with it without stress is to
| create a funnel.
|
| Require all new requests come as a ticket. Keep a meticulously
| refined backlog of requests, weighted by priorities, deadlines
| and blockers. Plan out work to remove tech debt and reduce toil.
| Dedicate time every quarter to automation that reduces toil and
| enables development teams to do their own operations. Get used to
| saying "no" intelligently; your backlog is explanation enough for
| anyone who gets huffy that you won't do something out of the blue
| immediately.
| tonto wrote:
| I don't really work on distributed systems but I do often worry
| about performance and reliability and even if I get some wins
| sometimes the anxiety of not performing right is stressful....
| thebackup wrote:
| My experience is that the expectations on what your average
| engineer should be able to handle has grown enormously during the
| last 10 years or so. Working both with large distributed systems
| and medium size monolithic systems I have seen the expectations
| become a lot higher in both.
|
| When I started my career the engineers at our company were
| assigned a very specific part of the product that they were
| experts on. Usually there were 1 or 2 engineers assigned to a
| specific area and they knew it really well. Then we went
| Agile(tm) and the engineers were grouped into 6 to 9 person teams
| that were assigned features that spanned several areas of the
| product. The teams also got involved in customer interaction,
| planning, testing and documentation. The days when you could
| focus on a single part of the system and become really good at it
| were gone.
|
| Next big change came when the teams moved from being feature
| teams to devops teams. None of the previous responsibilities were
| removed but we now became responsible also for setting up and
| running the (cloud) infrastructure and deploying our own
| software.
|
| In some ways I agree that these changes have empowered us. But it
| is also, as you say, exhausting. Once I was simply a programmer;
| now I'm a domain expert, project manager, programmer, tester,
| technical writer, database admin, operations engineer, and so on.
| ithkuil wrote:
| > None of the previous responsibilities were removed but we now
| became responsible also for setting up and running the (cloud)
| infrastructure and deploying our own software
|
| On the flipside, in the olden days when one set of people were
| churning features and another set of people were given a black
| box to run and be responsible for keep it running, it was very
| hard to get the damn thing to work reliably and the only
| recourse you often had was to "just be more careful", which
| often meant release aversion and multi-year release cycles.
|
| Hence, some companies explored alternatives, found ways to make
| them work, wrote about their success but a lot of people copied
| only half of the picture and then complained that it didn't
| work.
| bckr wrote:
| > only half of the picture
|
| Can you please share some details about what you think is
| missing from most "agile"/devops teams?
| ithkuil wrote:
| Proper staffing
| bckr wrote:
| Ah excellent. Yes. In my experience there's this idea of
| "scale at all costs"--a better way would probably be to
| limit scaling until the headcount is scaled. Although
| then you probably need more VC money.
| altacc wrote:
| It sounds like whomever shaped your teams & responsibilities
| didn't take into account the team's cognitive load. I find it's
| often overlooked, especially by those who think agile means
| "everyone does everything". The trick is to become agile whilst
| maintaining a division of responsibilities between teams.
|
| If you look up articles about Team Topologies by Matthew
| Skelton and Manuel Pais, they outline a team structure that
| works for large, distributed systems.
| thebackup wrote:
| I'll have a look a the book. Thanks!
| ebbp wrote:
| It'd be interesting to know - what are the expectations made of
| you? In this environment, I'd expect there to be dedicated
| support for teams operating their services - i.e.
| SRE/DevOps/Platform teams who should be looking to abstract away
| some of the raw edges of operating at scale.
|
| That said, I do think there's a psychological overhead when
| working on something that serves high levels of production
| traffic. The stakes are higher (or at least, they feel that way),
| which can affect different people in different ways. I definitely
| recognise your feeling of exhaustion, but I wonder if it maybe
| comes from a lack of feeling "safe" when you deploy - either from
| insufficient automated testing or something else.
|
| (For context - I'm an SRE who has worked in quite a few places
| exactly like this)
| jsiaajdsdaa wrote:
| It's only exhausting when you know deep in your heart that this
| could run on one t2 large box.
| dekhn wrote:
| it's exhausting but can be fun if you have a competent team to
| support you. I like nothing more than being told "one TPU chip in
| this data center is bad. Find it efficiently at priority 0."
| bravetraveler wrote:
| I did at first, but then learning config management and taking
| smaller bites helped.
|
| I started out as a systems administrator and it's evolved into
| doing that more and faster. The tooling helps me get there, but I
| did have to learn how to give better estimates.
| axegon_ wrote:
| Depends. Not the systems themselves but more the scope of the
| work and how it is being done. If the field is boring or the
| design itself is bad(with no ability to make it better, whether
| it's simply by design, code quality or whatever), my motivation,
| will and desire to work teleports to a different dimension-it's a
| fine line between exhaustion and frustration I guess. If it is
| something interesting, I can work on it for days straight without
| sleeping. Lately I've been working on a personal project and
| every time I have to do anything else I feel depressed for having
| to set it aside.
| macksd wrote:
| Mental / emotional burnout is certainly not uncommon in tech
| (probably in most other careers, I'd bet). Most people in Silicon
| Valley are changing jobs more often than 4-5 years. I don't like
| to constantly be the new guy, but there is a refreshing feeling
| to starting on something new and not carrying years of technical
| debt on your emotions. Maybe it's time to try something new, take
| a bigger vacation than usual, or talk to someone about new
| approaches you can try in your professional or personal life. But
| certainly don't let the fact that you feel like this add to the
| load - you're not alone, and it's not permanent.
| ok123456 wrote:
| Yes. That's why you avoid building them unless you absolutely
| need to, and instead build libraries instead.
| guilhas wrote:
| Recently I was asked to work on a older project for enterprise
| customers. And we are always weary of working on old unmaintained
| code
|
| But it just felt like a breath of fresh air
|
| All code in same repository, UI, back-end, SQL, MVC style Fast
| from feature request to deliver in production. Changes, test, fix
| bugs, deploy. We were happy and the customers too
|
| No cloud apps, buckets, secrets, no oauth, little configuration,
| no docker, no micro services, no proxies, no CICD. It does look
| somewhere along the way we overcomplicated things
| BatteryMountain wrote:
| 100% agree with you. OAuth + Docker/Kubernetes + massive
| configs to make things to build sucks the life out of every
| project for me that has them. And when it uses a non-git
| version control system.
| helsinki wrote:
| I find working on single services / components more exhausting.
| hliyan wrote:
| The first ten years of my career, I worked with distributed
| systems built on this stack: C++, Oracle, Unix (and to some
| extent, MFC and Qt). There were hundreds of instances of dozens
| of different type of processes (we would now call these
| microservices) connected via TCP links, running on hundreds of
| servers. I seldom found this exhausting.
|
| The second ten years of my career, I worked with (and continue to
| work on) much more simpler systems, but the stack looks like
| this: React/Angular/Vue.js, Node.js/SpringBoot,
| MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a
| dozen services right here), Docker, Kubenetes. _This_ is
| exhausting.
|
| When you spend so much time wrangling a zoo of commercial
| products, each with its own API and often own vocabulary for what
| should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and
| being constantly obsoleted by competing "latest" products, that
| you don't have enough time to focus on code, then yes, it can be
| exhausting.
| softwarebeware wrote:
| You know what? This is a really great point. When I reflect
| back on my career experience (at companies like Expedia, eBay,
| Zillow, etc.) the best distributed systems experience I had was
| at companies that standardized on languages and frameworks and
| drew a pretty strong boundary around those choices.
|
| It wasn't that you technically couldn't choose another stack
| for a project, but to do so you had to justify the cost/benefit
| with hard data, and the data almost never bore out more benefit
| than cost.
| asymmetric wrote:
| Reminds me of http://boringtechnology.club/
| alecbz wrote:
| Can you say more? What specifically is exhausting?
|
| Exhaustion/burnout isn't uncommon but without more context it's
| hard to say if it's a product of the type of work or your
| specific work environment.
| readingnews wrote:
| This is on point... You also give no actual numerical context.
| Are you saying you are working 40 hours a week and leave work
| exhausted? Are you saying you work 40 at work, and are on
| call/email/remote terminals for 40 more hours coordinating
| teams, putting out fires, designing architecture?
|
| Even then, I would ask you to be more specific. I have a normal
| 40 hour a week uni job as a sysadmin, but it typically takes
| somewhat more or less (hey, sometimes I can get it done in 35,
| sometimes its 50 hours). However, for the last several years we
| have been so shorthanded, faculty wise, that I teach (at a
| minimum) two senior level computer science classes every
| semester (I was a professor at another uni). About mid
| semester, things will break, professors will make unreasonable
| demands of building out new systems/software/architecture, and
| I find myself doing (again at a minimum) 80 hours a week. On
| the other hand, I am _not_ exhausted, as I enjoy teaching quite
| a bit, and I have been a sysadmin for many years and also enjoy
| that work.
| alecbz wrote:
| As you imply towards the end, I think things like numbers of
| hours worked are generally not relevant for stuff like this.
| I've been incredibly engaged working 12+ hour days and I've
| been burnt out barely getting 2-3 hours of real work in a
| day. It has more to do with the nature of the work.
| lozenge wrote:
| Even though you only did 2-3 hours of "real work", how much
| actual time investment was in your job? I don't see how
| somebody can burn out working just 2-3 hours in a day.
| Maybe emotionally burnt out if you're a therapist or
| something, but not as a software engineer.
| solatic wrote:
| What do you find exhausting?
|
| One anti-pattern I've found is that most orgs ask a single team
| to handle on-call around the clock for their service. This rarely
| scales well, from a human standpoint. If you're getting paged at
| 2:00 in the morning on a regular basis you _will_ start to resent
| it. There 's not much you can do about that so long as only one
| team is responsible for uptime 24/7.
|
| The solution is to hire operations teams globally, and then setup
| follow-the-sun operations whereby the people being paged are
| always naturally awake at that hour, and allows them to work
| normal eight hour shifts. But this requires companies to, _gasp_
| , have specialized developers and specialized operators
| _collaborate_ before allowing new feature work into production,
| to ensure that the operations teams understand what the services
| are supposed to do and keep it all online. It requires (oh, the
| horror!) actually maintaining production standards, runbooks, and
| other documentation.
|
| So naturally, many orgs would prefer to burn out their engineers
| instead.
| NationalPark wrote:
| I don't think this is a stable long term solution. The "on
| call" teams end up frustrated with the engineers who ship bugs
| and this results in added process that delays deploys,
| arbitrary demands for test coverage, capricious error budgets,
| etc. It's much better to have the engineers who wrote the code
| be responsible for running it, and if their operational burden
| becomes too high, to staff up the dev team to empower them to
| go after root causes. Plus the engineers who wrote the code
| _always_ have better context than reliability people who tend
| to be systems experts but lack the business logic intuition to
| spot errors at a glance.
| Hermitian909 wrote:
| I don't think the parent was implying you're never on call
| for your code, just only on call during working hours.
|
| One of the challenges for larger companies in trying to make
| teams on-call 24/7 is that your most senior engineers often
| have enough money that they don't have to take on-call. Some
| variation of the following conversation happens in Big Tech
| more than most people seem to anticipate:
|
| "hey, so I have 7 mil in the bank, a house, and kids; so I'm
| not taking on-call anymore"
|
| "I understand on-call is a burden, but the practice is a big
| part of how we maintain operational excellence"
|
| "Alright, I quit"
|
| "Woah woah woah, uh, ok, what about we work on transitioning
| you out of on call over the next 6 months?"
|
| "Nah, I'm done"
|
| "This is going to be really disruptive to the team!"
|
| "Yeah man it sucks, I really feel for you"
|
| My understanding is a few famous outages at large cloud
| providers are a direct result of management not anticipating
| these conversations and assuming 24/7 on-call from a single
| geographically centered team of high powered engineers was
| sustainable.
| [deleted]
| kqr wrote:
| Correct. Throwing software over the wall to "other people"
| and letting them deal with the problems of running the
| software is guaranteed to lead to low quality, inefficient
| processes, or usually both.
| solatic wrote:
| > The "on call" teams end up frustrated with the engineers
| who ship bugs and this results in added process that delays
| deploys, arbitrary demands for test coverage, capricious
| error budgets, etc.
|
| This is poor operations culture. Software is no different
| from industrial manufacturing. You QA before you ship product
| to customers and you QA your raw materials before you start
| to process them. Operations is responsible for catching show-
| stopper bugs before they hit production. This means that
| operations is responsible for pushing to staging, not
| developers; operations stakeholders need to be looped into
| feature planning to ensure that feature work will easily
| integrate into the operations culture (somebody's got to tell
| the developers they can't adopt MySQL if it's a PostgreSQL
| shop, etc.). Fundamentally, Ops needs to be able to say No to
| Dev. The SRE take on it is to "hand the pager back to Dev",
| but the actual method of saying No is different from Ops
| culture to Ops culture.
|
| > reliability people who tend to be systems experts but lack
| the business logic intuition to spot errors at a glance
|
| If Dev didn't build the monitoring, the observability, put
| proper logging in place, etc., then honestly, Dev isn't going
| to spot the errors at a glance. Customer Service will when
| customers complain. @jedberg seems to think that Developers
| should write code to auto-solve their operations issues. If
| Developers can write code to auto-solve their operations
| issues, and Developers obviously anyway need to add telemetry
| etc., then why, pray tell, should it be so unreasonable to
| expect Developers to be able to succinctly add the kind of
| telemetry and documentation that explains the business logic,
| according to an Operations standard, such that Operations can
| thus keep the system running?
| grogers wrote:
| The solution to get paged at off hours a lot is rarely to hire
| additional teams to cover those times for you, at least not
| long term. For things you can control, you should fix the root
| causes of those issues. For things you can't control you should
| spend effort on making them within your control (eg
| architecture improvement). This takes time, so follow-the-sun
| rotation might be a stop gap solution, but you need to make
| sure it doesn't cover over the real problems without them
| getting any better.
| ahelwer wrote:
| From experience, it's really hard to fix the root causes of
| issues when you were woken up three times the night before
| and had two more of the same incident occur during the
| workday. In my case I struggled along for a couple years but
| the best thing to do was just leave and let it be someone
| else's problem.
| kqr wrote:
| Best thing for what? Surely not software quality and
| customer satisfaction.
| ahelwer wrote:
| If they cared about that they would either pay me so much
| money I'd be insane to walk away or they would hire
| people in other time zones to cover the load. Instead
| they chose to pay for their customer satisfaction with my
| burnout. The thing about that strategy is... eventually
| the thing holding their customer satisfaction together
| gets burnt out. So I leave. And even then they're still
| getting the better half of the bargain.
| kqr wrote:
| Sorry, I accidentally said you did the wrong thing for
| leaving. That wasn't my intention. Of course, leaving was
| the right choice for you.
|
| What I meant was the company you were working for does
| not get the best quality or customer satisfaction by
| overworking you to the point where you have to leave. It
| would have been better for their software quality to
| handle things differently.
| notacoward wrote:
| This. Absolutely this. Working on large distributed system can
| be both exhilarating and exhausting. The two often go hand in
| hand. However, working on such systems _without diligence_ tips
| the scales toward exhausting. If your testing and your
| documentation and your communication (both internal and with
| consumers) suck, you 're in for a world of pain.
|
| "But writing documentation is a waste of time because the code
| evolves so fast."
|
| Yeah, I hear that, but there's also a lot of time lost to
| people harried during their on-call and still exhausted for a
| week afterward, to training new people because the old ones
| burned out or just left for greener pastures, to maintaining
| old failed experiments because customers (perhaps at your
| insistence) still rely on them and backing them out would be
| almost as much work than adding them was, and so on.
|
| That's not _really_ moving fast. That 's just flailing. You can
| _actually_ go further faster if you maintain a bit of
| discipline. Yes, there will still be some "wasted" time, but
| it'll be a bounded, controlled waste like the ablative tiles on
| a re-entry vehicle - not the uncontrolled explosion of
| complexity and effort that seems common in many of the younger
| orgs building/maintaining such systems nowadays.
| bckr wrote:
| > That's not really moving fast. That's just flailing.
|
| Yes, a million times yes. This is moving me. Where do I find
| a team that understands this wisdom?
| jedberg wrote:
| I would respectfully say that you are wrong. I speak from
| experience. At Netflix we tried to hire for around the clock
| coverage. But what ended up working much better was taking that
| same team and having each person on call for a week at a time,
| all based in Pacific Time.
|
| Yes, you would get calls at 2am, sometimes multiple days in a
| row. But you were only on call once every six to eight weeks,
| and we scheduled out well in advance so you could plan your
| life accordingly.
|
| As a bonus, for the five weeks you weren't on call, you were
| highly incentivized (and had the time) to build tools or submit
| patches to fix the problems that woke you at 2am.
|
| > It requires (oh, the horror!) actually maintaining production
| standards, runbooks, and other documentation.
|
| I disagree with this too. Documentation and runbooks are
| useless in an outage. Instead of runbooks, write code to do the
| thing. Instead of documentation, comment the code and build
| automation to make the documentation unnecessary, or at least
| surface the right information automatically if you can't
| automate it.
| solatic wrote:
| > you were highly incentivized (and had the time) to build
| tools or submit patches to fix the problems that woke you at
| 2am.
|
| Ah, so you worked on a team where the SRE needs were
| prioritized over the feature requests? Because in most
| companies where I've worked, Product + Customer Service +
| Sales + Marketing + Executives don't really have time or
| patience for the engineers to get their diamond polishing
| cloths out. They want to see _feature development_. They 're
| willing to be forced to prioritize exactly which feature
| they'll get soonest, and they understand that engineering
| needs time to keep the systems running, but in most
| businesses I've worked, the business comes first.
|
| > Documentation and runbooks are useless in an outage.
| Instead of runbooks, write code to do the thing. Instead of
| documentation, comment the code and build automation to make
| the documentation unnecessary
|
| We do that too. If you could write code to Solve All The
| Problems then you'd never need to page a human in the first
| place ;)
|
| I'll give you a simple example of where you can't write code
| to solve this sort of thing. Let's say that you have an
| autoscaler that will scale your server group up to X servers.
| You define an alert to page you if the autoscaler hits the
| maximum. The page goes off. Do you really want to write code
| to arbitrarily increase the autoscaler maximum whenever it
| hits the maximum? Why do you have the maximum in the first
| place? The _entire reason_ why the autoscaler maximum exists
| is to prevent cost overruns from autoscaling run amok. You
| _want_ a human being, not code, to look at the autoscaler and
| make the decision. Do you have steady-slow growth up to the
| maximum? Maybe it should be raised, if it represents natural
| growth. Maybe it shouldn 't, if you just raised it last week
| and it shouldn't be anywhere near this busy. Do you have
| hockey-stick growth? Maybe the maximum is working as
| expected, looks like a resource leak hit production. Or maybe
| you have a massive traffic hit and you actually _do_ want to
| increase the maximum. Maybe you 'd prefer to take the outage
| from the traffic hit, let the 429s cool everyone off. But
| good luck trying to write code to handle that automatically,
| and _correctly_ for you!
|
| > or at least surface the right information automatically if
| you can't automate it.
|
| Ah, well, that's exactly what the dedicated operations staff
| are doing, because when you have three follow-the-sun teams,
| you need _standards_ , not three sets of people who each
| somehow telepathically share the same tribal knowledge?
|
| Don't get me wrong, I'm not anti-automation or something. If
| your operations folks are click-clicking in consoles all day
| long, the same click-clicking every day, probably something's
| wrong. But the SRE model asks for operations automation to
| stick within _operations_ teams, not development teams.
| jedberg wrote:
| > Ah, so you worked on a team where the SRE needs were
| prioritized over the feature requests?
|
| Yes, it was an SRE team. All we do is write tools to make
| operations better, but more importantly we write tools to
| make it easier for the dev teams to operate their own
| systems better. But yes, we had products teams that would
| push back on our requests because they had product to
| deliver, and that was fine. We'd either figure out how to
| do the work for them, or figure out a workaround.
|
| > We do that too. If you could write code to Solve All The
| Problems then you'd never need to page a human in the first
| place ;)
|
| Well yes, that's the idea. You can't get to 5 9s of
| reliability unless it's all automated. :)
|
| > I'll give you a simple example of where you can't write
| code to solve this sort of thing.
|
| I could easily write code to solve the thing. Step one,
| double the limit to alleviate immediate customer pain. Step
| two, page someone to wake up and look at the graphs and
| figure out what the better medium term solution is to get
| us through until the morning, including links to said
| relevant graphs.
|
| You're not gonna have a cost overrun doubling the limit for
| one night. And if there is a big problem, the person will
| get paged again a few hours later and have more information
| to make a better decision.
|
| > But the SRE model asks for operations automation to stick
| within operations teams, not development teams.
|
| Yes, but I'm not sure I see why that's bad. I don't see any
| purpose for a dedicated operations team, especially a
| follow the sun team. If you're Google and you already have
| offices all around the world, sure, it will be better. But
| it makes no sense to hire an around the world team _just_
| for operations if the rest of your company is in one time
| zone.
| deanc wrote:
| This is the same approach as night shifts for nurses.
|
| There's a lot of evidence to suggest that the effects on this
| infrequent but consistent disturbance to their circadian
| rhythms causes all kinds of physiological damage. One example
| [1]. We have to do better. I think the original suggestion of
| finding specialised night workers or those in other timezones
| is more humane.
|
| [1] https://blogs.cdc.gov/niosh-science-
| blog/2021/04/27/nightshi...
| jedberg wrote:
| That article is about night shift work, not day shift work
| that occasionally makes you work an hour or two at night
| every six weeks.
| toomuchtodo wrote:
| Here is a reference that is a bit more attributable to
| the on call experience. There is a tangible human cost to
| after hours responses during an on call rotation. I
| personally do not recommend on call roles to any
| technology professional who can avoid them due to these
| health consequences of an on call requirement.
|
| https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5449130/
|
| > Sleep plays a vital role in brain function and systemic
| physiology across many body systems. Problems with sleep
| are widely prevalent and include deficits in quantity and
| quality of sleep; sleep problems that impact the
| continuity of sleep are collectively referred to as sleep
| disruptions. Numerous factors contribute to sleep
| disruption, ranging from lifestyle and environmental
| factors to sleep disorders and other medical conditions.
| Sleep disruptions have substantial adverse short- and
| long-term health consequences. A literature search was
| conducted to provide a nonsystematic review of these
| health consequences (this review was designed to be
| nonsystematic to better focus on the topics of interest
| due to the myriad parameters affected by sleep). _Sleep
| disruption is associated with increased activity of the
| sympathetic nervous system and hypothalamic-pituitary-
| adrenal axis, metabolic effects, changes in circadian
| rhythms, and proinflammatory responses. In otherwise
| healthy adults, short-term consequences of sleep
| disruption include increased stress responsivity, somatic
| pain, reduced quality of life, emotional distress and
| mood disorders, and cognitive, memory, and performance
| deficits._ For adolescents, psychosocial health, school
| performance, and risk-taking behaviors are impacted by
| sleep disruption. Behavioral problems and cognitive
| functioning are associated with sleep disruption in
| children. Long-term consequences of sleep disruption in
| otherwise healthy individuals include hypertension,
| dyslipidemia, cardiovascular disease, weight-related
| issues, metabolic syndrome, type 2 diabetes mellitus, and
| colorectal cancer. All-cause mortality is also increased
| in men with sleep disturbances. For those with underlying
| medical conditions, sleep disruption may diminish the
| health-related quality of life of children and
| adolescents and may worsen the severity of common
| gastrointestinal disorders. As a result of the potential
| consequences of sleep disruption, health care
| professionals should be cognizant of how managing
| underlying medical conditions may help to optimize sleep
| continuity and consider prescribing interventions that
| minimize sleep disruption.
| d0gsg0w00f wrote:
| Even as a dedicated operations team for a product, we did
| this too. On call person worked tickets and took calls for
| one week at a time, the rest of the team worked on ways to
| make on-call suck less. For an eight person team it worked
| well for about three years until bigger stuff happened in the
| org and we all parted ways.
| magicalhippo wrote:
| > But what ended up working much better was taking that same
| team and having each person on call for a week at a time, all
| based in Pacific Time.
|
| Our support team does the same, and they seem to be quite
| happy with it. They also get the following Friday off (in
| addition to compensation).
|
| They do their best to shield us developers from after-hour
| calls, usually one can get things moving enough that it can
| be handled properly in the morning.
| toast0 wrote:
| At the end of the day, there's a human cost to responding to
| pages, and there's a human cost to collaboration.
|
| Both of those can drive burn out. Personally, I find all that
| collaboration work very hard and stressful, so I work better in
| a situation where I get pages for the services I control; but
| that would change if pages were frequent and mostly related to
| dependencies outside of my control. It also helps to have been
| working in organizations that prioritize a working service over
| features. Getting frequent overnight issues that can't be
| resolved without third party effort that's not going to happen
| anytime soon is a major problem that I see reports of in
| threads like this.
|
| I can also get behind a team that can manage the base
| operations issues like ram/storage/cpu faults on nodes and
| networking. The runbooks for handling those issues are usually
| pretty short and don't need much collaboration.
| [deleted]
| Ozzie_osman wrote:
| I'd argue that timezone is just part of the problem. If you're
| responsible for a high oncall load, you are subjected to a
| steady, unpredictable stream of interrupts requiring you to act
| to minimize downtime or degradation. Obviously it's worse if
| you get these at night, but it's still bad during the day.
|
| I think the anti-pattern is having one team responsible for
| another's burden. You want teams to both be responsible for
| fixing their own systems when they break, AND be empowered to
| build/fix their broken systems to minimize oncall incidents.
| seanwilson wrote:
| It's okay to prefer working on small single server systems with
| small teams for example. I do this while contracting quite often
| and enjoy how much control you get to make big changes with
| minimal bureaucracy.
|
| Sometimes it feels like everyone is focused on eventually working
| with Google scale systems and following best practices that are
| more relevant towards that scale but you can pick your own path.
| primeletter wrote:
| I can see how it'd be exhausting to have to deal with the
| responsibility for the entirety of a few services.
|
| A key part of scaling at an org-level is continuously simplifying
| systems.
|
| At a certain level of maturity, it's common for companies to
| introduce a horizontal infra team (that may or may not be
| embedded in each vertical team).
| hughrr wrote:
| Yes it's horrible. I actually miss the early 00's when I did
| infra and code for small web design agencies. I actually could
| complete work back then.
| Simon_O_Rourke wrote:
| It's not so much the systems, but the organizations which create
| systems in their own image so to speak. If making changes is
| hard, either in the organization or within teams, you better
| believe any changes to a distributed system will be equally tough
| to implement.
| qxmat wrote:
| I've found that external tech requirements are horrible to work
| with, especially when the underlying stack simply doesn't support
| it. Normally these are pushed by certified cloud consultants or
| by an intrepid architect who found another "best practice blog."
|
| It's begins with small requirements such as coming up with a
| disaster recovery plan only for it to be rejected because your
| stack must "automatically heal" and devs can't be trusted to
| restore a backup during an emergency.
|
| Blink and you're implementing redundant networking (cross AZ
| route tables, DNS failover, SDN via gateways/load balancers), a
| ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health
| checks, EFS/FSX network mounts for persistent data that expensive
| enterprise app insists storing on-disk and some kind of HA
| database/multi-master SQL cluster.
|
| ... months and months of work because a 2 hour manual restore
| window is unacceptable. And when the dev work is finally complete
| after 20 zero-downtime releases over 6 months (bye weekend!) how
| does it perform? Abysmally - DNS caching left half the stack
| unreachable (partial data loss) and the mission critical Jira
| Server fail-over node has the wrong next-sequence id because Jira
| uses an actual fucking sequence table (fuck you Atlassian - fuck
| you!).
|
| If only the requirement was for a DR run-book + regular fire
| drills.
| theptip wrote:
| I think this highlights the importance of actually analyzing
| your RP/RT (recovery point/recovery time) requirements through
| the lens of business value, and being honest about the ROI of
| buying that extra 9 in uptime.
|
| It may be the case that 2 hours of downtime is completely
| unacceptable for the business, and paying $Xmm extra per year
| to maintain it is the right call. Or it may be that the
| business would be horrified to learn how many dollars are being
| spent to avert a level of downtime that no customer would
| notice or care about.
|
| If the requirement is just being set by engineering, then it's
| more about finding the equilibrium where the resource spent on
| automation balances the cost of the manual toil and the
| associated morale impact on the team. Nobody wants to work on a
| team where everything is on fire all the time, and it's
| time/money well spent to avert that situation.
| fullstackchris wrote:
| ...how is the JIRA server mission critical? is it tied to CI/CD
| somehow?
| qxmat wrote:
| In the enterprise you'll find that Jira is used for general
| workflow management not just CICD. I've encountered teams of
| analysts spend their working day moving and editing work
| items. It's the Quicken of workflow management solutions.
|
| Jira Server is deliberately cobbled by the sequence table +
| no Aurora support and now EOL (no security updates 1 year
| after purchase!). DC edition scales horizontally if you have
| 100k.
|
| Jira in general is a poorly thought out product (looking at
| you customfield_3726!) but it's held in such a high regard by
| users it's impossible to avoid.
| hogrider wrote:
| Pre covid I would have laughed at this. But now, no one knows
| what a user story should be unless you can reas it off jira
| and there are no backups of course.
| odonnellryan wrote:
| Gives me a fun idea: a program that randomly deletes items
| out of your backlog.
| Gwypaas wrote:
| "Chaos engineering for your backlog"
| angarg12 wrote:
| I don't find it exhausting, I find it *exhilarating*.
|
| After years of proving myself, earning trust and strategical
| positioning I am finally leading a system that will support
| millions of requests per second. I love my job and this is the
| most intellectually stimulating activity I have done in a long
| while.
|
| I think this is far from the expectation of the average engineer.
| You can find many random companies with very menial and low stake
| work. However if you work at certain companies you sign up for
| this.
|
| BTW I don't think this is unreasonable. This is precisely why
| programmers get paid big bucks, definitely in the US. We have
| have a set of skills that require a lot of talent and effort, and
| we are rewarded for it.
|
| Bottom line this isn't for everyone, so if you feel you are done
| with it that's fair. Shop around for jobs and be deliberate about
| where you choose to work, and you will be fine.
| bob1029 wrote:
| > I am finally leading a system that will support millions of
| requests per second.
|
| This is the difference. Millions of things _per second_ is a
| super hard problem to get right in any reality. Pulling this
| off with any technology at all is rewarding.
|
| Most distributed systems are not facing this degree of
| realistic challenge. In most shops, the challenge is synthetic
| and self-inflicted. For whatever reason, people seem to think
| saying things like "we do billions of x per _month_ " somehow
| justifies their perverse architectures.
| jacquesm wrote:
| That question probably needs more information.
|
| But your 'average engineer' is probably better served by asking
| themselves the question whether the system really needed to be
| that large and distributed rather than if working on them is
| exhausting. The vast bulk of the websites out there doesn't need
| that kind of overkill architecture, typically the non-scalable
| parts of the business preclude needing such a thing to begin
| with. If the work is exhausting that sounds like a mismatch
| between architecture choice and size of the workforce responsible
| for it.
|
| If you're an average (or even sub average) engineer in a mid
| sized company stick to what you know best and how to make that
| work to your advantage, KISS. A well tuned non-distributed system
| with sane platform choices will outperform a distributed system
| put together by average engineers any day of the week, and will
| be easier to maintain and operate.
| ublaze wrote:
| Yeah, large-scale systems are often boring in my experience,
| because the scale limits what features you can add to make things
| better. Each and every decision has to take scale into account,
| and it's tricky to try experimenting.
|
| I think it has to do with the kind of engineer you are. Some
| engineers love iterating and improving such systems to be more
| efficient, more scalable, etc. But it can be limiting due to the
| slower release cycles, hyper focus on availability, and other
| necessary constraints.
| harshaw wrote:
| I don't think they are boring, but very important on the kind
| of engineer you are. At AWS I try to encourage people who like
| the problem space and at the very least appreciate it, but can
| totally understand that you don't want to do your entire career
| on it. Many of our younger folks have never felt the speed and
| joy you can get with hammering out a simple app (web, python,
| ML) that doesn't have to work at scale.
| revskill wrote:
| Yes, a bit. But it's fun. And the motivation of fun is hardly to
| find in a big monothlic system.
| [deleted]
| systematical wrote:
| I have 15 years xp in dev but all of that was in smaller projects
| and a small team. I recently took a gig in bigger org with a
| distributed system and on call etc. It's exhausting and
| information overload. I'll give myself more time to acclimate but
| if I feel like this still after a year I'm out.
| heisenbit wrote:
| In these large scale systems the boundaries are usually not well
| defined (there are APIs but data flowing through the APIs is
| another matter as are operational and non functional
| requirements).
|
| Stress is often caused by a mismatch of what you feel responsible
| and accountable for and what you really control. The more you
| know the more you feel responsible for but you are rarely able to
| expand control as much or as fast as your knowledge. It helps to
| be very clear about where you have ultimate say (accountability)
| or control within some framework (responsibility) or simply know
| and contribute. Clear in your mind, others and your boss. Look at
| areas outside your responsibility with curiosity and willingness
| to offer support but know that you are not responsible and others
| need to worry.
| tylerrobinson wrote:
| This is spot on. Feeling frustrated working on large
| distributed systems could be generalized as "feeling frustrated
| working in a large organization" because the same limitations
| apply. You learn about things you cannot control, and it is
| important to see the difference between what you can control
| and contribute and what you can't.
| ChrisMarshallNY wrote:
| TLDR; Yes, it is exhausting, but I have found ways to mitigate
| it.
|
| I don't develop stuff that runs billions of queries. More like
| thousands.
|
| It is, however, important infrastructure, on which thousands of
| people around the world, rely, and, in some cases, it's not
| hyperbole to say that lives depend on its integrity and uptime.
|
| One fairly unique feature of my work, is that it's almost all
| "hand-crafted." I generally avoid relying on dependencies out of
| my direct control. I tend to be the dependency, on which _other_
| people rely. This has earned me quite a few sneers.
|
| I have issues...
|
| These days, I like to confine myself to frontend work, and avoid
| working on my server code, as monkeying with it is always
| stressful.
|
| My general posture is to do the highest Quality work possible;
| way beyond "good enough," so that I don't have to go back and
| clean up my mess. That seems to have worked fairly well for me,
| in at least the last fifteen years, or so. Also, I document the
| living bejeezus[0] out of my work, so, when I inevitably have to
| go back and tweak or fix, in six months, I can find my way
| around.
|
| [0] https://littlegreenviper.com/miscellany/leaving-a-legacy/
| zaphirplane wrote:
| Front end and no dependencies, tell us more
| ChrisMarshallNY wrote:
| Feel free to see for yourself. I have quite a few OS projects
| out there. My GH ID is the same as my HN one.
|
| My frontend work is native Swift work, using the built-in
| Apple frameworks (I ship classic AppKit/UIKit/WatchKit, using
| storyboards and MVC, but I will be moving onto the newer
| stuff, as it matures).
|
| My backend work has chiefly been PHP. It works quite well,
| but is not where I like to spend most of my time.
| glintik wrote:
| The most undervalued thing that forgot even highly skilled
| engineers - KISS principle. That's why you are burning out
| supporting such systems.
| jeffrallen wrote:
| Yes, it's amazing how much one modern high spec system running
| good code can do. Turn off all the distributed crap and just
| use a pair in leader/follower config with short ttl DNS to
| choose the leader and manual failover scripts. If your
| app/company/industry cannot accept the compromises from such a
| simple config, quit and work in one which can.
| BatteryMountain wrote:
| Good code? Where?
|
| This whole thread feels like therapy since I face the same
| monsters on the systems I work on. Partly due to bad platform
| & code, partly due to bad organization structure (Conway's
| 100% for us).
|
| My pet projects at home is the only thing keeping me sane,
| mostly because they are simple.
| asim wrote:
| Yup. Spent more than a decade doing it. Got so frustrated that I
| started a company to try abstract it all away for everyone else.
| It's called M3O https://m3o.com. Everyone ends up building the
| same thing over and over. A platform with APIs either built in
| house or an integration to external public APIs. If we reuse
| code, why not APIs.
|
| I should say, I've been a sysadmin, SRE, software engineer, open
| source creator, maintainer, founder and CEO. Worked at Google,
| bootstrapped startups, VC funded companies, etc. My general
| feeling, the cloud is too complex and I'm tired on waiting for
| others to fix it.
| randomsilence wrote:
| >Consume public APIs as simpler programmable building blocks
|
| Is the 'r' in simpler there intentionally? In which way are the
| building blocks more simple than simple blocks?
| lumost wrote:
| Handling scale is a technically challenging problem, if you enjoy
| it - then take advantage! however sometimes taking a break to
| work on something else can be more satisfying.
|
| Typically on a "High scale" service spanning hundreds or
| thousands of servers you'll have to deal with problems like. "How
| much memory does this object consume?", "how many ms will adding
| this regex/class into the critical path use?", "We need to add
| new integ/load/unit tests for X to prevent outage Y from
| recurring", and "I wish I could try new technique Y, but I have
| 90% of my time occupied on upkeep".
|
| It can be immensely satisfying to flip to a low/scale, low/ops
| problem space and find that you can actually bang out 10x the
| features/impact when you're not held back by scale.
|
| Source: Worked on stateful services handling 10 Million TPS, took
| a break to work on internal analytics tools and production ML
| modeling, transitioning back to high scale services shortly.
| qaq wrote:
| lets say for the argument sake it's 50 billion thats 20k/sec
| there is zero need to for a fancy setup at this scale
| cookiengineer wrote:
| I am not sure you are aware that server load is never linearly
| distributed. And that's the exact problem OP is talking about.
|
| If everybody would get a ticket number and do requests when
| they're supposed to do them, we wouldn't need load balancers.
| qaq wrote:
| This is orthogonal to what causes the pain point. All the
| pain comes from distributed state and these load levels even
| if you peak at 800K requests per second you don't need
| distributed state. So most of this pain is self inflicted.
| rhacker wrote:
| Humans GET simplicity from extreme hyper complexity.
|
| Take a gas generator. Easy, add oil and gas and get electricity
| and these days they even come in a smoothed over plastic shell
| that makes it look like a toy. Inside, very complex, spark plugs,
| engine, coils, inverter. A hundred years of inventions packed
| into a 1.5' x 1.5' box.
|
| It's the same thing for complicated systems. Front end to back.
| No matter how ugly or how much you wish it was refactored - some
| exec knows it as a box where you put something in and magical
| inference comes out. Maybe that box actually causes real change
| in the physical world - like billions of packages being sent out
| all over the world.
|
| In the days of castles you would have similar systems managed by
| people. People that drag wooden carts of shit out of a castle.
| Carrying water around. Manually husking corn and wheat and what
| have you.
|
| No matter how far into the future we go, we will continue to get
| simple out of monstrous complexity.
|
| That's not the answer to your question - but it's just that the
| world will always lean towards going that way.
| the_gipsy wrote:
| If you're burnt out, you're most likely being suckered.
| softwarebeware wrote:
| I find it "exhilirating," not "exhausting." But I also don't
| think that "...your average engineer should now be able to handle
| all this." That is where we went completely wrong as an industry.
| It used to be said that what we work on is complex, and you can
| either improve your tools or you can improve your people. I've
| always held that you will have to improve your people. But clever
| marketing of "the cloud" has held out the false promise that
| anyone can do it.
|
| Lies, lies, and damn lies, I say!
|
| Unless you have bright and experienced people at the top of a
| large distributed systems company, who have actually studied and
| built distributed systems at scale, your experience of working in
| such a company is going to suck, plain and simple. The only cure
| is a strong continuous learning culture, with experienced people
| around to guide and improve the others.
| artiscode wrote:
| Your story is close to home. I was part of a team that integrated
| our newly-acquired startup with a massive, complex and needlessly
| distributed enterprise system that burned me out.
|
| Being forced to do things that absolutely did not make sense(CS
| wise) was what I found to be most exhausting. Having no other way
| than writing shitty code or copying functionality into our app
| led me to an eventual burnout. My whole career felt pointless as
| I was unable to apply any of my skills and expertise that I
| learned over all these years, because everything was designed in
| a complex way. Getting a single property into an internal API is
| not a trivial task and requires coordination from different teams
| as there are a plethora of processes in place. However I helped
| to build a monstrous integration layer and everything wrong with
| it is partly my doing. Hindsight is 20/20 and I now see there
| really was no other, better way to do it, which feels nice in a
| schadenfreude kind of way.
|
| I sympathise with your point about not understanding what is
| expected of an average engineer nowadays. Should you take
| initiative and help manage things, are you allowed to simply
| write code and what should you expect from others were amongst my
| pain points. I certainly did not feel rewarded for going the
| extra mile, but somehow felt obliged because of my "senior"
| title.
|
| I took therapy, worked on side projects and I'm now trying out a
| manager role. My responsibilities are pretty much the same, but I
| don't have to write code anymore. It feels empowering to close my
| laptop after my last Zoom meeting and not think about bugs, code,
| CI or merging tomorrow morning because it's release day tomorrow.
|
| But hey, grass is always greener on the other side! I think
| taking therapy was one of my life's best decisions after being
| put through the ringer. Perhaps it will help you as well!
| unnouinceput wrote:
| I wrote such a system. 6+ years, between end of '07 to beginning
| of '14. It grew organically, with more and more end points as
| time went by, and when I exited the project it had over 250 end
| points, each having hundreds of thousand of users requests per
| day. By your measurement, that would mean the system I wrote
| would've handled in a month a total of 250 (end points) x 30
| (days) x ~400k (requests per day) == 3B user requests in a month.
|
| To my knowledge the system is still used to this day and I think
| it grew 10x meanwhile, so I think it's serving over 30B requests
| each month.
|
| That being said, to answer your question - Yes! I got tired of
| it, started to plateau and felt I was lagging behind in terms of
| keeping up with technology around me. So I exited but at the same
| time I also started to get involved in other projects as well. So
| in the end I was overworked and I ditched the biggest project of
| my entire career as freelancer because payment was not worth
| anymore. I wanted to feel excited and the additional projects
| eventually made up in terms of money, but boy oh boy! the
| variation is what made me not feeling burnout. Nowadays if I feel
| another project is going that route I discuss with client to
| replace me with a team once I deliver the project in a stable
| state and for horizontal scaling.
| asdfman123 wrote:
| Relevant comedy video:
|
| https://www.youtube.com/watch?v=y8OnoxKotPQ
|
| This recent video they put out is pretty good, too:
|
| https://www.youtube.com/watch?v=kHW58D-_O64
| karmakaze wrote:
| I'm trying to relate this to my experiences. The best I can make
| of it is that burnout comes from dealing with either the same
| types of problems, or new problems at a rate that's higher than
| old problems get resolved.
|
| I've been in those situations. My solution was to ensure that
| there was enough effort into systematically resolving long-known
| issues in a way that not only solves them but also reduces the
| number of new similar issues. If the strategy is instead to
| perform predominantly firefighting with 'no capacity' available
| for working on longer term solutions there is no end in sight
| unless/until you lose users or requests.
|
| I am curious what the split is of problems being related to:
|
| 1. error rates, how many 9s per end-user-action, and per service
| endpoint
|
| 2. performance, request (and per-user-action) latency
|
| 3. incorrect responses, bugs/bad-data
|
| 4. incorrect responses, stale-data
|
| 5. any other categories
|
| Another strategy that worked well was not to fix the problems
| reported but instead fix the problems known. This is like the
| physicist looking for keys under the streetlamp instead of where
| they were dropped. Tracing a bug report to a root cause and then
| fixing it is very time consuming. This of course needs to
| continue, but if sufficient effort it put to resolving known
| issues, such as latency or error rates of key endpoints, it can
| have an overall lifting effect reducing problems in general.
|
| A specific example was how effort into performance was toward
| average latency for the most frequently used endpoints. I changed
| the effort instead to reduce the p99 latency of the worst
| offenders. This made the system more reliable in general and paid
| off in a trend to fewer problem reports, though it's not
| easy/possible to directly relate one to the other.
| fxtentacle wrote:
| Yes, I used to,
|
| but No, I fixed it :)
|
| Among other things, I am team lead for a private search engine
| whose partner-accessible API handles roughly 500 mio requests per
| month.
|
| I used to feel powerless and stressed out by the complexity and
| the scale, because whenever stuff broke (and it always does at
| this scale), I had to start playing politics, asking for favors,
| or threatening people on the phone to get it fixed. Higher
| management would hold me accountable for the downtime even when
| the whole S3 AZ was offline and there was clearly nothing I could
| do except for hoping that we'll somehow reach one of their
| support engineers.
|
| But over time, management's "stand on the shoulders of giants"
| brainwashing wore off so that they actually started to read all
| the "AWS outage XY" information that we forwarded to them. They
| started to actually believe us when we said "Nothing we can do,
| call Amazon!". And then, I found a struggling hosting company
| with almost compatible tooling and we purchased them. And I moved
| all of our systems off the public cloud and onto our private
| cloud hosting service.
|
| Nowadays, people still hold me (at least emotionally) accountable
| for any issue or downtime, but I feel much better about it :)
| Because now it actually is withing my circle of power. I have
| root on all relevant servers, so if shit hits the fan, I can fix
| things or delegate to my team.
|
| Your situation sounds like you will constantly take the blame for
| other people's fault. I would imagine that to be disheartening
| and extremely exhausting.
| aeyes wrote:
| I feel that your problems aren't even remotely related to my
| problems with large distributed systems.
|
| My problems are all about convincing the company that I need
| 200 engineers to work on extremely large software projects
| before we hit a scalability wall. That wall might be 2 years in
| the future so usually it is next to impossible to convince
| anyone to take engineers out of product development. Even more
| so because working on this changes absolutely nothing for the
| end user, it is usually some internal system related to data
| storage or processing which can't cope anymore.
|
| Imagine that you are Amazon and for some scalability reason you
| have to rewrite the storage layer of your product catalog.
| Immediately you have a million problems like data migration,
| reporting, data ingestion, making it work with all the related
| systems like search, recommendations, reviews and so on.
|
| And even if you get the ball rolling you have to work across
| dozens of different teams which can be hard because naturally
| people resist change.
|
| Why do large sites like Facebook, Amazon, Twitter and Instagram
| all essentially look the same after 10 years but some of them
| now have 10x the amount of engineers? I think they have so much
| data and so many dependencies between parts of the system that
| any fundamental change is extremely hard to pull off. They even
| cut back on features like API access. But I am pretty sure that
| most of them have rewritten the whole thing at least 3 times.
| notimetorelax wrote:
| I usually move on to a different project/team/company when it
| gets to this. E.g. my new team builds a new product that
| grows like crazy and has its own set of challenges. I prefer
| to be deliver immediate customer value vs. long term hard to
| sell and hard to project the value work.
| andai wrote:
| Heh, I _wish_ they still looked the same. They added an order
| of magnitude of HTML and JS bloat while _removing_
| functionality.
| dasil003 wrote:
| I don't know your specifics, but I have worked on some large
| scale architecture changes, and 200 engineers + 2 year
| feature freeze is generally not a reasonable ask. In practice
| you need to find an incremental path with validation and
| course correction along the way to limit the amount of
| concurrent change in flight at any moment. If you don't do
| this run a very high risk of the entire initiative collapsing
| under its own weight.
|
| Assuming your estimation is more or less correct and it
| really is a 400 eng-year project, then you also need
| political capital as well as technical leadership to make it
| happen. There are lots of companies where a smart engineer
| can see a potential path out of a local maximum, but the org
| structure and lack of technical leadership in the highest
| ranks means that the problem is effectively intractable.
| trhway wrote:
| >I need 200 engineers to work on extremely large software
| projects before we hit a scalability wall. That wall might be
| 2 years in the future
|
| sounds like a typical massive rewrite project. They almost
| never succeed, many fail outright and most hardly even reach
| the functionality/performance/etc. level of the stuff the
| rewrite was supposed to replace. 2-4 years is typical for
| such glorious attempt before being closed or folded into
| something else. Management in general likes such projects,
| and they declare victory usually around 2 years mark and move
| on on the wave of the supposed success before reality hits
| the fan.
|
| >to convince anyone to take engineers out of product
| development.
|
| that means raiding someone's budget. Not happening :) New
| glorious effort needs new glorious budget - that is what
| management likes and not doing much more on the same budget
| as you're basically suggesting (i.e. i'm sure you'll get much
| more traction if you restate your proposal as "to hire 200
| more engineers ..." because that way you'll be putting
| serious technical foundation for some mid-managers to grow
| :). You're approaching this as an engineer and thus failing
| in what is the management game (or as Sun Tzu was pointing
| out one has to understand the enemy).
| ClumsyPilot wrote:
| "That wall might be 2 years in the future so usually it is
| next to impossible to convince anyone to take engineers out
| of product development. Even more so because working on this
| changes absolutely nothing for the end user"
|
| It seems to be the same story in fiels of Infrastructure
| maintenance, Aircraft design (boeing Max), and mortgage CDOs
| (2008). Was it always like this or the new management doesn
| not care untill something explodes?
| imachine1980_ wrote:
| a manufacturing company is designed the ground up to works
| whit machine but isn't the same whit software, is hard to
| understand that triple data isn't only triple server but a
| totaly different software stack, and exponentially more
| complex is not only put more factories like textile.
| fragmede wrote:
| There's still order of magnitude change analogies to real
| world processes, _if_ people are willing to listen (which
| is the hard part). Use something that everybody can
| understand, like making pancakes or waffles or an omelet.
| Going from making 1 by hand, every 4 minutes at home for
| your family, to 1,000 pancakes per minute at a factory is
| obviously going to take a better system. You can scale
| horizontally, and do the equivalent of putting more VMs
| behind the load balancer, and hire 4,000+ people to cook,
| but you still need to have /make that load balancer in
| the first place for even that to work.
|
| That's the tip of iceberg when going from 1 per 4 minutes
| to 1,000 per minute though. How do you make and
| distribute enough batter for that system, and plating and
| serving that is going to take a pub/sub bus, err,
| conveyor belt to support the cooks' output. Again though,
| you still gotta make that kafka queue, err, conveyor
| belt, plus the maintenance for that is going to a team of
| people if you need the conveyor belt to operate 24/7/52.
| If your standards are _so_ high that the system can never
| go down for more than 52.6 minutes per year or 13.15
| minutes per quarter, then that team needs to consist of
| highly-trained and smart (read: expensive) people to call
| when the system breaks in the middle of the night.
| pojzon wrote:
| Had that issue in my previous job.
|
| Higher management decided to migrate our properitary vendor
| locked platform from one cloud provider to the other one.
| Majority of migration fell on a single platform team that was
| constantly struggling with attricion.
|
| Unfortunately I was not able (neither our architects) to
| explain the higherups that we need bigger team and overall
| way more resources to pull that off.
|
| Hope that someone that comes after me will be able to make
| the miracle happen.
| fxtentacle wrote:
| My impression has always been that FAANG need lots of
| engineers because the 10xers refuse to work there. I've seen
| plenty of really scalable systems being built by a small core
| team of people who know what they are doing. FAANG instead
| seem to be more into chasing trends, inventing new
| frameworks, rewriting to another more hip language, etc.
|
| I would have no idea how to coordinate 200 engineers. But
| then again, I have never worked on a project that truly
| needed 50+ engineers.
|
| "Imagine that you are Amazon and for some scalability reason
| you have to rewrite the storage layer of your product
| catalog." Probably that's 4 friends in a basement, similar to
| the core Android team ;)
| danny_taco wrote:
| Your impression comes from the fact that you have not
| worked at larger teams, as you said so yourself. It's
| relatively easy to build something scalable from the
| beginning if you know what you need to build and if you are
| not already handling large amounts of traffic and data.
|
| It's a whole different ballgame to build on top of an
| existing complex system already in production that was made
| to satisfy the needs at the time it was built but it now
| needs to support other features, bug fixes and supporting
| existing features but at scale while having 50+ engineers
| not step on each other and not break each others code in
| the process. 4 friends in the basement will not achieve
| more than 50+ engineers in this scenario, even when
| considering the inefficiencies of the difficulty in
| communication that come along with so many minds working on
| the same thing.
| ratww wrote:
| GP said they have never work on something that _truly_
| needed 50+ engineers. Truly being the keyword here IMO.
|
| I have worked on a 1000+ engineer project and another
| that was 500+, but I'm on the same boat as GP. Both of
| those didn't needed 50+, and the presence of the extra
| 950/450 caused several communication, organisational and
| architectural issues that became impossible to fix on the
| long term.
|
| So I can definitely see where they're coming from.
| exikyut wrote:
| I've long wondered what I might be able to keep an eye
| out for during onboarding/transfer that would help me
| tell overstuffed kitchens apart from optimally-calibrated
| engineering caves from a distance.
|
| I'm also admittedly extremely curious what (broadly) had
| 1000 (and 500) engineers dedicated to it, when arguably
| only 50 were needed. Abstractly speaking that sounds a
| lot like coordinational/planning micromanagement, where
| the manglement had final say on how much effort needed to
| be expended where instead of allowing engineering to own
| the resource allocation process :/
|
| (Am I describing the patently impossible? Not yet had
| experience in these types of environments)
| ethbr0 wrote:
| > _what I might be able to keep an eye out for during
| onboarding /transfer that would help me tell overstuffed
| kitchens apart from optimally-calibrated engineering
| caves from a distance_
|
| The biggest thing I've been able to correlate are command
| styles: imperative vs declarative.
|
| I.e. is management used to telling engineering _how_ to
| do the work? Or communicating a desired end result and
| letting engineering figure it out?
|
| I think fundamentally this is correlated with bloat vs
| lean because the kind of organizations that hire
| headcount thoughtlessly inevitably attempt to manage the
| chaos by pulling back more control into the PM role.
| Which consequently leads to imperative command styles: my
| boss tells me what to do, I tell you, you do it.
|
| The quintessential quote from a call at a bad job was a
| manager saying "We definitely don't want to deliver
| anything they didn't ask for." This after having to
| cobble together 3/4 of the spec during the project,
| because so much functionality was missed.
|
| Or in interview question form posed to the interviewer:
| "Describe how you're told what to build for a new
| project." and "Describe the process if you identify a new
| feature during implementation and want to pitch it for
| inclusion."
| ratww wrote:
| _> a lot like coordinational /planning micromanagement,
| where the manglement had final say on how much effort
| needed to be expended where instead of allowing
| engineering to own the resource allocation process_
|
| Yep, that's a fair assessment!
|
| The 1000+ one was an ERP for mid-large businesses. They
| had 10 or so flagship products (all acquired) and wanted
| to consolidate it all into a single one. The failure was
| more on trying to join the 10 teams together (and
| including lots of field-only implementation consultants
| in the bunch), rather than picking a solid foundation
| that they already owned and handpicking what needed.
|
| The 500+ was an online marketplace. They had that many
| people because that was a condition imposed by investors.
| People ended up owning parts of a screen, so something
| that was a "two-man in a sprint" ended up being a whole
| team. It was demoralising but I still like the company.
|
| I don't think it's impossible to notice, but it's hard...
| you can ask during interviews about numbers of employees,
| what each one does, ask for examples of what each team
| does on a daily basis. Honestly 100, 500, 1000 people for
| a company is not really a lot, but 100, 500, 1000 for a
| single project is definitely a red flag for me now, and
| anyone trying to pull the "but think of the scale!!!"
| card is a bullshit artist.
| [deleted]
| aij wrote:
| > I've seen plenty of really scalable systems being built
| by a small core team of people who know what they are
| doing.
|
| There is huge difference between building a system that
| could theoretically be scaled up and actually scaling it up
| efficiently.
|
| At small scales, it's really easy to build on the work of
| others and take things for granted without even knowing
| where the scaling limits are. For example, if I suddenly
| find I need to double my data storage capacity, I can drive
| to a store and come back with a trunk full of hard drives
| the same day. I can only do that because someone already
| build the hard drives, and someone stocked the nearby
| stores with them. If a hyperscaler needs to double their
| capacity, they need to plan it well in advance, allocating
| a substantial fraction of global hard drive manufacturing
| capacity. They can't just assume someone would have already
| built the hardware, much less have it in stock near where
| it's needed.
| ratww wrote:
| _> Why do large sites like Facebook, Amazon, Twitter and
| Instagram all essentially look the same after 10 years but
| some of them now have 10x the amount of engineers? I think
| they have so much data and so many dependencies between parts
| of the system that any fundamental change is extremely hard
| to pull off. They even cut back on features like API access.
| But I am pretty sure that most of them have rewritten the
| whole thing at least 3 times._
|
| I used to work on a Unicorn a few years ago, and this hits
| close to home. From 2016 to 2020, the pages didn't change one
| single pixel, however there we had 400 more engineers working
| on the code and three stack iterations: full-stack PHP, PHP
| backend + React SSR frontend, Java backend + [redacted] SSR
| frontend (redacted because only two popular companies use
| this framework). All were rewrites, and those rewrites were
| justified because none of them was ever stable, the site was
| constantly going offline. However each rewrite just added
| more bloat and failure points. At some point the _three_ of
| them were running in tandem: PHP for legacy customers,
| another as main and another on an A /B test. (Yeah, it was a
| dysfunctional environment and I obviously quit).
| axiosgunnar wrote:
| > Yeah, it was a dysfunctional environment and I obviously
| quit
|
| What do you think could management have done better to make
| it not dysfunctional and have people quitting?
| [deleted]
| ratww wrote:
| I think just common sense and less bullshit
| rationalisation would have been enough.
|
| They had a billion dollars in cash to burn, so they hired
| more than they needed. They should have hired as needed,
| not as requested by Masayoshi Son.
|
| They shouldn't be so dogmatic. Some teams were too
| overworked, most were underworked (which means over-
| engineering will ensue), but no mobility was allowed
| because "ideally teams have N people".
|
| They shouldn't be so dogmatic pt 2. Services were one-
| per-team, instead of one-per-subject. So yeah, our
| internal tool for putting balloons and clowns into images
| lived together with the authentication micro-service,
| because it's the same team.
|
| Rewriting everything twice without analysis was wrong.
| The rewrites were because previous versions were "too
| complex" and too custom-made but newer ones had an even
| more complex architecture, but "this time it's right,
| software sometimes need complexity".
|
| Believing that some things were terrible would have gone
| a long way. Launching the main node.js server locally
| would take 10 to 20 minutes to launch, while something of
| the same complexity would often take about 2 or 3
| seconds. Of course it would blow up in production! Maybe
| try to fix instead of ordering another rewrite.
|
| They were good people, I miss the company and still use
| the product, but it didn't need to be like this.
| akkartik wrote:
| Favorited (https://news.ycombinator.com/favorites?id=akka
| rtik&comments=...)
| briandilley wrote:
| What i read here was "Cloud is hard, so I took on even more
| responsibility"
| fxtentacle wrote:
| What you should read is: At the monthly spend of a mid-sized
| company, it is impossible to get phone support from any
| public cloud provider.
| jqgatsby wrote:
| @fxtentacle, I was curious which private search engine this is
| for. Is the system you are describing ImageRights.com?
| fxtentacle wrote:
| No, ImageRights is much more requests and mostly images.
| Also, at ImageRights I don't have management above me that I
| would need to convince :)
|
| This one is text-only and used by influencers and brands to
| check which newspapers report about their events. As I said,
| it's internally used by a few partner companies who buy the
| API from my client and sell news alerts to their clients.
|
| BTW, I'm hoping to one day build something similar as an open
| source search engine where people pay for the data generation
| and then effectively run their own ad-free Google clone, but
| so far interest has been _very_ low:
|
| https://news.ycombinator.com/item?id=30374611 (1 upvote)
|
| https://news.ycombinator.com/item?id=30361385 (5 upvotes)
|
| EDIT: Out of curiosity I just checked and found my intuition
| wrong. The ImageRights API averages 316rps = 819mio requests
| per month. So it's not that much bigger.
| flyinglizard wrote:
| Care to share uptime metrics on AWS vs your own servers?
| fxtentacle wrote:
| That wouldn't be much help because the AWS and Heroku metrics
| are always green, no matter what. If you can't push updates
| to production, they count that as a developer-only outage and
| do not deduct it from their reported uptime.
|
| For me, the most important metric would be time that me and
| my team spent fixing issues. And that went down
| significantly. After a year of everyone feeling burned out,
| now people can take extended vacations again.
|
| One big issue for example was the connectivity between EC2
| servers degrading, so that instead of the usual 1gbit/s they
| would only get 10mbit/s. It's not quite an outage, but it
| makes things painfully slow and that sluggishness is visible
| for end users. Getting reliable network speeds is much easier
| if all the servers are in the same physical room.
| ckdarby wrote:
| >I used to feel powerless and stressed out by the complexity
| and the scale, because whenever stuff broke (and it always does
| at this scale), I had to start playing politics, asking for
| favors, or threatening people on the phone to get it fixed.
| Higher management would hold me accountable for the downtime
| even when the whole S3 AZ was offline and there was clearly
| nothing I could do except for hoping that we'll somehow reach
| one of their support engineers.
|
| If the business can't afford to have downtime then they should
| be paying for enterprise support. You'll be able to connect to
| someone in < 10 mins and have dedicated individuals you can
| reach out to.
| jerjerjer wrote:
| You never hosted on AWS, did you?
| ckdarby wrote:
| >You never hosted on AWS, did you?
|
| Previously 2k employee company, with the entire advertising
| back office on AWS.
|
| Currently >$1M YR at AWS, you can get the idea of scale &
| what is running, here: https://www.youtube.com/playlist?lis
| t=PLf-67McbxkT6iduMWoUsh...
| 0x445442 wrote:
| In the two years I worked on serverless AWS I filed four
| support tickets. Three out of those four I came up with the
| solution or fix on my own before support could find a
| solution. The other ticket was still open when I left the
| company. But the best part is when support wanted to know
| how I resolved the issues. I always asked how much they
| were going to pay me for that information.
| phillu wrote:
| Enterprise Support never disappointed me so far. Maybe not
| <10 minute response time, but we never felt left alone
| during an outage. But I guess this is also highly
| region/geo dependent.
| FpUser wrote:
| >"they should be paying for enterprise support"
|
| This sounds a bit arrogant. I think they found better and
| overall cheaper solution.
| ckdarby wrote:
| >This sounds a bit arrogant.
|
| The parent thread talks about how the business could not go
| down even with a triple AZ outage for S3, and I don't think
| it is arrogant to state they should be paying for
| enterprise support if that level of expectation is set.
|
| >I think they found better and overall cheaper solution.
|
| Cheaper solution does not just include the cost but also
| the time. For the time we need to look at the time they
| spent _regardless of department_ to acquire, migrate off of
| AWS, modifying the code to work for their multi-private
| cloud, etc. I 'd believe it if they're willing to say they
| did this, have been running for three years, and compiled
| the numbers in excel. It is common if you ask internally
| was it worth it to get a yes because people put their
| careers on it and want to have a "successful" project.
|
| The math doesn't work out in my experiences with clients in
| the past. The scenarios that work out are, top 30 in the
| enitre tech industry, significant GPU training, egress
| bandwidth (CDN, video, assets), or business that are
| selling basically the infrastructure (think Dropbox,
| Backblaze, etc.).
|
| I'm sure someone will throw down some post where their
| cost, $x is less than $y at AWS, but that is _such_ a tiny
| portion that if the cost is not >50% it isn't even worth
| looking at the rest of the math. The absolute total cost of
| ownership is much harder than most clickbait articles are
| willing to go into. I have not seen any developers talk
| about how it changes the income statement & balance sheet
| which can affect total net income and how much the company
| will lose just to taxes. One argument assumes that it evens
| out after the full amortization period in the end.
|
| Here are just a handful of factors that get overlooked,
| supply chain delays, migration time, access to expertise,
| retaining staff, churn increase due to pager/call rotation,
| opportunity cost of to capital being in idle/spare
| inventory and plenty more.
| FpUser wrote:
| So you basically saying that no matter what one should
| always stick to Amazon. I have my own experience that
| tells exactly the opposite. To each their own. We do not
| have to agree.
| ckdarby wrote:
| >So you basically saying that no matter what one should
| always stick to Amazon.
|
| What I am saying is given the list of exceptions I gave
| the business should run/colocate their gear if they're in
| the exception list or those components that fall in the
| exception list should be moved out.
|
| >I have my own experience that tells exactly the
| opposite.
|
| You begin using AWS for your first day ever and on that
| day it has a tri AZ outage for S3. In this example the
| experience with AWS has been terrible. Zooming out though
| over 5 years it wouldn't look like a terrible experience
| at all considering outages are limited and honestly not
| that frequent.
| FpUser wrote:
| >"You begin using AWS for your first day ever"
|
| I am not talking about outages here. Bad things can
| happen. More like a price.
| fxtentacle wrote:
| Back then, it was enough to saturate the S3 metadata node
| for your bucket and then all AZs would be unable to
| service GET requests.
|
| And yes, this won't be financially useful in every
| situation. But if the goal is to gain operational
| control, it's worthwhile nonetheless. That said, for a
| high-traffic API, you're paying through the nose for AWS
| egress bandwidth, so it is one of those cases where it
| also very much makes financial sense.
| ckdarby wrote:
| Same fxtenatcle as CTO of ImageRights? If that is the
| case my follow up question is did you actually move
| everything out of AWS? Or did you just take the same
| Netflix approach like Open Connect for the 95th billing +
| unmetered & peering with ISPs to reduce.
| BossingAround wrote:
| I don't read that as arrogant. The full statement is:
|
| > If the business can't afford to have downtime then they
| should be paying for enterprise support.
|
| It's simply stating that it's either cheaper for business
| to have downtime, or it's cheaper to pay for premium
| support. Each business owner evaluates which is it for
| them.
|
| If you absolutely can't afford downtime, chances are
| premium support will be cheaper.
| [deleted]
| ddorian43 wrote:
| What are you using for aws alternatives? Example for S3?
| ckdarby wrote:
| >What are you using for aws alternatives? Example for S3?
|
| Not OP but they're probably using Rook/Minio
| [deleted]
| fxtentacle wrote:
| docker + self-developed image management + CEPH
| mmcnl wrote:
| If you rely on public cloud infrastructure, you should
| understand both the advantages and disadvantages. Seems like
| your company forgot about the disadvantages.
| nostrebored wrote:
| You had problems with management of a cloud based api and
| executive visibility... so you bought a set of data centers to
| handle 500mio req per month?
|
| The visibility you will get after the capex when there's a
| truly disastrous outage will be interesting.
| Damogran6 wrote:
| As a security guy I HATE the loss of visibility in going to
| the cloud. Can you duplicate it? Sure. Still not as easily as
| spanning a trunk and you still have to trust what you're
| seeing to an extent.
| nostrebored wrote:
| The visibility I was mentioning in the parent comment was
| visibility from executives in your business, but I can see
| how it would be confusing.
|
| There are tradeoffs -- cloud removes much of the physical
| security risks and gives you tools to help automated
| incident detection. Things like serverless functions let
| you build out security scaffolding pretty easily.
|
| But in exchange you do have to give some trust. And I
| totally understand resistance there.
| justinclift wrote:
| > cloud removes much of the physical security risks
|
| Doesn't cloud _increase_ the physical security risks,
| rather than decrease /remove?
| NavinF wrote:
| Hmm that's only 190Hz on average, but we don't know what kind
| of search engine it is. For example if he's doing ML
| inference for every query, it would make perfect sense to get
| a few cabinets at a data center. I've done so for a _much_
| smaller project that only needs 4 GPUs and saved a ton of
| money.
| fxtentacle wrote:
| Nah, it's text-only requests returning JSON arrays of which
| newspaper article URLs mention which influencer or brand
| name keyword.
|
| The biggest hardware price point is that you need insane
| amounts of RAM so that you can mmap the bloom hash for the
| mapping from word_id to document_ids.
| winrid wrote:
| You could have used a sharded database like Mongo. Just
| throw up 10 shards, use "source" (influencer or brand
| name) as shard key?
| fxtentacle wrote:
| Yes, I could have used Mongo, but it would have been 100x
| to 1000x slower than an mmap-ed look up table.
| joshuamorton wrote:
| But you don't actually need that level of performance?
| You've made this system more complex and expensive to
| achieve a requirement that doesn't matter?
| shoo wrote:
| you seem to have a deeper knowledge of the business &
| organisational context that dictate the true requirements
| than someone working there. please share these details so
| we can all learn!
| joshuamorton wrote:
| Sure: the network request time of a person making a
| request over the open internet is going to be an order of
| magnitude longer than a DB lookup (in the right style,
| with a reverse-index) on the scale of data this person is
| describing. So making the lookup 10x faster saves
| you...1% of the request latency.
|
| And at the qps they've described, it's not a throughput
| issue either. So I'm pretty confident in saying that this
| is a case of premature optimization.
|
| Like put another way, if this level of performance was
| the dominating factor
| nostrebored wrote:
| Why ever use mmap instead of sharded inverted indices of
| word-doc here, a la elasticsearch?
| winrid wrote:
| Yeah the question is what level of performance you need I
| guess... was hoping you could clarify :)
| fxtentacle wrote:
| You might be surprised. The performance equivalent of $100k
| monthly in EC2 spend fits into a 16m2 cage with 52HU racks.
| dekhn wrote:
| that cage is a liability, not an asset. How is the
| networking in that rack? What's its connection to large-
| scale storage (IE, petabytes, since that's what I work
| with). What happens if a meteor hits the cage? Etc.
| fxtentacle wrote:
| That depends on what contracts you have. You could have
| multiple of these cages in different locations. Also, 1
| PB is only 56 large enterprise HDDs. So you just put
| storage into the cage, too.
|
| But my point wasn't about how precisely the hardware is
| managed. My point was that with a large cloud, a mid-
| sized company has effectively NO SUPPORT. So anything
| that gives you more control is an improvement.
| dekhn wrote:
| "1 pb is only 56 large enterprise hdds".
|
| umm, what happens when one fails?
|
| With large cloud my startup had excellent support. We
| negotiated a contract. That's how it works.
| fxtentacle wrote:
| Typically people use RAID or ZFS to prevent data loss
| when a few hdds fail.
| dekhn wrote:
| OK, so basically you're in a completely different class
| of expectations about how systems perform under disk loss
| and heavy load then me. A drive array is very different
| from large-scale cloud storage.
| fxtentacle wrote:
| Hard to say. My impression is:
|
| - A large ZFS pool of SSDs is much faster than any cloud
| storage.
|
| - Cloud storage failed much more often than the SSDs in
| our pool.
|
| - "Noisy neighbor" is an issue on the cloud
| qorrect wrote:
| This cracked me up. Thanks fxtentacle :D.
| benjamir wrote:
| Which costs you more than $100k monthly to operate with the
| same level of manageability and reliability.
|
| We don't use AWS, because our use cases don't require that
| level of reliability and we simply cannot afford it, but if
| I needed a company to depend on IT that generates enough
| revenue... I probably wouldn't argue about the AWS bill. So
| long, prepaid at hetzner + in-house works good enough, but
| I know what I _cannot_ offer with the click of a button to
| my user!
| Spooky23 wrote:
| This is a religious debate among many. The IT/engineering
| nerd stuff doesn't matter at all. Cloud migration
| decisions are always made by accounting and tax factors.
|
| I run two critical apps, one on-prem and one cloud. There
| is no difference in people cost, and the cloud service
| costs about 20% more on the infrastructure side. We went
| cloud because customer uptake was unknown and making
| capital investments didn't make sense.
|
| I've had a few scenarios where we've moved workloads from
| cloud to on-prem and reverse. These things are tools and
| it doesn't pay to be dogmatic.
| sdoering wrote:
| > These things are tools and it doesn't pay to be
| dogmatic.
|
| I wish I would hear this line more often.
|
| So many things today are (pseudo-) religious now. The
| right frsmework/language, cloud or on prem, x vs not x.
|
| Especially bad imho when somebody tries to tell you how
| you could do better with 'not x' instead of x you are
| currently using without even trying to understand the
| context this decision resides in.
|
| [Edit] typo
| qorrect wrote:
| > So many things today are (pseudo-) religious now. The
| right frsmework/language, cloud or on prem, x vs not x.
|
| Might have always been that way? We just have so many
| more tools to argue over now.
| throwaway219732 wrote:
| nikhilsimha wrote:
| I used to lead teams that owned message bus, a stream processing
| framework and a distributed scheduler (like k8s) at Facebook.
|
| The oncall was brutal. At some point I thought I should work on
| something else, perhaps even switch careers entirely. However
| this also forced us to separate user issues and system issues
| accurately. That's only possible because we are a platform team.
| Since then I regained my love for distributed systems.
|
| Another thing is, we had to cut down on the complexity - reduce
| number of services that talked to each other to a bare minimum.
| Weigh features for their impact vs. their complexity. And
| _regularly_ rewrite stuff to reduce complexity.
|
| Now Facebook being Facebook, valued speed and complexity over
| stability and simplicity. Specially when it comes to career
| growth discussions. So it's hard to build good infra in the
| company.
| robertlagrant wrote:
| I like that the mantra went from "move fast and break things"
| to (paraphrased) "move fast and don't break things".
| ehnto wrote:
| It's been a pretty poor mantra from the beginning anyway. How
| about we move at a realistic pace and deliver good features,
| without burning out, and without leaving a trail of half-
| baked code behind us?
| avensec wrote:
| Yes, but in a different way. I work in Quality Engineering, and
| the scope of maturity in testing distributed systems has been
| exhausting.
|
| Reading other comments from the thread, I see similar
| frustrations from teams I partner with. How to employ patterns
| like contact, hypothesis, doubles, or shape/data systems (etc.)
| typically gets conflated with System testing. Teams often
| disagree on the boundaries of the system start leaning towards
| System Testing, and end up adding additional complexity in tests
| that could be avoided.
|
| My thought is that I see the desire to control more scope
| presenting itself in test. I typically find myself doing some
| bounded context exercises to try to hone in on scope early.
| kodah wrote:
| If you're working on distributed systems scheduling and
| orchestration, then yeah it's exhausting. I did it for six years
| as a SRE-SE and am now back to being a SWE on a product team. If
| you like infrastructure stuff without having responsibility for
| the whole system the way that scheduling and orchestration makes
| you, then look at working on an infrastructure product.
| chubot wrote:
| My take is that it's exhausting because everything is so damn
| SLOW.
|
| "Back to the 70's with Serverless" is a good read:
|
| https://news.ycombinator.com/item?id=25482410
|
| The cloud basically has the productivity of a mainframe, not a
| workstation or PC. It's big and clunky.
|
| ----
|
| I quote it in my own blog post on distributed systems
|
| http://www.oilshell.org/blog/2021/07/blog-backlog-2.html
|
| https://news.ycombinator.com/item?id=27903720 - _Kubernetes is
| Our Generation 's Multics_
|
| Basically I want basic shell-like productivity -- not even an
| IDE, just reasonable iteration times.
|
| At Google I saw the issue where teams would build more and more
| abstraction and concepts without GUARANTEES. So basically you
| still have to debug the system with shell. It's a big tower of
| leaky abstractions. (One example is that I had to turn up a
| service in every data center at Google, and I did it with shell
| invoking low level tools, not the abstractions provided)
|
| Compare that with the abstraction of a C compiler or Python,
| where you rarely have to dip under the hood.
|
| IMO Borg is not a great abstraction, and Kubernetes is even
| worse. And that doesn't mean I think something better exists
| right now! We don't have many design data points, and we're still
| learning from our mistakes.
|
| ----
|
| Maybe a bigger issue is incoherent software architectures. In
| particular, disagreements on where authoritative state is, and a
| lot of incorrect caches that paper over issues. If everything
| works 99.9% of the time, well multiple those probabilities
| together, and you end up with a system that requires A LOT of
| manual work to keep running.
|
| So I think the cloud has to be more principled about state and
| correctness in order not to be so exhausting.
|
| If you ask engineers working on a big distributed system where
| the authoritative state in their system is stored, then I think
| you will get a lot of different answers...
| jeffrallen wrote:
| Yes, it's part of why I'm a dad at home who works on a little
| bash scripting sysadmin work as a side job.
|
| Everything has gotten too complicated and slow.
| SNosTrAnDbLe wrote:
| I actually love it and the more complex the system the better. I
| have been doing it for more than 10 years now and everyday I
| learn something new from the legacy and the replacement that we
| work on
| smoyer wrote:
| Using micro-services instead of monoliths is a great way for
| software engineers to reduce the complexities of their code.
| Unfortunately, it moves the complexity to operations. In an
| organization with a DevOps culture, the software engineers still
| share responsibility for resolving issues that occur between
| their micro-service and others.
|
| In other organizations, individual teams have ICDs and SLAs for
| one or more micro-services and can therefore state they're
| meeting their interface requirements as well as capacity/uptime
| requirements. In these organizations, when a system problem
| occurs, someone who's less familiar with the internals of these
| services will have to debug complex interactions. In my
| experience, once the root-cause is identified, there will be one
| or more teams who get updated requirements - why not make them
| stakeholders at the system-level and expedite the process?
| gbtw wrote:
| In all the situations i have had to work on microservices it
| generally means the team just works on all the different
| services, now spread out over more applications. Doing more
| integration work vs actual business logic. Because the fancy
| microservices the architect wanted doesn't mean there's
| actually money to do it properly or even have an ops team.
|
| Also for junior team members a lot of this stuff works via
| magic because they can't yet oversee where the boundaries are
| or do not understand all the automagically configuration stuff.
|
| Also the amount of works on my machine with docker is
| staggering even if the developers laptop's are the same batch /
| imaged machine.
| ickyforce wrote:
| > Using micro-services instead of monoliths is a great way for
| software engineers to reduce the complexities of their code
|
| Could you share why you think that's true?
|
| IMO that it's exactly the opposite - microservices have
| potential to simplify operations and processes (smaller
| artifacts, independent development/deployments, isolation,
| architectural boundaries easier to enforce) but when it comes
| to code and their internal architecture - they are always more
| complex.
|
| If you take microservices and merge them into a monolith - it
| will still work, you don't need to add code or increase
| complexity. You actually can remove code - anything related to
| network calls, data replication between components if they
| share a DB, etc.
| eez0 wrote:
| I find it actually the other way around.
|
| As you said, a benefit of large distributed systems is that
| usually its a shared responsibility, with different teams owning
| different services.
|
| The exhaustion comes into place when those services are not
| really independent, or when the responsibility is not really
| shared, which in turn is just a worse version of a typical system
| maintained by sysadmins.
|
| One thing that helps is bring the DevOps culture into the
| company, but the right way. It's not just about "oh cool we are
| now agile and deploy a few times a day", it's all down to shared
| responsibility.
| wilde wrote:
| Without more info it's hard to say. When I felt like this, a
| manager recommended I start journaling my energy. I kept a Google
| doc with sections for each week. In each section, there's a
| bulleted list of things I did that gave me energy and a list of
| things I did that took energy.
|
| Once you have a few lists some trends become clear and you can
| work with your manager to shift where you spend time.
___________________________________________________________________
(page generated 2022-02-19 23:01 UTC)