[HN Gopher] Our journey from a Python monolith to a managed plat...
___________________________________________________________________
Our journey from a Python monolith to a managed platform
Author : mands
Score : 83 points
Date : 2021-03-16 15:00 UTC (8 hours ago)
(HTM) web link (dropbox.tech)
(TXT) w3m dump (dropbox.tech)
| eevilspock wrote:
| I just want to know when they'll switch to native battery
| efficient clients, especially given the daemon is always running
| and monitoring file system events.
| bps4484 wrote:
| I'm most curious about how the experience was, both for the atlas
| team and the product team, around this quote, "Atlas is
| "managed," which means that developers writing code in Atlas only
| need to write the interface and implementation of their
| endpoints. Atlas then takes care of creating a production cluster
| to serve these endpoints. The Atlas team owns pushing to and
| monitoring these clusters."
|
| Does this imply that the atlas team gets into the weeds of
| understanding the business and business logic behind these
| endpoints to know the scalability and throughput needs? Is the
| autoscaler really good enough to handle this? If it's transparent
| to the product team, are they aware of their usage (potentially
| unexpected)? I imagine the atlas team would have to be very large
| with these sorts of responsibilities.
|
| From a product team perspective I imagine they are still
| responsible for database configuration and tuning? Has the daily
| auto-deployment led to unexpected breaks? Who is responsible for
| rollbacks? And is the product team responsible and capable of
| hotfixes?
|
| Maybe a more broad question which all of my questions above speak
| to: how are the roles and responsibilities set up between the
| atlas team and the product engineering team that owns the code,
| and how has the transition to that system been?
| qbasic_forever wrote:
| So what's the end game here, is Dropbox going to keep building
| out an internal Kubernetes-like platform with Atlas, or do they
| plan to eventually just move to k8s? I noticed this line in
| particular:
|
| "We evaluated using off-the-shelf solutions to run the platform.
| But in order to de-risk our migration and ensure low engineering
| costs, it made sense for us to continue hosting services on the
| same deployment orchestration platform used by the rest of
| Dropbox."
|
| It sounds like they acknowledge they're reinventing a lot of
| stuff but for now are sticking to their internal platform.
| Perhaps Atlas is a half-step then to get teams used to owning and
| running their code as isolated services. But everything I read
| that they built in Atlas--isolated orchestrated services, gRPC
| load balancing, canary deployments, horizontal scaling, etc.--are
| bog standard features of Kubernetes today. I'd be very leery of
| maintaining a bespoke Kubernetes-like platform in 2021 and beyond
| --in some ways it seems like it's just shifting the monolith
| technical debt into an internal Atlas platform team's technical
| debt. What's the plan to get rid of that debt for good I wonder?
|
| This hurdle shows there's already some cracks in the idea of
| long-term Atlas too:
|
| "While splitting up Metaserver had wins in production, it was
| infeasible to spin up 200+ Python processes in our integration
| testing framework. We decided to merge the processes back into a
| monolith for local development and testing purposes. We also
| built heavy integration with our Bazel rules, so that the merging
| happens behind the scene and developers can reference
| Atlasservlets as regular services."
|
| If I read that right does it really mean the first time a
| developer's code is run like it will run in production is when it
| goes out to canary deployment? I.e. integration tests are done in
| a local monolith instead of setting up a mini-prod cluster. That
| seems a bit nerve-racking as a dev to have no way to really test
| the service until bits are hitting user requests. In the k8s
| world a ton of work has been put into tooling and processes to
| make setting up local clusters easily. It's a shame to not have
| something similar for Atlas.
| hayst4ck wrote:
| Counter considerations would be: What is the delta between out
| of box solutions and current solutions? What is the cost of the
| migration? For what period will two services be supported
| simultaneously? Will development effort continue on the
| previous service while the new one is created? Will the new
| service successfully solve problems the old one didn't? What
| happens when Kubernetes is insufficient for a task or has a
| critical bug that only appears at scale? How will people be
| onboarded into the new system? Will the team handling how
| services run perform the migration or will the teams who own
| services perform the migration? How much time should be spent
| experimenting with new things compared to fixing bugs/adding
| requested features?
|
| > in some ways it seems like it's just shifting the monolith
| technical debt into an internal Atlas platform team's technical
| debt.
|
| This is a key insight into the monolith problem. How does a
| monolith become poor and unmaintainable? A monolith becomes
| poor quality and unmaintainable when there is no entity
| enforcing architectural simplicity. It becomes unmaintainable
| when there is no team focusing solely on how the monolith
| functions. It becomes unmaintainable if there is no entity
| capable of saying "no" to a product engineer. A monolith in a
| company with weak leadership is a tragedy of the commons where
| everyone takes from the commons by adding complexity and there
| is no governing entity to ensure that the commons remains
| viable.
|
| The exact statement you made is the key strength of this
| approach. Where there was a vacuum of responsibility before
| (monolith technical debt), a team has been created with direct
| responsibility and authority creating a governing force over
| that technical debt/overall complexity and therefore an entity
| directly responsible for improving it. This is a key first
| step. Atlas appears to be a compromise solution rather than an
| ideal end state.
| spondyl wrote:
| > A monolith becomes poor quality and unmaintainable when
| there is no entity enforcing architectural simplicity.
|
| Having worked in a company where no single team "owned" the
| monolith, the term "communally owned" tended to come up.
|
| It was generally understood within the platform teams that if
| everyone owns it then in reality, no one owns it :)
| lalos wrote:
| Semi-related: when people talk about monorepos, is it implied
| that all the project has only one version number? Why not just
| version subprojects of the monorepo, that way you have a small
| vetting process when cutting a release of a specific subproject.
| The rest of the subprojects that depend on it can read the
| release notes for breaking changes, etc.
| tudelo wrote:
| No. Not at all. A monorepo with multiple back end projects
| might push all at different times. So what this means is that
| when designing new features across multiple services you need
| to design them with push safety in mind with a roll out plan to
| accomplish that.
|
| For example, you are updating service B to call new endpoint on
| service A. First you need to make service A endpoint available,
| and then make service B call service A.
|
| Just because everything exists in the same repo does not mean
| it all gets shoved out at once. The downside is that you can't
| just read the code and assume the running service is doing
| that, unless it's embedded in your build. Processes like
| automated updates and a forced update cadence (no running
| binaries over X days old) with proper canary/vetting before a
| full release allow a large org to still manage this complexity.
| [deleted]
| dec0dedab0de wrote:
| To me the whole point of having a monorepo is to avoid
| versioning. But that doesnt mean you always need to deploy
| everything, you can take the newest commit hash from each
| project directory and only deploy if that has changed.
| reidrac wrote:
| Good question! I always assume it is the same version, mostly
| because a common pattern is to use SCM tags to track versions,
| and I haven't seen that work fine on any monorepo.
|
| If you have a specific version per subproject, how do you track
| that in the repo? Different tag schemes for different
| subprojects? I have used that in a small-ish monorepo and I
| didn't like it specially.
| derekperkins wrote:
| We just use the git sha and they works great, no reason to
| overcomplicate it
| mumblemumble wrote:
| When I'm doing it, I version subprojects independently using
| Git tags, but that's primarily because we mix and match
| versions in production, which creates need to make version
| numbering semantic.
|
| If we were doing continuous delivery, too, I could see there
| not being much value in messing with independent versioning,
| semver, whatever. Just make today's date the universal version
| numbering system for all modules and move along.
| JonAtkinson wrote:
| I'd be interested to better understand the timeline around this
| statement:
|
| "Metaserver was stuck on a deprecated legacy framework that
| unsurprisingly had poor performance and caused maintenance
| headaches due to esoteric bugs. For example, the legacy framework
| only supports HTTP/1.0 while modern libraries have moved to
| HTTP/1.1 as the minimum version."
|
| Dropbox has been around for a lot of years, and raised a lot of
| cash; was it only recently that they could pay down this
| technical debt? Were they really so busy in other areas that this
| was allowed to fester?
| muglug wrote:
| There's normally some sort of budget for paying down technical
| debt - presumably there was more pressing technical debt.
___________________________________________________________________
(page generated 2021-03-16 23:01 UTC)