[HN Gopher] Our journey from a Python monolith to a managed plat...
       ___________________________________________________________________
        
       Our journey from a Python monolith to a managed platform
        
       Author : mands
       Score  : 83 points
       Date   : 2021-03-16 15:00 UTC (8 hours ago)
        
 (HTM) web link (dropbox.tech)
 (TXT) w3m dump (dropbox.tech)
        
       | eevilspock wrote:
       | I just want to know when they'll switch to native battery
       | efficient clients, especially given the daemon is always running
       | and monitoring file system events.
        
       | bps4484 wrote:
       | I'm most curious about how the experience was, both for the atlas
       | team and the product team, around this quote, "Atlas is
       | "managed," which means that developers writing code in Atlas only
       | need to write the interface and implementation of their
       | endpoints. Atlas then takes care of creating a production cluster
       | to serve these endpoints. The Atlas team owns pushing to and
       | monitoring these clusters."
       | 
       | Does this imply that the atlas team gets into the weeds of
       | understanding the business and business logic behind these
       | endpoints to know the scalability and throughput needs? Is the
       | autoscaler really good enough to handle this? If it's transparent
       | to the product team, are they aware of their usage (potentially
       | unexpected)? I imagine the atlas team would have to be very large
       | with these sorts of responsibilities.
       | 
       | From a product team perspective I imagine they are still
       | responsible for database configuration and tuning? Has the daily
       | auto-deployment led to unexpected breaks? Who is responsible for
       | rollbacks? And is the product team responsible and capable of
       | hotfixes?
       | 
       | Maybe a more broad question which all of my questions above speak
       | to: how are the roles and responsibilities set up between the
       | atlas team and the product engineering team that owns the code,
       | and how has the transition to that system been?
        
       | qbasic_forever wrote:
       | So what's the end game here, is Dropbox going to keep building
       | out an internal Kubernetes-like platform with Atlas, or do they
       | plan to eventually just move to k8s? I noticed this line in
       | particular:
       | 
       | "We evaluated using off-the-shelf solutions to run the platform.
       | But in order to de-risk our migration and ensure low engineering
       | costs, it made sense for us to continue hosting services on the
       | same deployment orchestration platform used by the rest of
       | Dropbox."
       | 
       | It sounds like they acknowledge they're reinventing a lot of
       | stuff but for now are sticking to their internal platform.
       | Perhaps Atlas is a half-step then to get teams used to owning and
       | running their code as isolated services. But everything I read
       | that they built in Atlas--isolated orchestrated services, gRPC
       | load balancing, canary deployments, horizontal scaling, etc.--are
       | bog standard features of Kubernetes today. I'd be very leery of
       | maintaining a bespoke Kubernetes-like platform in 2021 and beyond
       | --in some ways it seems like it's just shifting the monolith
       | technical debt into an internal Atlas platform team's technical
       | debt. What's the plan to get rid of that debt for good I wonder?
       | 
       | This hurdle shows there's already some cracks in the idea of
       | long-term Atlas too:
       | 
       | "While splitting up Metaserver had wins in production, it was
       | infeasible to spin up 200+ Python processes in our integration
       | testing framework. We decided to merge the processes back into a
       | monolith for local development and testing purposes. We also
       | built heavy integration with our Bazel rules, so that the merging
       | happens behind the scene and developers can reference
       | Atlasservlets as regular services."
       | 
       | If I read that right does it really mean the first time a
       | developer's code is run like it will run in production is when it
       | goes out to canary deployment? I.e. integration tests are done in
       | a local monolith instead of setting up a mini-prod cluster. That
       | seems a bit nerve-racking as a dev to have no way to really test
       | the service until bits are hitting user requests. In the k8s
       | world a ton of work has been put into tooling and processes to
       | make setting up local clusters easily. It's a shame to not have
       | something similar for Atlas.
        
         | hayst4ck wrote:
         | Counter considerations would be: What is the delta between out
         | of box solutions and current solutions? What is the cost of the
         | migration? For what period will two services be supported
         | simultaneously? Will development effort continue on the
         | previous service while the new one is created? Will the new
         | service successfully solve problems the old one didn't? What
         | happens when Kubernetes is insufficient for a task or has a
         | critical bug that only appears at scale? How will people be
         | onboarded into the new system? Will the team handling how
         | services run perform the migration or will the teams who own
         | services perform the migration? How much time should be spent
         | experimenting with new things compared to fixing bugs/adding
         | requested features?
         | 
         | > in some ways it seems like it's just shifting the monolith
         | technical debt into an internal Atlas platform team's technical
         | debt.
         | 
         | This is a key insight into the monolith problem. How does a
         | monolith become poor and unmaintainable? A monolith becomes
         | poor quality and unmaintainable when there is no entity
         | enforcing architectural simplicity. It becomes unmaintainable
         | when there is no team focusing solely on how the monolith
         | functions. It becomes unmaintainable if there is no entity
         | capable of saying "no" to a product engineer. A monolith in a
         | company with weak leadership is a tragedy of the commons where
         | everyone takes from the commons by adding complexity and there
         | is no governing entity to ensure that the commons remains
         | viable.
         | 
         | The exact statement you made is the key strength of this
         | approach. Where there was a vacuum of responsibility before
         | (monolith technical debt), a team has been created with direct
         | responsibility and authority creating a governing force over
         | that technical debt/overall complexity and therefore an entity
         | directly responsible for improving it. This is a key first
         | step. Atlas appears to be a compromise solution rather than an
         | ideal end state.
        
           | spondyl wrote:
           | > A monolith becomes poor quality and unmaintainable when
           | there is no entity enforcing architectural simplicity.
           | 
           | Having worked in a company where no single team "owned" the
           | monolith, the term "communally owned" tended to come up.
           | 
           | It was generally understood within the platform teams that if
           | everyone owns it then in reality, no one owns it :)
        
       | lalos wrote:
       | Semi-related: when people talk about monorepos, is it implied
       | that all the project has only one version number? Why not just
       | version subprojects of the monorepo, that way you have a small
       | vetting process when cutting a release of a specific subproject.
       | The rest of the subprojects that depend on it can read the
       | release notes for breaking changes, etc.
        
         | tudelo wrote:
         | No. Not at all. A monorepo with multiple back end projects
         | might push all at different times. So what this means is that
         | when designing new features across multiple services you need
         | to design them with push safety in mind with a roll out plan to
         | accomplish that.
         | 
         | For example, you are updating service B to call new endpoint on
         | service A. First you need to make service A endpoint available,
         | and then make service B call service A.
         | 
         | Just because everything exists in the same repo does not mean
         | it all gets shoved out at once. The downside is that you can't
         | just read the code and assume the running service is doing
         | that, unless it's embedded in your build. Processes like
         | automated updates and a forced update cadence (no running
         | binaries over X days old) with proper canary/vetting before a
         | full release allow a large org to still manage this complexity.
        
         | [deleted]
        
         | dec0dedab0de wrote:
         | To me the whole point of having a monorepo is to avoid
         | versioning. But that doesnt mean you always need to deploy
         | everything, you can take the newest commit hash from each
         | project directory and only deploy if that has changed.
        
         | reidrac wrote:
         | Good question! I always assume it is the same version, mostly
         | because a common pattern is to use SCM tags to track versions,
         | and I haven't seen that work fine on any monorepo.
         | 
         | If you have a specific version per subproject, how do you track
         | that in the repo? Different tag schemes for different
         | subprojects? I have used that in a small-ish monorepo and I
         | didn't like it specially.
        
           | derekperkins wrote:
           | We just use the git sha and they works great, no reason to
           | overcomplicate it
        
         | mumblemumble wrote:
         | When I'm doing it, I version subprojects independently using
         | Git tags, but that's primarily because we mix and match
         | versions in production, which creates need to make version
         | numbering semantic.
         | 
         | If we were doing continuous delivery, too, I could see there
         | not being much value in messing with independent versioning,
         | semver, whatever. Just make today's date the universal version
         | numbering system for all modules and move along.
        
       | JonAtkinson wrote:
       | I'd be interested to better understand the timeline around this
       | statement:
       | 
       | "Metaserver was stuck on a deprecated legacy framework that
       | unsurprisingly had poor performance and caused maintenance
       | headaches due to esoteric bugs. For example, the legacy framework
       | only supports HTTP/1.0 while modern libraries have moved to
       | HTTP/1.1 as the minimum version."
       | 
       | Dropbox has been around for a lot of years, and raised a lot of
       | cash; was it only recently that they could pay down this
       | technical debt? Were they really so busy in other areas that this
       | was allowed to fester?
        
         | muglug wrote:
         | There's normally some sort of budget for paying down technical
         | debt - presumably there was more pressing technical debt.
        
       ___________________________________________________________________
       (page generated 2021-03-16 23:01 UTC)