[HN Gopher] We don't use a staging environment
___________________________________________________________________
We don't use a staging environment
Author : Chris86
Score : 196 points
Date : 2022-04-03 18:28 UTC (4 hours ago)
(HTM) web link (squeaky.ai)
(TXT) w3m dump (squeaky.ai)
| donohoe wrote:
| It seems like an April 1st troll (based on publication date), but
| I am assuming its not.
|
| I can only say that this is a fairly poor decision from someone
| who appears knowledgeable to know better.
|
| They could do everything they are doing as-is in terms of
| process, and just add a rudimentary test on a Staging environment
| as it passes to Production.
|
| Over a long enough timeline it will catch enough critical issues
| to justify itself.
| winrid wrote:
| This is how we work at fastcomments... soon we will have a shard
| in each major continent and will just deploy changes to a shard,
| run e2e tests, and then roll out to the rest of the Shards.
|
| But if you have a high risk system or a business that values
| absolute quality over iteration speed, then yeah you want
| dev/staging envs...
| cortesoft wrote:
| This makes some sense for a single application environment. In
| our system, however, there are dozens of interacting systems, and
| we need an integration environment to ensure that new code works
| with all the other systems.
| jokethrowaway wrote:
| A previous client was paying roughly 50% of their AWS budget
| (more than a million per year) just to keep up development and
| staging.
|
| They were roughly 3x machines for live, 2x for staging and 1x for
| development.
|
| Trying to get rid of it didn't work politically, because we had a
| cyclical contract with AWS where we were committing to spend X
| amount in exchange for discounts. Also, a healthy amount of ego
| and managers of managers BS.
|
| In terms of what that company was doing, I'm pretty sure I could
| have exceeded their environment for 2k per month on hetzner
| (using auction).
| MetaWhirledPeas wrote:
| I don't have experience with the true CI he describes, but I do
| have experience with pre-production environments.
|
| > "People mistakenly let process replace accountability"
|
| I find this to be mostly true. When the code goes somewhere else
| _before_ it goes to prod, much of the burden of responsibility
| goes along with it. _Other_ people find the bugs and spoon feed
| them back to the developers. I 'm sure as a developer this is
| nice, but as a process I hate it.
| otterley wrote:
| You can have both process and accountability. Process for the
| things that can be automated or subject to business rules;
| accountability for when the process fails (either by design or
| in its implementation) or after lapses in judgment.
| adamredwoods wrote:
| > "People mistakenly let process replace accountability"
|
| Who would do this? If a bug goes into production, the one
| responsible for the deployment is the one who rolls it back and
| fixes it. Even it it becomes a sev-3 later down the line,
| they're usually the one who gets looped back in thanks to Git
| commits.
|
| I would say that a pre-prod environment allows teams to
| incorporate a larger set of accountability, such as UX
| validation, dedicated QA, translation teams (think intl ecom)
| even verifying third party integrations in their pre-prod
| environments.
| otterley wrote:
| The short answer appears to be "we are cheap and nobody cares
| yet."
|
| It's easy to damn the torpedoes and deploy straight into
| production if there's nobody to care about, or your paying
| customers (to the extent you have any) don't care either.
|
| Once you start gaining paying customers who really care about
| your service being reliable, your tune changes pretty quickly. If
| your customers rely on data fidelity, they're going to get pretty
| steamed when your deployment irreversibly alters or irrevocably
| loses it.
|
| Also, "staging never looks like production" looks like a cost
| that tradeoff that the author made, not a Fundamental Law of
| DevOps. If you want it to look like production, you can do the
| work and develop the discipline to make it so. The cloud makes
| this easier than ever, if you're willing to pay for it.
| mr337 wrote:
| Ooof I think I have to agree with "we are cheap and nobody
| cares yet.". If we had a bad release go out that blocked
| nightly processing, for example, it was how amazing fast it
| became a ticket to CEOs start calling.
|
| One of the things that we did really well is we had tooling
| that spun up environments. The same tooling DevOps stood up
| production environments also stood up environments for PRs and
| UAT. Anyone within the company could spin up an environment for
| which ever reason be it from master or to apply a PR. When it
| works it works great, if it doesn't work fix it and don't throw
| out the entire concept.
| rileymat2 wrote:
| I think a lot of these process type articles would be well served
| by linking to some other post about team and project structure,
| size and scope.
| bombcar wrote:
| They have a staging environment - they just run production on it.
| jeffbee wrote:
| What I infer from the article is this company does not handle
| sensitive private data, or they do but are unaware of it, or they
| are aware of it and just handle it sloppily. I infer that because
| one of the biggest advantages of a pre-prod environment is you
| can let your devs play around in a quasi-production environment
| that gets real traffic, but no traffic from outside customers.
| This is helpful because when you take privacy seriously there is
| no way for devs to just look at the production database, or to
| gain interactive shells in prod, or to attach debuggers to
| production services without invoke glass-breaking emergency
| procedures. In the pre-prod environment they can do whatever they
| want.
|
| Most of the rest of the article is not about the disadvantages of
| pre-prod, but the drawbacks of the "git flow" branching model
| compared to "trunk based development". The latter is clearly
| superior and I agree with those parts of the article.
| krm01 wrote:
| This isn't very uncommon. In fact, it actually is exactly what
| the article is trying to explain it's not: a staging/pre-live
| environment. Only instead of having it be deployed online, you
| keep it local.
| cinbun8 wrote:
| This strategy won't scale beyond a very small team and codebase.
| The reasons mentioned, such as parity, are worth fixing.
| wahnfrieden wrote:
| lol what is continuous deployment
| myth2018 wrote:
| _I 'm assuming this is not an April Fools' joke, and my comments
| are targeted at the discussion it sparked here anyway._
|
| A flat branching model simplify things, and the strategy they
| describe surely enables them to ship features to production
| faster. But the risks I see there:
|
| - who decides when a feature is ready to go to production? The
| programmer who developed them? The automated tests?
|
| - features toggleable by a flag must, at least ideally, be
| double-tested -- both when turned on and off. Being in a hurry to
| deploy to production wouldn't help on that;
|
| - OK, staging environments aren't in parity with production. But
| wouldn't they be better than the CD/CI pipeline, or developer's
| laptop, testing new features in isolation?
|
| - Talking about features in isolation: what about bugs caused by
| spurious interaction between two or more features? No amount of
| test would find them if they only test features in isolation
| nvader wrote:
| Published April 1st. Ooh, nice try.
| DevKoala wrote:
| > We only merge code that is ready to go live
|
| That's a cool April fool's squeaky.ai
| mhitza wrote:
| I also like to live dangerously.
| fmakunbound wrote:
| I'm working at megacorp at the moment as contractor. The local
| dev, cloud dev, cloud stage, cloud prod pipeline is truly glacial
| in velocity even with automation like Jenkins, kubernetes, etc.
| it takes weeks to move from dev complete to production. It's a
| middle manager's wet dream.
|
| I used to wonder why isn't megacorp being murdered by competitors
| delivering features faster, but actually, everyone is moving
| glacially for the same reason, so it doesn't matter.
|
| I'm kinda reminded by pg's essay on which competitors to worry
| about. I might be a worried competitor if these guys are pulling
| off merging to master as production.
| rock_hard wrote:
| This is pretty common actually
|
| At Facebook too there was no staging environment. Engineers had
| their dev VM and then after PR review things just went into prod
|
| That said features and bug fixes were often times gated by
| feature flags and rolled out slowly to understand the
| product/perf impact better
|
| This is how we do it at my current team too...for all the same
| reasons that OP states
| aprdm wrote:
| The book Software Engineering at Google or something akin to
| that mentions the same kind of thing.
| Rapzid wrote:
| Facebook can completely break the user experience for 4.3
| million different users each day and each user would only
| experience one breakage per year.
|
| This is pretty common, but not because most employing it have
| 1.6bn users and 10k engineers; essentially enough scale to
| throw bodies at problems.
| abhishekjha wrote:
| That would be controlling a lot of feature flags given how many
| can be switched on at once. How do you control them?
| sillysaurusx wrote:
| flag = true
|
| More seriously, at my old company they just never got
| removed. So it wasn't really about control. You just forgot
| about the ones that didn't matter after awhile.
|
| If that sounds horrible, that's probably the correct
| reaction. But it's also common.
|
| Namespacing helps too. It's easier to forget a bunch of flags
| when they all start with foofeature-.
| withinboredom wrote:
| I've seen those old flags come in handy once. Someone
| accidentally deleted a production database (typo) and we
| needed to stop all writes to restore from a backup. For
| most of it, it was just turning off the original feature
| flag, even though the feature was several years old.
| skybrian wrote:
| It can become a code maintenance issue, though, when you
| revisit the code. You need to maintain both paths when you
| never know if they are being used.
|
| Also, where flags interact, you can get a combinatorial
| explosion of cases to consider.
| mdoms wrote:
| At a previous workplace we managed flags with Launch
| Darkly. We asked developers not to create flags in LD
| directly but used Jira web hooks to generate flags from any
| Jira issues of type Feature Flag. This issue type had a
| workflow that ensured you couldn't close off an epic
| without having rolled out and then removed every feature
| flag. Flags should not significantly outlast their 100%
| rollout.
| harunurhan wrote:
| > the ones that didn't matter after awhile.
|
| Ideally you have metrics for all flags and their values, so
| you can easily tell if one becomes redundant and safe to
| remove entirely after a while.
|
| I've also seen making it a requirement to remove a flag
| after N days, the feature is completely rolled out.
| clintonb wrote:
| I work at a different company. Typically feature flags are
| short-lived (on the order of days or weeks), and only control
| one feature. When I deploy, I only care about my one feature
| flag because that is the only thing gating the new
| functionality being deployed.
|
| There may be other feature flags, owned by other teams, but
| it's rare to have flags that cross team/service boundaries in
| a fashion that they need to be coordinated for rollout.
| rb2k_ wrote:
| It's 7 years old by now, but there's some literature:
|
| https://research.facebook.com/publications/holistic-
| configur...
|
| You can see that there's a common backend ("configerator")
| that a lot of other systems ("sitevars", "gatekeeper", ...)
| build on top of.
|
| Just imagine that these systems have been further developed
| over the last decade :)
|
| In general, there's 'configuration change at runtime' systems
| that the deployed code usually has access to and that can
| switch things on and off in very short time (or slowly roll
| it out). Most of these are coupled with a variety of health
| checks.
| otterley wrote:
| Was this true for the systems that related to revenue and ad
| sales as well? While I can believe that a lot of code at
| Facebook goes into production without first going through a
| staging environment, I would be extremely surprised if the same
| were true for their ads systems or anything that dealt with
| payment flows.
| zdragnar wrote:
| I don't know about Facebook, but at other companies without
| similar, each git branch gets deployed to its own subdomain,
| so manual testing etc. can happen prior to a merge. Dangerous
| changes are feature flagged or gated as much as possible to
| allow prod feedback after merge before enabling the changes
| for everyone.
| Gigachad wrote:
| This is how my current place does it. The only issue we are
| having is library / dependency updates have a tendency to work
| perfectly fine locally and then fail in production due to
| either some minor difference in environment or scale.
|
| It's a problem to the point that we have 5 year old ruby gems
| which have no listed breaking changes because no one is brave
| enough to bump them. I had a go at it and caused a major
| production incident because the datadog gem decided to kill
| Kubernetes with too many processes.
| kgeist wrote:
| >That said features and bug fixes were often times gated by
| feature flags
|
| Sorry for maybe a silly question, but how do feature flags work
| with migrations? If your migrations run automatically on
| deploy, then feature flags can't prevent badly tested
| migrations from corrupting the DB, locking tables and other
| sorts of regressions. If you run your migrations manually each
| time, then there's a chance that someone enables a feature
| toggle without running the required migrations, which can
| result in all sorts of downtime.
|
| Another concern I have is that if a feature toggle isn't
| enabled in production for a long time (for us, several days is
| already a long time due to a tight release schedule) new
| changes to the codebase by another team can conflict with the
| disabled feature and, since it's disabled, you probably won't
| know there's a problem until it's too late?
| drewcoo wrote:
| > how do feature flags work with migrations?
|
| The idea is to have migrations that are backward compatible
| so that the current version of your code can use the db and
| so can the new version. Part of the reason people started
| breaking up monoliths is that continuous deployment with a
| db-backed monolith can be brittle. And making it work well
| requires a whole bunch of brain power that could go into
| things like making the product better for customers.
|
| > another concern
|
| Avoiding "feature flag hell" is a valid concern. It has to be
| managed. The big problem with conflict is underlying tightly
| coupled code, though. That should be fixed. Note this is also
| solved by breaking up monoliths.
|
| > tight release schedule
|
| If a release in this sense is something product-led, then
| feature flags almost create an API boundary (a good thing!)
| between product and dev. Product can determine when their
| release (meaning set of feature flags to be flipped) is ready
| and ideally toggle themselves instead of roping devs into
| release management roles.
| kgeist wrote:
| >The idea is to have migrations that are backward
| compatible so that the current version of your code can use
| the db and so can the new version
|
| Well, any migration has to be backward-compatible with the
| old code because old code is still running when a migration
| is taking place.
|
| As an example of what I'm talking about: a few months ago
| we had a migration that passed all code reviews and worked
| great in the dev environment but in production it would
| lead to timeouts in requests for the duration of the
| migration for large clients (our application is sharded per
| tenant) because the table was very large for some of them
| and the migration locked it. The staging environment helped
| us find the problem before hitting production because we
| routinely clone production data (deanonymized) of the
| largest tenants to find problems like this. It's not
| practical (and maybe not very legal too) to force every
| developer have an up-to-date copy of that database on every
| VM/laptop, and load tests in an environment very similar to
| production show more meaningful results overall. And
| feature flags wouldn't help either because they only guard
| code. So far I'm unconvinced, it sounds pretty risky to me
| to go straight to prod.
|
| I agree however that the concern about conflicts between
| feature toggles is largely a monolith problem, it's a
| communication problem when many teams make changes to the
| same codebase and are unaware of what the other teams are
| doing.
| nicoburns wrote:
| > Well, any migration has to be backward-compatible with
| the old code because old code is still running when a
| migration is taking place.
|
| This is definitely best practice, but it's not strictly
| necessary if a small amount of downtime is acceptable. We
| only have customers in one timezone and minimal traffic
| overnight, so we have quite a lot of leeway with this.
| Frankly even during business hours small amounts of
| downtime (e.g. 5 minutes) would be well tolerated: it's a
| lot better than most of the other services they are used
| to using anyway.
| withinboredom wrote:
| > Well, any migration has to be backward-compatible with
| the old code because old code is still running when a
| migration is taking place.
|
| This doesn't have to be true. You can create an entirely
| separate table with the new data. New code knows how to
| join on this table, old code doesn't and thus ignores the
| new data. It doesn't work for every kind of migration,
| but in my experience, it's preferred by some DBAs if you
| have billions and billions of rows.
|
| Example: `select user_id, coalesce(new_col2, old_col2) as
| maybe_new_data, new_col3 as new_data from old_table left
| join new_table using user_id limit 1`
| cmeacham98 wrote:
| I think their question was more "if I wrote a migration
| that accidentally drops the users table, how does your
| system prevent that from running on production"? That's a
| pretty extreme case, but the tldr is how are you testing
| migrations if you don't have a staging environment.
| laurent123456 wrote:
| I'd think they create "append-only" migrations, that can
| only add columns or tables. Otherwise it wouldn't be
| possible to have migrations that work with both old and
| new code.
| derefr wrote:
| > Otherwise it wouldn't be possible to have migrations
| that work with both old and new code.
|
| Sure you can. Say that you've changed the type of a
| column in an incompatible way. You can, within a
| migration that executes as an SQL transaction:
|
| 1. rename the original table "out of the way" of the old
| code
|
| 2. add a new column of the new type
|
| 3. run an "INSERT ... SELECT ..." to populate the new
| column from a transformation of existing data
|
| 4. drop the old column of the old type
|
| 5. rename the new column to the old column's name
|
| 6. define a view with the name of the original table,
| that just queries through to the new (original + renamed
| + modified) table for most of the original columns, but
| which continues to serve the no-longer-existing column
| with its previous value, by computing its old-type value
| from its new-type value (+ data in other columns, if
| necessary.)
|
| Then either make sure that the new code is reading
| directly from the new table; or create a trivial
| passthrough view for the new version to use as well.
|
| (IMHO, as long as you've got writable-view support, every
| application-visible "table" _should_ really just be a
| view, with its name suffixed with the ABI-major-
| compatibility-version of the application using it. Then
| the infrastructure team -- and more specifically, a DBA,
| if you 've got one -- can do whatever they like with the
| underlying tables: refactoring them, partitioning them,
| moving them to other shards and forwarding them, etc. As
| long as all the views still work, and still produce the
| same query results, it doesn't matter what's underneath
| them.)
| freedomben wrote:
| I wrote a blog about this for anyone who would like to
| learn more.
|
| The query strings get you around the paywall if it comes
| up:
|
| https://freedomben.medium.com/the-rules-of-clean-and-
| mostly-...
|
| If anyone doesn't know what migrations are:
|
| https://freedomben.medium.com/what-are-database-
| migrations-5...
| ninth_ant wrote:
| That is largely the case.
|
| For other, more complex cases where that is not possible,
| you migrate a portion of the userbase to a new db schema
| and codepath at the same time.
| toast0 wrote:
| > Sorry for maybe a silly question, but how do feature flags
| work with migrations? If your migrations run automatically on
| deploy
|
| Basically they don't. Database migration based on frontend
| deploy doesn't really make sense at facebook scale, because
| deploy is no where close to synchronous; even feature flag
| changes aren't synchronous. I didn't work on FB databases
| while I was employed by them, but when you've got a lot of
| frontends and a lot of sharded databases, you don't have much
| choice; if your schema is changing, you've got to have a
| multiphased push:
|
| a) push frontend that can deal with either schema
|
| b) migrate schema
|
| c) push frontend that uses new schema for new feature (with
| the understanding that the old frontend code will be running
| on some nodes) --- this part could be feature flagged
|
| d) data cleanup if necessary
|
| e) push code that can safely assume all frontends are new
| feature aware and all rows are new feature ready
|
| IMHO, this multiphase push is really needed regardless of
| scale, but if you're small, you can cross your fingers and
| hope. Or if you're willing to take downtime, you can bring
| down the service, make the database changes without
| concurrent access, and bring the service back with code
| assuming the changes; most people don't like downtime though.
| kgeist wrote:
| >Basically they don't. Database migration based on frontend
| deploy doesn't really make sense at facebook scale, because
| deploy is no where close to synchronous; even feature flag
| changes aren't synchronous.
|
| Our deployments aren't strictly "synchronous" either. We
| have thousands of database shards which are all migrated
| one by one (with some degree of parallelism), and new code
| is deployed only after all the shards have migrated. So
| there's a large window (sometimes up to an hour) when some
| shards see the new schema and others see the old schema
| (while still running old code). It's one click of a button,
| however, and one logical release, we don't split it into
| separate releases (so I view them as "automatic"). The
| problem still stays, though, that you can only guard code
| with feature flags, migrations can't be conditionally
| disabled. With this setup, if a poorly tested migration
| goes awry, it's even more difficult to rollback, because it
| will take another hour to roll back all the shards.
| withinboredom wrote:
| We don't have a staging environment (for the backend) at
| work either. However, depending on the size of the tables
| in-question, a migration might take days. Thus, we
| usually ask DBA's for a migration days/weeks before any
| code goes live. There's usually quite a bit of
| discussion, and sometimes suggestions for an entirely
| different table with a join and/or application-only (in
| code, multiple query) join.
| [deleted]
| funfunfunction wrote:
| Infra as code + good modern automation solves the parity issue. I
| empathize with wanting to stay lean but this seems extreme.
| shoo wrote:
| different business or organisational contexts have different
| deployment patterns and different negative impacts of failure.
|
| in some contexts, failures can be isolated to small numbers of
| users, the negative impacts of failures are low, and rollback is
| quick and easy. in this kind of environment, provided you have
| good observability & deployment, it might be more reasonable to
| eliminate staging and focus more on being able to run experiments
| safely and efficiently in production.
|
| in other contexts, the negative impacts of failure are very high.
| e.g. medical devices, mars landers, software governing large
| single systems (markets, industrial machinery). in these
| situations you might prefer to put more emphasis on QA before
| production.
| user3939382 wrote:
| > Pre-live environments are never at parity with production
|
| Then you fix that particular problem. Infrastructure as code is
| one idea just off the top of my head.
| raffraffraff wrote:
| Yup. If you have 4 production data centers, I imagine they're
| different sizes (autoscaling groups, Kubernetes deployment
| scale, perhaps even database instance sizes). So just build a
| staging environment that's like those, except smaller and not
| public. If you can't do that, then I'm willing to bet you can't
| deploy a new data center very quickly either, and your DR looks
| like ass.
| crummy wrote:
| Is it possible to make staging 100% identical with prod? Load
| is one thing I can think of that is difficult to make
| identical; even if you artificially generate it, user behaviour
| will likely be different.
| user3939382 wrote:
| I don't work on systems where that factor is critical to our
| tests, but if I was I would start here (at least in my case
| since we use AWS)
| https://docs.aws.amazon.com/solutions/latest/distributed-
| loa...
| shakezula wrote:
| Easier said than done, obviously. And even with docker images
| and Infra as Code and pinned builds and virtual environments,
| it is difficult to be absolutely sure about the last 1% of the
| environment, and it requires a ton of effort and engineering
| discipline to properly maintain.
|
| Reducing the number of environments the team has to maintain
| means by definition more time for each environment.
| hpen wrote:
| Well of course you can ship faster -- But that's not the point of
| a staging environment!
| okamiueru wrote:
| My experience with their list of suppositions:
|
| > Pre-live environments are never at parity with production
|
| My experience is that is is fairly trivial to have feature parity
| with production. Whatever you do for production, just do it again
| for staging. That's what it is meant to be.
|
| > Most companies are not prepared to pay for a staging
| environment identical to production
|
| Au contraire. All companies I've been to are more than willing to
| pay this. And secondly, it is pennies compared to production
| environment costs, because it isn't expected to handle any
| significant load. And, the article does mention being able to
| handle load as being one of the things that differ. I have not
| yet found the need to use changes to staging to verify load
| scaling capabilities.
|
| > There's always a queue
|
| I don't undestand this paraph at all. It seems like an artificial
| problem created by how they handle repository changes, and has
| little to do with the purpose of a staging environment. It smells
| fishy to have local changes rely on a staging environment. The
| infrastructure I set up had a development environment be spun up
| and used for a development testing pipeline. Doesn't, and
| shouldn't need to rely on staging.
|
| > Releases are too large
|
| Well... one of the main benefits of having a staging environment
| is to safely do frequent small deployments. So this just seems
| like the exact wrong conclusion.
|
| > Poor ownership of changes
|
| This again, is not at all how I understand code should be shipped
| to a staging environment. "I've seen people merge, and then
| forget that their changes are on staging". What does this even
| mean? Surely, staging is only ever something that is deployed to
| from the latest release branch, which also surely comes from a
| main/master? The following "and now there are multiple sets of
| changes waiting to be released", also suggest some fundamental
| misunderstanding. *Releases* are what are meant to end up in
| staging. <Multiple set of changes> should be *a* release.
|
| > People mistakenly let process replace accountability > "By
| utilising a pre-production environment, you're creating a
| situation where developers often merge code and "throw it over
| the fence"
|
| Again. Staging environment isn't a place where you dump your
| shit. "Staging" is a place where releases are verified in an as
| much-as-possible-the-same-environment-as-production. So, again.
| This seems like entirely missing the point.
|
| ----
|
| It seems to me that they don't use a staging environment, because
| they don't understand what such a thing should be used for.
| That's not to say there are not reasons to not have such an
| environment. But... none of the reasons listed make any sense
| from what I've experienced.
| hutrdvnj wrote:
| It's about risk acceptance. What could go wrong without an
| staging environment, seriously?
| shaneprrlt wrote:
| Does QA just pull and test against a dev instance? Do they test
| against prod? Do engineers get prod API keys if they have to test
| an integration with a 3rd party?
| fishtoaster wrote:
| This is a pretty weird article. Their "how we do it" section
| lists:
|
| - "We only merge code that is ready to go live"
|
| - "We have a flat branching strategy"
|
| - "High risk features are always feature flagged"
|
| - "Hands-on deployments" (which, from their description, seems to
| be just a weird way of saying "we have good monitoring and
| observability tooling")
|
| ...absolutely none of which conflict with or replace having a
| staging environment. Three of my last four gigs have had all four
| of those _and_ found value in a staging environment. In fact, the
| often _help_ make staging useful: having feature-flagged features
| and ready-to-merge code means that multiple people can validate
| their features on staging without stepping on eachother 's toes.
| nunez wrote:
| There's a difference between permanent staging environments
| that need maintenance and disposable "staging" environments
| that are literally a clone of what's on your laptop that you
| trash once UAT/smoke is done.
|
| The former costs money and can lie to you; the latter is
| literally prod, but smaller.
| DandyDev wrote:
| This makes it sound so easy, but in my experience, permanent
| staging environments exist because setting up disposable
| staging environments is too complex.
|
| How do you deal with setting up complex infrastructure for
| your disposable staging environment when your system is more
| complex than a monolithic backend, some frontend and a
| (small) database? If your system consists of multiple
| components with complex interactions, and you can only
| meaningfully test features if there is enough data in the
| staging database and it's _the right_ data, then setting up
| disposable staging environments is not that easy.
| EsotericAlgo wrote:
| Absolutely. The answer is better integration boundaries but
| then you're paying the abstraction cost which might be
| higher.
|
| It's particularly difficult when the system under test
| includes an application that isn't designed to be set up
| ephemerally such as application-level managed services with
| only ClickOps configuration, proprietary systems where such
| a request is atypical and prevented by egregious licensing
| costs, or those that contain a physical component (e.g. a
| POS with physical peripherals).
| vasco wrote:
| It's not too complex. There's plenty of products that make
| this easy, gitlab review apps being one of them.
| tharkun__ wrote:
| Sibling here but I can talk a bit about how we do it.
|
| Through infrastructure as code. We do not have a monolithic
| backend. We have a bunch of services, some smaller, some
| bigger. Yes there's "some frontend" but it's not just one
| frontend. We have multiple different "frontend services"
| serving different parts of it. As for database, we use
| multiple different database technologies, depending on the
| service. Some service uses only one of those, while others
| use a mix that is suited best to a particular use case. For
| one of those we use sharding and while a staging or dev
| environment doesn't _need_ the sharding, these obviously
| use the only shard we create in dev /staging but the same
| mechanism for shard lookup are used. For data it depends.
| We have a data generator that can be loaded with different
| scenarios, either generator parameters or full fledged "db
| backup style" definitions that you _can_ use but don 't
| have to. We deploy to Prod multiple times per day
| (basically relatively shortly after something hits the main
| branch).
|
| Through the exact same means we could also re-create prod
| at any time and in fact DR exercises are held for that
| regularly.
| lolinder wrote:
| Yeah, it sounds to me like OP had the former, which they've
| dropped, and haven't yet found a need for the latter.
|
| I work for a tiny company that, when I joined, had a "pet"
| prod server and a "pet" staging server. The config between
| them varied in subtle but significant ways, since both had
| been running for 5 years.
|
| I helped make the transition the article described and it was
| _huge_ for our productivity. We went from releasing once a
| quarter to releasing multiple times a week. We used to plan
| on fixing bugs for weeks after a release, now they 're rare.
|
| We've since added staging back as a disposable system, but I
| understand where the author is coming from. "Pet" staging
| servers are nightmarish.
| [deleted]
| tharkun__ wrote:
| FWIW I don't think it is weird at all. Maybe a little short on
| details of what ready really means for example. While I don't
| think going completely staging-less makes a lot of sense, going
| without a shared staging environment is a good thing.
|
| It is absolutely awesome to be able to have your own "staging"
| environment for testing that is independent of everyone else.
| With the Cloud this is absolutely possible. Shared staging
| environments are really bad. Things that should take a day at
| most turn into a coordination and waiting game of weeks. And as
| pressure mounts to get things tested and out you might have
| people trying to deploy parts that "won't affect the other
| tests" going on at the same time. And then they do and you have
| no idea if it's your changes or their changes that made the
| tests fail. And since it's been 2 weeks since the change was
| made and you finally got time on that environment your devs
| have already finished working on two or more other changes in
| the meantime.
|
| FWIW we have a similar set up where devs and QA can spin up a
| complete environment that is almost the exact same as prod and
| do so independently. They can turn on and off feature flags
| individually without affecting each other. Since we don't need
| to wait (except for the few minutes to deploy or a bit longer
| to create a new env from scratch) any bugs found can be fixed
| rather quickly as devs at most have _started_ working on
| another task. The environment can be torn down once finished
| but probably will just be reused until the end of the day.
|
| (while it's almost the same as prod it isn't completely like it
| for cost reasons meaning less nodes by default and such but
| honestly for most changes that is completely irrelevant and
| when it might be relevant it's easy to spin up more nodes
| temporarily through the exact same means as one would use to
| handle load spikes in prod).
| drewcoo wrote:
| Those bullets together explain how they can avoid having a
| staging environment.
|
| There's a whole section of the article entitled "What's wrong
| with staging environments?" that explains why they don't want
| staging.
|
| They even presented their "why" before going into their "how."
| There is absolutely nothing weird about this.
|
| Well, ok, it's weird that not all so-called "software
| engineers" follow this pattern of problem-solving. But that's
| not Squeaky's fault. They're showing us how to do it better.
| lupire wrote:
| The company does some analysitics on highly redundant data
| (user behavior on website). They run a system with low
| requirements for a avaibility, correctness, and feature churn.
| Their product is nice to have but not important to mission on a
| daily basis. If their entire system went down for a day, or
| even 3 days a week, their customers would be only mildly
| inconvenienced. They aren't Amazon or Google. So they test in
| prod.
| Negitivefrags wrote:
| If you are saying you don't have a staging environment, what you
| are really saying is that your company doesn't have any QA
| process.
|
| If your QA process is just developers testing their own shit on
| their local machine then you are not going to get as much value
| out of staging.
| aurbano wrote:
| I've seen this before at very large companies. All testing done
| in local and very little manual smoke testing in QA by either
| the PM or other engineers.
|
| There are big tech companies that don't have QA people.
| morelisp wrote:
| No, it says if you have a QA process it doesn't including a
| staging environment.
|
| A QA process is _just a process_ - it doesn 't have necessary
| parts - as long as it's finding the right balance between cost,
| velocity, and risk for your needs, it's working. Some parts
| like CI are nearly universal now that they're so cheap; some
| like feature flags managed in a distributed control plane are
| expensive; some like staging deployments are somewhere in the
| middle.
| [deleted]
| capelio wrote:
| I've worked with multiple teams where QA tests in prod behind
| feature flags, canary deploys, etc. Staging environments and QA
| don't always go hand in hand.
| jokethrowaway wrote:
| That's absolutely not true.
|
| You can just compartmentalise important changes behind feature
| flags / service architecture and test things later.
| chrisseaton wrote:
| > If you are saying you don't have a staging environment, what
| you are really saying is that your company doesn't have any QA
| process.
|
| Come on - this is nonsense.
|
| Feature flags for example?
| tshaddox wrote:
| Feature flag systems don't magically prevent a new feature
| from causing a bug for other existing features, or even
| taking the whole site down.
| morelisp wrote:
| Speaking as the guy who pushed for and built our staging
| environments, neither do staging environments. (Speaking
| also as the guy who has taken the whole site down a few
| times.)
| jasonhansel wrote:
| But you don't need to have a _single_ staging env shared by all
| QA testers. Why not create individual QA environments on an as-
| needed basis for testing specific features? Of course this
| requires you to invest in making it easy to create new
| environments, but it allows QA teams to test different things
| without interfering with each other.
| paulryanrogers wrote:
| This worked reasonably well as v-hosts per engineer, though
| it did share some production resources. QA members would then
| run through test plans against those hosts to exercise the
| code. I prefer it to a single monolithic env. Though branches
| had to be kept up to date and bigger features tested as
| whole.
| mkl95 wrote:
| > People mistakenly let process replace accountability
|
| > We only merge code that is ready to go live.
|
| This is one of the most off-putting things I have read on HN
| lately. Having worked on several large SaaS where leadership
| claimed similar stuff, I simply refuse to believe it.
| davewritescode wrote:
| It really depends on the product and what you work on. For the
| front end this makes a ton of sense, for backend systems I'm
| less confident that this is reality.
| bob1029 wrote:
| > Pre-live environments are never at parity with production
|
| As a B2B vendor, this is a conclusion we have been forced to
| reach across the board. We have since learned how to convince our
| customers to test in production.
|
| Testing in prod is usually really easy _if_ you are willing to
| have a conversation with the other non-technical humans in the
| business. Simple measures like a restricted prod test group are
| about 80% of the solution for us.
| rubyist5eva wrote:
| marvinblum wrote:
| I use a somewhat similar approach for Pirsch [0]. It's build so
| that I can run it locally, basically as a fully fledged staging
| environment. Databases run in Docker, everything else is started
| using modd [1]. This has proven to be a good setup for quick
| iterations and testing. I can quickly run all tests on my laptop
| (Go and TypeScript) and even import data from production to see
| if the statistics are correct for real data. Of course, there are
| some things that need to be mocked, like automated backups, but
| so far it turned out to work really well.
|
| You can find more on our blog [2] if you would like to know more.
|
| [0] https://pirsch.io
|
| [1] https://github.com/cortesi/modd
|
| [2] https://pirsch.io/blog/techstack/
| midrus wrote:
| Good monitoring, logs, metrics, feature flagging (allowing for
| opening a branch of code for a % of users), blue/green deployment
| (allowing a release to handle a % of the user's traffic) and good
| tooling for quick builds/releases/rollback, in my experience, are
| far better tools than intermediate staging environments.
|
| I've had great success in the past with a custom feature flags
| system + Google's App Engine % based traffic shifting, where you
| can send just a small % of traffic to a new service, and rollback
| to your previous version quickly without even needing to
| redeploy.
|
| Now, not having those tools as a minimum, and not having either
| staging environment is just reckless. No
| unit/integration/whatever tests are going to make me feel safe
| about a deploy.
| midrus wrote:
| And yes, you need blue/green deployments in addition to feature
| flags, as it is not easy to feature flag certain things, such
| as a language runtime version update or a third party library
| upgrade, among many other things.
| kayodelycaon wrote:
| I don't see how this works when you have multiple external
| services you don't control in critical code paths that you can't
| fully test in CI.
|
| The cost of maintaining a staging environment is peanuts compared
| to 30 minutes of downtime or data corruption.
| kingcharles wrote:
| Some places don't even have dev. It's all on production.
|
| "Fuck it, we'll do it live!"
| parksy wrote:
| This sounds like something I would write if a hypothetical gun
| was pointed at my head in a company where the most prominent
| customer complaint was that time spent in QA and testing was too
| expensive.
|
| I have zero trust in any company that deploys directly from a
| developer's laptop to production, not in the least starting with
| how much do you trust that developer. There has to be some
| process right?
| drewcoo wrote:
| > company that deploys directly from a developer's laptop to
| production
|
| Luckily, there's no sign of doing that here. There's no mention
| of how their CI/CD works, probably because it's out of scope
| for an already long article, but that's clearly happening.
| parksy wrote:
| "We only have two environments: our laptops, and production.
| Once we merge into the main branch, it will be immediately
| deployed to production."
|
| Maybe my reading skills have completely vanished but to me,
| this exactly says they deploy directly from their developers'
| laptops to production. Those are literally the words used.
| The rest of the article goes on to defend not having a pre
| production environment.
|
| They literally detail how they deploy from their laptops to
| production with no other environments and make arguments for
| why that's a good thing.
| clintonb wrote:
| My assumption is the process is more like this:
|
| Laptop --> pull request + CI --> merge + CI + CD --> production
|
| I don't think folks are pushing code directly via Git or SFTP.
| rio517 wrote:
| I struggle with a lot of the arguments made here. I think one key
| thing is that staging can mean different things. In the authors
| case, they say "can't merge your code because someone else is
| testing code on staging." It is important to differentiate
| between this type of staging for development testing development
| branches vs a staging where only what's already merged for for
| deployment is automatically deployed.
|
| Many of the problems are organizational/infrastructure
| challenges, not inherent to staging environments/setups.
| Straightening out dev processes and investing in the
| infrastructure solves most of the challenges discussed.
|
| Their points:
|
| What's wrong with staging environments?
|
| * "Pre-live environments are never at parity with production" -
| resolved with proper investment in infrastructure.
|
| * "There's always a queue [for staging]" - is staging the only
| place to test pre-production code? If you need a place to test
| code that isn't in master, consider investing in disposable
| staging environments or better infrastructure so your team has
| more confidence for what they merge.
|
| * "Releases are too large" - reduced queues reduces deployment
| times. Manage releases so they're smaller.
|
| * "Poor ownership of changes" Of course this happens with all
| that queued code. address earlier challenges and this will be
| massively mitigated. Once there, good mangers's job is to ensure
| this doesn't happen.
|
| * "People mistakenly let process replace accountability" - this
| is a management problem.
|
| Solving some of the above challenges with the right investments
| creates a virtuous cycle of improvements.
|
| How we ship changes at Squeaky?
|
| * "We only merge code that is ready to go live" - This is quite
| arbitrary. How do you define/ensure this?
|
| * "We have a flat branching strategy" - Great. It then surprises
| me that they have so much queued code and such large releases. I
| find it surprising they say, "We always roll forward." I wonder
| how this impacts their recovery time.
|
| * "High risk features are always feature flagged" - do low risk
| features never cause problems?
|
| * "Hands-on deployments" - I'm not sure this is good practice.
| How much focus does it take away from your team? Would a hands-
| off deployment with high confidence pre-deploy, automated
| deployment, automated monitoring and alerting, while ensuring the
| team is available to respond and recover quickly?
|
| * "Allows a subset of users to receive traffic from the new
| services while we validate" is fantastic. Surprised they don't
| break this into its own thing.
| drcongo wrote:
| I don't recognise any of those "problems" with staging.
| mattm wrote:
| An important piece of context missing from the article is the
| size of their team. LinkedIn shows 0 employees and their about
| page lists the two cofounders so I assume they have a team of 2.
| It's odd that the article talks about the problems with large
| codebases and multiple people working on a codebase when it
| doesn't look like they have those problems. With only 2 people,
| of course they can ship like that.
| briandilley wrote:
| > Pre-live environments are never at parity with production
|
| Same with your laptops... and this is only true if you make it
| that way. Using things like Docker containers eliminates some of
| the problem with this too.
|
| > There's always a queue
|
| This has never been a problem for any of the teams I've been on
| (teams as large as ~80 people). Almost never do they "not want
| your code on there too". Eventually it's all got to run together
| anyway.
|
| > Releases are too large
|
| This has nothing to do with how many environments you have, and
| everything to do with your release practices. We try to do a
| release per week at a minimum, but have done multiple releases in
| a single day as well.
|
| > Poor ownership of changes
|
| Code ownership is a bad practice anyway. It allows people to
| throw their hands up and claim they're not responsible for a
| given part of the system. A down system is everyone's problem.
|
| > People mistakenly let process replace accountability
|
| Again - nothing to do with your environments here, just bad
| development practices.
| lucasyvas wrote:
| > Code ownership is a bad practice anyway. It allows people to
| throw their hands up and claim they're not responsible for a
| given part of the system. A down system is everyone's problem.
|
| Agreed with a lot of what you said up until this - this is,
| frankly, just completely wrong. If nobody has any ownership
| over anything, nobody is compelled to fix anything - I've
| experienced this first-hand on multiple occasions.
|
| There have also been several studies done to refute your point
| - higher ownership correlates with higher quality. A
| particularly well-known one is from Microsoft, which had a
| follow up study later that attempted to refute the original
| findings but failed to do so. Granted, these were conducted
| from the perspective of code quality, but it is trivial to
| apply the findings to other scenarios that demand
| accountability.
|
| [1] https://www.microsoft.com/en-us/research/wp-
| content/uploads/...
|
| [2] https://www.microsoft.com/en-us/research/wp-
| content/uploads/...
|
| Whoever sold you on the idea that ownership of _any and all
| kinds_ is bad would likely rather you be a replaceable cog than
| someone of free thought. I don't know about you, but I take
| pride in the things I'm responsible for. Most people are that
| way. I also don't give two shits about anything that I don't
| own, because there's not enough time in the day for everyone to
| care about everything. This is why we have teams in the first
| place.
|
| There is a mile of difference between toxic and productive
| ownership - Gatekeepers are bad, custodians are good.
| debarshri wrote:
| We used to believe staging environments are not important enough.
| If you believe that then I would argue that you have not crossed
| a threshold as an org where your product is critical enough for
| you consumers. The staging environment or any for that matter
| just acts as a gating mechanism to not ship crappy stuff to
| customers. You cannot have too many gates, then you would be
| shipping lates but with less number of gates you end up shipping
| low quality product.
|
| Staging environment saves unnecessary midnight alerts and easy to
| catch issues that might have a huge impact when a customer has to
| face it. I wouldn't be surprised if in few quarters or a year or
| so they would have an article about why they decided to introduce
| a staging environment.
| drewcoo wrote:
| This reminds me of the "bake time" arguments I've had. There's
| some magical idea that if software "bakes" in an environment
| for some unknowable amount of time, it will be done and ready
| to deploy. Very superstitious.
|
| what is the actual value gained from staging specifically? Once
| you have a list of those, a specific list, figure out why only
| staging could do that and not testing before or after. And
| "it's caught bugs before" is not good enough.
| tilolebo wrote:
| > And "it's caught bugs before" is not good enough.
|
| Why isn't it good enough?
| debarshri wrote:
| Firstly, There is no magical idea of software "baking" in an
| environment. It is about the risk appetite of the org., how
| willing is an org to push a feature that is "half-baked"
| their customers.
|
| I believe modern day testing infrastructure looks very
| different. I have seen products like ReleaseHub that provides
| ondemand environments to dev to testing their changes out
| which eliminates the need for common testing env. That
| naturally means you need atleast one "pre-release"
| environment where all the changes are which would eventually
| becomes the next release. If you don't have this "pre-
| release" environment you will never be able to capture the
| side-effects of all the parallel changes that are happening
| to the codebase.
|
| Thirdly, you have to see the context. When you have a
| microservice architecture, having a staging environment does
| not matter as fault tolerance, circuit breaking and other
| concepts makes sure that failed deployment of one services
| does not impact others. However, when you have a monolithic
| architecture you will never know what the side-effects of
| changes are unless you have a staging environment which would
| get promoted to production.
|
| If you value customers, you should have a staging environment
| as a guardrail. The cost of not adhering or having a process
| like this is huge and possibly company-ending.
| WYepQ4dNnG wrote:
| I don't see how this can scale beyond a single service.
|
| Complex systems are made of several services and infrastructure
| all interconnected. Things that are impossible to run on local.
| And even if you can run on local, the setup is most likely very
| different from production. The fact that things work on local
| give a little to zero guarantees that they will work in prod.
|
| If you have a fully automated infrastructure setup (e.g:
| terraform and friends), then it is not that hard to maintain a
| staging environment that is identical to production.
|
| Create a new feature branch from main, run unit tests,
| integrations tests. Changes are automatically merged in the main
| branch.
|
| From there a release is cut and deployed to staging. Run tests in
| staging, if all good, promote the release to production.
| drewcoo wrote:
| > Complex systems are made of several services and
| infrastructure all interconnected.
|
| Then maybe it's a forcing function to drive decoupling that
| tangle of code. That's a good thing!
| [deleted]
| sergiotapia wrote:
| > We only merge code that is ready to go live > If we're not
| confident that changes are ready to be in production, then we
| don't merge them. This usually means we've written sufficient
| tests and have validated our changes in development.
|
| Yeah I don't trust even myself with this one. Your database
| migration can fuck up your data big time in ways you didn't even
| predict. Just use staging with a copy of prod.
| https://render.com/docs/pull-request-previews
|
| Sounds like OP could benefit from review apps, he's at the point
| where one staging environment for the entire tech org slows
| everybody down.
| KaiserPro wrote:
| > We only merge code that is ready to go live
|
| Cool story, but you don't _know_ if its ready until after.
|
| Look, staging environments are not great, for the reasons
| described. But just killing staging and having done with it isn't
| the answer either. You need to _know_ when your service is fucked
| or not performing correctly.
|
| The only way that this kind of deployment is practical _at scale_
| is to have comprehensive end-to-end testing constantly running on
| prod. This was the only real way we could be sure that our
| service was fully working within acceptable parameters. We ran
| captured real life queries constantly in a random order, at a
| random time (caching can give you a false sense of security, go
| on, ask me how I know)
|
| At no point is monitoring strategy discussed.
|
| Unless you know how your service is supposed to behave, and you
| can describe that state using metrics, your system isn't
| monitored. Logging is too shit, slow and expensive to get
| meaningful near realtime results. Some companies expend billions
| taming logs into metrics. don't do that, make metrics first.
|
| > You'll reduce cost and complexity in your infrastructure
|
| I mean possibly, but you'll need to spend a lot more on making
| sure that your backups work. I have had a rule for a while that
| all instances must be younger than a month in prod. This means
| that you should be able to re-build _from scratch_ all instances
| _and datastores_. Instances are trivial to rebuild, databases
| should also be, but often arn 't. If you're going to fuck around
| an find out in prod, then you need good well practised recovery
| procedures
|
| > If we ever have an issue in production, we always roll forward.
|
| I mean that cute and all, but not being able to back out means
| that you're fucked, you might not think you're fucked, but that's
| because you've not been fucked yet.
|
| its like the old addage, there are two states of system admin:
| Those who are about to have data loss, and those who have had
| data loss.
| aprdm wrote:
| All good advice, but do you also have a rule where our DBs have
| to be less than a month old in prod? Doesn't look very
| practical if your DB has >100s of TBs
| KaiserPro wrote:
| > Doesn't look very practical if your DB has >100s of TBs
|
| If that's in one shard, then you've got big issues. with
| larger DBs you need to be practising rolling replacement
| replicas, because as you scale the chance that one of your
| shards cocking up approaches 1.
|
| Again, it depends on your use case. RDS solves 95% of your
| problems (barring high scale and expense)
|
| If your running your own DBs then you _must_ be replacing
| part or all of the cluster regularly to make sure that your
| backup mechanisms are working.
|
| For us, when we were using cassandra (hint: dont) we used to
| spin up a "b cluster" for large scale performance testing of
| prod. That allowed us to do one touch deploys from hot
| snapshots. Eventually. This saved us from a drive by malware
| infection, which caused our instances to OOM.
| aprdm wrote:
| I work in VFX and we have 1 primary 1 replica for the
| render farm (MySQL), and another one for an asset system.
| They both have 100s of TBs many cores and a lot of RAM, we
| treat them a bit like unicorn machines (they're bare
| metal), which isn't ideal, but yeah.. our failover and
| whatnot is to make the primary the replica and vice versa.
|
| I cannot imagine reprovisioning it very often, when I
| worked in startups and used rds and other managed DBs it
| was easier to not have to think about it.
| [deleted]
| drexlspivey wrote:
| > Last updated: April 1, 2022
| epolanski wrote:
| > If we ever have an issue in production, we always roll forward.
|
| What does it mean to roll forward?
| joshmlewis wrote:
| I believe it means to only push ahead when things break with
| fixes rather than rolling back to a previously working version.
| chrisan wrote:
| I assume rather than rollback a botched deploy, they solve the
| bug and do another push?
| mdoms wrote:
| > "We only merge code that is ready to go live"
|
| I like to go even farther, I advocate only merging code that
| won't break anything. If you're feature flagging as many changes
| as possible then you can merge code that doesn't even work, as
| long as you can gate users away from it using feature flags. The
| sooner and more often you can integrate unfinished code (safely)
| into master the better.
| ohmanjjj wrote:
| I've been shipping software for over two decades, built multiple
| successful SaaS companies, and have never in my life written a
| single unit test.
| gabrieledarrigo wrote:
| I feel no confident at all without unit tests on my code. Do
| you rely on some other types of testing?
| davewritescode wrote:
| You must be a fantastic coder because personally I can't write
| code without unit tests.
| lapser wrote:
| Disclaimer: I worked for a major feature flagging company, but
| these opinions are my own.
|
| This article makes a lot of valid points regarding staging
| environments, but their reasoning to not use them is dubious.
| None of their reasons are good enough to take staging
| environments out of the equation.
|
| I'd be willing to be that the likelihood of anyone merging code
| that isn't ready to go live is close to zero. You still need to
| validate the code. Their branching strategy is (in my opinion)
| the ideal branching strategy, but again, that isn't good enough
| to take staging away.
|
| Using feature flags is probably the only reason they give that
| comes to close to being okay with getting rid of staging, but
| even then, you can't always be sure that the code you've built
| works as expected. So you still need a staging environment to
| validate some things.
|
| Having hands-on deployments should always be happening anyway.
| It's not a reason to not have a staging environment.
|
| If you truly want to get rid of your a staging environment the
| minimum that you need to feature flagging of _everything_, and I
| do mean everything. That is honestly near impossible. You also
| need live preview environments for each PR/branch. This somewhat
| eliminates the need for a staging because reviewers can test the
| changes on a live environment. These two things still aren't good
| enough reason to get rid of your staging environment. There is
| still many things that can go wrong.
|
| The reason we have layered deployment systems (CI, staging etc)
| is to increase confidence that your deployment will be good. You
| can never be 100% sure. But I'll bet you, removing a staging
| environment lowers that confidence further.
|
| Having said all of this, if it works for you, then great. But the
| reasons I've read on this post, don't feel good enough to me to
| get rid of any staging environments.
| midrus wrote:
| > If you truly want to get rid of your a staging environment
| the minimum that you need to feature flagging of _everything_,
| and I do mean everything. That is honestly near impossible. You
| also need live preview environments for each PR/branch. This
| somewhat eliminates the need for a staging because reviewers
| can test the changes on a live environment. These two things
| still aren't good enough reason to get rid of your staging
| environment. There is still many things that can go wrong.
|
| This can be done very easily with many modern PaaS services. I
| had this like 6 or 7 years ago with Google App Engine, and we
| didn't have staging environment as each branch would be
| deployed and tested as if it were its own environment.
| bradleyjg wrote:
| How do you feature flag a refactor?
| detaro wrote:
| You copy your service into refactored_service and feature-
| flag which of the two microservices the rest of the system
| uses /s
| lapser wrote:
| Right. Hence why I said:
|
| > That is honestly near impossible.
|
| Point is, staging environment is there to increase the
| confidence that what you are deploying won't fail. Removing
| that is doable, but I wouldn't recommend it.
| blorenz wrote:
| We duplicate the production environment and sanitize all the data
| to be anonymous. We run our automated tests on this production-
| like data to smoke test. Our tests are driven by pytest and
| Playwright. God bless, I have to say how much I love Playwright.
| It just makes sense.
| pigcat wrote:
| This is my first time hearing about Playwright. Curious to know
| what you like about it over other frameworks? I didn't glean a
| whole lot from the website.
| Gigachad wrote:
| How big is your production dataset? Are you duplicating this
| for each deploy? Asking this because I work on a medium size
| app with only about 80k users and the production data is
| already in the tens of terabytes.
| kuon wrote:
| How do you do QA? I mean, staging in our case is accessible by a
| lot of non technical people that test things automated test
| cannot test (did I say test?).
| richardfey wrote:
| Let's talk again about this after the next postmortem?
| klabb3 wrote:
| Not endorsing this point blank but.. One positive side effect of
| this is that it becomes much easier to rally folks into improving
| the fidelity of the dev environment, which has compound positive
| impact on productivity (and mental health of your engineers).
|
| In my experience at Big Tech Corp, dev environments were reduced
| to low unit test fidelity over years, then as a result you need
| to _iterate_ (ie develop) in a staging environment that is orders
| of magnitude slower (and more expensive if you 're paying for
| it). It isn't unusual that waiting for integration tests is the
| majority of your day.
|
| Now, you might say that it's too complex so there's no other way,
| and yes sometimes that's the case, but there's nuance! Engineers
| have no incentive to fix dev if staging/integration works at all
| (even if super slow) so it's impossible to tell. If you think
| slow is a mild annoyance, I will tell you that I had senior
| engineers on my team that committed around 2-3 (often small) PRs
| per month.
| sedatk wrote:
| They're not mutually exclusive. You can achieve local + staging
| environments at the same time. Stable local env + staging.
| Local is almost always the most comfortable option due to fast
| iteration times, so nobody would bother with staging by
| default. Make it good, people will come.
| devmunchies wrote:
| One approach I'm experimenting with is that all services
| communicate via a message channel (e.g. NATS or Pub/Sub).
|
| By doing this, I can run a service locally but connect it to the
| production pubsub server and then see how it effects the system
| if I publish events to it locally.
|
| I could also subscribe to events and see real production events
| hitting my local machine.
| nickelpro wrote:
| This article has some very weird trade-offs.
|
| They can't spin up test environments quickly, so they have
| windows when they cannot merge code due to release timing. They
| can't maintain parity of their staging environments with prod, so
| they forswear staging environments. These seem like
| infrastructure problems that aren't addressing the same problem
| as the staging environment eo ipso.
|
| They're not arguing that testing or staging environments are bad,
| they're just saying their organization couldn't manage to get
| them working. If they didn't hit those roadblocks in managing
| their staging environments, presumably they would be using them.
| _3u10 wrote:
| Having staging always encourages this. It's really difficult to
| replicate prod in any non trivial way that exceeds what can be
| created on a workstation.
|
| Eg. Even if you buy the same hardware you can't replicate
| production load anyway because it's not being used by 5 million
| people concurrently. Your cache access patterns aren't the
| same, etc.
|
| It's far better to have a fast path to prod than a staging
| environment in my opinion.
| nickelpro wrote:
| Perhaps we have different ideas about what a staging
| environment is for. I wouldn't expect a staging environment
| to give accurate performance numbers for a change, the only
| solution to that is instrumenting the production environment.
| saq7 wrote:
| I think it's too much to expect staging to match the load and
| access patterns of your prod system.
|
| I find staging to be very useful. In various teams I have
| been a part of, I have seen the following productive use
| cases for staging
|
| 1. Extended development environment - If you use a micro-
| services or serverless architecture, it becomes really useful
| to do end-to-end tests of your code on staging. Docker helps
| locally, but unless you have a $4,000 laptop, the dev
| experience becomes very poor.
|
| 2. User acceptance testing - Generally performed by QAs, PMs
| or some other businessy folks. This becomes very important
| for teams that serve a small number of customer who write big
| checks.
|
| 3. Legacy enterprise teams - Very large corporations in which
| software does not drive revenue directly, but high quality
| software drives a competitive advantage. Insurance companies
| are an example. These folks have a much lower tolerance for
| shipping software that doesn't work exactly right for
| customers.
| toast0 wrote:
| > I think it's too much to expect staging to match the load
| and access patterns of your prod system.
|
| For a lot of things, this makes staging useless, or worse.
| When production falls over, but it worked in staging, then
| staging gave unwarranted confidence. When you push to
| production without staging, you know there's danger.
|
| That said, for changes that don't affect stability (which
| can sometimes be hard to tell), staging can be useful. And
| I don't disagree with a staging environment for your
| usecases.
| _3u10 wrote:
| dev workstations should cost at least $4000.
|
| Like how much productivity is being wasted because their
| machine is slow.
|
| $4000 workstations are cheap compared to staging.
| freedomben wrote:
| when I worked for big corp, the reason we were told in
| engineering for getting $1,000 laptops was that it wasn't
| fair to accounting, HR, etc for us to have better
| machines. In the past people from these departments
| complained quite a bit.
|
| The official reason (which was BS) was "to simplify IT's
| job by only having to support one model"
| rileymat2 wrote:
| Depending on your tech, staging environments can be very
| expensive, SQL Server Enterprise licenses at 13k for 2 cores.
| https://www.microsoft.com/en-us/sql-server/sql-server-2019-p...
| colonwqbang wrote:
| You could call that an infrastructure problem. You have built
| an expensive infrastructure which you cannot afford to scale
| to the extent you desire.
| coder543 wrote:
| If you're choosing to pay large sums of money for SQL Server
| instead of the open source alternatives, you should also
| factor in the large sums of money to have good
| development/staging environments too.
|
| All the more reason to just use Postgres or MySQL.
|
| EDIT: as someone else hinted at, it does look like the free
| Developer version of SQL Server is fully featured and
| licensed for use in any non-prod environment, which seems
| reasonable.
| rileymat2 wrote:
| Sure different planning 20 years ago would have made a big
| difference. Or the will/resources to transition. I am just
| saying that this scenario exists.
| bob1029 wrote:
| > Depending on your tech, staging environments can be very
| expensive
|
| For our business & customers, a new staging environment means
| another phone call to IBM and a ~6 month wait before someone
| even begins to talk about how much money its gonna cost.
| jiggawatts wrote:
| Non-prod is free.
| rileymat2 wrote:
| The developer version, yes. But I have not seen the AWS
| amis for the developer version:
| https://aws.amazon.com/about-aws/whats-new/2021/10/amazon-
| ec...
|
| You can't install the enterprise non-prod for free. (But
| the developer version is supposed to have all the features)
| booi wrote:
| It's pretty easy to create your own amis with developer
| versions. It makes sense why AWS doesn't necessarily
| provide this out of the box. But it still stands for
| fully managed versions of licensed software, you'll pay
| for the license even if it's non-prod
| rileymat2 wrote:
| Yes, that's not to say it is not possible to create a
| similar env, but I thought the debate was how precisely
| you are replicating your production env.
|
| Sure it may be "good enough", but I thought the debate
| was about precision. How your own ami setup may differ
| from the AWS built from the developer version compared to
| the AWS ami? I don't know.
|
| Trying for an identical setup in staging is expensive,
| this is just a scenario I am familiar with. I am sure
| there are a lot like this.
| nickelpro wrote:
| I was thinking about this line from the article:
|
| > More often than not, each environment uses different
| hardware, configurations, and software versions.
|
| They can't even deploy the same _software versions_ to
| their staging environment. We 're a long way off talking
| about precisely replicating load characteristics
| [deleted]
| repler wrote:
| Exactly. "Staging never matches Prod" - well why is that? Make
| it so!!
| drewcoo wrote:
| I have never ever even heard of a place where that was
| possible.
|
| The easiest way to make that scenario happen is take do
| whatever testing you'd have done in staging and do it in
| prod. Problem solved.
| anothernewdude wrote:
| > I have never ever even heard of a place where that was
| possible.
|
| You set the CI/CD pipeline to enforce that deploys happen
| to staging, and then happen to production. That's it. It's
| not hard.
| karmasimida wrote:
| It is possible.
|
| But you need infrastructure and paying delicate attention
| to this problem. It is hard to define exactly what does
| replicating prod mean. And sometimes it might be difficult,
| e.g. prod might have access controlled customer data store
| that has its own problem, or it is about cost. But doesn't
| necessarily mean if you can't replicate perfectly, it is
| useless, you can still catch problems with things that you
| can replicate and do go wrong.
|
| Ofc it is impossible to catch bugs 100% with staging,
| however, that argument goes either way.
| saq7 wrote:
| I am curios, why do you think it's impossible?
|
| I think we can establish that the database is the biggest
| culprit in making this difficult.
|
| As an independent developer, I have seen several teams that
| either back sync the prod db into the staging db OR capture
| known edge cases through diligent use of fixtures.
|
| I am not trying to counter your point necessarily, but just
| trying to understand your POV. Very possible that, in my
| limited experience, I haven't come across all the problems
| around this domain.
| lamontcg wrote:
| The variety of requests and load in prod never matches
| production along with all the messiness and jitter you
| get from requests coming from across the planet and not
| just from your own LAN. And you'll probably never build
| it out to the same scale as production and have half your
| capex dedicated to it, so you'll miss issues which depend
| on your own internal scaling factors.
|
| There's a certain amount of "best practices" effort you
| can go through in order to make your preprod environments
| sufficiently prod like but scaled down, with real data in
| their databases, running all the correct services, you
| can have a load testing environment where you hit one
| front end with a replay of real load taking from prod
| logs to look for perf regressions, etc. But ultimately
| time is better spent using feature flags and one box
| tests in prod rather than going down the rabbit hole of
| trying to simulate packet-level network failures in your
| preprod environment to try to make it look as prodlike as
| possible (although if you're writing your own distributed
| database you should probably be doing that kind of fault
| injection, but then you probably work somewhere FAANG
| scale, or you've made a potentially fatal NIH/DIY
| mistake).
| sharken wrote:
| As if this wasn't enough of a headache, GDPR regulation
| requires more safeguards before you can put your prod-
| data in a secured staging environment.
|
| Then there is the database size, which can make it hard
| and expensive to keep preprod up to date.
|
| And should you want to measure performance, then no one
| else can use preprod while that is going on.
| nickelpro wrote:
| The article doesn't talk about any of that though. The
| article says staging diffs prod because of:
|
| > different hardware, configurations, and software
| versions
|
| The hardware might be hard or expensive to get an exact
| match for in staging (but also, your stack shouldn't be
| hyper fragile to hardware changes). The latter two are
| totally solvable problems
| Gigachad wrote:
| With modern cloud computing and containerization, it
| feels like it has never been easier to get this right.
| Start up exactly the same container/config you use for
| production on the same cloud service. It should run
| acceptably similar to the real thing. Real problem is the
| lack of users/usage.
| darkwater wrote:
| IME, when you are not webscale, the issues you will miss
| from not testing in staging are bigger than the other way
| round. But that doesn't mean that all the extra efforts
| you have to put in the "test in prod only" scenario
| should not be put even when you do have a staging env.
| quickthrower2 wrote:
| All of their "problems" with staging are fixable bathwater that
| doesn't require baby ejection.
|
| I avoid staging for solo projects but it does feel a bit dirty.
|
| For team work or complex solo projects (such as anything
| commercial) I would never!
|
| On the cloud it is too easy to stage.
|
| To the point where I have teared down and recreated staging
| environment to save a bit of money at times because it is so easy
| to bring back.
|
| The article says to me their not using modern devops practices.
|
| It is rare a tech practice "hot take" post is on the money, and
| this post follows the rule not the exception.
|
| Have a staging environment!
|
| Just the work / thinking / tech debt payoff to make one is worth
| it for other reasons: including to streamline your deployment
| processes both human and in code.
| andersco wrote:
| Isn't the concept of a single staging environment becoming a bit
| dated? Every recent project I've worked on uses preview branches
| or deploy previews, eg what Netlify offers
| https://docs.netlify.com/site-deploys/deploy-previews/
|
| Or am I missing something?
| smokey_circles wrote:
| I imagine you missed the same thing I did: the last update
| time.
|
| April 1st, 2022
| replygirl wrote:
| no you're right, "staging" is gradually being replaced with
| per-commit "preview". but at enterprise scale when you have
| distributed services and data, and strict financial controls,
| and uncompromising compliance standards, it can often be
| unrealistic to transition to that until a new program group
| manager comes in with permission to blow everything up
| awill wrote:
| >>>If we're not confident that changes are ready to be in
| production, then we don't merge them. This usually means we've
| written sufficient tests and have validated our changes in
| development.
|
| This made me laugh.
| kafrofrite wrote:
| - I don't always test my code but when I do, it's in production.
|
| - Everyone has a testing environment. Some people are lucky
| enough that they have a separate one for running production
|
| [INSERT ADDITIONAL JOKES HERE]
| [deleted]
| productceo wrote:
| > We only merge code that is ready to go live.
|
| In their perception, is the rest of tech industry gambling in
| every pull request that some untested code would work in
| production?
|
| I work at a large company. We extensively test code on local
| machines. Then dev test environments. Then small roll out to just
| a few data centers in prod bed. Run small scale online flight
| experiments. Then roll out to the rest of prod bed.
|
| And I've seen code fail in each of the stages, no matter how
| extensively we tested and robustly code ran in prior stages.
| Sebguer wrote:
| Yeah, it seems like someone took RFC 9225 to heart.
| (https://www.rfc-editor.org/rfc/rfc9225.html)
| kafrofrite wrote:
| I wouldn't be surprised. I've seen colleagues reference
| April's Fools RFCs, and the reference wasn't meant to be
| taken as a joke.
| [deleted]
| joshuamorton wrote:
| Generally speaking yes, I think that if you aren't hiding stuff
| behind feature flags you're gambling.
| drewcoo wrote:
| > I've seen code fail in each of the stages
|
| How many of the failures caught in dev would have been
| legitimate problems in production? How about the ones in
| staging?
|
| If your environments are that different are you even testing
| the right things?
|
| And if yes, if you need all of those, then why not add a couple
| more environments? Because more pre-prod environments means
| more bugs caught in those, right? /s
| smokey_circles wrote:
| I dunno if I'm getting older or if this is as silly as it seems.
|
| You don't like pre-live because it doesn't have parity with
| production, so you use a developers laptop? What???
|
| I stopped reading at that point because that's pretty indicative
| of either a specific niche or a poorly thought out
| problem/solution set
| zimbatm wrote:
| If you can, provide on-demand environments for PRs. It's mostly
| helpful to test frontend changes, but also database migrations
| and just demoing changes to colleagues.
|
| If you have that, you will see people's behaviour change. We have
| a CTO that creates "demo" PRs with features they want to show to
| customers. All all the contension around staging as identified in
| the article is mostly gone.
| drewcoo wrote:
| You point out another kind of use of staging I've seen. "Don't
| touch staging until tomorrow after <some time> because SoAndSo
| is giving a demo to What'sTheirFace" so a bunch of engineering
| activity gets backed up.
| adamredwoods wrote:
| We use multiple staging lambdas specifically for demos and
| QA. CICD with terraform. Works great.
| shoo wrote:
| in enterprisey environments with large numbers of integrated
| services, its even worse if a single staging environment is
| used to do end-to-end integration testing involving many
| systems. lots of resource contention for access to staging
| environment.
| shoo wrote:
| it depends a bit on the system architecture.
|
| if you have a relatively self-contained system with few or zero
| external dependencies, so the system can be meaningfully tested
| in isolation, then i agree that standing up a ephemeral test
| environment can be a great idea. i've done this in the past to
| spin up SQL DBs using AWS RDS to ensure each heavyweight batch
| of integration tests that runs in CI gets its own DB isolated
| from any other concurrent CI runs. amusingly, this alarmed
| people in the org's platform team ("why are you creating so
| many databases?!") until we were able to explain our
| motivation.
|
| in contrast, if the system your team works on has a lot of
| external integrations, and those integrations in turn have
| transitive dependencies throughout some twisty enterprise
| macroservice distributed monolith, then you might find yourself
| in a situation where you'd need to sort out on-demand
| provisioning of _many_ services maintained by other teams
| before before you could do nontrivial integration testing.
|
| an inability to test a system meaningfully in isolation is
| likely a symptom of architectural problems, but good to
| understand the context where a given pattern may or may not be
| helpful.
| chrisshroba wrote:
| Just wondering, what does this phrase mean?
|
| > If we ever have an issue in production, we always roll forward.
| aeyes wrote:
| Instead of going back to a known good version, they release a
| hotfix to prod. This will probably backfire once they encounter
| a bug which is hard to fix.
| simonw wrote:
| Without a staging environment, how do you test that large scale
| database migrations work as intended?
|
| I wouldn't feel at all comfortable shipping changes like that
| which have only been tested on laptops.
| clintonb wrote:
| How do you define a large scale database migration? If you're
| just updating data or schema, that can be done locally via
| integration test. No need for a separate environment.
| jasonhansel wrote:
| This is good insofar as it forces you to make local development
| possible. In my experience: it's a big red flag if your systems
| are so complex or interdependent that it's impossible to run or
| test any of them locally.
|
| That leads to people _only_ testing in staging envs, causing
| staging to constantly break and discouraging automated tests that
| prevent regression bugs. It also leads to increasing complexity
| and interconnectedness over time, since people are never
| encouraged to get code running in isolation.
| bob1029 wrote:
| > In my experience: it's a big red flag if your systems are so
| complex or interdependent that it's impossible to run or test
| any of them locally
|
| At one time this was a huge blocker for our productivity.
| Access to a reliable test environment was only possible by way
| of a specific customer's production environment. The vendor
| does maintain a shared 3rd party integration test system, but
| its so far away from a realistic customer configuration that
| any result from that environment is more distracting than
| helpful.
|
| In order to get this sort of thing out of the way, we wrote a
| simulator for the vendor's system which approximates behavior
| across 3-4 of our customer's live configurations. Its a totally
| fake piece of shit, but its a consistent one. Our simulated
| environment testing will get us about 90% of the way there now.
| There are still things we simply have to test in customer prod
| though.
| tedmiston wrote:
| Ehh... once your systems use more than a few pieces of cloud
| infrastructure / SaaS / PaaS / external dependencies / etc,
| purely local development of the system is just not possible.
|
| There are some (limited) simulators / emulators / etc available
| and whatnot for some services, but running a full platform that
| has cloud dependencies on a local machine is often just not
| possible.
| [deleted]
| mgfist wrote:
| Spinning up services for local dev is still in spirit. As
| long as it's something you can do is isolation from other
| devs/users it serves the function.
| revicon wrote:
| Forcing developers to deal with mocks right from the
| beginning is critical in my opinion. Unit testing as part of
| your CI/CD flow needs to be a first priority rather than
| something that gets thought of later on. Testing locally
| should be synonymous with running your unit test suite.
|
| Doing your integration testing deployed to a non-production
| cloud environment is always necessary but should never be a
| requirement for doing development locally.
| jasonhansel wrote:
| The answer (IMHO) is to not use services that make it
| impossible to develop locally, unless you can trivially mock
| them; the benefits of such services aren't worth it if they
| result in a system that is inherently untestable with an
| environment that's inherently unreproducible.
|
| (I can go on a rant about AWS Lambda, and how if they'd used
| a standardized interface like FastCGI it would make local
| testing trivial, but they won't do that because they need
| vendor lock-in...)
| Gigachad wrote:
| Yeah, ideally you'd only use the ones which are just
| managed versions of software you can run locally. Stuff
| like managed databases and redis.
| jasonhansel wrote:
| Agreed. And stay away from proprietary cloud services
| that lock you into a specific cloud provider. Otherwise,
| you'll end up like one of those companies that still does
| everything on MS SQL Server and various Oracle byproducts
| despite rising costs because of decisions made many years
| ago.
| higeorge13 wrote:
| They mention database as a factor not to have a staging env due
| to different size, but they don't mention how they test schema
| migrations and any feature which touches the data which usually
| produce multiple issues, or even data loss.
| NorwegianDude wrote:
| Staging, tests, previews and even running code locally is for
| people who make mistakes. It's dumb and a total waste of time if
| you don't make any mistakes.
|
| No testing at all, that's what I call optimizing for success!
|
| On a more serious note: Sometimes staging is the same as local,
| and in those situations there is very limited use for staging.
| jurschreuder wrote:
| We often deploy to production directly because a customer wants
| a feature right now. I was thinking of changing the staging
| server to be called beta. Customers can use new features
| directly, but at their own risk.
| hetspookjee wrote:
| I've seen that before but then called acceptance with a
| select group.
| dexwiz wrote:
| Staging environments should be separate from production
| environments. If the Beta is expected to persist data in the
| long term, then it's not staging. Staging environments should
| be nukable. You don't want a messy Beta release to corrupt
| production data or to have customers trying to sue you if you
| reset staging.
|
| I don't know about your customer but wanting a feature
| yesterday may be a sign of some dysfunctional operating
| practices. Shortening your already short deployment pipeline
| shouldn't be your answer, unless its currently part of the
| problem. Otherwise, this should be solved with setting better
| expectations.
| jurschreuder wrote:
| It's mostly front-end features that change a lot, so there
| is not much danger in running them on the prod api and db.
| Our api is very stable because it uses event streaming.
| Mostly the front-end is different for different customers.
| jurschreuder wrote:
| What I found with customers is that they really like it if
| they talk to you about a feature, and next week it's there,
| although it's a preview version of the feature. After that
| they forget about it a bit and you've got plenty of time to
| perfect it.
| shoo wrote:
| it's a good idea to be crystal clear about which environments
| are running production workloads. if you end up with "non-
| production" environments running production workloads then it
| becomes much easier to accidentally blow away customer data,
| let alone communicate coherently. "beta" is fine provided it
| is regarded as a production environment. you may still want a
| non-production staging environment!
|
| i worked somewhere that had fallen into this kind of mess,
| where 80% of the business' production workloads were done in
| the Production environment, and 20% of the business'
| production workloads (with slightly different requirements)
| were done in a non-production test environment. it took
| calendar years to dig out of that hole.
| [deleted]
| tezza wrote:
| This reads like a Pre-Mortem.
|
| When they lose all their most important customers' data because
| the feature flags got too confusing... they can take this same
| article and say: "BECAUSE WE xxxx that led to YYYY.
|
| In future we will use a Staging or UAT environment to mitigate
| against YYYY and avoid xxxx"
|
| Saving time on authoring a Post Mortem by pre-describing your
| folly seems like an odd way to spend precious dev time
| mianos wrote:
| This probably also depends on your core business. If your product
| does not deal with real money, crypto, or other financial
| instruments and it is not serious if something goes wrong with a
| small number of people in production, this may work for you. It
| is probably cheaper and simpler. Lots of products are not like
| that. I built a bank and work on stock exchanges. Probably not a
| good idea to save money by not testing as people get quite
| annoyed when their money goes missing.
| sedatk wrote:
| Problem TL;DR:
|
| "With staging:
|
| - There could be differences from production
|
| - Multiple people can't test at the same time
|
| - Devs don't test their code."
|
| Solution TL;DR: "Test your code, and push to production."
|
| They completely misunderstood the problem and their solution
| literally changed nothing other than making devs test their code
| now. Staging could stay as is and would provide some significant
| risk mitigation with zero additional effort.
|
| "Whenever we deploy changes, we monitor the situation
| continuously until we are certain there are no issues."
|
| I'm sure customers would stay on the site, monitoring the
| situation too. Good luck with that strategy.
| kafrofrite wrote:
| or they could maybe use a specific OS as their golden image,
| use ansible or chef or puppet or any of the hundreds of tools
| that config machines and keep their staging and prod in sync.
| Bonus points for introducing a service that produces mock data
| for staging.
| sedatk wrote:
| Yeah, and not achieving 100% parity is definitely not worth
| throwing away the benefits from, say, 80% parity.
| issa wrote:
| I have a lot of questions, but one above all the others. How do
| you preview changes to non-technical stakeholders in the company?
| Do you make sales people and CEOs and everyone else boot up a
| local development environment?
| robbiemitchell wrote:
| Also my main thought. Among other things, we sometimes use UAT
| as the place for broad QA on UX behavior a member of eng or
| data might not think to test. For quickly developed features
| that don't go through a more formal design process, we'll also
| review copy and styling.
| drewcoo wrote:
| They already said they use feature flags. Those usually allow
| betas or demos for certain groups. Just have whomever owns the
| flag system add them to the right group.
| issa wrote:
| I guess that makes sense, but it means you would have rough
| versions of your feature sitting on production, hidden by
| flags. I could certainly be wrong about the potential for
| issues there, but it would definitely make me nervous.
| nunez wrote:
| This makes sense. With a high-enough release velocity to trunk, a
| super safe release pipeline with lots of automated checks, a
| well-tested rolling update/rollback process in production, and
| aggressive observability, it is totally possible to remove
| staging in many environments. This is one of the popular talking
| points touted by advocates of trunk-based development.
|
| (Note that you can do a lot of exploratory testing in disposable
| environments that get spun up during CI. Since the code in prod
| is the same as the code in main, there's no reason to keep them
| around. That's probably how they get around what's traditionally
| called UAT.)
|
| The problem for larger companies that tend to have lots of
| staging environments is that the risk of testing in production
| vastly exceeds the benefits gained from this approach. Between
| the learning curve required to make this happen, the investment
| required to get people off of dev, the significantly larger
| amounts of money at stake, and, in many cases, stockholder
| responsibilities, it is an uphill battle to get companies to this
| point.
|
| Also, many (MANY) development teams at BigCo's don't even "own"
| their code once it leaves staging.
|
| I've found it easier to employ a more grassroots approach towards
| moving people towards laptop-to-production. Every dev wants to
| work like Squeaky does (many hate dev/staging environments for
| the reasons they've outlined); they just don't feel empowered to
| do so. Work with a single team that ships something important but
| won't blow up the company if they push a bad build into prod. Let
| them be advocates internally to promote (hopefully) pseudo-viral
| spread.
| pmoriarty wrote:
| This sounds horrible unless they have a super reliable way to
| roll back changes to a consistent working state, both in their
| deployments and their databases.
| js4ever wrote:
| Agreed, this sounds crazy. One argument raised is because
| staging is often different from prod. But their laptop are even
| more different. It seems the main goal was to save money. All
| this make sense only for a very small team and code base
| Msw242 wrote:
| How much do you save?
|
| We spend like 3-4k/yr tops on staging
| nsb1 wrote:
| Or, on the flip side, how much do you lose by deploying an
| 'oops', resulting in customers having a bad experience and
| posting "This thing sux!" on social media?
|
| I can sympathize with the costs in both time and money to
| maintain a staging environment, but you're going to pay for
| those bugs somehow - either in staging or in customer
| satisfaction.
| lambda_dn wrote:
| You really need to use canary deployments/feature flags with
| this style. i.e. release to production but only for a group of
| users or be able to turn a feature off without another
| deployment.
| bradleyjg wrote:
| Apparently they never roll back, only forwards. That was
| elsewhere in the article.
|
| Sounds like a miserable idea. If you make a mistake and take
| down production you have to debug under extreme pressure to
| find a roll forward solution.
| karmasimida wrote:
| It really depends
|
| Without staging environment, your chance of finding critical bugs
| rely on offline testing. Not all bugs can be found in unit tests,
| you need load tests to detect certain bugs that doesn't break
| your program from correctness perspective, but on latency/memory
| leakage front. And such tests might take longer time to run.
|
| Staging slows things down, but it is intended, it creates a
| buffer to observe behavior. Depending on the nature of your
| service, it can be quite critical.
| pigbearpig wrote:
| > "Last updated: April 1, 2022"
|
| April Fools joke? It is the only post on their blog. Or maybe
| they don't have any customers yet?
| anothernewdude wrote:
| If they're not a parity then you are doing CI/CD wrong and aren't
| forcing deploys to staging before production. If you set the
| pipelines correctly then you *can't* get to production without
| being at parity with pre-production.
|
| > they don't want your changes to interfere with their
| validation.
|
| Almost like those are issues you want to catch. That's the whole
| point of continuous integration!
| coldcode wrote:
| At my previous job we had a single staging environment, which was
| used by dozens of teams to test independent releases as well as
| to test our public mobile app before release. That said, it never
| matched production, so releases were always a crapshoot as things
| suddenly happened no one ever tested. Yes, it was dumb.
| cosmiccatnap wrote:
| This is currently how my job works and it's hell.
| midrus wrote:
| See my other comment [1], it might be hell because you're
| missing the right tooling. With the right tooling, it's heaven
| actually.
|
| [1]
| https://news.ycombinator.com/reply?id=30900066&goto=item%3Fi...
| teen wrote:
| Imagine writing this entire blog post and being completely wrong
| about every topic you discuss. This is the most amateur content
| I've seen make it to the front page, let alone top post.
| vvpan wrote:
| Well you are not making an argument at all. But if it works for
| them then it works for them. Perhaps the description is
| somewhat sparse.
___________________________________________________________________
(page generated 2022-04-03 23:00 UTC)