[HN Gopher] How I Made a Giant Mistake with Terraform (and How A...
___________________________________________________________________
How I Made a Giant Mistake with Terraform (and How Azure Made It
Worse)
Author : todsacerdoti
Score : 49 points
Date : 2021-07-08 19:51 UTC (3 hours ago)
(HTM) web link (www.craigstuntz.com)
(TXT) w3m dump (www.craigstuntz.com)
| zug_zug wrote:
| Where I last worked, all terraform changes went through PR,
| requiring approval, after having read the plan. It was using a
| system called atlantis. It was slow, but it prevented issues like
| this.
| Pokepokalypse wrote:
| This is the way.
|
| (we use gitlab-ci's built-in review process for approval).
|
| At the end of the day, approval's still a human job, and humans
| make mistakes. Right? :D
|
| Terraform is an incredibly powerful tool, and you can make some
| monumentally huge mistakes with it.
| arlk wrote:
| Same, not atlantis but we used Gitlab-CI and Jenkins steps for
| an approval whenever there's a change in production, while
| staging changes are auto-deployed. Terraform plan was written
| to the PRs using tfnotify[0]. Normal deployments typically took
| 1 minute and 20 seconds (for each environment, in parallel)
| which I would consider very reasonable considering that we
| deployed a medium size infrastructure with only 2 terraform
| layers, so there was a room for optimizations.
|
| [0]: https://github.com/mercari/tfnotify
| carrja99 wrote:
| Atlantis is great. If you've grown beyond 5 or so engineers you
| should have no excuse to be running terraform apply from
| laptops.
| throwaway290232 wrote:
| It is completely embarrassing how many engineers we have and
| still apply manually from laptops. Changes are slow and
| error-prone, we don't even have them hooked up to CI/CD. I
| think it still works because we have so many damn engineers
| and we don't actually need to change infrastructure multiple
| times a day.
|
| That said, Terraform breaks so often that if we did it all
| automated, we'd have a million more Git commits from trying
| to fix broken apply's.
| corty wrote:
| One of the big mistakes I see here is that the testing
| environment A doesn't look and work the same as the prod
| environment B. The closer both are, the surer you can be that you
| can catch such mistakes by testing in A. The bigger the
| differences, the more problems can sneak in undetected.
|
| So while I think it is OK for the author to be a little humbled
| by his mistake, I would actually place the blame on the
| misdesigned environments. We should try to expect human mistakes
| and make them avoidable (by noticing them early in the test
| environment) or prevent them alltogether (by automatically
| testing and only allowing the change to production, but that is
| far harder to setup).
| bwship wrote:
| We had the same issue with DynamoDB, that Terraform is all to
| happy to delete a table it deems acceptable to delete. Which in
| 99 out of 100 cases is never the case. It is a very aggressive
| IaC tool.
| andrew_ wrote:
| Instances like this is where CDK is superior to Terraform imho.
| There are many resources which are difficult (or impossible) to
| delete or modify once created with CDK, and for good reason.
| Secrets also fall into this category.
| marcinzm wrote:
| You could tag everything you care about with lifecycle rules to
| not delete but it'd be nicer if they were the default for
| certain resources.
| notwedtm wrote:
| It's a very _powerful_ tool. Just like any other (well
| designed) tool, it 's only going to do what the operator tells
| it to do.
|
| Terraform will never delete something you don't tell it to.
| anonydsfsfs wrote:
| > Terraform will never delete something you don't tell it to.
|
| Not necessarily. With some providers (e.g. Azure), Terraform
| will fail to recognize automated behind-the-scenes changes
| and try to revert them, causing serious breakage. This is why
| the "ignore_changes" meta-argument exists. See
| https://itnext.io/how-and-when-to-ignore-lifecycle-
| changes-i...
| throwaway290232 wrote:
| Actually that is impossible to know. Terraform doesn't delete
| anything at all. It runs provider plugins, which issue API
| calls (based on various criteria, some of which is not known
| until apply time) and those API servers then do various
| things in the background.
|
| Sometimes you _need_ it to delete something, and it won 't.
| And sometimes you need it to remove something, and something
| else in the background gets removed by the API service
| handling the first delete.
|
| Terraform is "powerful" in the sense that a 6'11 35 year old
| man with an IQ of 78 is "powerful".
| Arnavion wrote:
| Well, Azure prevented you from creating a RG but still let you
| delete it because that was the role that the client configured
| for you. AFAIK there's no built-in role that behaves like this,
| but it's probably a very easy mistake to make - grant permissions
| on resourcegroups/$name/* , forgetting that * includes delete.
|
| re: SQL Server backups, I assume this was VMs running SQL Server
| rather than Azure SQL DB (the managed one) ? If it's the latter,
| then I think the backups will be retained even if the RG is
| deleted.
| comice wrote:
| "immediately raise a ticket with Azure Support, who were able to
| grab the resources from "somewhere" (I guess when you delete a
| resource in Azure, it's still on a disk somewhere, for a while),
| and we got our database and backups back"
|
| what the hell is Azure doing pretending to delete things?
|
| How long do Azure keep your data after you think you've deleted
| it?
|
| Is there a way to ensure data stored in Azure is actually
| destroyed when you ask for it to be destroyed?
| Arnavion wrote:
| A bunch of high-profile resources have temporary soft-deletes -
| resource groups, managed SQL DBs, KeyVaults, Storage Accounts,
| and even subscriptions themselves. For some of these the
| undelete option is given to the user, for others you have to
| call support.
| comice wrote:
| ah ok, phew! I guess the OP just didn't know about that -
| they seemed surprised it was recoverable!
| Arnavion wrote:
| Well it's not documented for resource groups, so their
| surprise is expected. The fact that it applies to resource
| groups is based on empirical evidence.
| ikiris wrote:
| Encrypt it and lose the key is generally effective.
| oneepic wrote:
| Tables in Azure Storage are an example of this -- the delete is
| performed asynchronously by a background thread some time later
| (just garbage collection). The table is not immediately
| deleted. I don't think you can force Azure to immediately
| delete it, but you could possibly raise a support ticket to ask
| them to delete it immediately. Have not tried this though.
|
| Part of the reason is performance, I would assume, since a
| large delete could be hard on the overall system, and could
| slow Azure down for you + other customers. Also because some
| customers accidentally delete critical resources sometimes. (Or
| forgot to copy down any important config options from the
| resource before deleting it. I have some experience with that
| mistake.)
| NicoJuicy wrote:
| It is destroyed. But i believe there is a timeline where azure
| support can recover it for you until the deletion is final.
| macintux wrote:
| I can't speak to any specifics, but asynchronous behavior,
| including deletions, is very common in large distributed
| systems, and Azure I'd wager is no exception.
|
| Update: I'd forgotten that even filesystems often do a soft
| delete. https://lwn.net/Articles/462437/
| carrja99 wrote:
| There's a song in Hamilton that I'm always reminded of when
| terraform applies run:
|
| "Blow Us All Away"
| aequitas wrote:
| You can (and always should) decouple the plan and apply steps.
| For a non terminal approach this can be done by outputting the
| binary plan to a plan file (with -out=path) and reviewing the
| plan terminal output which will show all actions Terraform will
| perform in human readable format. Then have the plan file be used
| as input for the apply step. The apply won't perform any other
| action that was not in the plan file (which matches the output
| that was reviewed) and if state of the environment it would apply
| to has changed in the meantime it will abort without causing
| undesired changes and you can restart the process again.
|
| There is also the prevent_destroy[0] meta argument for resources
| but afaik it has no effect when you remove the resource from your
| .tf files[1], so it would not have helped in this case.
|
| [0] https://www.terraform.io/docs/language/meta-
| arguments/lifecy...
|
| [1] https://github.com/hashicorp/terraform/issues/17599
| operatorius wrote:
| This!
|
| In addition to that things can be made even less error prone.
| Ive done this using yaml pipeline in azure devops. The plan
| task can be used to set an output variable which indicates if
| the generated plan contains any changes. That boolean value is
| used as a condition to trigger a manual verification task which
| basically prevents apply running if there are any changes
| without reviewing it first.
|
| As the op mentions, the generated plan is an artifact itself
| that is used in a following apply task
| thedougd wrote:
| I'm not very familiar with Azure anymore, but AWS also offers
| deletion protection on RDS and other services. This forces a
| two-phase operation with Terraform. You must first apply
| configuration to remove the deletion protection, then in the
| second phase you may delete the RDS instance.
|
| Unfortunately, I've never been able to get much mileage from
| the Terraform prevent_destroy lifecycle option because it can't
| be set from a variable. Most of my configurations use a module
| and pass different variable values per each environment. I'd
| want the lifecycle flag in production, but maybe not dev.
| Arnavion wrote:
| Yes, Azure has a similar thing for resources where you can
| set a "CanNotDelete" lock on one. You have to delete the lock
| before you can delete the resource.
| Androider wrote:
| In my opinion, databases are not cattle, and don't need to be
| automatically created (and destroyed!) in your main Terraform
| plan.
|
| It's perfectly OK to have a completely separate Terraform project
| that just configures the DB initially (or even manually, I see
| lots of places running DB's that predate Terraform with immutable
| infrastructure for everything else), and applies minor non-
| destructive changes in the future. This way you get the benefits
| of IaC, but the DB plan doesn't participate with the rest of your
| infrastructure that IS ok to blow away and re-create at will.
|
| BTW, Amazon RDS backups work the exact same way: Destroy the
| database and the backups are also destroyed. Therefore, same
| region automated RDS backups are fine for day-to-day, but in a
| true "DB goes poof" disaster you should expect that you WILL lose
| them too! You need cross-region, or even better, cross-account DB
| replication or snapshots to survive this.
| camjohnson26 wrote:
| Doesn't terraform show you how many resources are affected and
| require you to approve that when running apply?
| knownjorbist wrote:
| It does, but according to the post it was run via Azure DevOps.
| I'm not familiar with ADO but it sounds like it might not have
| been as obvious as running locally. Alternatively, perhaps it
| would have only shown deleting the Resource Group, but
| internally to Azure(and unbeknownst to Terraform) this means
| deleting everything within the RG as well.
| arlk wrote:
| .. unless you add the `-auto-approve` flag.
___________________________________________________________________
(page generated 2021-07-08 23:00 UTC)