[HN Gopher] How I Made a Giant Mistake with Terraform (and How A...
       ___________________________________________________________________
        
       How I Made a Giant Mistake with Terraform (and How Azure Made It
       Worse)
        
       Author : todsacerdoti
       Score  : 49 points
       Date   : 2021-07-08 19:51 UTC (3 hours ago)
        
 (HTM) web link (www.craigstuntz.com)
 (TXT) w3m dump (www.craigstuntz.com)
        
       | zug_zug wrote:
       | Where I last worked, all terraform changes went through PR,
       | requiring approval, after having read the plan. It was using a
       | system called atlantis. It was slow, but it prevented issues like
       | this.
        
         | Pokepokalypse wrote:
         | This is the way.
         | 
         | (we use gitlab-ci's built-in review process for approval).
         | 
         | At the end of the day, approval's still a human job, and humans
         | make mistakes. Right? :D
         | 
         | Terraform is an incredibly powerful tool, and you can make some
         | monumentally huge mistakes with it.
        
         | arlk wrote:
         | Same, not atlantis but we used Gitlab-CI and Jenkins steps for
         | an approval whenever there's a change in production, while
         | staging changes are auto-deployed. Terraform plan was written
         | to the PRs using tfnotify[0]. Normal deployments typically took
         | 1 minute and 20 seconds (for each environment, in parallel)
         | which I would consider very reasonable considering that we
         | deployed a medium size infrastructure with only 2 terraform
         | layers, so there was a room for optimizations.
         | 
         | [0]: https://github.com/mercari/tfnotify
        
         | carrja99 wrote:
         | Atlantis is great. If you've grown beyond 5 or so engineers you
         | should have no excuse to be running terraform apply from
         | laptops.
        
           | throwaway290232 wrote:
           | It is completely embarrassing how many engineers we have and
           | still apply manually from laptops. Changes are slow and
           | error-prone, we don't even have them hooked up to CI/CD. I
           | think it still works because we have so many damn engineers
           | and we don't actually need to change infrastructure multiple
           | times a day.
           | 
           | That said, Terraform breaks so often that if we did it all
           | automated, we'd have a million more Git commits from trying
           | to fix broken apply's.
        
       | corty wrote:
       | One of the big mistakes I see here is that the testing
       | environment A doesn't look and work the same as the prod
       | environment B. The closer both are, the surer you can be that you
       | can catch such mistakes by testing in A. The bigger the
       | differences, the more problems can sneak in undetected.
       | 
       | So while I think it is OK for the author to be a little humbled
       | by his mistake, I would actually place the blame on the
       | misdesigned environments. We should try to expect human mistakes
       | and make them avoidable (by noticing them early in the test
       | environment) or prevent them alltogether (by automatically
       | testing and only allowing the change to production, but that is
       | far harder to setup).
        
       | bwship wrote:
       | We had the same issue with DynamoDB, that Terraform is all to
       | happy to delete a table it deems acceptable to delete. Which in
       | 99 out of 100 cases is never the case. It is a very aggressive
       | IaC tool.
        
         | andrew_ wrote:
         | Instances like this is where CDK is superior to Terraform imho.
         | There are many resources which are difficult (or impossible) to
         | delete or modify once created with CDK, and for good reason.
         | Secrets also fall into this category.
        
         | marcinzm wrote:
         | You could tag everything you care about with lifecycle rules to
         | not delete but it'd be nicer if they were the default for
         | certain resources.
        
         | notwedtm wrote:
         | It's a very _powerful_ tool. Just like any other (well
         | designed) tool, it 's only going to do what the operator tells
         | it to do.
         | 
         | Terraform will never delete something you don't tell it to.
        
           | anonydsfsfs wrote:
           | > Terraform will never delete something you don't tell it to.
           | 
           | Not necessarily. With some providers (e.g. Azure), Terraform
           | will fail to recognize automated behind-the-scenes changes
           | and try to revert them, causing serious breakage. This is why
           | the "ignore_changes" meta-argument exists. See
           | https://itnext.io/how-and-when-to-ignore-lifecycle-
           | changes-i...
        
           | throwaway290232 wrote:
           | Actually that is impossible to know. Terraform doesn't delete
           | anything at all. It runs provider plugins, which issue API
           | calls (based on various criteria, some of which is not known
           | until apply time) and those API servers then do various
           | things in the background.
           | 
           | Sometimes you _need_ it to delete something, and it won 't.
           | And sometimes you need it to remove something, and something
           | else in the background gets removed by the API service
           | handling the first delete.
           | 
           | Terraform is "powerful" in the sense that a 6'11 35 year old
           | man with an IQ of 78 is "powerful".
        
       | Arnavion wrote:
       | Well, Azure prevented you from creating a RG but still let you
       | delete it because that was the role that the client configured
       | for you. AFAIK there's no built-in role that behaves like this,
       | but it's probably a very easy mistake to make - grant permissions
       | on resourcegroups/$name/* , forgetting that * includes delete.
       | 
       | re: SQL Server backups, I assume this was VMs running SQL Server
       | rather than Azure SQL DB (the managed one) ? If it's the latter,
       | then I think the backups will be retained even if the RG is
       | deleted.
        
       | comice wrote:
       | "immediately raise a ticket with Azure Support, who were able to
       | grab the resources from "somewhere" (I guess when you delete a
       | resource in Azure, it's still on a disk somewhere, for a while),
       | and we got our database and backups back"
       | 
       | what the hell is Azure doing pretending to delete things?
       | 
       | How long do Azure keep your data after you think you've deleted
       | it?
       | 
       | Is there a way to ensure data stored in Azure is actually
       | destroyed when you ask for it to be destroyed?
        
         | Arnavion wrote:
         | A bunch of high-profile resources have temporary soft-deletes -
         | resource groups, managed SQL DBs, KeyVaults, Storage Accounts,
         | and even subscriptions themselves. For some of these the
         | undelete option is given to the user, for others you have to
         | call support.
        
           | comice wrote:
           | ah ok, phew! I guess the OP just didn't know about that -
           | they seemed surprised it was recoverable!
        
             | Arnavion wrote:
             | Well it's not documented for resource groups, so their
             | surprise is expected. The fact that it applies to resource
             | groups is based on empirical evidence.
        
         | ikiris wrote:
         | Encrypt it and lose the key is generally effective.
        
         | oneepic wrote:
         | Tables in Azure Storage are an example of this -- the delete is
         | performed asynchronously by a background thread some time later
         | (just garbage collection). The table is not immediately
         | deleted. I don't think you can force Azure to immediately
         | delete it, but you could possibly raise a support ticket to ask
         | them to delete it immediately. Have not tried this though.
         | 
         | Part of the reason is performance, I would assume, since a
         | large delete could be hard on the overall system, and could
         | slow Azure down for you + other customers. Also because some
         | customers accidentally delete critical resources sometimes. (Or
         | forgot to copy down any important config options from the
         | resource before deleting it. I have some experience with that
         | mistake.)
        
         | NicoJuicy wrote:
         | It is destroyed. But i believe there is a timeline where azure
         | support can recover it for you until the deletion is final.
        
         | macintux wrote:
         | I can't speak to any specifics, but asynchronous behavior,
         | including deletions, is very common in large distributed
         | systems, and Azure I'd wager is no exception.
         | 
         | Update: I'd forgotten that even filesystems often do a soft
         | delete. https://lwn.net/Articles/462437/
        
       | carrja99 wrote:
       | There's a song in Hamilton that I'm always reminded of when
       | terraform applies run:
       | 
       | "Blow Us All Away"
        
       | aequitas wrote:
       | You can (and always should) decouple the plan and apply steps.
       | For a non terminal approach this can be done by outputting the
       | binary plan to a plan file (with -out=path) and reviewing the
       | plan terminal output which will show all actions Terraform will
       | perform in human readable format. Then have the plan file be used
       | as input for the apply step. The apply won't perform any other
       | action that was not in the plan file (which matches the output
       | that was reviewed) and if state of the environment it would apply
       | to has changed in the meantime it will abort without causing
       | undesired changes and you can restart the process again.
       | 
       | There is also the prevent_destroy[0] meta argument for resources
       | but afaik it has no effect when you remove the resource from your
       | .tf files[1], so it would not have helped in this case.
       | 
       | [0] https://www.terraform.io/docs/language/meta-
       | arguments/lifecy...
       | 
       | [1] https://github.com/hashicorp/terraform/issues/17599
        
         | operatorius wrote:
         | This!
         | 
         | In addition to that things can be made even less error prone.
         | Ive done this using yaml pipeline in azure devops. The plan
         | task can be used to set an output variable which indicates if
         | the generated plan contains any changes. That boolean value is
         | used as a condition to trigger a manual verification task which
         | basically prevents apply running if there are any changes
         | without reviewing it first.
         | 
         | As the op mentions, the generated plan is an artifact itself
         | that is used in a following apply task
        
         | thedougd wrote:
         | I'm not very familiar with Azure anymore, but AWS also offers
         | deletion protection on RDS and other services. This forces a
         | two-phase operation with Terraform. You must first apply
         | configuration to remove the deletion protection, then in the
         | second phase you may delete the RDS instance.
         | 
         | Unfortunately, I've never been able to get much mileage from
         | the Terraform prevent_destroy lifecycle option because it can't
         | be set from a variable. Most of my configurations use a module
         | and pass different variable values per each environment. I'd
         | want the lifecycle flag in production, but maybe not dev.
        
           | Arnavion wrote:
           | Yes, Azure has a similar thing for resources where you can
           | set a "CanNotDelete" lock on one. You have to delete the lock
           | before you can delete the resource.
        
       | Androider wrote:
       | In my opinion, databases are not cattle, and don't need to be
       | automatically created (and destroyed!) in your main Terraform
       | plan.
       | 
       | It's perfectly OK to have a completely separate Terraform project
       | that just configures the DB initially (or even manually, I see
       | lots of places running DB's that predate Terraform with immutable
       | infrastructure for everything else), and applies minor non-
       | destructive changes in the future. This way you get the benefits
       | of IaC, but the DB plan doesn't participate with the rest of your
       | infrastructure that IS ok to blow away and re-create at will.
       | 
       | BTW, Amazon RDS backups work the exact same way: Destroy the
       | database and the backups are also destroyed. Therefore, same
       | region automated RDS backups are fine for day-to-day, but in a
       | true "DB goes poof" disaster you should expect that you WILL lose
       | them too! You need cross-region, or even better, cross-account DB
       | replication or snapshots to survive this.
        
       | camjohnson26 wrote:
       | Doesn't terraform show you how many resources are affected and
       | require you to approve that when running apply?
        
         | knownjorbist wrote:
         | It does, but according to the post it was run via Azure DevOps.
         | I'm not familiar with ADO but it sounds like it might not have
         | been as obvious as running locally. Alternatively, perhaps it
         | would have only shown deleting the Resource Group, but
         | internally to Azure(and unbeknownst to Terraform) this means
         | deleting everything within the RG as well.
        
         | arlk wrote:
         | .. unless you add the `-auto-approve` flag.
        
       ___________________________________________________________________
       (page generated 2021-07-08 23:00 UTC)