Skip to content
ioob.dev
Go back

Terraform Part 15 — Practical Patterns and Pitfalls

· 8 min read
Terraform Series (15/15)
  1. Terraform Part 1 — What Is Terraform
  2. Terraform Part 2 — Installation and First Deploy
  3. Terraform Part 3 — HCL Syntax
  4. Terraform Part 4 — Variables and Outputs
  5. Terraform Part 5 — Providers
  6. Terraform Part 6 — Resources and Dependencies
  7. Terraform Part 7 — Data Sources and Import
  8. Terraform Part 8 — State Management
  9. Terraform Part 9 — Modules
  10. Terraform Part 10 — Loops and Conditionals
  11. Terraform Part 11 — Workspaces and Environment Separation
  12. Terraform Part 12 — Kubernetes and Helm Providers
  13. Terraform Part 13 — CI/CD Integration
  14. Terraform Part 14 — Testing and Policy
  15. Terraform Part 15 — Practical Patterns and Pitfalls
Table of contents

Table of contents

Knowing the Tool vs Using It Well

Following the series, we’ve covered most of Terraform’s features. Syntax, resources, modules, state, workspaces, CI/CD, testing. But knowing these doesn’t mean you can immediately run large infrastructure well.

What matters in practice is different. When to choose which structure, which mistakes to avoid, how to split things as scale grows — these are the things that count. This part compiles the patterns and pitfalls repeatedly encountered from experience.

Directory Structure Conventions

There’s no single right answer, but several frequently used patterns exist. Choose based on project scale and team structure.

Pattern 1: Flat structure (small scale)

infra/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars

For 20 or fewer resources, one environment. Suitable for simple prototypes or side projects. Easy to manage but low scalability.

Pattern 2: Environment separation (small to medium)

infra/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
└── envs/
    ├── dev/
    ├── staging/
    └── prod/

Modularization + environment separation. This is sufficient for most startup/mid-size teams in practice.

Pattern 3: Component separation (medium to large)

infra/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
└── envs/
    ├── dev/
    │   ├── network/      # VPC, subnets, etc.
    │   ├── cluster/      # EKS
    │   ├── database/     # RDS
    │   └── application/  # App deployment
    └── prod/
        ├── network/
        ├── cluster/
        ├── database/
        └── application/

Separate components even within environments. State is split per component, making management units smaller. The advantage is clear: changes to networking don’t affect database state. The downside is that cross-component references (output sharing) become cumbersome. This can be resolved with remote state data sources or Terragrunt’s dependency blocks.

# envs/prod/application/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "my-company-tfstate"
    key    = "envs/prod/network/terraform.tfstate"
    region = "ap-northeast-2"
  }
}

resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
}

Pattern 4: Live/module separation (large scale)

terraform-modules/         # Separate repo
├── vpc/
├── eks/
└── rds/

infra-live/                # Separate repo
└── envs/
    ├── dev/
    └── prod/

Modules are completely separated into their own repo. Tag modules with SemVer, and reference them with version pins from infra-live. A natural structure when platform and service teams separate in large organizations.

flowchart TB
    Size["Team/infra scale"]

    Size -->|"Small"| Flat["Flat structure"]
    Size -->|"Small-medium"| Envs["Environment separation"]
    Size -->|"Medium-large"| Comp["Component separation"]
    Size -->|"Large"| Repo["Live/module repo separation"]

    Flat -.-> Flat2["Simple management,\nlimited scalability"]
    Envs -.-> Envs2["Per-environment independence,\nmixed components"]
    Comp -.-> Comp2["Granular state,\ncomplex references"]
    Repo -.-> Repo2["Clear team boundaries,\nhigher operational cost"]

Whichever pattern you start with, the most important thing is to modularize from the beginning so splitting later is easy.

Tagging Strategy

Tags may seem trivial, but they’re critical in operations. Cost analysis, resource search, policy enforcement, and ownership tracking all rely on tags.

Common tags should be applied consistently to all resources. A typical minimum set looks like this.

TagMeaningExample
EnvironmentEnvironmentdev, staging, prod
ServiceService nameorder-api, auth, platform
OwnerOwner / teamplatform-team, alice@company.com
CostCenterCost centerengineering, marketing
ManagedByManagement toolterraform, manual
RepoDefining repogithub.com/org/infra-repo

Setting default tags at the provider level auto-applies them to all resources.

provider "aws" {
  region = "ap-northeast-2"

  default_tags {
    tags = {
      Environment = var.environment
      Service     = var.service_name
      Owner       = "platform-team"
      ManagedBy   = "terraform"
      Repo        = "github.com/my-org/infra-repo"
    }
  }
}

default_tags is supported from AWS provider 3.38 onward. With this, you don’t need to write tags on every resource. If a resource needs additional tags, just add them in its tags block.

Automate tag policy validation

Use OPA as shown in the previous part to “fail on missing required tags.” This isn’t something you can manually check every time.

Common Mistake — State Loss

Losing state is one of the most horrifying incidents in Terraform. Without state, Terraform tries to create resources again. An existing RDS fails with a name collision, or a useless duplicate appears next to the original.

Causes

  1. Losing the laptop with local state
  2. Accidentally deleting the remote backend bucket
  3. Multiple people applying simultaneously, corrupting state
  4. Misusing terraform state rm and removing needed resources from state

Prevention

Code to manage and protect the state bucket itself with Terraform.

resource "aws_s3_bucket" "tfstate" {
  bucket = "my-company-tfstate"

  lifecycle {
    prevent_destroy = true   # Rejects even accidental destroy
  }
}

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_policy" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Sid       = "DenyDelete"
      Effect    = "Deny"
      Principal = "*"
      Action    = ["s3:DeleteBucket", "s3:DeleteBucketPolicy"]
      Resource  = [aws_s3_bucket.tfstate.arn]
    }]
  })
}

Recovery

What if you lose it? There’s only one way — rebuild from scratch with import.

import {
  to = aws_vpc.main
  id = "vpc-0abc123def456"
}

Add import blocks one by one for each resource, check the diff with terraform plan, and adjust the code to match. Painful with dozens, nightmarish with hundreds. That’s why prevention is everything.

Common Mistake — Circular References

If module A uses module B’s output and module B uses module A’s output, a circular reference occurs. Terraform can’t compute the DAG and fails.

Error: Cycle detected in configuration

Typical case

# Bad example
module "sg_web" {
  source = "./modules/sg"
  ingress_from_sg = module.sg_app.security_group_id   # References app
}

module "sg_app" {
  source = "./modules/sg"
  ingress_from_sg = module.sg_web.security_group_id   # References web — circular!
}

Solution

For cases like security groups that need mutual references, extract the rules into separate resources.

module "sg_web" {
  source = "./modules/sg"
}

module "sg_app" {
  source = "./modules/sg"
}

# Mutual references without circularity
resource "aws_security_group_rule" "web_from_app" {
  type                     = "ingress"
  from_port                = 80
  to_port                  = 80
  protocol                 = "tcp"
  security_group_id        = module.sg_web.security_group_id
  source_security_group_id = module.sg_app.security_group_id
}

resource "aws_security_group_rule" "app_from_web" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = module.sg_app.security_group_id
  source_security_group_id = module.sg_web.security_group_id
}

By separating rules into resources, the SGs themselves don’t need to know about each other. Terraform creates the SGs first, then connects the rules.

Common Mistake — Destroy Incidents

The most heart-stopping incident. Running terraform destroy on prod, or prod resources getting deleted during code refactoring.

Defenses

  1. Block terraform destroy from running in prod environments

    In CI, never execute the destroy command on the prod directory. If manual destroy is needed, require multi-step approval.

  2. prevent_destroy on critical resources

    resource "aws_db_instance" "prod_db" {
      # ...
      lifecycle {
        prevent_destroy = true
      }
    }

    If this resource becomes a destroy target, terraform plan immediately fails.

  3. Always use state mv when renaming resources

    Renaming resource "aws_instance" "old_name" to "new_name" makes Terraform try to delete the existing resource and recreate with the new name. Fatal for a production DB.

    terraform state mv aws_instance.old_name aws_instance.new_name

    Renaming only in state means the actual resource isn’t touched and is recognized under the new name.

  4. Always review plan results

    If -destroy appears in a plan, always be suspicious. Unless it’s intentional, something went wrong.

flowchart LR
    Plan["terraform plan"] --> Check{"Any destroy\noperations?"}
    Check -->|"Yes"| Review["Double check"]
    Check -->|"No"| OK["apply"]
    Review --> Intended{"Intentional\ndestroy?"}
    Intended -->|"No"| Stop["Stop and investigate"]
    Intended -->|"Yes"| Confirm["Approve and apply"]

Large-Scale Project Splitting

When an infrastructure repo grows large, several problems emerge.

The solution is state-level splitting. Break one massive state into multiple smaller states.

Splitting criteria

  1. Change frequency — Separate rarely changing from frequently changing (VPC rarely changes, app deployments change often)
  2. Owning team — Separate by management owner
  3. Lifecycle — Group things that are created/destroyed together
  4. Blast radius — Minimize the scope of impact from a single mistake

Example:

envs/prod/
├── 01-network/          # VPC, subnets, routing (rarely changes)
├── 02-security/         # Common security groups, KMS keys
├── 03-cluster/          # EKS cluster (quarterly upgrades)
├── 04-database/         # RDS, ElastiCache (rarely changes)
├── 05-bootstrap/        # Cluster bootstrap (monthly changes)
└── 06-applications/     # App deployment (daily changes)

Numbering by dependency order makes creation/deletion order clear.

Split migration

If you need to split an already-massive state, you’ll use the remote state version of terraform state mv, such as terraform state rm + state push, or Terragrunt run-all. The work is large and risky, so proceed incrementally.

  1. Create the new state directory’s backend first
  2. Define resources to move in HCL
  3. Move between states with terraform state mv -state-out=..., or rm from old state and import into new state
  4. Verify no diff on both sides with plan

Always practice on dev/staging first, and secure full state backups before proceeding.

Migration Strategy — Moving Existing Infrastructure to Terraform

For new projects, just start with Terraform from day one. But what if you need to move infrastructure already running on console or CloudFormation to Terraform?

Approach 1: Incremental import

The safest method. Don’t import everything at once; start with small units.

Phase 1: Network (VPC, subnets) — Bottom layer, rarely changes
Phase 2: Data (RDS, S3) — Sensitive but rarely changes
Phase 3: Compute (EC2, ECS, EKS)
Phase 4: Application layer (app deployments, Helm charts)

At each phase, write HCL with import blocks and adjust code until plan shows zero diff.

import {
  to = aws_vpc.main
  id = "vpc-0abc123def456"
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  # ... fill in while checking plan diffs
}

From Terraform 1.5 onward, you can use import blocks with the -generate-config-out flag to auto-generate an HCL draft.

terraform plan -generate-config-out=generated.tf

Refine the generated code and modularize.

Approach 2: Tools like Terraformer

Terraformer, made by Google, scans existing cloud resources and generates HCL and state simultaneously. It’s fast, but the generated code needs significant cleanup.

terraformer import aws --resources=vpc,subnet --regions=ap-northeast-2

Useful when you need to import a large number of resources at once.

Approach 3: Parallel operations

Leave existing infrastructure as-is and create only new resources with Terraform. Gradually push existing infrastructure toward EOL while consistently building new with Terraform. Takes the longest but carries the lowest risk.

Whichever approach you choose, thorough backups and dry-runs are essential. Dump state multiple times, run plan dozens of times, and only then actually apply.

Operational Essentials to Remember

Finally, here are a few principles that cut across everything covered so far.

1) Always review the plan

Reading plan results before apply is non-negotiable. “It’s probably fine” is the seed of incidents.

2) Apply small, apply often

A PR touching 50 resources at once is less safe than applying 5 at a time, ten times. Root cause analysis is easier on failure, and rollback is easier too.

3) State is your most valuable asset

Losing state makes recovery extremely difficult. Backend versioning, locking, and access control are non-negotiable.

4) Manual changes are forbidden

Changing things directly via the console creates drift and widens the gap between Terraform and reality. If you had no choice but to make an emergency fix by hand, reflect it in code immediately.

5) Modules are a public API

Think of a module’s inputs/outputs as a public API. Changing them carelessly affects all users. Manage versions with SemVer.

6) Policies as code

“Let’s be careful during PR review” doesn’t work. Automate tagging, security groups, encryption, and similar policies with OPA/conftest.

7) The tool is a means

Terraform isn’t omnipotent. Frequently changing cluster internals go to ArgoCD, passwords go to Secrets Manager, monitoring goes to observability tools. Be clear about what Terraform should and shouldn’t do.


This is the final part of the Terraform series. Starting from creating your first main.tf, we’ve come full circle through variables and state, modules, environment separation, CI/CD, testing, and real-world operations.

Terraform is a powerful tool for managing infrastructure as code, but memorizing features alone isn’t enough. The habit of always thinking “What impact will this change have?” and making small, safe changes is what builds operational skill. Treat your infrastructure code with the same care as application code, and above all, treat state with care. That’s all there is to it.

I hope that someone who started with this series will gradually strengthen their team’s infrastructure, one step at a time. At every pitfall you encounter along the way, a well-crafted line of code will be your most reliable safety net.


Related Posts

Share this post on:

Comments

Loading comments...


Previous Post
Terraform Part 14 — Testing and Policy
Next Post
OOP Design Principles Part 1 — SRP and OCP