Table of contents
- Knowing the Tool vs Using It Well
- Directory Structure Conventions
- Tagging Strategy
- Common Mistake — State Loss
- Common Mistake — Circular References
- Common Mistake — Destroy Incidents
- Large-Scale Project Splitting
- Migration Strategy — Moving Existing Infrastructure to Terraform
- Operational Essentials to Remember
Knowing the Tool vs Using It Well
Following the series, we’ve covered most of Terraform’s features. Syntax, resources, modules, state, workspaces, CI/CD, testing. But knowing these doesn’t mean you can immediately run large infrastructure well.
What matters in practice is different. When to choose which structure, which mistakes to avoid, how to split things as scale grows — these are the things that count. This part compiles the patterns and pitfalls repeatedly encountered from experience.
Directory Structure Conventions
There’s no single right answer, but several frequently used patterns exist. Choose based on project scale and team structure.
Pattern 1: Flat structure (small scale)
infra/
├── main.tf
├── variables.tf
├── outputs.tf
└── terraform.tfvars
For 20 or fewer resources, one environment. Suitable for simple prototypes or side projects. Easy to manage but low scalability.
Pattern 2: Environment separation (small to medium)
infra/
├── modules/
│ ├── vpc/
│ ├── eks/
│ └── rds/
└── envs/
├── dev/
├── staging/
└── prod/
Modularization + environment separation. This is sufficient for most startup/mid-size teams in practice.
Pattern 3: Component separation (medium to large)
infra/
├── modules/
│ ├── vpc/
│ ├── eks/
│ └── rds/
└── envs/
├── dev/
│ ├── network/ # VPC, subnets, etc.
│ ├── cluster/ # EKS
│ ├── database/ # RDS
│ └── application/ # App deployment
└── prod/
├── network/
├── cluster/
├── database/
└── application/
Separate components even within environments. State is split per component, making management units smaller. The advantage is clear: changes to networking don’t affect database state. The downside is that cross-component references (output sharing) become cumbersome. This can be resolved with remote state data sources or Terragrunt’s dependency blocks.
# envs/prod/application/main.tf
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "my-company-tfstate"
key = "envs/prod/network/terraform.tfstate"
region = "ap-northeast-2"
}
}
resource "aws_instance" "app" {
subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
}
Pattern 4: Live/module separation (large scale)
terraform-modules/ # Separate repo
├── vpc/
├── eks/
└── rds/
infra-live/ # Separate repo
└── envs/
├── dev/
└── prod/
Modules are completely separated into their own repo. Tag modules with SemVer, and reference them with version pins from infra-live. A natural structure when platform and service teams separate in large organizations.
flowchart TB
Size["Team/infra scale"]
Size -->|"Small"| Flat["Flat structure"]
Size -->|"Small-medium"| Envs["Environment separation"]
Size -->|"Medium-large"| Comp["Component separation"]
Size -->|"Large"| Repo["Live/module repo separation"]
Flat -.-> Flat2["Simple management,\nlimited scalability"]
Envs -.-> Envs2["Per-environment independence,\nmixed components"]
Comp -.-> Comp2["Granular state,\ncomplex references"]
Repo -.-> Repo2["Clear team boundaries,\nhigher operational cost"]
Whichever pattern you start with, the most important thing is to modularize from the beginning so splitting later is easy.
Tagging Strategy
Tags may seem trivial, but they’re critical in operations. Cost analysis, resource search, policy enforcement, and ownership tracking all rely on tags.
Common tags should be applied consistently to all resources. A typical minimum set looks like this.
| Tag | Meaning | Example |
|---|---|---|
Environment | Environment | dev, staging, prod |
Service | Service name | order-api, auth, platform |
Owner | Owner / team | platform-team, alice@company.com |
CostCenter | Cost center | engineering, marketing |
ManagedBy | Management tool | terraform, manual |
Repo | Defining repo | github.com/org/infra-repo |
Setting default tags at the provider level auto-applies them to all resources.
provider "aws" {
region = "ap-northeast-2"
default_tags {
tags = {
Environment = var.environment
Service = var.service_name
Owner = "platform-team"
ManagedBy = "terraform"
Repo = "github.com/my-org/infra-repo"
}
}
}
default_tags is supported from AWS provider 3.38 onward. With this, you don’t need to write tags on every resource. If a resource needs additional tags, just add them in its tags block.
Automate tag policy validation
Use OPA as shown in the previous part to “fail on missing required tags.” This isn’t something you can manually check every time.
Common Mistake — State Loss
Losing state is one of the most horrifying incidents in Terraform. Without state, Terraform tries to create resources again. An existing RDS fails with a name collision, or a useless duplicate appears next to the original.
Causes
- Losing the laptop with local state
- Accidentally deleting the remote backend bucket
- Multiple people applying simultaneously, corrupting state
- Misusing
terraform state rmand removing needed resources from state
Prevention
- Ban local state: If working as a team, always use a remote backend
- Enable bucket versioning: Mandatory for S3/GCS/Azure Blob
- Protect the bucket from deletion:
prevent_destroy = trueon the state bucket itself - Enable locking: DynamoDB, GCS locking, Azure lease
Code to manage and protect the state bucket itself with Terraform.
resource "aws_s3_bucket" "tfstate" {
bucket = "my-company-tfstate"
lifecycle {
prevent_destroy = true # Rejects even accidental destroy
}
}
resource "aws_s3_bucket_versioning" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_policy" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Sid = "DenyDelete"
Effect = "Deny"
Principal = "*"
Action = ["s3:DeleteBucket", "s3:DeleteBucketPolicy"]
Resource = [aws_s3_bucket.tfstate.arn]
}]
})
}
Recovery
What if you lose it? There’s only one way — rebuild from scratch with import.
import {
to = aws_vpc.main
id = "vpc-0abc123def456"
}
Add import blocks one by one for each resource, check the diff with terraform plan, and adjust the code to match. Painful with dozens, nightmarish with hundreds. That’s why prevention is everything.
Common Mistake — Circular References
If module A uses module B’s output and module B uses module A’s output, a circular reference occurs. Terraform can’t compute the DAG and fails.
Error: Cycle detected in configuration
Typical case
# Bad example
module "sg_web" {
source = "./modules/sg"
ingress_from_sg = module.sg_app.security_group_id # References app
}
module "sg_app" {
source = "./modules/sg"
ingress_from_sg = module.sg_web.security_group_id # References web — circular!
}
Solution
For cases like security groups that need mutual references, extract the rules into separate resources.
module "sg_web" {
source = "./modules/sg"
}
module "sg_app" {
source = "./modules/sg"
}
# Mutual references without circularity
resource "aws_security_group_rule" "web_from_app" {
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
security_group_id = module.sg_web.security_group_id
source_security_group_id = module.sg_app.security_group_id
}
resource "aws_security_group_rule" "app_from_web" {
type = "ingress"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_group_id = module.sg_app.security_group_id
source_security_group_id = module.sg_web.security_group_id
}
By separating rules into resources, the SGs themselves don’t need to know about each other. Terraform creates the SGs first, then connects the rules.
Common Mistake — Destroy Incidents
The most heart-stopping incident. Running terraform destroy on prod, or prod resources getting deleted during code refactoring.
Defenses
-
Block
terraform destroyfrom running in prod environmentsIn CI, never execute the destroy command on the prod directory. If manual destroy is needed, require multi-step approval.
-
prevent_destroyon critical resourcesresource "aws_db_instance" "prod_db" { # ... lifecycle { prevent_destroy = true } }If this resource becomes a destroy target,
terraform planimmediately fails. -
Always use
state mvwhen renaming resourcesRenaming
resource "aws_instance" "old_name"to"new_name"makes Terraform try to delete the existing resource and recreate with the new name. Fatal for a production DB.terraform state mv aws_instance.old_name aws_instance.new_nameRenaming only in state means the actual resource isn’t touched and is recognized under the new name.
-
Always review plan results
If
-destroyappears in a plan, always be suspicious. Unless it’s intentional, something went wrong.
flowchart LR
Plan["terraform plan"] --> Check{"Any destroy\noperations?"}
Check -->|"Yes"| Review["Double check"]
Check -->|"No"| OK["apply"]
Review --> Intended{"Intentional\ndestroy?"}
Intended -->|"No"| Stop["Stop and investigate"]
Intended -->|"Yes"| Confirm["Approve and apply"]
Large-Scale Project Splitting
When an infrastructure repo grows large, several problems emerge.
terraform plantakes 5, 10 minutes- State files grow to tens of MB, slowing operations
- Lock contention from multiple teams working simultaneously
- Change impact becomes hard to assess
The solution is state-level splitting. Break one massive state into multiple smaller states.
Splitting criteria
- Change frequency — Separate rarely changing from frequently changing (VPC rarely changes, app deployments change often)
- Owning team — Separate by management owner
- Lifecycle — Group things that are created/destroyed together
- Blast radius — Minimize the scope of impact from a single mistake
Example:
envs/prod/
├── 01-network/ # VPC, subnets, routing (rarely changes)
├── 02-security/ # Common security groups, KMS keys
├── 03-cluster/ # EKS cluster (quarterly upgrades)
├── 04-database/ # RDS, ElastiCache (rarely changes)
├── 05-bootstrap/ # Cluster bootstrap (monthly changes)
└── 06-applications/ # App deployment (daily changes)
Numbering by dependency order makes creation/deletion order clear.
Split migration
If you need to split an already-massive state, you’ll use the remote state version of terraform state mv, such as terraform state rm + state push, or Terragrunt run-all. The work is large and risky, so proceed incrementally.
- Create the new state directory’s backend first
- Define resources to move in HCL
- Move between states with
terraform state mv -state-out=..., orrmfrom old state andimportinto new state - Verify no diff on both sides with
plan
Always practice on dev/staging first, and secure full state backups before proceeding.
Migration Strategy — Moving Existing Infrastructure to Terraform
For new projects, just start with Terraform from day one. But what if you need to move infrastructure already running on console or CloudFormation to Terraform?
Approach 1: Incremental import
The safest method. Don’t import everything at once; start with small units.
Phase 1: Network (VPC, subnets) — Bottom layer, rarely changes
Phase 2: Data (RDS, S3) — Sensitive but rarely changes
Phase 3: Compute (EC2, ECS, EKS)
Phase 4: Application layer (app deployments, Helm charts)
At each phase, write HCL with import blocks and adjust code until plan shows zero diff.
import {
to = aws_vpc.main
id = "vpc-0abc123def456"
}
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
# ... fill in while checking plan diffs
}
From Terraform 1.5 onward, you can use import blocks with the -generate-config-out flag to auto-generate an HCL draft.
terraform plan -generate-config-out=generated.tf
Refine the generated code and modularize.
Approach 2: Tools like Terraformer
Terraformer, made by Google, scans existing cloud resources and generates HCL and state simultaneously. It’s fast, but the generated code needs significant cleanup.
terraformer import aws --resources=vpc,subnet --regions=ap-northeast-2
Useful when you need to import a large number of resources at once.
Approach 3: Parallel operations
Leave existing infrastructure as-is and create only new resources with Terraform. Gradually push existing infrastructure toward EOL while consistently building new with Terraform. Takes the longest but carries the lowest risk.
Whichever approach you choose, thorough backups and dry-runs are essential. Dump state multiple times, run plan dozens of times, and only then actually apply.
Operational Essentials to Remember
Finally, here are a few principles that cut across everything covered so far.
1) Always review the plan
Reading plan results before apply is non-negotiable. “It’s probably fine” is the seed of incidents.
2) Apply small, apply often
A PR touching 50 resources at once is less safe than applying 5 at a time, ten times. Root cause analysis is easier on failure, and rollback is easier too.
3) State is your most valuable asset
Losing state makes recovery extremely difficult. Backend versioning, locking, and access control are non-negotiable.
4) Manual changes are forbidden
Changing things directly via the console creates drift and widens the gap between Terraform and reality. If you had no choice but to make an emergency fix by hand, reflect it in code immediately.
5) Modules are a public API
Think of a module’s inputs/outputs as a public API. Changing them carelessly affects all users. Manage versions with SemVer.
6) Policies as code
“Let’s be careful during PR review” doesn’t work. Automate tagging, security groups, encryption, and similar policies with OPA/conftest.
7) The tool is a means
Terraform isn’t omnipotent. Frequently changing cluster internals go to ArgoCD, passwords go to Secrets Manager, monitoring goes to observability tools. Be clear about what Terraform should and shouldn’t do.
This is the final part of the Terraform series. Starting from creating your first main.tf, we’ve come full circle through variables and state, modules, environment separation, CI/CD, testing, and real-world operations.
Terraform is a powerful tool for managing infrastructure as code, but memorizing features alone isn’t enough. The habit of always thinking “What impact will this change have?” and making small, safe changes is what builds operational skill. Treat your infrastructure code with the same care as application code, and above all, treat state with care. That’s all there is to it.
I hope that someone who started with this series will gradually strengthen their team’s infrastructure, one step at a time. At every pitfall you encounter along the way, a well-crafted line of code will be your most reliable safety net.


Loading comments...